Help Summary
About Pfam
Pfam 23.0 (Jul 2008 , 10340 families)
Proteins are generally comprised of one or more functional regions, commonly termed domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifying the domains present in a protein can provide insights into the function of that protein.
The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and hidden Markov models (HMMs).
There are two levels of quality to Pfam families: Pfam-A and Pfam-B. Pfam-A entries are derived from the underlying sequence database, known as Pfamseq, which is built from the most recent release of UniProtKB at a given time-point. Each Pfam-A family consists of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov models (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases.
Pfam-B families are un-annotated and of lower quality as they are generated automatically from the non-redundant clusters of the latest ADDA release. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found.
Pfam entries are classified in one of four ways:
- Family:
- A collection of related proteins
- Domain:
- A structural unit which can be found in multiple protein contexts
- Repeat:
- A short unit which is unstable in isolation but forms a stable structure when multiple copies are present
- Motifs:
- A short unit found outside globular domains
Related Pfam entries are grouped together into clans; the relationship may be defined by similarity of sequence, structure or profile-HMM.
Pfam Changes
This section details the changes that we plan to make or have made to Pfam. This includes changes to the flatfiles, MySQL database and the public website.
Latest changes to Pfam data
Changes between Pfam releases 22 and 23
Release 23.0 contains a total of 10340 families, with 1063 new families and 52 families killed since the latest release. 73.75% of all proteins in Pfamseq contain a match to at least one Pfam domain. 51.22% of all residues in the sequence database fall within Pfam domains. Pfam 23.0 is based on UniProt release 12.5, a composite of Swiss-Prot release 54.5 and TrEMBL release 37.5.
Show past changes.
Latest changes to website
Release 1.8 (24th March 2009)
- Integrated posts from the Xfam blog : recent blog posts are now shown in the home page
- Better validation of batch sequences: yet more improvements to the validation process for uploaded sequences. The latest changes should avoid problems with invalid header lines, which could cause searches to fail silently
- Added "image only" option to graphics generator: the Pfam graphics generator now has an option to return just the domain graphic, rather than an HTML page containing the graphic
- Improvements to database documentation: fixed some broken queries and added more explanation of some features
- Fixed formatting of structure mapping table: the table in the "Structure" tab of the family page could break badly for some families. This should now be fixed
- Fixed problems of large HMM logos crashing mozilla browsers: very large HMM logos are known to crash the web browser in certain situations. The server now tries to detect those situations and will not load the logo by default
Show past changes.
Getting Started using Pfam
Using the "Jump to" search
Many pages in the site include a small search box, entitled "Jump to...". The "Jump to..." box allows you to go immediately to the page for any entry in the Pfam site entry, including Pfam families, clans and UniProt sequence entries.
The "Jump to..." search understands accessions and IDs for
most types of entry. For example, you can enter either a Pfam family
accession, e.g. PF02171, or, if you find it easier to
remember, a family ID, such as piwi. Note that the search
is case insensitive.
Because some identifiers can be ambiguous, the "Jump to..."
search may need to test several types of identifier to find
the entry that you're looking for. For example, Pfam A family IDs (e.g.
Kazal_1) and Pfam clan IDs (e.g. Kazal) aren't easily distinguished, so
if you enter kazal, the search will first look for a
family called kazal and, if it doesn't find one, will then
look for a clan called kazal. If all of the guesses fail, you'll
see an error message saying "Entry not found".
The order in which the search tries the various types of ID and accession is given below:
- Pfam A accession, e.g. PF02171
- Pfam A identifier, e.g. piwi
- Pfam B accession, e.g. PB000001
- Pfam B identifier, e.g. Pfam-B_1
- UniProt sequence accession, e.g. P00789
- UniProt sequence ID, e.g. CANX_CHICK
- NCBI "GI" number, e.g. 113594566
- NCBI secondary accession, e.g. BAF18440.1
- Pfam clan accession, e.g. CL0005
- metaseq ID, e.g. JCVI_ORF_1096665732460
- metaseq accession, e.g. JCVI_PEP_1096665732461
- Pfam clan accession, e.g. CL0005
- Pfam clan ID, e.g. Kazal
- PDB entry, e.g. 2abl
- Proteome species name, e.g. Homo sapiens
Keyword search
Every page in the Pfam site includes a search box in the page header. You can use this to find Pfam A families which match a particular keyword. The search includes several different areas of the Pfam database:
- text fields in Pfam entries, e.g. family descriptions
- UniProt sequence entry description and species fields
HEADERandTITLEfields from PDB entries- Gene Ontology IDs and terms
- InterPro entry abstracts
Each Pfam A entry is listed only once in the results table, although it might have been found in more than one area of the database.
Searching a protein sequence against Pfam
Searching a protein sequence against the Pfam library of HMM will allow you to find out the domain architecture of the protein.
Single protein search
If your protein is present in the version of UniProt we used to make the current release of Pfam, we have already calculated its domain architecture which you can access by entering the Uniprot accession or id on the Pfam homepage. If your protein is new, you will need to paste the protein sequence into the search page and we will search your sequence against our HMMs and display the matches for you.
Medium scale protein searches (fewer than 5,000 sequences)
If you have a few thousand protein searches to do, you can use our batch upload facility to upload a file of your sequences in FASTA format. We will run them against our HMMs and email the results back to you, usually within 48 hours. We request that you to put a maximum of 1000 sequences in each file.
Large scale protein searches (more than 5,000 sequences)
If you have a large number of protein searches, it may be more convenient to run Pfam locally. To do this you will need the HMMER2 software, the Pfam HMM libraries and a couple of additional files from the Pfam website.
- Download the HMMER2 software.
-
Download the Pfam files Pfam_ls, Pfam_fs, Pfam_ls.bin, Pfam_fs.bin, Pfam_ls.bin.ssi, Pfam_fs.bin.ssi, Pfam-A.seed and Pfam-C from the ftp site.
These files contain the HMMs and additional information that is required to carry out the searches.
-
Download a copy of pfam_scan.pl from the ftp site.
This is a wrapper script around hmmpfam, the HMMER program that searches query sequences against a library of profile HMMs. If perldoc is installed, more detailed instructions on how to use pfam_scan.pl can be found by typing on the command line 'perldoc pfam_scan.pl'.
-
On the command line enter:
pfam_scan.pl -d <directory_location_of_Pfam> <files fasta_file_of_proteome>For example if the files were downloaded into a folder called pfam_files, and the FASTA sequences were in a file called sequences.fasta, type:
pfam_scan.pl -d pfam_files sequences.fastapfam_scan.plwill search the FASTA sequences against the profile HMMs and report all matches to families that score higher than the manually set thresholds for each of the Pfam families. The output format will look something like this:<seq_id> <seq_start> <seq_end> <hmm_acc> <hmm_start> <hmm_end> <bit_score> <evalue> <hmm_name>e.g.
O00519 95 562 PF01425.9 1 513 526.5 2.6e-155 Amidase O01636 22 139 PF02408.8 1 137 157.3 3.6e-44 DUF141 O03046 12 137 PF02788.5 1 132 295.2 1.1e-85 RuBisCO_large_N
Proteome analysis
Pfam pre-calculates the domain architecture for the proteomes present in Integr8. To see the list of proteomes, visit the genomes page within the Pfam website. From here you will be able to access a table showing the Pfam families that are found in a particular proteome, a short description of each of those famillies and the number of proteins that match each family. Links from these pages will allow you to investigate particular proteins further. You can also compare two proteomes and out which protein families are shared between the proteomes, and which are unique to a particular proteome.
The taxonomy query allows allows quick identification of families/domains which are present in one species but are absent from another. It can also be used to find families/domains that are unique to a particular species (note this can be very slow).
Finding proteins with a specific set of domain combinations ('architectures')
Pfam allows you to retrive all the proteins with a particular domain combination (e.g. proteins containing both a CBS domain and an IMPDH domain) by using the domain query tool. For a more detailed study of domain architectures you should use PfamAlyzer, a tool that is hosted from the Swedish Pfam site. PfamAlyzer allows the user to find proteins which contain a specific combination of domains, and it allows the user to specify particular species and the distances allowed between domains.
- What is Pfam ?
- What is the difference between Pfam-A and Pfam-B families?
- What is a clan ?
- Why are there two HMMs foreach Pfam entry?
- Can I search DNA against Pfam ?
- How can I submit a new domain ?
- What is iPfam ?
- How can I search Pfam locally ?
- Why doesn't Pfam doesn't include my sequence ?
- What's on the family page ?
- How many accurate alignments do you have ?
- Can I search my protein against Pfam ?
- Can I search my dna sequence in translation against Pfam ?
- What is the difference between the - and . characters in your full alignments ?
- What do the SS lines in the alignment mean ?
- You don't have domain YYYY in Pfam !
- Are there other databases which do this ?
- So which database is better ?
What is Pfam ?
Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam HMM represents a protein family or domain. By searching a protein sequence against our the Pfam library of HMMs you can find out its domain architecture. You can also use Pfam to analyse proteomes and domain architectures.
For each Pfam entry we have a family page which can be accessed in several ways; by following links for a particular family/domain on the website, clicking on a graphical image of a domain and by searching for a particular family using the search box on the Pfam website. From the family page you can view the alignments and annotation (including annotation from the InterPro database where available), view the species distribution of a domain, access cross-links to other databases and use other tools for protein analysis. We display structural information for each family where available.
What is the difference between Pfam-A and Pfam-B families ?
Pfam contains two types of families, Pfam-A and Pfam-B. Pfam-A families are manually curated HMM based families which we build using an alignment of a small number of representative sequences (we call this alignment the 'seed' alignment). We manually set a threshold value for each HMM, and this determines the minimum score a sequence must attain to belong to the family. We search all of our HMMs against the UniProt database, and include all sequences that score above the cut-off value for a particular family in the family's full alignment. For each family we build two HMMs, one to represent fragment matches and one to represent full length matches. We use the HMMER2 software to build and search our profile HMMs. Pfam-A matches are very unlikely to be false matches.
To complement the Pfam-A families, we automatically generate Pfam-B families using the ADDA database. Pfam-B families are formed by taking alignments of sequence segments from ADDA, and removing any Pfam-A residues from them. Some Pfam-B families are composed of low complexity regions and may not reflect true relationships, therefore we recommend you verify that sequences in a Pfam-B family are related by using other methods such as BLAST.
All families in Pfam are non-overlapping such that no amino acid belongs to more that one family/domain. At each Pfam release we search all of our models against an updated version of UniProt, and regenerate our Pfam-B families using the most recent version of ADDA.
What is a clan ?
Some of the Pfam families are grouped into 'clans'. Pfam defines a clan as a collection of families that have arisen from a single evolutionary origin. Evidence of their evolutionary relationship can be in the form of similarity in tertiary structures, or when structures are not available, by common sequence motifs. The seed alignments for all families within a clan are aligned and the resulting alignment (called the clan alignment) can be accessed from a link on the clan page. Each clan page includes a clan alignment, a description of the clan and database links where appropriate. The clan pages can be accessed by following a link from the family page, or alternatively they can be accessed by clicking on 'clans' under the 'browse' by menu on tab on any Pfam page.
Why are there two HMMs foreach Pfam entry ?
For each Pfam entry we build two HMMs, one to represent full length matches (ls model), and one to represent fragment matches (fs model). When you search a protein on our site we search them against both the ls and fs models.
Can I search DNA against Pfam ?
The Wise2 software package allows the comparison of protein HMMs to genomic DNA. We use this package to allow users to search single DNA sequences against the library of Pfam HMMs. To do this, you will need to go to the Pfam website hosted at the Sanger Institute and paste your DNA sequence in the DNA search box on the search page. The results take approximately 2 minutes for a 1kb sequence, and approximately 1 hours for a 80kb sequence.
How can I submit a new domain ?
If you know of a domain that is not present in Pfam, you can
submit it to us by email
(pfam-help@sanger.ac.uk) and we will endeavour to build a Pfam entry
for it. We ask that you supply us with a multiple sequence
alignment of the domain (please send the alignment file as a
text file (e.g. .txt) and not in the format of
a specific application such as Microsoft Word (e.g. a .doc)
file), and associated literature evidence if available.
What is iPfam ?
iPfam is a resource that describes domain-domain interactions that are observed in PDB entries. Where two or more Pfam domains occur in a single structure, it analyses them to see if the are close enough to form an interaction. If they are close enough it calculates the bonds forming the interaction. Further information can be found on the iPfam help pages.
How can I search Pfam locally ?
If you have a large number of sequences or you don't want to post your sequence across the web, you can search your sequence locally. Please see the help page on getting started with Pfam.
Why doesn't Pfam doesn't include my sequence ?
Pfam is built from a fixed release of UniProt. At each Pfam release we incorporate sequences from the latest release of UniProt. This means that at any time the sequences used by Pfam could be several months behind those in the most up-to-date versions of the sequence databases. If your sequence isn't in Pfam you can still find out what domains it contains by pasting it into the search box in Pfam.
What's on the family page ?
Each family has the following data:
- A seed alignment which is a hand edited multiple alignment representing the family.
- Hidden Markov Models (HMM) derived from the seed alignment, which can be used to find new members of the domain and also take a set of sequences to realign them to the model. One HMM is in ls mode (global) the other is an fs mode (local) model.
- A full alignment which is an automatic alignment of all the examples of the domain using the two HMMs to find and then align the sequences.
- Annotation that contains a brief description of the domain, links to other databases and some Pfam specific data. To record how the family was constructed.
How many accurate alignments do you have ?
Release 23.0 has 10340 families. Over 73.7 of the proteins in SWISSPROT 54.5 and TrEMBL 37.5 have at least one match to a Pfam-A family.
Can I search my protein against Pfam ?
Of course! Please use this search form.
Can I search my dna sequence in translation against Pfam ?
Yes you can, on the DNA search page.
What is the difference between the - and . characters in your full alignments ?
The '-' and '.' characters both represent gap characters. However they do tell you some extra information about how the HMM has generated the alignment. The '-' symbols are where the alignment of the sequence has used a delete state in the HMM to jump past a match state. This means that the sequence is missing a column that the HMM was expecting to be there. The '.' character is used to pad gaps where one sequence in the alignment has sequence from the HMMs insert state. See the alignment below where both characters are used. The HMM states emitting each column are shown. Note that residues emitted from the Insert (I) state are in lower case.
FLPA_METMA/1-193 ---MPEIRQLSEGIFEVTKD.KKQLSTLNLDPGKVVYGEKLISVEGDE FBRL_XENLA/86-317 RKVIVEPHR-HEGIFICRGK.EDALVTKNLVPGESVYGEKRISVEDGE FBRL_MOUSE/90-321 KNVMVEPHR-HEGVFICRGK.EDALFTKNLVPGESVYGEKRVSISEGD O75259/81-312 KNVMVEPHR-HEGVFICRGK.EDALVTKNLVPGESVYGEKRVSISEGD FBRL_SCHPO/71-303 AKVIIEPHR-HAGVFIARGK.EDLLVTRNLVPGESVYNEKRISVDSPD O15647/71-301 GKVIVVPHR-FPGVYLLKGK.SDILVTKNLVPGESVYGEKRYEVMTED FBRL_TETTH/64-294 KTIIVK-HR-LEGVFICKGQ.LEALVTKNFFPGESVYNEKRMSVEENG FBRL_LEIMA/57-291 AKVIVEPHMLHPGVFISKAK.TDSLCTLNMVPGISVYGEKRIELGATQ Q9ZSE3/38-276 SAVVVEPHKVHAGIFVSRGKsEDSLATLNLVPGVSVYGEKRVQTETTD HMM STATES MMMMMMMMMMMMMMMMMMMMIMMMMMMMMMMMMMMMMMMMMMMMMMMM
What do the SS lines in the alignment mean ?
These lines are structural information. The SS stands for secondary structure, and this is taken from DSSP. The following list gives the definitions for each code letter:
- C
- Random Coil
- H
- Alpha-helix
- G
- 3(10) helix
- I
- Pi-helix
- E
- Hydrogen bonded beta-strand (extended strand)
- B
- Residue in isolated beta-bridge
- T
- H-bonded turn (3-turn, 4-turn, or 5-turn)
- S
- Bend (five-residue bend centered at residue i)
You don't have domain YYYY in Pfam !
We are very keen to be alerted to new domains. If you can provide us with a multiple alignment then we will try hard to incorporate it into the database. If you know of a domain, but don't have a multiple alignment, we still want to know, for simple families just one sequence is enough. Again E-mail pfam-help@sanger.ac.uk.
Can we have this running sensibly locally. Do you have software to support Pfam locally?
In terms on HMMs and formats, Pfam is based around the HMMER2 package. This will need to be installed on your local machine, and you will need to also download the Pfam HMM libraries from the FTP site. These files are called Pfam_ls (global models) and Pfam_fs (fragment models).
Are there other databases which do this ?
To a certain extent yes, they are a number of "second generation" databases which are trying to organise the protein database into evolutionary conserved regions. Examples include:
- PROSITE
- This originally was based around regular expression patterns but now also includes profiles.
- PRINTS
- This is based around protein "finger-prints" of a series of small conserved motifs making up a domain.
- BLOCKS
- This is based around automatic ungapped alignments.
- SMART
- This is a database concentrating on extracellular modules and signaling domains.
- ADDA
- This is an automatic algorithm for domain decomposition and clustering of protein domain families.
- InterPro
- Combines information from Pfam, Prints, SMART, Prosite and PRODOM.
- CDD
- The Conserved Domain Database is derived from Pfam and SMART databases.
So which database is better ?
As with everything, it depends on your problem: I would certainly suggest using more than one method. Pfam is likely to provide more interpretable results with crisp definitions of domains in a protein.
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website.
Architecture
The collection of domains that are present on a protein.
Build method
The order that the ls (global) and fs (fragment) matches are aligned to the model to give the full alignment. The build method can be global first, where ls matches are aligned first followed by fs matches that do not overlap, byscore, where matches are aligned in order of evalue score, or localfirst, where fs matches are aligned first followed by ls matches that do not overlap.
Clan
A collection of related Pfam entries. The relationship may be defined by similarity of sequence, structure or profile-HMM.
Domain
A structural unit which can be found in multiple protein contexts.
Domain score
The score of a single domain aligned to a HMM. If there is more than one domain, the sequence score is the sum of all the domain scores for that Pfam entry. If there is only a single domain, the sequence and the domains score for the protein will be identical.
DUF
Domain of unknown function.
Family
A collection of related proteins.
Full alignment
An alignment of the set of related sequences which score higher than the manually set threshold values for the HMMs of a particular Pfam entry.
Gathering threshold (GA)
Also called the gathering cut-off, this value is the search threshold used to build the full alignment. The GA is the minimum score a sequence must attain in order to belong the the full alignment of a Pfam entry. For each Pfam HMM we have two GA cutoff values, a sequence cutoff and a domain cutoff.
Hidden Markov model (HMM)
A HMM is a probablistic model. In Pfam we use HMMs to transform the information contained within a multiple sequence alignment into a position-specific scoring system. We search our HMMs against the (UniProt) protein database to find homologous sequences.
HMMER2
The suite of programs that Pfam uses to build and search HMMs. See the HMMER site.
iPfam
A resource that describes domain-domain interactions that are observed in PDB entries. Where two or more Pfam domains occur in a single structure, it analyses them to see if the are close enough to form an interaction. If they are close enough it calculates the bonds forming the interaction.
Metaseq
A collection of sequences derived from various metagenomics datasets.
Motif
A short unit found outside globular domains.
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment.
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of representative sequences. We manually set a threshold value for each HMM and search our models against the UniProt database. All of the sequnces which score above the threshold for a Pfam entry are included in the entry's full alignment.
Pfam-B
An automatically generated alignment which is formed by an alignment from the ADDA database and removing Pfam-A residues from them. Since Pfam-B families are automatically generated we recommend that you verify that the sequences in a Pfam-B are related, using other methods such as BLAST.
Repeat
A short unit which is unstable in isolation but forms a stable structure when multiple copies are present.
Seed alignment
An alignment of a set of representative sequences for a Pfam entry. We use this alignment to construct the HMMs for the Pfam entry.
Sequence score
The total score of a sequence aligned to a HMM. If there is more than one domain, the sequence score is the sum of all the domain scores for that Pfam entry. If there is only a single domain, the sequence and the domains score for the protein will be identical. We use the sequence score to determine whether a sequence belongs to the full alignment of a particular Pfam entry.
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment.
Help With Pfam HMM scores
What Pfam HMM scores mean
Pfam-A is based around hidden Markov model (HMM) searches, as provided by the HMMER2 package. In HMMER2, like BLAST, E-values (expectation values) are calculated. The E-value is the number of hits that would be expected to have a score equal or better than this by chance alone. A good E-value is much less than 1. Around 1 is what we expect just by chance. In principle, all you need to decide on the significance of a match is the E-value.
However, there are a few complications.
The most serious complication is that there are no analytical results available for accurately determining E-values for gapped alignments, especially profile HMM alignments. HMMER uses empirical methods to estimate E-values. These methods are generally rather accurate. However, when in doubt, HMMER tends to err on the conservative side.
We use a second, and even more empirical, system in maintaining Pfam models. This system is implemented in the Pfam database rather than in the HMMER software. For each Pfam family, we record a "trusted cutoff" and a "noise cutoff", TC1 and NC1. TC1 is the lowest score for sequences we included in the family (e.g. in the Full alignment). NC1 is the highest score for sequences we did not include in the Full alignment. (Since Full alignments are produced automatically, the trusted sequence cutoff is always greater than the noise sequence cutoff.)
Therefore, we can consider a hit very significant if it scores better than the trusted cutoff, better than the noise cutoff, and has a significant E-value. Sometimes sequences score better than the cutoffs though they don't have significant E-values; these are marginal hits that we've chosen to include in the family.
Sequence versus domain scores
There's one additional wrinkle in the scoring scheme. HMMER2 calculates two kinds of scores. The "sequence classification score" is the total score of a sequence aligned to a model; if there are more than one domain, the sequence score is the sum of all (finding multiple domains increases our confidence that the sequence belongs to that protein family, even if each domain individually is a weak match.) The "domain score" is a score for a single domain (these two scores are identical for single domain proteins).
References & Bibliography
Pfam References
Book Chapters on Pfam
How to link to Pfam?
Pfam is maintained by a consortium of researchers based at the Wellcome Trust Sanger Institute, Cambridge, UK (WTSI), Stockholm Bioinformatics Center, Stockholm, Sweden (SBC), and Janelia Farm, Maryland, USA. All three sites run the same Pfam website and linking to different sites only requires that you change the site name, not the parameters in the URL.
Although we have no plans to change the locations of resources within this site dramatically, webmasters are advised to link only to the following types of page within the site.
Home pages
- WTSI:
- http://pfam.sanger.ac.uk/
- SBC:
- http://pfam.sbc.su.se/
- Janelia:
- http://pfam.janelia.org/
Searching a protein sequence against Pfam
- WTSI:
- http://pfam.sanger.ac.uk/search?tab=sequenceSearchBlock
- SBC:
- http://pfam.sbc.su.se/search?tab=sequenceSearchBlock
- Janelia:
- http://pfam.janelia.org/search?tab=sequenceSearchBlock
Searching a DNA sequence against Pfam
- WTSI:
- http://pfam.sanger.ac.uk/search?tab=sequenceDnaBlock
- SBC:
- http://pfam.sbc.su.se/search?tab=sequenceDnaBlock
- Janelia:
- http://pfam.janelia.org/search?tab=sequenceDnaBlock
Linking to Pfam family pages
You can refer to Pfam families either by accession or ID. You can also refer to a family by "entry", although this is a convenience that should be used only if you're not sure if what you have is an accession or an ID.
Pfam accession numbers are more stable between releases than IDs and we strongly recommend that you link by accession number.
Here are some examples of linking to Pfam at WTSI:
- By accession:
- http://pfam.sanger.ac.uk/family?acc=PF00002
- By ID:
- http://pfam.sanger.ac.uk/family?id=7tm_2
- Using "entry":
-
http://pfam.sanger.ac.uk/family?entry=PF00002
or
http://pfam.sanger.ac.uk/family?entry=7tm_2 - Directly:
-
http://pfam.sanger.ac.uk/family/PF00002
or
http://pfam.sanger.ac.uk/family/7tm_2
You can link to Pfam family data at the other sites by changing "pfam.sanger.ac.uk" to "pfam.sbc.su.se" or "pfam.janelia.org".
Linking to protein sequence pages
As for Pfam family pages, you can refer to protein sequence pages by accession, ID or entry. Protein IDs are unstable and do change between releases, so, again, we strongly recommend that you use protein accessions where possible.
Here are some examples of linking to protein sequence pages at WTSI:
- By accession:
- http://pfam.sanger.ac.uk/protein?acc=P15498
- By ID:
- http://pfam.sanger.ac.uk/protein?id=VAV_HUMAN
- Using "entry":
-
http://pfam.sanger.ac.uk/protein?entry=P15498
or
http://pfam.sanger.ac.uk/protein?entry=VAV_HUMAN - Directly:
-
http://pfam.sanger.ac.uk/protein/P15498
or
http://pfam.sanger.ac.uk/protein/VAV_HUMAN
Again, to generate links to the other Pfam sites, change "pfam.sanger.ac.uk" to "pfam.sbc.su.se" or "pfam.janelia.org".
Linking to the "jump to" search
The Pfam website features a search tool that tries to guess the type of any accession or ID that it is given. For example, if given "VAV_HUMAN", the search returns the URL for the protein sequence page for the VAV_HUMAN entry. If given "1w9h", the search returns the URL for the PDB entry (structure) 1w9h.
You can use the "jump to" search if you need to link to Pfam but
can't be sure what type of accession or ID you will be using in your link.
By default, the search returns the URL that it has found, as a simple,
plain text HTTP response. Adding the parameter redirect=1
will make the "jump to" tool redirect to the URL that it finds
or, if it couldn't find an appropriate URL, to the Pfam homepage.
- Return URL:
- http://pfam.sanger.ac.uk/search/jump?entry=P15498
- Redirect:
- http://pfam.sanger.ac.uk/search/jump?entry=P15498&redirect=1
Note that, although it may be convenient to link to Pfam using this search tool, there is no error reporting for your users if the search fails to find an appropriate URL in the Pfam site. It is much safer to link directly to the correct section of the site. Please contact us if you need help with building specific links.
One of the features provided by the Pfam website is a graphical representation of the features found within a sequence, termed domain graphics. There are a variety of different shapes and styles and each one has a particular meaning. This page gives an in-depth description of the elements of Pfam domain graphics.
The library which generates the images in this page and throughout the Pfam site uses an XML language to describe the domain graphic that is required. Each of the example graphics in this page is followed by a link that can be used to show the XML that produced it.
We provide a set of tools, described in the Tools & Web Services section of the help pages, that allow you to generate custom domain graphics by uploading your own XML file, or to generate graphics for a specific UniProt sequence, given the UniProt accession or ID.
The sequence
The base sequence, undecorated by any domains or features, is represented by a plain grey bar:
Show XML
The length of the domain graphic that is drawn is proportional to the length of the sequence itself. The graphics in this page are drawn with a X-scale of 0.5 pixels per amino-acid, so that a 200 residue sequence will result in a 100 pixel-wide image. Any domains or features which are drawn on the sequence are also scaled by the same factor.
Pfam-A
The high quality, curated Pfam-A domains are classified into one of four different types: family, domain, repeat and motif (more details). These different classification types are rendered slightly differently.
Family/domain
It is possible for a sequence to match either the full length of a Pfam HMM (a full length match), or to match a portion of an HMM (a fragment match). The two types of match are rendered differently.
Both family and domainentries are rendered as rectangles with curved ends when the sequence is a full length match. The curves at the ends become less pronounced when the domains are short, as shown in the second domain below. Different types of domain are displayed with different colours. When the domain image is long enough, the domain name is shown within the domain itself. In most cases, you can click on the domains to visit the "family page" for that domain. Moving the mouse over the domain image should also display a tooltip showing the domain name, as well as the start and end positions of the domain.
Show XML
When the sequence does not match the full length of the HMM that models a Pfam entry, matching domain fragments are shown. When a sequence match does not pass through the first position in the HMM, the N-terminal side of the domain graphic is drawn with a jagged edge instead of a curved edge. Similarly, when a sequence match does not pass through the last position of the HMM, the C-terminal side of the domain graphic is drawn with a jagged edge. In some rarer cases, the sequence match may not pass through either of the first or last positions of the HMM, in which case both sides are drawn with jagged edges. Examples of all three cases are shown here:
Show XML
Repeat/motif
Repeats and motifs are types of Pfam domain which do not form independently folded units. In order to distinguish them from domains of type family and domain, repeats and motifs are represented by rectangles with straight edges. As for families and domains, partial matches are represented with jagged edges.
Show XML
Discontinuous nested domains
Some domains in Pfam are disrupted by the insertion of another domain (or domains) within them. A number of names have been given to this arrangement: discontinuous (referring to the outer domain), inserted or nested (both referring to the inner domain). For example, in many sequences containing an IMPDH domain, the IMPDH domain is continuous along the primary sequence. However, in some cases the linear sequence of the IMPDH domain is broken by the insertion of a CBS domain, as shown below.
Where three-dimensional structures are available for representatives of a Pfam domain, it is generally clear that the three-dimensional arrangement of the domain containing the nested domain is maintained. Typically the nested domain is found inserted within a surface exposed loop, having little or no effect on the structure of the other domain. Such an arrangement explains why and how these nested domains can be functionally tolerated.
To represent this arrangement of domain graphically, the discontinuous domain is represented in two parts (as shown below). These two parts are joined by a line bridging them. The vertical parts of the line are dashed, while the horizontal line is solid (to distinguish it from a disulphide bridge).
Show XML
back to top
Context domains
Context domains in Pfam are those that, despite not scoring above the family gathering threshold, are expected to be real, based on the presence of the surrounding domains found in the protein. The method is described in:
In some cases it is possible for a protein without any matches to gain context domains. This happens when two or more weak matches support each other. This is most often seen with multiple tandem repeats such as WD40 and leucine rich repeats such as LRR_1.
Within the Pfam domain graphics, the context domains are represented by rectangles that are coloured from white to pink as shown below. These images are interactive in the same manner as the Pfam-A graphics.
Show XML
Please note that context domains are generated automatically and have not been subjected to the same high level of quality control as Pfam-A domains. Therefore, context domains, although likely to be correct should always be verified by other means.
Pfam-B
Pfam-B regions are automatically generated clusters that supplement the high quality Pfam-A regions. The mechanism for generating Pfam-B regions is detailed here. These regions are represented by a small rectangle, coloured with three stripes. As for Pfam-A regions, clicking on a Pfam-B domain takes the user to the Pfam-B summary page for that entry. Moving the mouse over the striped image will show a tooltip listing the Pfam-B identifier and its start and end points. If the Pfam-B region is long enough, its identifier will also be displayed on the image.
Show XML
Other sequence motifs
In addition to domains, smaller sequences motifs are represented by the domain graphics. Currently the following motifs are represented: signal peptides, low complexity regions, coiled-coils and transmembrane regions. These usually take lower prority than other regions that are drawn and they are therefore often obscured by, for example, a Pfam-A graphic being drawn over the top of them. An example of each motif is shown here.
Show XML
Signal peptides
Signal peptides are short regions (<60 residues long) found at the N-terminus of proteins, which direct the post-translational transport of a protein and are subsequently removed by peptidases. More specifically, a signal peptide is characterised by a short hydrophobic helix (approximately 7-15 residues). This helix is preceded by a slight positively charged region of highly variable length (approximately 1-12 residues). Between the hydrophobic helix and the cleavage site is a somewhat polar and uncharged region, of between 3 and 8 amino-acids. In Pfam, we use Phobius for the prediction of signal peptides and represent them graphically by a small orange box.
Low complexity regions
Low complexity regions are regions of biased sequence composition, usually comprised of different types of repeats. These regions have been shown to be functionally important in some proteins, but they are generally not well understood and are masked out to focus on globular domains within the protein.
Within Pfam, we use SEG to calculate low complexity regions in Pfam. The presence of a low complexity region is indicated by a cyan rectangle.
Coiled-coils
Coiled coils are motifs found in proteins that structurally form alpha-helices that wrap or wind around each other. Normally, two to three helices are involved, but cases of up to seven alpha-helices have been reported. Coilded-coild are found in a wide variety of proteins, many functionally very important. In Pfam we use ncoils, to identify these motifs. Coiled-coils are represented by a small lime-green rectangle.
Transmembrane regions
Integral membrane proteins contain one or more transmembrane regions that are comprised of an alpha-helix that passes through or "spans" a membrane. Transmembrane helices are quite variable in length, with the average being about 20 amino-acids in length. Again, Phobius is used for the prediction of transmebrane regions, which are represented by a red rectangle.
Other Sequence features
Below is a demonstration of how disulphide bridges and active residues are representated in Pfam. Each of these features can appear above or below the sequence, but in this case the disulphide bridges are shown above the sequence and the active site residues below the line.
Show XML
Disulphide bridges
Disulphide bridges play a fundamental role in the folding and stability of some proteins. They are formed by covalent bonding between the thiol groups from two cysteine residues. The disulphide bridge annotations used in Pfam come from UniProt and are represented by a solid bridge-shaped line. When mutliple disulphide bonds occur, the heights of the bridges are adjusted to avoid overlaps between them. Inter-protein disulphides are represented by single vertical lines. As always, moving the mouse over the "bridge graphic" shows the details

