Help Summary
Pfam 24.0 (Oct 2009 , 11912 families)
Proteins are generally comprised of one or more functional regions, commonly termed domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifying the domains present in a protein can provide insights into the function of that protein.
The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and hidden Markov models (HMMs).
There are two levels of quality to Pfam families: Pfam-A and Pfam-B. Pfam-A entries are derived from the underlying sequence database, known as Pfamseq, which is built from the most recent release of UniProtKB at a given time-point. Each Pfam-A family consists of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov models (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases.
Pfam-B families are un-annotated and of lower quality as they are generated automatically from the non-redundant clusters of the latest ADDA release. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found.
Pfam entries are classified in one of four ways:
- Family:
- A collection of related proteins
- Domain:
- A structural unit which can be found in multiple protein contexts
- Repeat:
- A short unit which is unstable in isolation but forms a stable structure when multiple copies are present
- Motifs:
- A short unit found outside globular domains
Related Pfam entries are grouped together into clans; the relationship may be defined by similarity of sequence, structure or profile-HMM.
Pfam Changes
This section details the changes that we plan to make or have made to Pfam. This includes changes to the flatfiles, MySQL database and the public website.
Latest changes to Pfam data
Changes between Pfam releases 23 and 24
Release 24.0 contains a total of 11912 families, with 1808 new families and 236 families killed since the latest release. 75.15% of all proteins in Pfamseq contain a match to at least one Pfam domain. 53.18% of all residues in the sequence database fall within Pfam domains. Pfam 24.0 is based on UniProt release 15.6, a composite of Swiss-Prot release 57.6 and TrEMBL release 40.6.
Show past changes.
Latest changes to website
Release 2.0.1 (29th October 2009)
This was a minor release to fix bugs introduced in the last major update.
- Updated documentation: around half of the documentation has been brought up to date with the changes due to HMMER3
- Reinstated RESTful services: most of the "RESTful" services have been updated. There are some schema changes and some differences in how sequence searches must be run. Please check the help pages for up to date documentation
- Domain architecture search is back: the domain architecture tab in the search page has been re-enabled
- Sequence search restrictions: sequence validation code now refuses to accept '-' as a valid sequence character
- NCBI GI numbers: the 'jump' tool now understands GI numbers again, and the NCBI sequence page has been fixed
- Other small bug fixes: there have been numerous other bug fixes throughout the site
Show past changes.
Getting Started using Pfam
Using the "Jump to" search
Many pages in the site include a small search box, entitled "Jump to...". The "Jump to..." box allows you to go immediately to the page for any entry in the Pfam site entry, including Pfam families, clans and UniProt sequence entries.
The "Jump to..." search understands accessions and IDs for
most types of entry. For example, you can enter either a Pfam family
accession, e.g. PF02171, or, if you find it easier to
remember, a family ID, such as piwi. Note that the search
is case insensitive.
Because some identifiers can be ambiguous, the "Jump to..."
search may need to test several types of identifier to find
the entry that you're looking for. For example, Pfam A family IDs (e.g.
Kazal_1) and Pfam clan IDs (e.g. Kazal) aren't easily distinguished, so
if you enter kazal, the search will first look for a
family called kazal and, if it doesn't find one, will then
look for a clan called kazal. If all of the guesses fail, you'll
see an error message saying "Entry not found".
The order in which the search tries the various types of ID and accession is given below:
- Pfam A accession, e.g. PF02171
- Pfam A identifier, e.g. piwi
- Pfam B accession, e.g. PB000001
- Pfam B identifier, e.g. Pfam-B_1
- UniProt sequence accession, e.g. P00789
- UniProt sequence ID, e.g. CANX_CHICK
- NCBI "GI" number, e.g. 113594566
- NCBI secondary accession, e.g. BAF18440.1
- Pfam clan accession, e.g. CL0005
- metaseq ID, e.g. JCVI_ORF_1096665732460
- metaseq accession, e.g. JCVI_PEP_1096665732461
- Pfam clan accession, e.g. CL0005
- Pfam clan ID, e.g. Kazal
- PDB entry, e.g. 2abl
- Proteome species name, e.g. Homo sapiens
Keyword search
Every page in the Pfam site includes a search box in the page header. You can use this to find Pfam A families which match a particular keyword. The search includes several different areas of the Pfam database:
- text fields in Pfam entries, e.g. family descriptions
- UniProt sequence entry description and species fields
HEADERandTITLEfields from PDB entries- Gene Ontology IDs and terms
- InterPro entry abstracts
Each Pfam A entry is listed only once in the results table, although it might have been found in more than one area of the database.
Searching a protein sequence against Pfam
Searching a protein sequence against the Pfam library of HMMs will enable you to find out the domain architecture of the protein. If your protein is present in the version of UniProt, NCBI Genpept or the metagenomic sequence set that we used to make the current release of Pfam, we have already calculated its domain architecture. You can access this by entering the sequence accession or ID in the 'view a sequence' box on the Pfam homepage.
If your sequence is not in the Pfam database, you could perform a single-sequence or a batch search by clicking on the 'Search' link at the top of the Pfam page.
Single protein search
If your protein is not recognised by Pfam, you will need to paste the protein sequence into the search page. We will search your sequence against our HMMs and instantly display the matches for you.
Batch search
If you have a large number of sequences to search (up to several thousand), you can use our batch upload facility. This allows you to upload a file of your sequences in FASTA format, and we will run them against our HMMs and email the results back to you, usually within 48 hours. We request that you put a maximum of 1000 sequences in each file.
Local protein searches
If you have a very large number of protein searches to perform, or you do not wish to post your sequence across the web, it may be more convenient to run the Pfam searches locally using the 'pfam_scan.pl' script. To do this you will need the HMMER3 software, the Pfam HMM libraries and a couple of additional data files from the Pfam website. You will also need to download a few modules from CPAN, most notably Moose.
Full details on how to get 'pfam_scan.pl' up and running can be found on our FTP site.
Proteome analysis
Pfam pre-calculates the domain compositions and architectures for all the proteomes present in Integr8. To see the list of proteomes, click on the 'browse' link at the top of the Pfam website, and click on a letter of the alphabet in the 'proteomes' section. By clicking on a particular organism, you will be be able to view the proteome page for that organism. From here you can view the domain organisation and the domain composition for that proteome.
Finding proteins with a specific set of domain combinations ('architectures')
Pfam allows you to retrieve all of the proteins with a particular domain combination (e.g. proteins containing both a CBS domain and an IMPDH domain) using the domain query tool. For a more detailed study of domain architectures you should use PfamAlyzer, a tool that is hosted by the Swedish Pfam site. PfamAlyzer allows the user to find proteins which contain a specific combination of domains, and it allows the user to specify particular species and the evolutionary distances allowed between domains.
- What is Pfam ?
- What is the difference between Pfam-A and Pfam-B families?
- What is on a Pfam-A family page ?
- What is a clan ?
- What happened to the Pfam_ls and Pfam_fs files ?
- Can I search DNA against Pfam ?
- How can I search Pfam locally ?
- Why doesn't Pfam doesn't include my sequence ?
- How many accurate alignments do you have ?
- How can I submit a new domain ?
- What is iPfam ?
- Can I search my protein against Pfam ?
- What is the difference between the '-' and '.' characters in your full alignments ?
- What do the SS lines in the alignment mean ?
- You don't have domain YYYY in Pfam !
- Are there other databases which do this ?
- So which database is better ?
What is Pfam ?
Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam HMM represents a protein family or domain. By searching a protein sequence against the Pfam library of HMMs, you can determine which domains it carries i.e. its domain architecture. Pfam can also be used to analyse proteomes and questions of more complex domain architectures.
For each Pfam accession we have a family page, which can be accessed in several ways: from the 'View a Pfam Family' search box on the HOME page, by clicking on any graphical image of a domain, by searching for a particular family using the 'Keyword Search' box on the top right hand corner of most website pages, or by pasting the family identifier or accession into the 'JUMP TO' box that is present on most pages in the site.
What is the difference between Pfam-A and Pfam-B families ?
There are two levels of quality to Pfam families: Pfam-A and Pfam-B.
Pfam-A entries are derived from the underlying sequence database, known as Pfamseq, which is built from the most recent release of UniProtKB at a given time-point. For each Pfam-A family we build a single curated profile hidden Markov model (profile HMM) from the seed alignment (a small set of representative members of the family) using the HMMER3 software, and search this against Pfamseq to provide an automatically generated full alignment. All sequences that score above the cut-off threshold value determined for that family are included in the full alignment, which should then contain all detectable protein sequences belonging to that family.
We also search our Pfam-A HMMs against NCBI Genpept and a set of metagenomic sequences, and these alignments are available from the 'Alignments' tab of the Pfam-A family page. As the seed alignments have been manually checked for quality by a Pfam curator Pfam-A matches are very unlikely to be false matches. Pfam-A families also carry a summary annotation and links to other databases
To complement the Pfam-A families, we automatically generate Pfam-B families using the ADDA database. Pfam-B families have no associated annotation or literature reference and are of much lower quality than Pfam-A families, as their alignments have not been manually checked by a Pfam curator. Pfam-B families are formed by taking alignments of sequence segments from ADDA and removing any Pfam-A residues from them. Some Pfam-B families are composed of low complexity regions and may not reflect true relationships and we therefore we recommend you verify that sequences in a Pfam-B family are related by using other methods, such as BLAST.
In Pfam 24.0, we have built HMMs for the first (and largest) 20,000 Pfam-B families. Using the Pfam website, users are able to perform a single-sequence or batch search against both the Pfam-A and Pfam-B HMMs.
All families in Pfam are non-overlapping, such that no amino acid belongs to more than one family/domain. At each Pfam release we search all our models against an updated version of UniProt and NCBI Genpept, and regenerate our Pfam-B families using the most recent version of ADDA.
What is on a Pfam-A family page ?
From the family page you can view the Pfam annotation for a family. We also provide access to many other sources of information, including annotation from the InterPro database, where available, cross-links to other databases and other tools for protein analysis.
Via the tabs on the left-hand side of the page, you can view:
- the domain architectures in which this family is found
- the alignments for the family in various formats, including alignments of matches to the NCBI and metagenomic sets, as well as in 'heat-map' format. All alignments can be downloaded
- the phylogenetic and species distribution trees, the latter being interactive
- the HMM logo
- the structural information for each family where available
What is a clan ?
Some of the Pfam families are grouped into clans. Pfam defines a clan as a collection of families that have arisen from a single evolutionary origin. Evidence of their evolutionary relationship can be in the form of similarity in tertiary structures, or, when structures are not available, from common sequence motifs. The seed alignments for all families within a clan are aligned and the resulting alignment (called the clan alignment) can be accessed from a link on the clan page. Each clan page includes a clan alignment, a description of the clan and database links, where appropriate. The clan pages can be accessed by following a link from the family page, or alternatively they can be accessed by clicking on 'clans' under the 'browse' by menu on the top of any Pfam page.
What happened to the Pfam_ls and Pfam_fs files?
In the past, each Pfam family was represented by two profile-hidden Markov models (HMMs). One of these could match partially to a family and was called local or fs mode, the other required a sequence to match to the whole length of the HMM, and was called glocal or ls mode. With HMMER2, we found that the combination of the two models gave us the most sensitive searches. However, HMMER3 models are only available for searching in local (fs) mode. Because of the improvements in HMMER3, this single model is as sensitive as the two combined HMMER2 models. This means that we no longer provide two HMM libraries called 'HMM_ls' and 'HMM_fs'. Instead, a single library is available called 'Pfam-A.hmm'.
Can I search DNA against Pfam ?
The Wise2 software package allows the comparison of protein HMMs to genomic DNA. We use this package to allow users to search single DNA sequences against the library of Pfam HMMs. Paste your DNA sequence into the DNA search box on the search page. The results take approximately 2 minutes for a 1kb sequence, and approximately 1 hour for a 80kb sequence.
How can I search Pfam locally ?
If you have a large number of sequences or you don't want to post your sequence across the web, you can search your sequence locally using the 'pfam_scan.pl' script.
In terms of HMMs and formats, Pfam is based around the HMMER3 package. This will need to be installed on your local machine. You will need also to download the Pfam HMM libraries from the FTP site, as well as a few modules from CPAN, most notably Moose.
Full details on how to get 'pfam_scan.pl' up and running can be found on our FTP site.
Why doesn't Pfam doesn't include my sequence ?
Pfam is built from a fixed release of UniProt. At each Pfam release we incorporate sequences from the latest release of UniProt. This means that, at any time, the sequences used by Pfam might be several months behind those in the most up-to-date versions of the sequence databases. If your sequence isn't in Pfam, you can still find out what domains it contains by pasting it into the sequence search box on the search page.
How many accurate alignments do you have ?
Release 24.0 has 11912 families. Over 73.7% of the proteins in SWISSPROT 57.6 and TrEMBL 40.6 have at least one match to a Pfam-A family.
How can I submit a new domain ?
If you know of a domain that is not present in Pfam, you can
submit it to us by email
(pfam-help@sanger.ac.uk) and we will endeavour to build a Pfam entry
for it. We ask that you supply us with a multiple sequence
alignment of the domain (please send the alignment file as a
text file (e.g. .txt) and not in the format of
a specific application such as Microsoft Word (e.g. a .doc)
file), and associated literature evidence if available.
What is iPfam ?
iPfam is a resource that describes domain-domain interactions that are observed in PDB entries. Where two or more Pfam domains occur in a single structure, it analyses them to see if the are close enough to form an interaction. If they are close enough it calculates the bonds forming the interaction. Further information can be found on the iPfam help pages.
Can I search my protein against Pfam ?
Of course! Please use this search form.
What is the difference between the - and . characters in your full alignments ?
The '-' and '.' characters both represent gap characters. However they do tell you some extra information about how the HMM has generated the alignment. The '-' symbols are where the alignment of the sequence has used a delete state in the HMM to jump past a match state. This means that the sequence is missing a column that the HMM was expecting to be there. The '.' character is used to pad gaps where one sequence in the alignment has sequence from the HMMs insert state. See the alignment below where both characters are used. The HMM states emitting each column are shown. Note that residues emitted from the Insert (I) state are in lower case.
FLPA_METMA/1-193 ---MPEIRQLSEGIFEVTKD.KKQLSTLNLDPGKVVYGEKLISVEGDE
FBRL_XENLA/86-317 RKVIVEPHR-HEGIFICRGK.EDALVTKNLVPGESVYGEKRISVEDGE
FBRL_MOUSE/90-321 KNVMVEPHR-HEGVFICRGK.EDALFTKNLVPGESVYGEKRVSISEGD
O75259/81-312 KNVMVEPHR-HEGVFICRGK.EDALVTKNLVPGESVYGEKRVSISEGD
FBRL_SCHPO/71-303 AKVIIEPHR-HAGVFIARGK.EDLLVTRNLVPGESVYNEKRISVDSPD
O15647/71-301 GKVIVVPHR-FPGVYLLKGK.SDILVTKNLVPGESVYGEKRYEVMTED
FBRL_TETTH/64-294 KTIIVK-HR-LEGVFICKGQ.LEALVTKNFFPGESVYNEKRMSVEENG
FBRL_LEIMA/57-291 AKVIVEPHMLHPGVFISKAK.TDSLCTLNMVPGISVYGEKRIELGATQ
Q9ZSE3/38-276 SAVVVEPHKVHAGIFVSRGKsEDSLATLNLVPGVSVYGEKRVQTETTD
HMM STATES MMMMMMMMMMMMMMMMMMMMIMMMMMMMMMMMMMMMMMMMMMMMMMMM
What do the SS lines in the alignment mean ?
These lines are structural information. The SS stands for secondary structure, and this is taken from DSSP. The following list gives the definitions for each code letter:
- C
- Random Coil
- H
- Alpha-helix
- G
- 3(10) helix
- I
- Pi-helix
- E
- Hydrogen bonded beta-strand (extended strand)
- B
- Residue in isolated beta-bridge
- T
- H-bonded turn (3-turn, 4-turn, or 5-turn)
- S
- Bend (five-residue bend centered at residue i)
You don't have domain YYYY in Pfam !
We are very keen to be alerted to new domains. If you can provide us with a multiple alignment then we will try hard to incorporate it into the database. If you know of a domain, but don't have a multiple alignment, we still want to know, for simple families just one sequence is enough. Again E-mail pfam-help@sanger.ac.uk.
Are there other databases which do this ?
To a certain extent yes, there are a number of "second generation" databases which are trying to organise protein space into evolutionarily conserved regions. Examples include:
- PROSITE
- This originally was based around regular expression patterns but now also includes profiles.
- PRINTS
- This is based around protein "finger-prints" of a series of small conserved motifs making up a domain.
- BLOCKS
- This is based around automatic ungapped alignments.
- SMART
- This is a database concentrating on extracellular modules and signaling domains.
- ADDA
- This is an automatic algorithm for domain decomposition and clustering of protein domain families.
- InterPro
- Combines information from Pfam, Prints, SMART, Prosite and PRODOM.
- CDD
- The Conserved Domain Database is derived from Pfam and SMART databases.
So which database is better ?
As with everything, it depends on your problem: we would certainly suggest using more than one method. Pfam is likely to provide more interpretable results, with crisp definitions of domains in a protein.
Glossary of terms used in Pfam
These are some of the commonly used terms in the Pfam website.
Alignment coordinates
HMMER3 reports two sets of domain coordinates for each profile HMM match. The envelope coordinates delineate the region on the sequence where the match has been probabilistically determined to lie, whereas the alignment coordinates delineate the region over which HMMER is confident that the alignment of the sequence to the profile HMM is correct. Our full alignments contain the alignment co-ordinates from HMMER3.
Architecture
The collection of domains that are present on a protein.
Clan
A collection of related Pfam entries. The relationship may be defined by similarity of sequence, structure or profile-HMM.
Domain
A structural unit which can be found in multiple protein contexts.
Domain score
The score of a single domain aligned to an HMM. Note that, for HMMER2, if there was more than one domain, the sequence score was the sum of all the domain scores for that Pfam entry. This is not quite true for HMMER3.
DUF
Domain of unknown function.
Envelope coordinates
See Alignment coordinates.
Family
A collection of related proteins.
Full alignment
An alignment of the set of related sequences which score higher than the manually set threshold values for the HMMs of a particular Pfam entry.
Gathering threshold (GA)
Also called the gathering cut-off, this value is the search threshold used to build the full alignment. The GA is the minimum score a sequence must attain in order to belong the the full alignment of a Pfam entry. For each Pfam HMM we have two GA cutoff values, a sequence cutoff and a domain cutoff.
HMMER
The suite of programs that Pfam uses to build and search HMMs. For Pfam release 24.0 we have used HMMER version 3 to make Pfam. See the HMMER site.
Hidden Markov model (HMM)
A HMM is a probablistic model. In Pfam we use HMMs to transform the information contained within a multiple sequence alignment into a position-specific scoring system. We search our HMMs against the UniProt protein database to find homologous sequences.
HMMER2
The suite of programs that Pfam uses to build and search HMMs. See the HMMER site.
iPfam
A resource that describes domain-domain interactions that are observed in PDB entries. Where two or more Pfam domains occur in a single structure, it analyses them to see if the are close enough to form an interaction. If they are close enough it calculates the bonds forming the interaction.
Metaseq
A collection of sequences derived from various metagenomics datasets.
Motif
A short unit found outside globular domains.
Noise cutoff (NC)
The bit scores of the highest scoring match not in the full alignment.
Pfam-A
A HMM based hand curated Pfam entry which is built using a small number of representative sequences. We manually set a threshold value for each HMM and search our models against the UniProt database. All of the sequnces which score above the threshold for a Pfam entry are included in the entry's full alignment.
Pfam-B
An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues from them. Since Pfam-B families are automatically generated, we recommend that you verify that the sequences in a Pfam-B family are related, using other methods such as BLAST. For Pfam 24.0, we have made HMMs for the first (and therefore largest) 20,000 Pfam-B familes. Users can search their sequences against the Pfam-B HMMs in addition to the Pfam-A HMMs when performing both single-sequence searches and batch searches on the website.
Posterior probability
HMMER3 reports a posterior probability for each residue that matches a 'match' or 'insert' state in the profile HMM. A high posterior probability shows that the alignment of the amino acid to the match/insert state is likely to be correct, whereas a low posterior probability indicates that there is alignment uncertainty. This is indicated on a scale with '*' being 10, the highest certainty, down to 1 being complete uncertainty. Within Pfam we display this information as a heat map view, where green residues indicate high posterior probability, and red ones indicate a lower posterior probability.
Repeat
A short unit which is unstable in isolation but forms a stable structure when multiple copies are present.
Seed alignment
An alignment of a set of representative sequences for a Pfam entry. We use this alignment to construct the HMMs for the Pfam entry.
Sequence score
The total score of a sequence aligned to a HMM. If there is more than one domain, the sequence score is the sum of all the domain scores for that Pfam entry. If there is only a single domain, the sequence and the domains score for the protein will be identical. We use the sequence score to determine whether a sequence belongs to the full alignment of a particular Pfam entry.
Trusted cutoff (TC)
The bit scores of the lowest scoring match in the full alignment.
Help With Pfam HMM scores
Documentation update
October 2009
The documentation in this tab is currently out of date. Although the general information is still largely accurate, the details of the site and underlying database may be inaccurate. We hope to update the documentation within the coming weeks.
What Pfam HMM scores mean
Pfam-A is based around hidden Markov model (HMM) searches, as provided by the HMMER2 package. In HMMER2, like BLAST, E-values (expectation values) are calculated. The E-value is the number of hits that would be expected to have a score equal or better than this by chance alone. A good E-value is much less than 1. Around 1 is what we expect just by chance. In principle, all you need to decide on the significance of a match is the E-value.
However, there are a few complications.
The most serious complication is that there are no analytical results available for accurately determining E-values for gapped alignments, especially profile HMM alignments. HMMER uses empirical methods to estimate E-values. These methods are generally rather accurate. However, when in doubt, HMMER tends to err on the conservative side.
We use a second, and even more empirical, system in maintaining Pfam models. This system is implemented in the Pfam database rather than in the HMMER software. For each Pfam family, we record a "trusted cutoff" and a "noise cutoff", TC1 and NC1. TC1 is the lowest score for sequences we included in the family (e.g. in the Full alignment). NC1 is the highest score for sequences we did not include in the Full alignment. (Since Full alignments are produced automatically, the trusted sequence cutoff is always greater than the noise sequence cutoff.)
Therefore, we can consider a hit very significant if it scores better than the trusted cutoff, better than the noise cutoff, and has a significant E-value. Sometimes sequences score better than the cutoffs though they don't have significant E-values; these are marginal hits that we've chosen to include in the family.
Sequence versus domain scores
There's one additional wrinkle in the scoring scheme. HMMER2 calculates two kinds of scores. The "sequence classification score" is the total score of a sequence aligned to a model; if there are more than one domain, the sequence score is the sum of all (finding multiple domains increases our confidence that the sequence belongs to that protein family, even if each domain individually is a weak match.) The "domain score" is a score for a single domain (these two scores are identical for single domain proteins).
References & Bibliography
Pfam References
Book Chapters on Pfam
How to link to Pfam?
Pfam is maintained by a consortium of researchers based at the Wellcome Trust Sanger Institute, Cambridge, UK (WTSI), Stockholm Bioinformatics Center, Stockholm, Sweden (SBC), and Janelia Farm, Maryland, USA. All three sites run the same Pfam website and linking to different sites only requires that you change the site name, not the parameters in the URL.
Although we have no plans to change the locations of resources within this site dramatically, webmasters are advised to link only to the following types of page within the site.
Home pages
- WTSI:
- http://pfam.sanger.ac.uk/
- SBC:
- http://pfam.sbc.su.se/
- Janelia:
- http://pfam.janelia.org/
Searching a protein sequence against Pfam
- WTSI:
- http://pfam.sanger.ac.uk/search?tab=sequenceSearchBlock
- SBC:
- http://pfam.sbc.su.se/search?tab=sequenceSearchBlock
- Janelia:
- http://pfam.janelia.org/search?tab=sequenceSearchBlock
Searching a DNA sequence against Pfam
- WTSI:
- http://pfam.sanger.ac.uk/search?tab=sequenceDnaBlock
- SBC:
- http://pfam.sbc.su.se/search?tab=sequenceDnaBlock
- Janelia:
- http://pfam.janelia.org/search?tab=sequenceDnaBlock
Linking to Pfam family pages
You can refer to Pfam families either by accession or ID. You can also refer to a family by "entry", although this is a convenience that should be used only if you're not sure if what you have is an accession or an ID.
Pfam accession numbers are more stable between releases than IDs and we strongly recommend that you link by accession number.
Here are some examples of linking to Pfam at WTSI:
- By accession:
- http://pfam.sanger.ac.uk/family?acc=PF00002
- By ID:
- http://pfam.sanger.ac.uk/family?id=7tm_2
- Using "entry":
-
http://pfam.sanger.ac.uk/family?entry=PF00002
or
http://pfam.sanger.ac.uk/family?entry=7tm_2 - Directly:
-
http://pfam.sanger.ac.uk/family/PF00002
or
http://pfam.sanger.ac.uk/family/7tm_2
You can link to Pfam family data at the other sites by changing "pfam.sanger.ac.uk" to "pfam.sbc.su.se" or "pfam.janelia.org".
Linking to protein sequence pages
As for Pfam family pages, you can refer to protein sequence pages by accession, ID or entry. Protein IDs are unstable and do change between releases, so, again, we strongly recommend that you use protein accessions where possible.
Here are some examples of linking to protein sequence pages at WTSI:
- By accession:
- http://pfam.sanger.ac.uk/protein?acc=P15498
- By ID:
- http://pfam.sanger.ac.uk/protein?id=VAV_HUMAN
- Using "entry":
-
http://pfam.sanger.ac.uk/protein?entry=P15498
or
http://pfam.sanger.ac.uk/protein?entry=VAV_HUMAN - Directly:
-
http://pfam.sanger.ac.uk/protein/P15498
or
http://pfam.sanger.ac.uk/protein/VAV_HUMAN
Again, to generate links to the other Pfam sites, change "pfam.sanger.ac.uk" to "pfam.sbc.su.se" or "pfam.janelia.org".
Linking to the "jump to" search
The Pfam website features a search tool that tries to guess the type of any accession or ID that it is given. For example, if given "VAV_HUMAN", the search returns the URL for the protein sequence page for the VAV_HUMAN entry. If given "1w9h", the search returns the URL for the PDB entry (structure) 1w9h.
You can use the "jump to" search if you need to link to Pfam but
can't be sure what type of accession or ID you will be using in your link.
By default, the search returns the URL that it has found, as a simple,
plain text HTTP response. Adding the parameter redirect=1
will make the "jump to" tool redirect to the URL that it finds
or, if it couldn't find an appropriate URL, to the Pfam homepage.
- Return URL:
- http://pfam.sanger.ac.uk/search/jump?entry=P15498
- Redirect:
- http://pfam.sanger.ac.uk/search/jump?entry=P15498&redirect=1
Note that, although it may be convenient to link to Pfam using this search tool, there is no error reporting for your users if the search fails to find an appropriate URL in the Pfam site. It is much safer to link directly to the correct section of the site. Please contact us if you need help with building specific links.
Documentation update
October 2009
The documentation in this tab is currently out of date. Although the general information is still largely accurate, the details of the site and underlying database may be inaccurate. We hope to update the documentation within the coming weeks.
One of the features provided by the Pfam website is a graphical representation of the features found within a sequence, termed domain graphics. There are a variety of different shapes and styles and each one has a particular meaning. This page gives an in-depth description of the elements of Pfam domain graphics.
The library which generates the images in this page and throughout the Pfam site uses an XML language to describe the domain graphic that is required. Each of the example graphics in this page is followed by a link that can be used to show the XML that produced it.
We provide a set of tools, described in the Tools & Web Services section of the help pages, that allow you to generate custom domain graphics by uploading your own XML file, or to generate graphics for a specific UniProt sequence, given the UniProt accession or ID.
The sequence
The base sequence, undecorated by any domains or features, is represented by a plain grey bar:
Show XML
The length of the domain graphic that is drawn is proportional to the length of the sequence itself. The graphics in this page are drawn with a X-scale of 0.5 pixels per amino-acid, so that a 200 residue sequence will result in a 100 pixel-wide image. Any domains or features which are drawn on the sequence are also scaled by the same factor.
Pfam-A
The high quality, curated Pfam-A domains are classified into one of four different types: family, domain, repeat and motif (more details). These different classification types are rendered slightly differently.
Family/domain
It is possible for a sequence to match either the full length of a Pfam HMM (a full length match), or to match a portion of an HMM (a fragment match). The two types of match are rendered differently.
Both family and domainentries are rendered as rectangles with curved ends when the sequence is a full length match. The curves at the ends become less pronounced when the domains are short, as shown in the second domain below. Different types of domain are displayed with different colours. When the domain image is long enough, the domain name is shown within the domain itself. In most cases, you can click on the domains to visit the "family page" for that domain. Moving the mouse over the domain image should also display a tooltip showing the domain name, as well as the start and end positions of the domain.
Show XML
When the sequence does not match the full length of the HMM that models a Pfam entry, matching domain fragments are shown. When a sequence match does not pass through the first position in the HMM, the N-terminal side of the domain graphic is drawn with a jagged edge instead of a curved edge. Similarly, when a sequence match does not pass through the last position of the HMM, the C-terminal side of the domain graphic is drawn with a jagged edge. In some rarer cases, the sequence match may not pass through either of the first or last positions of the HMM, in which case both sides are drawn with jagged edges. Examples of all three cases are shown here:
Show XML
Repeat/motif
Repeats and motifs are types of Pfam domain which do not form independently folded units. In order to distinguish them from domains of type family and domain, repeats and motifs are represented by rectangles with straight edges. As for families and domains, partial matches are represented with jagged edges.
Show XML
Discontinuous nested domains
Some domains in Pfam are disrupted by the insertion of another domain (or domains) within them. A number of names have been given to this arrangement: discontinuous (referring to the outer domain), inserted or nested (both referring to the inner domain). For example, in many sequences containing an IMPDH domain, the IMPDH domain is continuous along the primary sequence. However, in some cases the linear sequence of the IMPDH domain is broken by the insertion of a CBS domain, as shown below.
Where three-dimensional structures are available for representatives of a Pfam domain, it is generally clear that the three-dimensional arrangement of the domain containing the nested domain is maintained. Typically the nested domain is found inserted within a surface exposed loop, having little or no effect on the structure of the other domain. Such an arrangement explains why and how these nested domains can be functionally tolerated.
To represent this arrangement of domain graphically, the discontinuous domain is represented in two parts (as shown below). These two parts are joined by a line bridging them. The vertical parts of the line are dashed, while the horizontal line is solid (to distinguish it from a disulphide bridge).
Show XML
back to top
Context domains
Context domains in Pfam are those that, despite not scoring above the family gathering threshold, are expected to be real, based on the presence of the surrounding domains found in the protein. The method is described in:
In some cases it is possible for a protein without any matches to gain context domains. This happens when two or more weak matches support each other. This is most often seen with multiple tandem repeats such as WD40 and leucine rich repeats such as LRR_1.
Within the Pfam domain graphics, the context domains are represented by rectangles that are coloured from white to pink as shown below. These images are interactive in the same manner as the Pfam-A graphics.
Show XML
Please note that context domains are generated automatically and have not been subjected to the same high level of quality control as Pfam-A domains. Therefore, context domains, although likely to be correct should always be verified by other means.
Pfam-B
Pfam-B regions are automatically generated clusters that supplement the high quality Pfam-A regions. The mechanism for generating Pfam-B regions is detailed here. These regions are represented by a small rectangle, coloured with three stripes. As for Pfam-A regions, clicking on a Pfam-B domain takes the user to the Pfam-B summary page for that entry. Moving the mouse over the striped image will show a tooltip listing the Pfam-B identifier and its start and end points. If the Pfam-B region is long enough, its identifier will also be displayed on the image.
Show XML
Other sequence motifs
In addition to domains, smaller sequences motifs are represented by the domain graphics. Currently the following motifs are represented: signal peptides, low complexity regions, coiled-coils and transmembrane regions. These usually take lower prority than other regions that are drawn and they are therefore often obscured by, for example, a Pfam-A graphic being drawn over the top of them. An example of each motif is shown here.
Show XML
Signal peptides
Signal peptides are short regions (<60 residues long) found at the N-terminus of proteins, which direct the post-translational transport of a protein and are subsequently removed by peptidases. More specifically, a signal peptide is characterised by a short hydrophobic helix (approximately 7-15 residues). This helix is preceded by a slight positively charged region of highly variable length (approximately 1-12 residues). Between the hydrophobic helix and the cleavage site is a somewhat polar and uncharged region, of between 3 and 8 amino-acids. In Pfam, we use Phobius for the prediction of signal peptides and represent them graphically by a small orange box.
Low complexity regions
Low complexity regions are regions of biased sequence composition, usually comprised of different types of repeats. These regions have been shown to be functionally important in some proteins, but they are generally not well understood and are masked out to focus on globular domains within the protein.
Within Pfam, we use SEG to calculate low complexity regions in Pfam. The presence of a low complexity region is indicated by a cyan rectangle.
Coiled-coils
Coiled coils are motifs found in proteins that structurally form alpha-helices that wrap or wind around each other. Normally, two to three helices are involved, but cases of up to seven alpha-helices have been reported. Coilded-coild are found in a wide variety of proteins, many functionally very important. In Pfam we use ncoils, to identify these motifs. Coiled-coils are represented by a small lime-green rectangle.
Transmembrane regions
Integral membrane proteins contain one or more transmembrane regions that are comprised of an alpha-helix that passes through or "spans" a membrane. Transmembrane helices are quite variable in length, with the average being about 20 amino-acids in length. Again, Phobius is used for the prediction of transmebrane regions, which are represented by a red rectangle.
Other Sequence features
Below is a demonstration of how disulphide bridges and active residues are representated in Pfam. Each of these features can appear above or below the sequence, but in this case the disulphide bridges are shown above the sequence and the active site residues below the line.
Show XML
Disulphide bridges
Disulphide bridges play a fundamental role in the folding and stability of some proteins. They are formed by covalent bonding between the thiol groups from two cysteine residues. The disulphide bridge annotations used in Pfam come from UniProt and are represented by a solid bridge-shaped line. When mutliple disulphide bonds occur, the heights of the bridges are adjusted to avoid overlaps between them. Inter-protein disulphides are represented by single vertical lines. As always, moving the mouse over the "bridge graphic" shows the details of the bond in a tooltip.
Active site residues
Within an enyzme, a small number of residues are directly involved in catalysis of a reaction. These are termed active site residues. Within Pfam there are three categories of active site: those that are experimentally determined, those that are predicted by UniProt and those predicted by Pfam. All three types are represented by a "lollipop" with a diamond head. The head is coloured red, pink and purple for each of the three types respectively.
Pfam-predicted active sites are determined by using the experimental data and transferring these annotations through a Pfam alignment.
Other features
In addition to the drawing features outlined above, the Pfam domain graphics library includes some additional, general purpose representation styles.
Arrows
Arrows can be drawn perpendicular to the sequence, and can point either towards or away from the sequence line. They can be drawn with different vertical line styles (solid, dashed or bold) and can be placed above or below the sequence. The example below shows the different arrow styles that are available:
Show XML
Additional "lollipop" styles
A wide range of different lollipop styles can be create by combining different line and head colours with different drawing styles. For example, a lollipop can be drawn with either bold (solid) or dashed lines. The lollipop head can be drawn as either a square, circle or diamond.
Show XML
Guide to Pfam tools and services
Documentation update
October 2009
The documentation in this tab is currently out of date. Although the general information is still largely accurate, the details of the site and underlying database may be inaccurate. We hope to update the documentation within the coming weeks.
Tools
Producing your own graphics
As we are regularly approached for producing domain graphics for use in publications, we have produced a tool for users to upload a "domain graphics" XML. This file will be validated against the schema and subsequently rendered. The images that the tool produces can then be saved for your own use.
If there is an existing sequence in Pfam that you wish to alter/elaborate then the XML used by Pfam for this sequence can also be obtained using this tool.
You can see a detailed description of the XML language that describes Pfam domain images in the Guide to Graphics section of the help pages.
There is a similar tool which allows you to see the domain graphic for a given UniProt entry.
Web services
In the past, Pfam has provided a set of SOAP-based web services, designed to allow programmatic access to Pfam data. These services were built as a stand-alone service, entirely separate from the old Pfam website. As such they are somewhat difficult to maintain and poorly integrated with the new Pfam website.
With the latest website release, we have added a new type of programmatic interface to Pfam services, in the form of a set of "RESTful" services. You can see documentation for these new services here.
Because of the problems of maintaining the SOAP-based web-services, we are now phasing them out, in favour of the RESTful interface. We would strongly encourage developers to switch to the new services. If you have questions or comments about this switch, please contact us at the email address at the bottom of the page.
Show the web services documentation.
Contents:
This is an introduction to the "RESTful" interface to the Pfam website. REST (or Representation State Transfer) refers to a style of building websites which makes it easy to interact programmatically with the services provided by the site. A programmatic interface, commonly called an Application Programming Interface (API) allows users to write scripts or programs to access data, rather than having to rely on a browser to view a site.
Basic concepts
URLs
A RESTful service typically sends and receives data over HTTP, the same protocol that's used by websites and browsers. As such, the services provided through a RESTful interface are identified using URLs.
In the Pfam website we use the same basic URL to provide both the standard HTML representation of Pfam data and the alternative XML representation. To see the data for a particular Pfam-A family, you would visit the following URL in your browser:
http://pfam.janelia.org/family/Piwi
To retrieve the data in XML format, just add an extra parameter,
output=xml, to the URL:
http://pfam.janelia.org/family/Piwi?output=xml
The response from the server will now be an XML document, rather than an HTML page.
Sending requests
Although you can use a browser to retrieve family data in XML format,
it's most useful to send requests and retrieve XML programmatically.
The simplest way to do this is using a Unix command line tool such as
curl:
Most programming languages have the ability to send HTTP requests and receive HTTP responses. A perl script to retrieve data about a Pfam family might be as trivial as this:
Retrieving data
Although XML is just plain text and therefore human-readable, it's intended to be parsed into a data structure. Extending the perl script above, we can add the ability to parse the XML using an external perl module, XML::LibXML:
This script now prints out the accession for the family "Piwi" (PF02171).
Available services
The following is a list of the sections of the website which are currently available as RESTful services.
Pfam ID/accession conversion
This is a simple service to return the accession and ID for a Pfam family, given either the ID or accession as input. Any of the following URLs will return the same simple XML document:
http://pfam.janelia.org/family/acc?id=Piwi&output=xml http://pfam.janelia.org/family/acc/Piwi?output=xml http://pfam.janelia.org/family/id?output=xml&acc=PF02171 http://pfam.janelia.org/family/id/Piwi?output=xml http://pfam.janelia.org/family?entry=Piwi&output=xml
You can see the XML schema for this XML document here.
Note that, as a convenience, you can also omit the output=xml
parameter and the response will contain only the ID or accession, as a
plain text string:
Pfam-A annotations
You can retrieve a sub-set of the data in a Pfam-A family page as an XML document using any of the following styles of URL:
http://pfam.janelia.org/family?id=Piwi&output=xml http://pfam.janelia.org/family?output=xml&acc=PF02171 http://pfam.janelia.org/family?entry=Piwi&output=xml http://pfam.janelia.org/family/Piwi?output=xml
The last two styles, using the entry parameter or
an extended URL, accept either accessions or identifiers. The
accession/ID is case-insensitive in all cases.
You can see the XML schema for this XML document here.
Some Pfam families are removed or merged into others, in which case they become "dead" families. If you try to retrieve annotation information about a dead family, you'll get a simple XML document that only includes information on the replacement (if any) for the family:
You can see the XML schema for this XML document here.
Pfam-A family list
You can retrieve a list of all Pfam-A families in the latest Pfam release, either as an XML document or as a tab-delimited text file. Both formats contain the Pfam-A accession, Pfam-A identifier and description:
http://pfam.janelia.org/families?output=xml http://pfam.janelia.org/families?output=text
You can also view the list in a web browser by removing the
output=xml parameter from the URL.
You can see the XML schema for this XML document here.
Protein sequence data
You can retrieve a sub-set of the data in a protein page as an XML document using any of the following styles of URL:
http://pfam.janelia.org/protein?id=CANX_CHICK&output=xml http://pfam.janelia.org/protein?output=xml&acc=P00789 http://pfam.janelia.org/protein?entry=P00789&output=xml http://pfam.janelia.org/protein/P00789?output=xml
As for Pfam-A families, arguments are all case-insensitive and the
entry parameter accepts either ID or accession.
You can see the XML schema for this XML document here.
Sequence searches
The Pfam website includes a form that allows users to upload a protein sequence and see a list of the Pfam domains that are found on their search sequence. We've now implemented a RESTful interface to this search tool, making it possible to run single-sequence Pfam searches programmatically.
Running a search is a two step process:
- submit the search sequence and specify search parameters
- retrieve search results in XML format
The reason for separating the operation into two steps rather than performing a search in a single operation is that the time taken to perform a sequence search will vary according to the length of the sequence searched. Most web clients, browsers or scripts, will simply time-out if a response is not received within a short time period, usually less than a minute. By submitting a search, waiting and then retrieving results as a separate operation, we avoid the risk of a client reaching a time-out before the results are returned.
The following example uses simple command-line tools to submit the search and retrieve results, but the whole process is easily transferred to a single script or program.
Save your sequence to file
It is usually most convenient to save your sequence into a plain text file, something like this:
The sequence should contain only valid sequence characters, i.e. letters, excluding "J" and "O". You can break the sequence across multiple lines to make it easier to handle.
Submit the search
You can see the XML schema for this XML document here.
When using curl the value of the parameter "seq"
needs to be quoted so that its value is taken correctly from the file
"test.seq". The second parameter can also be added directly to
the URL, as a regular CGI-style parameter, if you prefer.
The search service accepts the following parameters (you can see a more complete description of these settings here):
| Parameter | Description | Accepted values | Default | Notes |
|---|---|---|---|---|
| evalue | use this E-value cut-off | valid float | 1.0 | to use the gathering threshold for the family, set "ga=1" and don't specify an E-value. If an E-value is given, it will be used, regardless of the value of "ga" |
| ga | use gathering threshhold | 0 | 1 | 0 | |
| searchBs | do search for Pfam-B hits | 0 | 1 | 0 | setting "skipAs=0" implies "searchBs=1"; you must search for at least one type of family |
| skipAs | don't search for Pfam-A hits | 0 | 1 | 0 | |
| seq | protein sequence | valid sequence characters | none | required |
Wait for the search to complete
Although you can check for results immediately, if you poll before your job has completed, you won't receive an XML document. Instead, the HTTP response to your request will have its status set appropriately and the body of the response will contain only string giving the status. You should ideally check the HTTP status of the response, rather than relying on the body of the response.
These are the possible status codes for the response:
| HTTP status code |
Status description |
Response body |
Notes |
|---|---|---|---|
| 202 | Accepted | PEND / RUN | The job has been accepted by the search system and is either pending (waiting to be started) or running. After a short delay, your script should check for results again |
| 502 | Bad gateway | FAIL | There was a problem scheduling or running the job. The job has failed and will not produce results. There is no need to check the status again |
| 503 | Service unavailable | HOLD | Your job was accepted but is on hold. This status will not be assigned by the search system, but by an administrator. There is probably a problem with the job and you should contact the help desk for assistance with it |
| 410 | Gone | DEL | Your job was deleted from the search system. This status will not be assigned by the search system, but by an administrator. There was probably a problem with the job and you should contact the help desk for assistance with it |
| 500 | Internal server error | Error message | There was some problem with running your job, but it does not fall into any of the other categories. The body of the response will contain an error message from the server. Contact the help desk for assistance with the problem |
When writing a script to submit searches and retrieve results, please add a short delay between the submission and the first attempt to retrieve results. Most search jobs are returned within four to five seconds of submission, depending greatly on the length of the sequence to be searched.
Retrieve results
The XML that was returned from the first query includes one or more URLs
from which you can now retrieve results, given in the
<result_url>. You can now poll these URLs to retrieve
XML documents with the search hits.
You can see the XML schema for this XML document here.
Since the search is performed by the same server as searches in the Pfam website, you can view your results in a web page by modifying the URL slightly:
http://pfam.janelia.org/search/sequence/results/F69126C4-C24E-11DE-825F-800A2878356D
Note that old search results are generally cleared out after some time, so if you wait too long before trying to view your hits in the website, you may find that they are already gone.
Database documentation
Documentation update
October 2009
The documentation in this tab is currently out of date. Although the general information is still largely accurate, the details of the site and underlying database may be inaccurate. We hope to update the documentation within the coming weeks.
This section describes the tables in the Pfam MySQL database and shows example queries. Installation packages and documentation on the MySQL database itself can be found on the MySQL website.
VERSION table
The VERSION table contains information that relates to a particular Pfam release. It contains the version number of the Pfam database, the version numbers of the Swiss-Prot and TrEMBL databases that were used to build Pfam, and some statistics about the number of families and coverage. This table is stand-alone and does not link to any of the other tables.
Domain information
Two of the central tables in the database are pfamseq, which contains the UniProtkKB sequence database, and pfamA, which contains information about the Pfam-A families. The table pfamA_reg_seed contains the Pfam domains that are present in a seed alignment, and the pfamA_reg_full contains all of the sequence regions that match the HMM for each family. Note that the pfamA_reg_full table contains both the significant and insignificant data.
The pfamA_reg_full_significant and pfamA_reg_full_insignificant tables contain, as the names suggest, the significant and insignificant data respectively. Significant hits are those with a bits score above the curated threshold for the family, whilst insignificant matches are those that score below the curated threshold. With respect to the tables that contain significant data (pfamA_reg_full_significant and pfamA_reg_full), there is an extra column called 'in_full'. The matches that are present in the full alignment for a Pfam family have this column set to 1, while those that are not present in the full alignment have the 'in_full' column set to 0. Where there is an overlapping fragment match and a full length match to the same Pfam-A family, only one of the matches will be present in the full alignment for that Pfam-A family.
The Pfam database has historically been built on the UniProtKB database, however as of release 22.0 we also provide Pfam domain data for the NCBI sequence database (genpept) and a set of metagenomics sequences. Further information about querying the NCBI and metagenomics data sets can be found below.
UniProtKB sequences have secondary accessions if they have been merged or split. Secondary accession numbers are stored in the table called secondary_pfamseq_acc.
Other regions, active site and disulphide bond information for a sequence
These tables contain-sequence specific information about the sequences in the UniProtKB database. The other_regions tables contains coiled coil, low complexity, signal peptide and transmembrane regions. The context_pfam_regions table contains context domains; context domains are those that do not score above the family gathering threshold, but are expected to be real based on the presence of the surrounding domains found in the protein. The pfamseq_markup table contains active site information which is taken from the UniProtKB feature table. Additional active site residues are predicted by Pfam based on conserved residues in a Pfam alignment. The pfamseq_disulphide tables contains disulphide bond information from the UniProtKB feature table.
Architecture information for a family
In Pfam, an architecture is the collection of domains that are present on a protein.
Annotation information for a family
In addition to the Pfam annotation for each family, we also store InterPro annotation and their associated GO terms for each family. Links to other databases (e.g. SCOP) are also stored where appropriate.
Note: The other_params column contains 'fa;' where the Pfam family corresponds to a SCOP family, and 'sf;' where the Pfam family corresponds to a SCOP superfamily.
Clan data
A clan contains a set of related Pfam-A families. The information we use to determine which families belong to the same clan include related structure, related function, matching of the same sequence to HMMs from different families and profile-profile comparisons. Note that not all Pfam-A families belong to a clan.
Dead families and clans
Sometimes we find that two or more families within Pfam can be merged into a single family, which leads to the deletion of Pfam-A families. The dead_families and dead_clans tables contain information about families and clans that have been deleted. These tables may be of use if you need to track what happened to the members of a particular family that is no longer in Pfam.
Hidden Markov model (HMM) tables
The tables pfamA_HMM_ls and pfamA_HMM_ls contain the HMMs for the global and fragment models respectively. It is unlikely that you will need to query these tables. The table pfamA_web contains information about the percentage identity, average length and average coverage for a Pfam-A family.
Nested domains
Some domains in Pfam are disrupted by the insertion of another domain (or domains) within them. The domain that is inserted into another is known as a nested domain.
Structural data
In order for the Protein DataBank (PDB) information to be useful to Pfam, we need to map between PDB residues and UniProtKB sequence residues. This is not a trivial task and this mapping information is provided by the Macromolecular Structure Database (MSD) group. The msd_data table contains this residue-by-residue mapping.
Genomes
The tables in this section allow you to retrieve domain information about a particular species, or to retrieve all of the species which contain a partciular Pfam domain.
Note: The ncbi_code for the species 'Arabidopsis thaliana' is 3702. This information can be found in the ncbi_taxonomy table.
Related families
PRC and SCOOP are two pieces of software that we use to determine which Pfam families are related. The scores from these programs have been very useful in deciding which Pfam-A families should belong to the same clan. As a rough guide, a PRC E-value score of less than 0.001, or a SCOOP score greater than 50 shows that two families are closely related.
Note: The model_start and model_end values let you know which region of the models are similar.
Note: This query currently returns no results. PRC comparisons for Pfam-B families were not added to Pfam release 23.0, due to computational constraints. We hope to re-instate this data in a later release.
Note: This query currently returns no results. PRC comparisons for Pfam-B families were not added to Pfam release 23.0, due to computational constraints. We hope to re-instate this data in a later release.
NCBI data
In addition to searching all of the sequences in UniProtKB, we also search the protein sequences from NCBI against Pfam. The ncbi_pfamA_reg tables contains all of the sequence regions (both significant and insignificant) that match each HMM. The ncbi_map table links the GI number to its corresponding UniProtKB entry(s). Note that not all GI numbers have a corresponding UniProtKB entry.
Note: The query must include 'in_full=1' in order to retrieve only significant hits.
Note: The query must include 'in_full=1' in order to retrieve only significant hits.
Metagenomics data
We have searched a set of metagenomics seqeuences against Pfam. The metagenomics sequences that we searched are found in the metaseq table. Note that the meta_pfamA_reg table is different to the ncbi_pfamA_reg and pfamA_reg_full tables in that it contains only significant data.
Pfam FTP site
The following list describes a few of the important files in the Pfam FTP site. Some of these files may be very large (of the order of several hundred megabytes). Please check the sizes on the FTP site before trying to download them over a slow connection.
- relnotes.txt
- Release notes
- pfamseq.gz
- A fasta version of Pfam's underlying sequence database
- Pfam-A.hmm.gz
- The Pfam HMM library for Pfam-A families
- Pfam-B.hmm.gz
- The Pfam HMM library for Pfam-B families
- Pfam-A.full.gz
- The full alignments of the curated families
- Pfam-A.seed.gz
- The seed alignments of the curated families
- Pfam-B.gz
- Automatically generated alignments of sequence clusters in SWISSPROT and TrEMBL that are not modelled in the curated part of Pfam
- Pfam-C.gz
- The contains the information about clans and the Pfam-A membership
- swisspfam.gz
- The domain structure of SWISSPROT and TrEMBL proteins according to Pfam
- COPYRIGHT
- Copyright notice for Pfam
- GNULICENSE
- The full text of the GNU Library General Public License under which Pfam is licensed
Installing the Pfam website
Documentation update
October 2009
The documentation in this tab is currently out of date. Although the general information is still largely accurate, the details of the site and underlying database may be inaccurate. We hope to update the documentation within the coming weeks.
The source code for this website and the ancillary systems that it uses are all freely available for download. The website is designed to be portable, so that it can be installed and run at your local site if required. This section gives an overview of the requirements for running the site, a brief description of the steps involved in installing it, and links to detailed installation instructions.
Requirements
Software
The site is written in entirely in perl, using the Catalyst web framework. It runs under mod_perl in the Apache web-server. All data are retrieved from MySQL databases, running locally. Sequence searches are performed by a separate job queuing system, which uses various third-party software to perform searches, generate alignments, etc., including HMMER, genewise and wublast.
Hardware resources
The hardware requirements for the whole system are significant. Although it is possible to install all components on a single machine, we would not recommend it. Ideally you should have one or more web-server machines, a separate database server, and one or more machines to serve the back-end job queuing system. That said, although we give an idea of the hardware that we use at WTSI below, a local installation could be run on a significantly lower specification system.
Web server
The Pfam website includes mainly dynamically generated pages, with a large number of statically served items. The best performance can be gained by separating the two kinds of data onto two (or more) separate machines, so that dynamic data are served by one server, static data by another. If this is not possible, a single large machine should still give resonable performance. We serve a development site from a 3GHz Intel Xeon with 4Gb of physical memory.
Database server
You will first need a reasonable amount of disk space for the database files. The database is distributed as a set of gzip-compressed table dumps, which total about 12Gb. Once uncompressed these table files take up around 35Gb and once the tables are installed into MySQL, the database will require around 150Gb of disk space.
The MySQL database daemon will run happily on most machines, but in order to get the performance required to serve the website, you will need a machine with a fast processor (preferably multiple processors) and a large amount of memory. We run our database on a four processor AMD Opteron 280 server with 8Gb of physical memory.
Queuing system
Our job queuing system can be run on the same machine as the website or database server, but we would recommend running it on a separate machine or, ideally, on a farm of machines. This will ensure that the site can handle multiple requests for sequence alignments, sequence searches, etc. We run our queuing system on a farm of 14 dual-core 2.8GHz Xeons, each with 4Gb of physical memory.
Installation
You will need to install three sub-systems:
- the MySQL databases
- the back-end job queuing system
- the website itself
Database
- If you don't have it already, install MySQL
- Download the database files from the WTSI FTP site
- Install the database tables in your MySQL server
Back-end
- Install the required third-party software such as HMMER
- Install the perl prerequisites
- Download the data files for running the offline searches
- Retrieve the queuing system code from CVS
- Configure and start the queues
Website
- Configure cpan
- Install catalyst
- Install perl prerequisites for the website
- Retrieve the website code from CVS
- Configure the website
- Configure apache
- Restart apache
Detailed installation instructions
The process of installing the three sub-systems is described in detail in three Portable Document Format (PDF) files. You will need a PDF-reader in order to view these instructions.
- Database installation notes
- installing the Pfam databases
- Offline script installation notes
- installing the "backend" scripts that run the job queuing system
- PfamWeb installation notes
- installing the website itself
Privacy issues
This section outlines the ways in which the Pfam website handles information about users. This should not be read as a legal document, but as a description of how we handle information that could be considered sensitive. It should be read in conjunction with the privacy policy documents of the individual Pfam consortium member sites. If you have any concerns about the way that information is used in the website, please contact us at the address given at the bottom of the page and we will be more than happy to discuss your concerns.
Although we make every possible effort to keep this site and the data that it manipulates safe and secure, we make no claim to be able to protect sensitive or privileged information. If you are at all concerned about sensitive information being released, please do not use the site and consider installing the Pfam database and/or this website locally.
Urchin
We use Urchin, a software package closely related to Google Analytics (GA), to track the usage of this website. Urchin uses a single-pixel "web bug" image, which is served from every page, a javascript script that collects information about each request, and cookies that maintain information about your usage of the site between visits. You can read more about how GA works on the Google Analytics website, which includes a detailed description of how traffic is tracked and analysed.
We use the information generated by Urchin purely for audit and accounting purposes, and to help us assess the usefulness and popularity of different features of the site. It does not provide the ability to track individual users' usage of the site. However, Urchin does provides a high-level overview of the traffic that passes through the site, including such information as the approximate geographical location of users, how often and for how long they visited the site, etc.
We understand that this level of tracking may be worrying to some of our users. If you have any concerns about our use of Urchin, please feel free to contact us.
Browsing
All web servers maintain fairly detailed logs of their activity. This includes keeping a record of every request that they serve, usually along with the IP address of the client that made the request. This is true of the web servers that host the Pfam websites.
Although our servers do collect information about your IP address during the normal process of serving the Pfam website, we do not use this information explicitly. The Pfam group uses server logs only to help with development and debugging of the site.
Searches
The sequence search feature of the site allows you to upload a protein or DNA sequence to be searched against our library of HMMs. The sequence that you upload is stored in a database and is retrieved by a set of scripts that actually perform the search. Although we do not have any information that could be used to link that sequence to you personally, you should be aware that the sequence itself is accessible to systems administrators and other users who maintain the Pfam site.
The batch search function allows you to submit larger searches, the results of which are emailed to you. Obviously, this requires you to provide identifiable information, namely an email address. However, beyond the routine backups of our databases, we do not store any information about email addresses and sequences in the longer term and we make no attempt to keep track of the searches that a particular user may be performing.
Information from other types of search, such as a keyword search, is held only in the web server logs but, as described above, no attempt is made to interpret these logs except as part of development or debugging of the site.
Cookies
We use cookies to maintain some information about you between your visits to the site. The information that is stored cannot be used to identify you personally and cannot be used to track your usage of the site.
If you are at all concerned about the use of cookies in the Pfam site, you are free to block all cookies from this site and you should not experience any problems. You may see some unintended behaviour, such as being notified of all new features every time you visit the index page, but the core functionality of the site should be unaffected.
Third-party javascript libraries
This site makes heavy use of javascript and relies on javascript libraries that are developed by various groups and companies. In order to improve the performance of the Pfam website, we no longer serve these files ourselves, but rely on files that are hosted on third-party web-servers. In particular, we use various files that are provided by the AJAX libraries APIs, hosted by google code, and components of the Yahoo! User Interface Library (YUI), hosted by Yahoo!.
As these services are provided by commercial sites, it's likely that their usage will be carefully monitored by the companies that provide them. Although the Pfam site does not pass any information about you to these third-party sites, the sites themselves may use cookies to track your usage of the files that they serve. If you are concerned about the privacy implications of this monitoring, you may want to block cookies from the third-party hosting sites.
The Pfam Consortium
Pfam is maintained by an international consortium of researchers that has been borne out of its original development by Erik Sonnhammer, Sean Eddy and Richard Durbin. The current list of consortium members, their institutes and primary roles are listed below.
Wellcome Trust Sanger Institute (UK)
- Alex Bateman - Co-ordinator of the Pfam, Merops and Rfam databases
- Penny Coggill - Pfam database annotator
- Rob Finn - Project leader
- Jaina Mistry - Pfam research and development
- Prasad Gunasekaran - Pfam development
- John Tate - Web development
Janelia Farm Research Center (USA)
- Sean Eddy - Co-ordinator of Pfam-USA, founding developer and author of HMMER software
Stockholm Bioinformatics Center (Sweden)
- Erik Sonnhammer - Co-ordinator of Pfam-Sweden and founding developer
Mirror Sites
Previous contributors
- Shimelis Assefa
- Matthew Bashton
- Ewan Birney
- Lorenzo Cerrutti
- Lachlan Coin
- Richard Durbin
- Matthew Fenech
- O. Luke Gavin
- Sam Griffiths-Jones
- Kevin Howe
- Nicola Kerrison
- Mhairi Marshall
- Nina Mian
- William Mifsud
- Simon Moxon
- Joanne Pollington
- Stephen John Sammut
- David Studholme
- Corin Yeats
Pfam is a collaborative venture and we hope to be able to interact with as many people as possible, in order to provide a quality database. Please get in touch with any one of us for more information about Pfam. You can email Pfam using the address found at the bottom of the page.
How to contact Pfam
Contact Pfam
You can contact us in various ways. Each of the Pfam consortium sites provides a contact email address, which you can find at the bottom of every page. You can use this address to contact the specific Pfam group.
We also run a central helpdesk, which handles annotation comments, data enquiries and general problems with the Pfam websites. We use a request tracking system to monitor emails to the helpdesk, so you should receive an automated response to your email, letting you know that the system has logged your mail and notified us of its arrival.
Mailing list
The Pfam mailing list is a low traffic list that has important announcements, such as releases or major changes.
To join the mailing list send a mail to pfamlist-subscribe@sanger.ac.uk.
If you should want to unsubscribe from the list send a mail to pfamlist-unsubscribe@sanger.ac.uk.
Xfam blog
The Pfam group contributes to the Xfam blog. The blog is used to announce releases, new features and important changes to Pfam, as well as for posts discussing general issues surrounding the Pfam resource. You can see blog posts that are specific to Pfam here.
RSS feeds
You can keep in touch with the latest goings by subscribing to the RSS feed from the Xfam blog.

