0  structures 0  species 0  interactions 0  sequences 0  architectures

Pfam Help

Help Summary

Pfam 24.0 (Oct 2009 , 11912 families)

Proteins are generally comprised of one or more functional regions, commonly termed domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifying the domains present in a protein can provide insights into the function of that protein.

The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and hidden Markov models (HMMs).

There are two levels of quality to Pfam families: Pfam-A and Pfam-B. Pfam-A entries are derived from the underlying sequence database, known as Pfamseq, which is built from the most recent release of UniProtKB at a given time-point. Each Pfam-A family consists of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov models (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases.

Pfam-B families are un-annotated and of lower quality as they are generated automatically from the non-redundant clusters of the latest ADDA release. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found.

Pfam entries are classified in one of four ways:

Family:
A collection of related proteins
Domain:
A structural unit which can be found in multiple protein contexts
Repeat:
A short unit which is unstable in isolation but forms a stable structure when multiple copies are present
Motifs:
A short unit found outside globular domains

Related Pfam entries are grouped together into clans; the relationship may be defined by similarity of sequence, structure or profile-HMM.

Pfam Changes

This section details the changes that we plan to make or have made to Pfam. This includes changes to the flatfiles, MySQL database and the public website.


Latest changes to Pfam data

Changes between Pfam releases 23 and 24

Release 24.0 contains a total of 11912 families, with 1808 new families and 236 families killed since the latest release. 75.15% of all proteins in Pfamseq contain a match to at least one Pfam domain. 53.18% of all residues in the sequence database fall within Pfam domains. Pfam 24.0 is based on UniProt release 15.6, a composite of Swiss-Prot release 57.6 and TrEMBL release 40.6.

Show past changes.


Latest changes to website

Release 2.0.1 (29th October 2009)

This was a minor release to fix bugs introduced in the last major update.

  • Updated documentation: around half of the documentation has been brought up to date with the changes due to HMMER3
  • Reinstated RESTful services: most of the "RESTful" services have been updated. There are some schema changes and some differences in how sequence searches must be run. Please check the help pages for up to date documentation
  • Domain architecture search is back: the domain architecture tab in the search page has been re-enabled
  • Sequence search restrictions: sequence validation code now refuses to accept '-' as a valid sequence character
  • NCBI GI numbers: the 'jump' tool now understands GI numbers again, and the NCBI sequence page has been fixed
  • Other small bug fixes: there have been numerous other bug fixes throughout the site

Show past changes.

Getting Started using Pfam

jump to...

Using the "Jump to" search

Many pages in the site include a small search box, entitled "Jump to...". The "Jump to..." box allows you to go immediately to the page for any entry in the Pfam site entry, including Pfam families, clans and UniProt sequence entries.

The "Jump to..." search understands accessions and IDs for most types of entry. For example, you can enter either a Pfam family accession, e.g. PF02171, or, if you find it easier to remember, a family ID, such as piwi. Note that the search is case insensitive.

Because some identifiers can be ambiguous, the "Jump to..." search may need to test several types of identifier to find the entry that you're looking for. For example, Pfam A family IDs (e.g. Kazal_1) and Pfam clan IDs (e.g. Kazal) aren't easily distinguished, so if you enter kazal, the search will first look for a family called kazal and, if it doesn't find one, will then look for a clan called kazal. If all of the guesses fail, you'll see an error message saying "Entry not found".

The order in which the search tries the various types of ID and accession is given below:

  • Pfam A accession, e.g. PF02171
  • Pfam A identifier, e.g. piwi
  • Pfam B accession, e.g. PB000001
  • Pfam B identifier, e.g. Pfam-B_1
  • UniProt sequence accession, e.g. P00789
  • UniProt sequence ID, e.g. CANX_CHICK
  • NCBI "GI" number, e.g. 113594566
  • NCBI secondary accession, e.g. BAF18440.1
  • Pfam clan accession, e.g. CL0005
  • metaseq ID, e.g. JCVI_ORF_1096665732460
  • metaseq accession, e.g. JCVI_PEP_1096665732461
  • Pfam clan accession, e.g. CL0005
  • Pfam clan ID, e.g. Kazal
  • PDB entry, e.g. 2abl
  • Proteome species name, e.g. Homo sapiens

keyword search

Keyword search

Every page in the Pfam site includes a search box in the page header. You can use this to find Pfam A families which match a particular keyword. The search includes several different areas of the Pfam database:

  • text fields in Pfam entries, e.g. family descriptions
  • UniProt sequence entry description and species fields
  • HEADER and TITLE fields from PDB entries
  • Gene Ontology IDs and terms
  • InterPro entry abstracts

Each Pfam A entry is listed only once in the results table, although it might have been found in more than one area of the database.


Searching a protein sequence against Pfam

Searching a protein sequence against the Pfam library of HMMs will enable you to find out the domain architecture of the protein. If your protein is present in the version of UniProt, NCBI Genpept or the metagenomic sequence set that we used to make the current release of Pfam, we have already calculated its domain architecture. You can access this by entering the sequence accession or ID in the 'view a sequence' box on the Pfam homepage.

If your sequence is not in the Pfam database, you could perform a single-sequence or a batch search by clicking on the 'Search' link at the top of the Pfam page.

Single protein search

If your protein is not recognised by Pfam, you will need to paste the protein sequence into the search page. We will search your sequence against our HMMs and instantly display the matches for you.

Batch search

If you have a large number of sequences to search (up to several thousand), you can use our batch upload facility. This allows you to upload a file of your sequences in FASTA format, and we will run them against our HMMs and email the results back to you, usually within 48 hours. We request that you put a maximum of 1000 sequences in each file.

Local protein searches

If you have a very large number of protein searches to perform, or you do not wish to post your sequence across the web, it may be more convenient to run the Pfam searches locally using the 'pfam_scan.pl' script. To do this you will need the HMMER3 software, the Pfam HMM libraries and a couple of additional data files from the Pfam website. You will also need to download a few modules from CPAN, most notably Moose.

Full details on how to get 'pfam_scan.pl' up and running can be found on our FTP site.

Proteome analysis

Pfam pre-calculates the domain compositions and architectures for all the proteomes present in Integr8. To see the list of proteomes, click on the 'browse' link at the top of the Pfam website, and click on a letter of the alphabet in the 'proteomes' section. By clicking on a particular organism, you will be be able to view the proteome page for that organism. From here you can view the domain organisation and the domain composition for that proteome.


Finding proteins with a specific set of domain combinations ('architectures')

Pfam allows you to retrieve all of the proteins with a particular domain combination (e.g. proteins containing both a CBS domain and an IMPDH domain) using the domain query tool. For a more detailed study of domain architectures you should use PfamAlyzer, a tool that is hosted by the Swedish Pfam site. PfamAlyzer allows the user to find proteins which contain a specific combination of domains, and it allows the user to specify particular species and the evolutionary distances allowed between domains.

What is Pfam ?

Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam HMM represents a protein family or domain. By searching a protein sequence against the Pfam library of HMMs, you can determine which domains it carries i.e. its domain architecture. Pfam can also be used to analyse proteomes and questions of more complex domain architectures.

For each Pfam accession we have a family page, which can be accessed in several ways: from the 'View a Pfam Family' search box on the HOME page, by clicking on any graphical image of a domain, by searching for a particular family using the 'Keyword Search' box on the top right hand corner of most website pages, or by pasting the family identifier or accession into the 'JUMP TO' box that is present on most pages in the site.

back to top

What is the difference between Pfam-A and Pfam-B families ?

There are two levels of quality to Pfam families: Pfam-A and Pfam-B.

Pfam-A entries are derived from the underlying sequence database, known as Pfamseq, which is built from the most recent release of UniProtKB at a given time-point. For each Pfam-A family we build a single curated profile hidden Markov model (profile HMM) from the seed alignment (a small set of representative members of the family) using the HMMER3 software, and search this against Pfamseq to provide an automatically generated full alignment. All sequences that score above the cut-off threshold value determined for that family are included in the full alignment, which should then contain all detectable protein sequences belonging to that family.

We also search our Pfam-A HMMs against NCBI Genpept and a set of metagenomic sequences, and these alignments are available from the 'Alignments' tab of the Pfam-A family page. As the seed alignments have been manually checked for quality by a Pfam curator Pfam-A matches are very unlikely to be false matches. Pfam-A families also carry a summary annotation and links to other databases

To complement the Pfam-A families, we automatically generate Pfam-B families using the ADDA database. Pfam-B families have no associated annotation or literature reference and are of much lower quality than Pfam-A families, as their alignments have not been manually checked by a Pfam curator. Pfam-B families are formed by taking alignments of sequence segments from ADDA and removing any Pfam-A residues from them. Some Pfam-B families are composed of low complexity regions and may not reflect true relationships and we therefore we recommend you verify that sequences in a Pfam-B family are related by using other methods, such as BLAST.

In Pfam 24.0, we have built HMMs for the first (and largest) 20,000 Pfam-B families. Using the Pfam website, users are able to perform a single-sequence or batch search against both the Pfam-A and Pfam-B HMMs.

All families in Pfam are non-overlapping, such that no amino acid belongs to more than one family/domain. At each Pfam release we search all our models against an updated version of UniProt and NCBI Genpept, and regenerate our Pfam-B families using the most recent version of ADDA.

back to top

What is on a Pfam-A family page ?

From the family page you can view the Pfam annotation for a family. We also provide access to many other sources of information, including annotation from the InterPro database, where available, cross-links to other databases and other tools for protein analysis.

Via the tabs on the left-hand side of the page, you can view:

  • the domain architectures in which this family is found
  • the alignments for the family in various formats, including alignments of matches to the NCBI and metagenomic sets, as well as in 'heat-map' format. All alignments can be downloaded
  • the phylogenetic and species distribution trees, the latter being interactive
  • the HMM logo
  • the structural information for each family where available

back to top

What is a clan ?

Some of the Pfam families are grouped into clans. Pfam defines a clan as a collection of families that have arisen from a single evolutionary origin. Evidence of their evolutionary relationship can be in the form of similarity in tertiary structures, or, when structures are not available, from common sequence motifs. The seed alignments for all families within a clan are aligned and the resulting alignment (called the clan alignment) can be accessed from a link on the clan page. Each clan page includes a clan alignment, a description of the clan and database links, where appropriate. The clan pages can be accessed by following a link from the family page, or alternatively they can be accessed by clicking on 'clans' under the 'browse' by menu on the top of any Pfam page.

back to top

What happened to the Pfam_ls and Pfam_fs files?

In the past, each Pfam family was represented by two profile-hidden Markov models (HMMs). One of these could match partially to a family and was called local or fs mode, the other required a sequence to match to the whole length of the HMM, and was called glocal or ls mode. With HMMER2, we found that the combination of the two models gave us the most sensitive searches. However, HMMER3 models are only available for searching in local (fs) mode. Because of the improvements in HMMER3, this single model is as sensitive as the two combined HMMER2 models. This means that we no longer provide two HMM libraries called 'HMM_ls' and 'HMM_fs'. Instead, a single library is available called 'Pfam-A.hmm'.

back to top

Can I search DNA against Pfam ?

The Wise2 software package allows the comparison of protein HMMs to genomic DNA. We use this package to allow users to search single DNA sequences against the library of Pfam HMMs. Paste your DNA sequence into the DNA search box on the search page. The results take approximately 2 minutes for a 1kb sequence, and approximately 1 hour for a 80kb sequence.

back to top

How can I search Pfam locally ?

If you have a large number of sequences or you don't want to post your sequence across the web, you can search your sequence locally using the 'pfam_scan.pl' script.

In terms of HMMs and formats, Pfam is based around the HMMER3 package. This will need to be installed on your local machine. You will need also to download the Pfam HMM libraries from the FTP site, as well as a few modules from CPAN, most notably Moose.

Full details on how to get 'pfam_scan.pl' up and running can be found on our FTP site.

back to top

Why doesn't Pfam doesn't include my sequence ?

Pfam is built from a fixed release of UniProt. At each Pfam release we incorporate sequences from the latest release of UniProt. This means that, at any time, the sequences used by Pfam might be several months behind those in the most up-to-date versions of the sequence databases. If your sequence isn't in Pfam, you can still find out what domains it contains by pasting it into the sequence search box on the search page.

back to top

How many accurate alignments do you have ?

Release 24.0 has 11912 families. Over 73.7% of the proteins in SWISSPROT 57.6 and TrEMBL 40.6 have at least one match to a Pfam-A family.

back to top

How can I submit a new domain ?

If you know of a domain that is not present in Pfam, you can submit it to us by email (pfam-help@sanger.ac.uk) and we will endeavour to build a Pfam entry for it. We ask that you supply us with a multiple sequence alignment of the domain (please send the alignment file as a text file (e.g. .txt) and not in the format of a specific application such as Microsoft Word (e.g. a .doc) file), and associated literature evidence if available.

back to top

What is iPfam ?

iPfam is a resource that describes domain-domain interactions that are observed in PDB entries. Where two or more Pfam domains occur in a single structure, it analyses them to see if the are close enough to form an interaction. If they are close enough it calculates the bonds forming the interaction. Further information can be found on the iPfam help pages.

back to top

Can I search my protein against Pfam ?

Of course! Please use this search form.

back to top

What is the difference between the - and . characters in your full alignments ?

The '-' and '.' characters both represent gap characters. However they do tell you some extra information about how the HMM has generated the alignment. The '-' symbols are where the alignment of the sequence has used a delete state in the HMM to jump past a match state. This means that the sequence is missing a column that the HMM was expecting to be there. The '.' character is used to pad gaps where one sequence in the alignment has sequence from the HMMs insert state. See the alignment below where both characters are used. The HMM states emitting each column are shown. Note that residues emitted from the Insert (I) state are in lower case.

FLPA_METMA/1-193     ---MPEIRQLSEGIFEVTKD.KKQLSTLNLDPGKVVYGEKLISVEGDE
FBRL_XENLA/86-317    RKVIVEPHR-HEGIFICRGK.EDALVTKNLVPGESVYGEKRISVEDGE
FBRL_MOUSE/90-321    KNVMVEPHR-HEGVFICRGK.EDALFTKNLVPGESVYGEKRVSISEGD
O75259/81-312        KNVMVEPHR-HEGVFICRGK.EDALVTKNLVPGESVYGEKRVSISEGD
FBRL_SCHPO/71-303    AKVIIEPHR-HAGVFIARGK.EDLLVTRNLVPGESVYNEKRISVDSPD
O15647/71-301        GKVIVVPHR-FPGVYLLKGK.SDILVTKNLVPGESVYGEKRYEVMTED
FBRL_TETTH/64-294    KTIIVK-HR-LEGVFICKGQ.LEALVTKNFFPGESVYNEKRMSVEENG
FBRL_LEIMA/57-291    AKVIVEPHMLHPGVFISKAK.TDSLCTLNMVPGISVYGEKRIELGATQ
Q9ZSE3/38-276        SAVVVEPHKVHAGIFVSRGKsEDSLATLNLVPGVSVYGEKRVQTETTD
HMM STATES           MMMMMMMMMMMMMMMMMMMMIMMMMMMMMMMMMMMMMMMMMMMMMMMM
    

back to top

What do the SS lines in the alignment mean ?

These lines are structural information. The SS stands for secondary structure, and this is taken from DSSP. The following list gives the definitions for each code letter:

C
Random Coil
H
Alpha-helix
G
3(10) helix
I
Pi-helix
E
Hydrogen bonded beta-strand (extended strand)
B
Residue in isolated beta-bridge
T
H-bonded turn (3-turn, 4-turn, or 5-turn)
S
Bend (five-residue bend centered at residue i)

back to top

You don't have domain YYYY in Pfam !

We are very keen to be alerted to new domains. If you can provide us with a multiple alignment then we will try hard to incorporate it into the database. If you know of a domain, but don't have a multiple alignment, we still want to know, for simple families just one sequence is enough. Again E-mail pfam-help@sanger.ac.uk.

back to top

Are there other databases which do this ?

To a certain extent yes, there are a number of "second generation" databases which are trying to organise protein space into evolutionarily conserved regions. Examples include:

PROSITE
This originally was based around regular expression patterns but now also includes profiles.
PRINTS
This is based around protein "finger-prints" of a series of small conserved motifs making up a domain.
BLOCKS
This is based around automatic ungapped alignments.
SMART
This is a database concentrating on extracellular modules and signaling domains.
ADDA
This is an automatic algorithm for domain decomposition and clustering of protein domain families.
InterPro
Combines information from Pfam, Prints, SMART, Prosite and PRODOM.
CDD
The Conserved Domain Database is derived from Pfam and SMART databases.

back to top

So which database is better ?

As with everything, it depends on your problem: we would certainly suggest using more than one method. Pfam is likely to provide more interpretable results, with crisp definitions of domains in a protein.

back to top

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website.

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM match. The envelope coordinates delineate the region on the sequence where the match has been probabilistically determined to lie, whereas the alignment coordinates delineate the region over which HMMER is confident that the alignment of the sequence to the profile HMM is correct. Our full alignments contain the alignment co-ordinates from HMMER3.

Architecture

The collection of domains that are present on a protein.

Clan

A collection of related Pfam entries. The relationship may be defined by similarity of sequence, structure or profile-HMM.

Domain

A structural unit which can be found in multiple protein contexts.

Domain score

The score of a single domain aligned to an HMM. Note that, for HMMER2, if there was more than one domain, the sequence score was the sum of all the domain scores for that Pfam entry. This is not quite true for HMMER3.

DUF

Domain of unknown function.

Envelope coordinates

See Alignment coordinates.

Family

A collection of related proteins.

Full alignment

An alignment of the set of related sequences which score higher than the manually set threshold values for the HMMs of a particular Pfam entry.

Gathering threshold (GA)

Also called the gathering cut-off, this value is the search threshold used to build the full alignment. The GA is the minimum score a sequence must attain in order to belong the the full alignment of a Pfam entry. For each Pfam HMM we have two GA cutoff values, a sequence cutoff and a domain cutoff.

HMMER

The suite of programs that Pfam uses to build and search HMMs. For Pfam release 24.0 we have used HMMER version 3 to make Pfam. See the HMMER site.

Hidden Markov model (HMM)

A HMM is a probablistic model. In Pfam we use HMMs to transform the information contained within a multiple sequence alignment into a position-specific scoring system. We search our HMMs against the UniProt protein database to find homologous sequences.

HMMER2

The suite of programs that Pfam uses to build and search HMMs. See the HMMER site.

iPfam

A resource that describes domain-domain interactions that are observed in PDB entries. Where two or more Pfam domains occur in a single structure, it analyses them to see if the are close enough to form an interaction. If they are close enough it calculates the bonds forming the interaction.

Metaseq

A collection of sequences derived from various metagenomics datasets.

Motif

A short unit found outside globular domains.

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment.

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of representative sequences. We manually set a threshold value for each HMM and search our models against the UniProt database. All of the sequnces which score above the threshold for a Pfam entry are included in the entry's full alignment.

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues from them. Since Pfam-B families are automatically generated, we recommend that you verify that the sequences in a Pfam-B family are related, using other methods such as BLAST. For Pfam 24.0, we have made HMMs for the first (and therefore largest) 20,000 Pfam-B familes. Users can search their sequences against the Pfam-B HMMs in addition to the Pfam-A HMMs when performing both single-sequence searches and batch searches on the website.

Posterior probability

HMMER3 reports a posterior probability for each residue that matches a 'match' or 'insert' state in the profile HMM. A high posterior probability shows that the alignment of the amino acid to the match/insert state is likely to be correct, whereas a low posterior probability indicates that there is alignment uncertainty. This is indicated on a scale with '*' being 10, the highest certainty, down to 1 being complete uncertainty. Within Pfam we display this information as a heat map view, where green residues indicate high posterior probability, and red ones indicate a lower posterior probability.

Repeat

A short unit which is unstable in isolation but forms a stable structure when multiple copies are present.

Seed alignment

An alignment of a set of representative sequences for a Pfam entry. We use this alignment to construct the HMMs for the Pfam entry.

Sequence score

The total score of a sequence aligned to a HMM. If there is more than one domain, the sequence score is the sum of all the domain scores for that Pfam entry. If there is only a single domain, the sequence and the domains score for the protein will be identical. We use the sequence score to determine whether a sequence belongs to the full alignment of a particular Pfam entry.

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment.

Help With Pfam HMM scores

Documentation update

October 2009

The documentation in this tab is currently out of date. Although the general information is still largely accurate, the details of the site and underlying database may be inaccurate. We hope to update the documentation within the coming weeks.

What Pfam HMM scores mean

Pfam-A is based around hidden Markov model (HMM) searches, as provided by the HMMER2 package. In HMMER2, like BLAST, E-values (expectation values) are calculated. The E-value is the number of hits that would be expected to have a score equal or better than this by chance alone. A good E-value is much less than 1. Around 1 is what we expect just by chance. In principle, all you need to decide on the significance of a match is the E-value.

However, there are a few complications.

The most serious complication is that there are no analytical results available for accurately determining E-values for gapped alignments, especially profile HMM alignments. HMMER uses empirical methods to estimate E-values. These methods are generally rather accurate. However, when in doubt, HMMER tends to err on the conservative side.

We use a second, and even more empirical, system in maintaining Pfam models. This system is implemented in the Pfam database rather than in the HMMER software. For each Pfam family, we record a "trusted cutoff" and a "noise cutoff", TC1 and NC1. TC1 is the lowest score for sequences we included in the family (e.g. in the Full alignment). NC1 is the highest score for sequences we did not include in the Full alignment. (Since Full alignments are produced automatically, the trusted sequence cutoff is always greater than the noise sequence cutoff.)

Therefore, we can consider a hit very significant if it scores better than the trusted cutoff, better than the noise cutoff, and has a significant E-value. Sometimes sequences score better than the cutoffs though they don't have significant E-values; these are marginal hits that we've chosen to include in the family.

Sequence versus domain scores

There's one additional wrinkle in the scoring scheme. HMMER2 calculates two kinds of scores. The "sequence classification score" is the total score of a sequence aligned to a model; if there are more than one domain, the sequence score is the sum of all (finding multiple domains increases our confidence that the sequence belongs to that protein family, even if each domain individually is a weak match.) The "domain score" is a score for a single domain (these two scores are identical for single domain proteins).

References & Bibliography

Pfam References

The Pfam protein families database: R.D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J.E. Pollington, O.L. Gavin, P. Gunesekaran, G. Ceric, K. Forslund, L. Holm, E.L. Sonnhammer, S.R. Eddy, A. Bateman Nucleic Acids Research (2010)  Database Issue 38:D211-222
The Pfam protein families database: R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, J.S. Sammut, H.R. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L. Sonnhammer and A. Bateman Nucleic Acids Research (2008)  Database Issue 36:D281-D288
Pfam: clans, web tools and services: R.D. Finn, J. Mistry, B. Schuster-Böckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L.L. Sonnhammer and A. Bateman Nucleic Acids Research (2006)  Database Issue 34:D247-D51
Enhanced protein domain discovery by using language modeling techniques from speech recognition: L. Coin, A. Bateman and R. Durbin Proc. Natl. Acad. Sci. USA. (2003) 100(8):4516-20
The Pfam Protein Families Database: A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats and S.R. Eddy Nucleic Acids Research (2004) 32:D138-D141
The Pfam Protein Families Database: A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S.R. Eddy, S. Griffiths-Jones, K.L. Howe, M. Marshall and E.L. Sonnhammer Nucleic Acids Research (2002) 30(1):276-280
The Pfam Protein Families Database: A. Bateman, E. Birney, R. Durbin, S.R. Eddy, K.L. Howe and E.L. Sonnhammer Nucleic Acids Research  (2000) 28:263-266
Pfam 3.1: 1313 multiple alignments match the majority of proteins: A. Bateman, E. Birney, R. Durbin, S.R. Eddy, R.D. Finn and E.L.L. Sonnhammer Nucleic Acids Research (1999) 27:260-262
Pfam: multiple sequence alignments and HMM-profiles of protein domains: E.L.L. Sonnhammer, S.R. Eddy, E. Birney, A. Bateman and R. Durbin Nucleic Acids Research (1998) 26:320-322
Pfam: a comprehensive database of protein families based on seed alignments: E.L.L. Sonnhammer, S.R. Eddy and R. Durbin Proteins (1997) 28:405-420

Book Chapters on Pfam

Pfam: the protein families database R.D. Finn (eds M.J. Dunn, L.B. Jorde, P.F.R. Little, S. Subramaniam) Genetics, Genomics, Proteomics and Bioinformatics, Section 6: Protein Families  (2005) ISBN 978-0-470-84974-3
Identifying protein domains with the Pfam database R.D. Finn, A. Bateman and S. Griffiths-Jones Current protocols in bioinformatics  ISBN 978-0-471-25093-7
Pfam: a domain-centric method for analysing proteins and proteomes J. Mistry and R.D. Finn Methods in Molecular Biology - Comparative Genomics

How to link to Pfam?

Pfam is maintained by a consortium of researchers based at the Wellcome Trust Sanger Institute, Cambridge, UK (WTSI), Stockholm Bioinformatics Center, Stockholm, Sweden (SBC), and Janelia Farm, Maryland, USA. All three sites run the same Pfam website and linking to different sites only requires that you change the site name, not the parameters in the URL.

Although we have no plans to change the locations of resources within this site dramatically, webmasters are advised to link only to the following types of page within the site.

Home pages

WTSI:
http://pfam.sanger.ac.uk/
SBC:
http://pfam.sbc.su.se/
Janelia:
http://pfam.janelia.org/

Searching a protein sequence against Pfam

WTSI:
http://pfam.sanger.ac.uk/search?tab=sequenceSearchBlock
SBC:
http://pfam.sbc.su.se/search?tab=sequenceSearchBlock
Janelia:
http://pfam.janelia.org/search?tab=sequenceSearchBlock

Searching a DNA sequence against Pfam

WTSI:
http://pfam.sanger.ac.uk/search?tab=sequenceDnaBlock
SBC:
http://pfam.sbc.su.se/search?tab=sequenceDnaBlock
Janelia:
http://pfam.janelia.org/search?tab=sequenceDnaBlock

Linking to Pfam family pages

You can refer to Pfam families either by accession or ID. You can also refer to a family by "entry", although this is a convenience that should be used only if you're not sure if what you have is an accession or an ID.

Pfam accession numbers are more stable between releases than IDs and we strongly recommend that you link by accession number.

Here are some examples of linking to Pfam at WTSI:

By accession:
http://pfam.sanger.ac.uk/family?acc=PF00002
By ID:
http://pfam.sanger.ac.uk/family?id=7tm_2
Using "entry":
http://pfam.sanger.ac.uk/family?entry=PF00002 or
http://pfam.sanger.ac.uk/family?entry=7tm_2
Directly:
http://pfam.sanger.ac.uk/family/PF00002 or
http://pfam.sanger.ac.uk/family/7tm_2

You can link to Pfam family data at the other sites by changing "pfam.sanger.ac.uk" to "pfam.sbc.su.se" or "pfam.janelia.org".

Linking to protein sequence pages

As for Pfam family pages, you can refer to protein sequence pages by accession, ID or entry. Protein IDs are unstable and do change between releases, so, again, we strongly recommend that you use protein accessions where possible.

Here are some examples of linking to protein sequence pages at WTSI:

By accession:
http://pfam.sanger.ac.uk/protein?acc=P15498
By ID:
http://pfam.sanger.ac.uk/protein?id=VAV_HUMAN
Using "entry":
http://pfam.sanger.ac.uk/protein?entry=P15498 or
http://pfam.sanger.ac.uk/protein?entry=VAV_HUMAN
Directly:
http://pfam.sanger.ac.uk/protein/P15498 or
http://pfam.sanger.ac.uk/protein/VAV_HUMAN

Again, to generate links to the other Pfam sites, change "pfam.sanger.ac.uk" to "pfam.sbc.su.se" or "pfam.janelia.org".

Linking to the "jump to" search

The Pfam website features a search tool that tries to guess the type of any accession or ID that it is given. For example, if given "VAV_HUMAN", the search returns the URL for the protein sequence page for the VAV_HUMAN entry. If given "1w9h", the search returns the URL for the PDB entry (structure) 1w9h.

You can use the "jump to" search if you need to link to Pfam but can't be sure what type of accession or ID you will be using in your link. By default, the search returns the URL that it has found, as a simple, plain text HTTP response. Adding the parameter redirect=1 will make the "jump to" tool redirect to the URL that it finds or, if it couldn't find an appropriate URL, to the Pfam homepage.

Return URL:
http://pfam.sanger.ac.uk/search/jump?entry=P15498
Redirect:
http://pfam.sanger.ac.uk/search/jump?entry=P15498&redirect=1

Note that, although it may be convenient to link to Pfam using this search tool, there is no error reporting for your users if the search fails to find an appropriate URL in the Pfam site. It is much safer to link directly to the correct section of the site. Please contact us if you need help with building specific links.

Documentation update

October 2009

The documentation in this tab is currently out of date. Although the general information is still largely accurate, the details of the site and underlying database may be inaccurate. We hope to update the documentation within the coming weeks.

One of the features provided by the Pfam website is a graphical representation of the features found within a sequence, termed domain graphics. There are a variety of different shapes and styles and each one has a particular meaning. This page gives an in-depth description of the elements of Pfam domain graphics.

The library which generates the images in this page and throughout the Pfam site uses an XML language to describe the domain graphic that is required. Each of the example graphics in this page is followed by a link that can be used to show the XML that produced it.

We provide a set of tools, described in the Tools & Web Services section of the help pages, that allow you to generate custom domain graphics by uploading your own XML file, or to generate graphics for a specific UniProt sequence, given the UniProt accession or ID.


The sequence

The base sequence, undecorated by any domains or features, is represented by a plain grey bar:

Show XML

The length of the domain graphic that is drawn is proportional to the length of the sequence itself. The graphics in this page are drawn with a X-scale of 0.5 pixels per amino-acid, so that a 200 residue sequence will result in a 100 pixel-wide image. Any domains or features which are drawn on the sequence are also scaled by the same factor.

back to top


Pfam-A

The high quality, curated Pfam-A domains are classified into one of four different types: family, domain, repeat and motif (more details). These different classification types are rendered slightly differently.

Family/domain

It is possible for a sequence to match either the full length of a Pfam HMM (a full length match), or to match a portion of an HMM (a fragment match). The two types of match are rendered differently.

Both family and domainentries are rendered as rectangles with curved ends when the sequence is a full length match. The curves at the ends become less pronounced when the domains are short, as shown in the second domain below. Different types of domain are displayed with different colours. When the domain image is long enough, the domain name is shown within the domain itself. In most cases, you can click on the domains to visit the "family page" for that domain. Moving the mouse over the domain image should also display a tooltip showing the domain name, as well as the start and end positions of the domain.

Show XML

When the sequence does not match the full length of the HMM that models a Pfam entry, matching domain fragments are shown. When a sequence match does not pass through the first position in the HMM, the N-terminal side of the domain graphic is drawn with a jagged edge instead of a curved edge. Similarly, when a sequence match does not pass through the last position of the HMM, the C-terminal side of the domain graphic is drawn with a jagged edge. In some rarer cases, the sequence match may not pass through either of the first or last positions of the HMM, in which case both sides are drawn with jagged edges. Examples of all three cases are shown here:

Show XML

back to top

Repeat/motif

Repeats and motifs are types of Pfam domain which do not form independently folded units. In order to distinguish them from domains of type family and domain, repeats and motifs are represented by rectangles with straight edges. As for families and domains, partial matches are represented with jagged edges.

Show XML

back to top

Discontinuous nested domains

Some domains in Pfam are disrupted by the insertion of another domain (or domains) within them. A number of names have been given to this arrangement: discontinuous (referring to the outer domain), inserted or nested (both referring to the inner domain). For example, in many sequences containing an IMPDH domain, the IMPDH domain is continuous along the primary sequence. However, in some cases the linear sequence of the IMPDH domain is broken by the insertion of a CBS domain, as shown below.

Where three-dimensional structures are available for representatives of a Pfam domain, it is generally clear that the three-dimensional arrangement of the domain containing the nested domain is maintained. Typically the nested domain is found inserted within a surface exposed loop, having little or no effect on the structure of the other domain. Such an arrangement explains why and how these nested domains can be functionally tolerated.

To represent this arrangement of domain graphically, the discontinuous domain is represented in two parts (as shown below). These two parts are joined by a line bridging them. The vertical parts of the line are dashed, while the horizontal line is solid (to distinguish it from a disulphide bridge).

Show XML back to top

Context domains

Context domains in Pfam are those that, despite not scoring above the family gathering threshold, are expected to be real, based on the presence of the surrounding domains found in the protein. The method is described in:

Enhanced protein domain discovery by using language modeling techniques from speech recognition: L. Coin, A. Bateman and R. Durbin Proc. Natl. Acad. Sci. USA. (2003) 100(8):4516-20

In some cases it is possible for a protein without any matches to gain context domains. This happens when two or more weak matches support each other. This is most often seen with multiple tandem repeats such as WD40 and leucine rich repeats such as LRR_1.

Within the Pfam domain graphics, the context domains are represented by rectangles that are coloured from white to pink as shown below. These images are interactive in the same manner as the Pfam-A graphics.

Show XML

Please note that context domains are generated automatically and have not been subjected to the same high level of quality control as Pfam-A domains. Therefore, context domains, although likely to be correct should always be verified by other means.

back to top


Pfam-B

Pfam-B regions are automatically generated clusters that supplement the high quality Pfam-A regions. The mechanism for generating Pfam-B regions is detailed here. These regions are represented by a small rectangle, coloured with three stripes. As for Pfam-A regions, clicking on a Pfam-B domain takes the user to the Pfam-B summary page for that entry. Moving the mouse over the striped image will show a tooltip listing the Pfam-B identifier and its start and end points. If the Pfam-B region is long enough, its identifier will also be displayed on the image.

Show XML

back to top


Other sequence motifs

In addition to domains, smaller sequences motifs are represented by the domain graphics. Currently the following motifs are represented: signal peptides, low complexity regions, coiled-coils and transmembrane regions. These usually take lower prority than other regions that are drawn and they are therefore often obscured by, for example, a Pfam-A graphic being drawn over the top of them. An example of each motif is shown here.

Show XML

back to top

Signal peptides

Signal peptides are short regions (<60 residues long) found at the N-terminus of proteins, which direct the post-translational transport of a protein and are subsequently removed by peptidases. More specifically, a signal peptide is characterised by a short hydrophobic helix (approximately 7-15 residues). This helix is preceded by a slight positively charged region of highly variable length (approximately 1-12 residues). Between the hydrophobic helix and the cleavage site is a somewhat polar and uncharged region, of between 3 and 8 amino-acids. In Pfam, we use Phobius for the prediction of signal peptides and represent them graphically by a small orange box.

A combined transmembrane topology and signal peptide prediction method: L. Kall, A. Krogh and E.L.L. Sonnhammer J. Mol. Biol. (2004) 338(5):1027-36

back to top

Low complexity regions

Low complexity regions are regions of biased sequence composition, usually comprised of different types of repeats. These regions have been shown to be functionally important in some proteins, but they are generally not well understood and are masked out to focus on globular domains within the protein.

Within Pfam, we use SEG to calculate low complexity regions in Pfam. The presence of a low complexity region is indicated by a cyan rectangle.

back to top

Coiled-coils

Coiled coils are motifs found in proteins that structurally form alpha-helices that wrap or wind around each other. Normally, two to three helices are involved, but cases of up to seven alpha-helices have been reported. Coilded-coild are found in a wide variety of proteins, many functionally very important. In Pfam we use ncoils, to identify these motifs. Coiled-coils are represented by a small lime-green rectangle.

back to top

Transmembrane regions

Integral membrane proteins contain one or more transmembrane regions that are comprised of an alpha-helix that passes through or "spans" a membrane. Transmembrane helices are quite variable in length, with the average being about 20 amino-acids in length. Again, Phobius is used for the prediction of transmebrane regions, which are represented by a red rectangle.

back to top


Other Sequence features

Below is a demonstration of how disulphide bridges and active residues are representated in Pfam. Each of these features can appear above or below the sequence, but in this case the disulphide bridges are shown above the sequence and the active site residues below the line.

Show XML

back to top

Disulphide bridges

Disulphide bridges play a fundamental role in the folding and stability of some proteins. They are formed by covalent bonding between the thiol groups from two cysteine residues. The disulphide bridge annotations used in Pfam come from UniProt and are represented by a solid bridge-shaped line. When mutliple disulphide bonds occur, the heights of the bridges are adjusted to avoid overlaps between them. Inter-protein disulphides are represented by single vertical lines. As always, moving the mouse over the "bridge graphic" shows the details of the bond in a tooltip.

back to top

Active site residues

Within an enyzme, a small number of residues are directly involved in catalysis of a reaction. These are termed active site residues. Within Pfam there are three categories of active site: those that are experimentally determined, those that are predicted by UniProt and those predicted by Pfam. All three types are represented by a "lollipop" with a diamond head. The head is coloured red, pink and purple for each of the three types respectively.

Pfam-predicted active sites are determined by using the experimental data and transferring these annotations through a Pfam alignment.

back to top


Other features

In addition to the drawing features outlined above, the Pfam domain graphics library includes some additional, general purpose representation styles.

Arrows

Arrows can be drawn perpendicular to the sequence, and can point either towards or away from the sequence line. They can be drawn with different vertical line styles (solid, dashed or bold) and can be placed above or below the sequence. The example below shows the different arrow styles that are available:

Show XML

back to top

Additional "lollipop" styles

A wide range of different lollipop styles can be create by combining different line and head colours with different drawing styles. For example, a lollipop can be drawn with either bold (solid) or dashed lines. The lollipop head can be drawn as either a square, circle or diamond.

Show XML

back to top

Guide to Pfam tools and services

Documentation update

October 2009

The documentation in this tab is currently out of date. Although the general information is still largely accurate, the details of the site and underlying database may be inaccurate. We hope to update the documentation within the coming weeks.

Tools

Producing your own graphics

As we are regularly approached for producing domain graphics for use in publications, we have produced a tool for users to upload a "domain graphics" XML. This file will be validated against the schema and subsequently rendered. The images that the tool produces can then be saved for your own use.

If there is an existing sequence in Pfam that you wish to alter/elaborate then the XML used by Pfam for this sequence can also be obtained using this tool.

You can see a detailed description of the XML language that describes Pfam domain images in the Guide to Graphics section of the help pages.

There is a similar tool which allows you to see the domain graphic for a given UniProt entry.


Web services

In the past, Pfam has provided a set of SOAP-based web services, designed to allow programmatic access to Pfam data. These services were built as a stand-alone service, entirely separate from the old Pfam website. As such they are somewhat difficult to maintain and poorly integrated with the new Pfam website.

With the latest website release, we have added a new type of programmatic interface to Pfam services, in the form of a set of "RESTful" services. You can see documentation for these new services here.

Because of the problems of maintaining the SOAP-based web-services, we are now phasing them out, in favour of the RESTful interface. We would strongly encourage developers to switch to the new services. If you have questions or comments about this switch, please contact us at the email address at the bottom of the page.

Show the web services documentation.

This is an introduction to the "RESTful" interface to the Pfam website. REST (or Representation State Transfer) refers to a style of building websites which makes it easy to interact programmatically with the services provided by the site. A programmatic interface, commonly called an Application Programming Interface (API) allows users to write scripts or programs to access data, rather than having to rely on a browser to view a site.


Basic concepts

URLs

A RESTful service typically sends and receives data over HTTP, the same protocol that's used by websites and browsers. As such, the services provided through a RESTful interface are identified using URLs.

In the Pfam website we use the same basic URL to provide both the standard HTML representation of Pfam data and the alternative XML representation. To see the data for a particular Pfam-A family, you would visit the following URL in your browser:

http://pfam.janelia.org/family/Piwi

To retrieve the data in XML format, just add an extra parameter, output=xml, to the URL:

http://pfam.janelia.org/family/Piwi?output=xml 

The response from the server will now be an XML document, rather than an HTML page.

back to top

Sending requests

Although you can use a browser to retrieve family data in XML format, it's most useful to send requests and retrieve XML programmatically. The simplest way to do this is using a Unix command line tool such as curl:

Example
shell% curl -F output=xml 'http://pfam.janelia.org/family/Piwi'
<?xml version="1.0" encoding="UTF-8"?>
<!-- information on Pfam-A family PF02171 (Piwi), generated: 16:35:52 26-Oct-2009 -->
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/pfam_family.xsd"
      release="24.0"
      release_date="2009-10-07">
  <entry entry_type="Pfam-A" accession="PF02171" id="Piwi">
    ...

Most programming languages have the ability to send HTTP requests and receive HTTP responses. A perl script to retrieve data about a Pfam family might be as trivial as this:

Example
#!/usr/bin/perl

use strict;
use warnings;

use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->env_proxy;

my $res = $ua->get( 'http://pfam.janelia.org/family/Piwi?output=xml' );

if ( $res->is_success ) {
  print $res->content;
}
else {
  print STDERR $res->status_line, "\n";
}

back to top

Retrieving data

Although XML is just plain text and therefore human-readable, it's intended to be parsed into a data structure. Extending the perl script above, we can add the ability to parse the XML using an external perl module, XML::LibXML:

Example
#!/usr/bin/perl

use strict;
use warnings;

use LWP::UserAgent;
use XML::LibXML;

my $ua = LWP::UserAgent->new;
$ua->env_proxy;

my $res = $ua->get( 'http://pfam.janelia.org/family/Piwi?output=xml' );

die "error: failed to retrieve XML: " . $res->status_line . "\n"
  unless $res->is_success;

my $xml = $res->content;

my $xml_parser = XML::LibXML->new();
my $dom = $xml_parser->parse_string( $xml );

my $root = $dom->documentElement();
my ( $entry ) = $root->getChildrenByTagName( 'entry' );

print 'accession: ' . $entry->getAttribute( 'accession' ) . "\n";

This script now prints out the accession for the family "Piwi" (PF02171).

back to top


Available services

The following is a list of the sections of the website which are currently available as RESTful services.

Pfam ID/accession conversion

This is a simple service to return the accession and ID for a Pfam family, given either the ID or accession as input. Any of the following URLs will return the same simple XML document:

http://pfam.janelia.org/family/acc?id=Piwi&output=xml
http://pfam.janelia.org/family/acc/Piwi?output=xml
http://pfam.janelia.org/family/id?output=xml&acc=PF02171
http://pfam.janelia.org/family/id/Piwi?output=xml
http://pfam.janelia.org/family?entry=Piwi&output=xml
Example
shell% curl -F output=xml 'http://pfam.janelia.org/family/acc/Piwi'
<?xml version="1.0" encoding="UTF-8"?>
<!-- information on Pfam-A family PF02171 (Piwi), generated: 16:37:09 26-Oct-2009 -->
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/pfam_family.xsd"
      release="24.0"
      release_date="2009-10-07">
  <entry entry_type="Pfam-A" accession="PF02171" id="Piwi" />
</pfam>%

You can see the XML schema for this XML document here.

Note that, as a convenience, you can also omit the output=xml parameter and the response will contain only the ID or accession, as a plain text string:

Example
shell% curl 'http://pfam.janelia.org/family/acc/Piwi'
PF02171
shell% curl 'http://pfam.janelia.org/family/id/PF02171'
Piwi

back to top

Pfam-A annotations

You can retrieve a sub-set of the data in a Pfam-A family page as an XML document using any of the following styles of URL:

http://pfam.janelia.org/family?id=Piwi&output=xml
http://pfam.janelia.org/family?output=xml&acc=PF02171
http://pfam.janelia.org/family?entry=Piwi&output=xml
http://pfam.janelia.org/family/Piwi?output=xml

The last two styles, using the entry parameter or an extended URL, accept either accessions or identifiers. The accession/ID is case-insensitive in all cases.

Example
shell% curl -F output=xml 'http://pfam.janelia.org/family/Piwi'
<?xml version="1.0" encoding="UTF-8"?>
<!-- information on Pfam-A family PF02171 (Piwi), generated: 16:35:52 26-Oct-2009 -->
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/pfam_family.xsd"
      release="24.0"
      release_date="2009-10-07">
  <entry entry_type="Pfam-A" accession="PF02171" id="Piwi">
    <description>
<![CDATA[
Piwi domain
]]>
    </description>
    <comment>
<![CDATA[
This domain is found in the protein Piwi and its relatives.  The function of this 
domain is the dsRNA guided hydrolysis of ssRNA. Determination of the crystal 
structure of Argonaute reveals that PIWI is an RNase H domain, and identifies 
Argonaute as Slicer, the enzyme that cleaves mRNA in the RNAi RISC complex [2].  
In addition, Mg+2 dependence and production of 3'-OH and 5' phosphate products 
are shared characteristics of RNaseH and RISC. The PIWI domain core has a tertiary 
structure belonging to the RNase H family of enzymes.  RNase H fold proteins all 
have a five-stranded mixed beta-sheet surrounded by helices. By analogy to 
RNase H enzymes which cleave single-stranded RNA guided by the DNA strand in an 
RNA/DNA hybrid, the PIWI domain can be inferred to cleave single-stranded RNA, 
for example mRNA, guided by double stranded siRNA.
]]>
    </comment>
    <curation_details>
      <status>CHANGED</status>
      <seed_source>Bateman A</seed_source>
      <num_archs>16</num_archs>
      <num_seqs>
        <seed>21</seed>
        <full>756</full>
      </num_seqs>
      <num_species>140</num_species>
      <num_structures>22</num_structures>
      <percentage_identity>30</percentage_identity>
      <av_length>277.50</av_length>
      <av_coverage>33.67</av_coverage>
      <type>Family</type>
    </curation_details>
    <hmm_details hmmer_version="3.0b2" model_version="10" model_length="304">
      <build_commands>hmmbuild  -o /dev/null HMM SEED</build_commands>
      <search_commands>hmmsearch -Z 9421015 -E 1000 HMM pfamseq</search_commands>
      <cutoffs>
        <gathering>
          <sequence>19.9</sequence>
          <domain>19.9</domain>
        </gathering>
        <trusted>
          <sequence>20.0</sequence>
          <domain>21.0</domain>
        </trusted>
        <noise>
          <sequence>18.6</sequence>
          <domain>19.5</domain>
        </noise>
      </cutoffs>
    </hmm_details>
  </entry>
</pfam>%

You can see the XML schema for this XML document here.

Some Pfam families are removed or merged into others, in which case they become "dead" families. If you try to retrieve annotation information about a dead family, you'll get a simple XML document that only includes information on the replacement (if any) for the family:

Example
shell% curl -F output=xml 'http://pfam.janelia.org/family/PF06700'
<?xml version="1.0" encoding="UTF-8"?>
<!-- information on dead Pfam-A family PF06700 (2oxo_fer_oxidoB), generated: 16:34:44 26-Oct-2009 -->
<dead_pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xmlns="http://pfam.sanger.ac.uk/"
           xsi:schemaLocation="http://pfam.sanger.ac.uk/
                               http://pfam.sanger.ac.uk/static/documents/schemas/pfam_family.xsd"
           release="24.0"
           release_date="2009-10-07">
  <entry accession="PF06700"
         id="2oxo_fer_oxidoB">
    <forward_to>PF02775</forward_to>
    <comment>Merged into TPP binding domain</comment>
  </entry>
</dead_pfam>

You can see the XML schema for this XML document here.

back to top

Pfam-A family list

You can retrieve a list of all Pfam-A families in the latest Pfam release, either as an XML document or as a tab-delimited text file. Both formats contain the Pfam-A accession, Pfam-A identifier and description:

http://pfam.janelia.org/families?output=xml
http://pfam.janelia.org/families?output=text

You can also view the list in a web browser by removing the output=xml parameter from the URL.

Example
shell% curl -F output=xml 'http://pfam.janelia.org/families'
<?xml version="1.0" encoding="UTF-8"?>
<!-- all Pfam-A families, generated: 16:12:54 26-Oct-2009 -->
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/pfam_families.xsd"
      release="24.0" 
      release_date="2009-10-07">
  <entry entry_type="Pfam-A" accession="PF00001" id="7tm_1">
    <description>
<![CDATA[
7 transmembrane receptor (rhodopsin family)
]]>
    </description>
  </entry>
  ...

You can see the XML schema for this XML document here.

back to top

Protein sequence data

You can retrieve a sub-set of the data in a protein page as an XML document using any of the following styles of URL:

http://pfam.janelia.org/protein?id=CANX_CHICK&output=xml
http://pfam.janelia.org/protein?output=xml&acc=P00789
http://pfam.janelia.org/protein?entry=P00789&output=xml
http://pfam.janelia.org/protein/P00789?output=xml

As for Pfam-A families, arguments are all case-insensitive and the entry parameter accepts either ID or accession.

Example
shell% curl -F output=xml 'http://pfam.janelia.org/protein/P00789'
<?xml version="1.0" encoding="UTF-8"?>
<!-- information on UniProt entry P00789 (CANX_CHICK), generated: 16:28:26 26-Oct-2009 -->
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/protein.xsd"
      release="24.0"
      release_date="2009-10-07">
  <entry entry_type="sequence" db="uniprot" db_release="57.6" accession="P00789" id="CANX_CHICK">
    <description>
<![CDATA[
Calpain-1 catalytic subunit EC=3.4.22.52
]]>
    </description>
    <taxonomy tax_id="9031" species_name="Gallus gallus (Chicken)">
      Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archosauria; 
      Dinosauria; Saurischia; Theropoda; Coelurosauria; Aves; Neognathae; Galliformes; 
      Phasianidae; Phasianinae; Gallus.
    </taxonomy>
    <sequence length="705" md5="934014b14ecb71623fa5898c7f81862a" crc64="ABCDDC56298E48AA" version="2">
      MMPFGGIAARLQRDRLRAEGVGEHNNAVKYLNQDYEALKQECIESGTLFRDPQFPAGPTALGFKELGPYSSKTR
      GVEWKRPSELVDDPQFIVGGATRTDICQGALGDCWLLAAIGSLTLNEELLHRVVPHGQSFQEDYAGIFHFQIWQ
      FGEWVDVVVDDLLPTKDGELLFVHSAECTEFWSALLEKAYAKLNGCYESLSGGSTTEGFEDFTGGVAEMYDLKR
      APRNMGHIIRKALERGSLLGCSIDITSAFDMEAVTFKKLVKGHAYSVTAFKDVNYRGQQEQLIRIRNPWGQVEW
      TGAWSDGSSEWDNIDPSDREELQLKMEDGEFWMSFRDFMREFSRLEICNLTPDALTKDELSRWHTQVFEGTWRR
      GSTAGGCRNNPATFWINPQFKIKLLEEDDDPGDDEVACSFLVALMQKHRRRERRVGGDMHTIGFAVYEVPEEAQ
      GSQNVHLKKDFFLRNQSRARSETFINLREVSNQIRLPPGEYIVVPSTFEPHKEADFILRVFTEKQSDTAELDEE
      ISADLADEEEITEDDIEDGFKNMFQQLAGEDMEISVFELKTILNRVIARHKDLKTDGFSLDSCRNMVNLMDKDG
      SARLGLVEFQILWNKIRSWLTIFRQYDLDKSGTMSSYEMRMALESAGFKLNNKLHQVVVARYADAETGVDFDNF
      VCCLVKLETMFRFFHSMDRDGTGTAVMNLAEWLLLTMCG
    </sequence>
    <matches>
      <match accession="PF00648" id="Peptidase_C2" type="Pfam-A">
        <location start="48" end="347" ali_start="48" ali_end="347" hmm_start="1" hmm_end="298" evalue="2.6e-148" bitscore="502.00" />
      </match>
      <match accession="PF01067" id="Calpain_III" type="Pfam-A">
        <location start="358" end="513" ali_start="358" ali_end="512" hmm_start="1" hmm_end="144" evalue="3.5e-57" bitscore="201.20" />
      </match>
    </matches>
  </entry>
</pfam>

You can see the XML schema for this XML document here.

back to top

Sequence searches

The Pfam website includes a form that allows users to upload a protein sequence and see a list of the Pfam domains that are found on their search sequence. We've now implemented a RESTful interface to this search tool, making it possible to run single-sequence Pfam searches programmatically.

Running a search is a two step process:

  1. submit the search sequence and specify search parameters
  2. retrieve search results in XML format

The reason for separating the operation into two steps rather than performing a search in a single operation is that the time taken to perform a sequence search will vary according to the length of the sequence searched. Most web clients, browsers or scripts, will simply time-out if a response is not received within a short time period, usually less than a minute. By submitting a search, waiting and then retrieving results as a separate operation, we avoid the risk of a client reaching a time-out before the results are returned.

The following example uses simple command-line tools to submit the search and retrieve results, but the whole process is easily transferred to a single script or program.

back to top

Save your sequence to file

It is usually most convenient to save your sequence into a plain text file, something like this:

Example
shell% cat test.seq 
MMASTENNEKDNFMRDTASRSKKSRRRSLWIAAGAVPTAIALSLSLASPA
AVAQSSFGSSDIIDSGVLDSITRGLTDYLTPRDEALPAGEVTYPAIEGLP
AGVRVNSAEYVTSHHVVLSIQSAAMPERPIKVQLLLPRDWYSSPDRDFPE
IWALDGLRAIEKQSGWTIETNIEQFFADKNAIVVLPVGGESSFYTDWNEP
NNGKNYQWETFLTEELAPILDKGFRSNGERAITGISMGGTAAVNIATHNP
EMFNFVGSFSGYLDTTSNGMPAAIGAALADAGGYNVNAMWGPAGSERWLE
NDPKRNVDQLRGKQVYVSAGSGADDYGQDGSVATGPANAAGVGLELISRM
TSQTFVDAANGAGVNVIANFRPSGVHAWPYWQFEMTQAWPYMADSLGMSR
EDRGADCVALGAIADATADGSLGSCLNNEYLVANGVGRAQDFTNGRAYWS
PNTGAFGLFGRINARYSELGGPDSWLGFPKTRELSTPDGRGRYVHFENGS
IYWSAATGPWEIPGDMFTAWGTQGYEAGGLGYPVGPAKDFNGGLAQEFQG
GYVLRTPQNRAYWVRGAISAKYMEPGVATTLGFPTGNERLIPGGAFQEFT
NGNIYWSASTGAHYILRGGIFDAWGAKGYEQGEYGWPTTDQTSIAAGGET
ITFQNGTIRQVNGRIEESR

The sequence should contain only valid sequence characters, i.e. letters, excluding "J" and "O". You can break the sequence across multiple lines to make it easier to handle.

Submit the search

Example
shell% curl -F seq='<test.seq' -F output=xml 'http://pfam.sanger.ac.uk/search/sequence'
<?xml version="1.0" encoding="UTF-8"?>
<jobs xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/submission.xsd">
  <job job_id="F69126C4-C24E-11DE-825F-800A2878356D">
    <opened>2009-10-26 16:45:27</opened>
    <result_url>http://pfam.sanger.ac.uk/search/sequence/resultset/F69126C4-C24E-11DE-825F-800A2878356D?output=xml</result_url>
  </job>
</jobs>

You can see the XML schema for this XML document here.

When using curl the value of the parameter "seq" needs to be quoted so that its value is taken correctly from the file "test.seq". The second parameter can also be added directly to the URL, as a regular CGI-style parameter, if you prefer.

The search service accepts the following parameters (you can see a more complete description of these settings here):

Parameter Description Accepted values Default Notes
evalue use this E-value cut-off valid float 1.0 to use the gathering threshold for the family, set "ga=1" and don't specify an E-value. If an E-value is given, it will be used, regardless of the value of "ga"
ga use gathering threshhold 0 | 1 0
searchBs do search for Pfam-B hits 0 | 1 0 setting "skipAs=0" implies "searchBs=1"; you must search for at least one type of family
skipAs don't search for Pfam-A hits 0 | 1 0
seq protein sequence valid sequence characters none required

Wait for the search to complete

Although you can check for results immediately, if you poll before your job has completed, you won't receive an XML document. Instead, the HTTP response to your request will have its status set appropriately and the body of the response will contain only string giving the status. You should ideally check the HTTP status of the response, rather than relying on the body of the response.

These are the possible status codes for the response:

HTTP status
code
Status
description
Response
body
Notes
202 Accepted PEND / RUN The job has been accepted by the search system and is either pending (waiting to be started) or running. After a short delay, your script should check for results again
502 Bad gateway FAIL There was a problem scheduling or running the job. The job has failed and will not produce results. There is no need to check the status again
503 Service unavailable HOLD Your job was accepted but is on hold. This status will not be assigned by the search system, but by an administrator. There is probably a problem with the job and you should contact the help desk for assistance with it
410 Gone DEL Your job was deleted from the search system. This status will not be assigned by the search system, but by an administrator. There was probably a problem with the job and you should contact the help desk for assistance with it
500 Internal server error Error message There was some problem with running your job, but it does not fall into any of the other categories. The body of the response will contain an error message from the server. Contact the help desk for assistance with the problem

When writing a script to submit searches and retrieve results, please add a short delay between the submission and the first attempt to retrieve results. Most search jobs are returned within four to five seconds of submission, depending greatly on the length of the sequence to be searched.

Retrieve results

The XML that was returned from the first query includes one or more URLs from which you can now retrieve results, given in the <result_url>. You can now poll these URLs to retrieve XML documents with the search hits.

Example
shell% curl 'http://pfam.sanger.ac.uk/search/sequence/results?jobId=8B39C37A-BF7D-11DC-9E70-862FEF792CB2&output=xml'
<?xml version="1.0" encoding="UTF-8"?>
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/results.xsd"
      release="24.0"
      release_date="2009-10-07">
  <results job_id="F69126C4-C24E-11DE-825F-800A2878356D">
    <matches>
      <protein length="669">
        <database id="pfam" release="24.0" release_date="2009-10-07">
          <match accession="PF02532.7" id="PsbI" type="Pfam-A" class="Family">
            <location start="297" end="308" ali_start="298" ali_end="306" 
              hmm_start="23" hmm_end="31" evalue="2.4e+02" bitscore="1.1" 
              evidence="hmmer v3.0b2" significant="0" />
          </match>
          <match accession="PF10055.2" id="DUF2292" type="Pfam-A" class="Family">
            <location start="647" end="664" ali_start="650" ali_end="662" 
              hmm_start="19" hmm_end="31" evalue="49" bitscore="3.0" 
              evidence="hmmer v3.0b2" significant="0" />
          </match>
          <match accession="PF09772.2" id="Tmem26" type="Pfam-A" class="Family">
            <location start="164" end="187" ali_start="167" ali_end="185" 
              hmm_start="260" hmm_end="278" evalue="3.3e+02" bitscore="-0.1" 
              evidence="hmmer v3.0b2" significant="0" />
          </match>
          <match accession="PF04335.6" id="VirB8" type="Pfam-A" class="Family">
            <location start="9" end="57" ali_start="12" ali_end="49" 
              hmm_start="4" hmm_end="41" evalue="17" bitscore="4.6" 
              evidence="hmmer v3.0b2" significant="0" />
          </match>
          ...

You can see the XML schema for this XML document here.

Since the search is performed by the same server as searches in the Pfam website, you can view your results in a web page by modifying the URL slightly:

http://pfam.janelia.org/search/sequence/results/F69126C4-C24E-11DE-825F-800A2878356D

Note that old search results are generally cleared out after some time, so if you wait too long before trying to view your hits in the website, you may find that they are already gone.

back to top

Database documentation

Documentation update

October 2009

The documentation in this tab is currently out of date. Although the general information is still largely accurate, the details of the site and underlying database may be inaccurate. We hope to update the documentation within the coming weeks.

This section describes the tables in the Pfam MySQL database and shows example queries. Installation packages and documentation on the MySQL database itself can be found on the MySQL website.


VERSION table

Version table

The VERSION table contains information that relates to a particular Pfam release. It contains the version number of the Pfam database, the version numbers of the Swiss-Prot and TrEMBL databases that were used to build Pfam, and some statistics about the number of families and coverage. This table is stand-alone and does not link to any of the other tables.

Example query Give me all of the version information for the Pfam database
SQL
mysql> SELECT * FROM version \G
                     pfam_release: 22.0
                pfam_release_date: 2007-07-10
               swiss_prot_version: 51.7
                   trembl_version: 34.7
                    hmmer_version: 2.3.2
                   pfamA_coverage: 73.2
        pfamB_additional_coverage: 11.6
           pfamA_residue_coverage: 50.8
pfamB_additional_residue_coverage: 6.5
                  number_families: 9318

Domain information

Domain information

Two of the central tables in the database are pfamseq, which contains the UniProtkKB sequence database, and pfamA, which contains information about the Pfam-A families. The table pfamA_reg_seed contains the Pfam domains that are present in a seed alignment, and the pfamA_reg_full contains all of the sequence regions that match the HMM for each family. Note that the pfamA_reg_full table contains both the significant and insignificant data.

The pfamA_reg_full_significant and pfamA_reg_full_insignificant tables contain, as the names suggest, the significant and insignificant data respectively. Significant hits are those with a bits score above the curated threshold for the family, whilst insignificant matches are those that score below the curated threshold. With respect to the tables that contain significant data (pfamA_reg_full_significant and pfamA_reg_full), there is an extra column called 'in_full'. The matches that are present in the full alignment for a Pfam family have this column set to 1, while those that are not present in the full alignment have the 'in_full' column set to 0. Where there is an overlapping fragment match and a full length match to the same Pfam-A family, only one of the matches will be present in the full alignment for that Pfam-A family.

The Pfam database has historically been built on the UniProtKB database, however as of release 22.0 we also provide Pfam domain data for the NCBI sequence database (genpept) and a set of metagenomics sequences. Further information about querying the NCBI and metagenomics data sets can be found below.

Example query Give me all of the domains for UniProtKB protein sequence 'VAV_HUMAN'
SQL
mysql> SELECT pfamA_acc, \
              pfamA_id, \
              seq_start, \
              seq_end \
       FROM   pfamseq, \
              pfamA, \
              pfamA_reg_full_significant \
       WHERE  pfamseq_id = 'VAV_HUMAN' \
       AND    in_full = 1 \
       AND    pfamseq.auto_pfamseq = pfamA_reg_full_significant.auto_pfamseq \
       AND    pfamA_reg_full_significant.auto_pfamA = pfamA.auto_pfamA;
+-----------+----------+-----------+---------+
| pfamA_acc | pfamA_id | seq_start | seq_end |
+-----------+----------+-----------+---------+
| PF00307   | CH       |         2 |     120 |
| PF00621   | RhoGEF   |       198 |     372 |
| PF00017   | SH2      |       671 |     745 |
| PF00169   | PH       |       403 |     504 |
| PF00130   | C1_1     |       516 |     568 |
| PF00018   | SH3_1    |       785 |     840 |
| PF00018   | SH3_1    |       617 |     658 |
+-----------+----------+-----------+---------+

UniProtKB sequences have secondary accessions if they have been merged or split. Secondary accession numbers are stored in the table called secondary_pfamseq_acc.

Example query Give me the secondary accession(s) for the UniProtKB sequence 'P15455'
SQL
mysql> SELECT pfamseq_acc, \
              secondary_acc \
       FROM   pfamseq, \
              secondary_pfamseq_acc \
       WHERE  pfamseq.auto_pfamseq = secondary_pfamseq_acc.auto_pfamseq \
       AND    pfamseq_acc= 'P15455';
+-------------+---------------+
| pfamseq_acc | secondary_acc |
+-------------+---------------+
| P15455      | Q3E711        |
| P15455      | Q9FFH7        |
+-------------+---------------+
Example query Give me all the UniProtKB sequences in the full alignment for the family 'B12D'
SQL
mysql> SELECT pfamseq_id, \
              pfamseq_acc, \
              seq_start, \
              seq_end, \
              pfamA_id \
       FROM   pfamA, \
              pfamseq, \
              pfamA_reg_full_significant \
       WHERE  pfamA_id = 'B12D' \
       AND    in_full = 1 \
       AND    pfamA.auto_pfamA = pfamA_reg_full_significant.auto_pfamA \
       AND    pfamA_reg_full_significant.auto_pfamseq = pfamseq.auto_pfamseq;
+--------------+-------------+-----------+---------+----------+
| pfamseq_id   | pfamseq_acc | seq_start | seq_end | pfamA_id |
+--------------+-------------+-----------+---------+----------+
| O22414_ORYSA | O22414      |         3 |      89 | B12D     |
| Q01BS9_OSTTA | Q01BS9      |         5 |     119 | B12D     |
| A3BLZ4_ORYSJ | A3BLZ4      |        76 |     155 | B12D     |
| Q69F92_PHAVU | Q69F92      |        30 |     128 | B12D     |
| A6MZE4_ORYSI | A6MZE4      |         3 |      89 | B12D     |
| Q84MX3_ORYSJ | Q84MX3      |         1 |      75 | B12D     |
| Q6YU35_ORYSJ | Q6YU35      |        11 |      97 | B12D     |
| A7PAD2_VITVI | A7PAD2      |         1 |      99 | B12D     |
| A7PD80_VITVI | A7PD80      |         6 |      92 | B12D     |
...
Example query Give me all the UniProtKB sequences in the seed alignment for the family 'B12D'
SQL
mysql> SELECT pfamseq_id, \
              seq_start, \
              seq_end, \
              pfamA_id \
       FROM   pfamA, \
              pfamseq, \
              pfamA_reg_seed \
       WHERE  pfamA_id = 'B12D' \
       AND    pfamA.auto_pfamA = pfamA_reg_seed.auto_pfamA \
       AND    pfamA_reg_seed.auto_pfamseq = pfamseq.auto_pfamseq;
+--------------+-----------+---------+----------+
| pfamseq_id   | seq_start | seq_end | pfamA_id |
+--------------+-----------+---------+----------+
| Q9XHD5_IPOBA |         3 |      89 | B12D     |
| Q940E1_CASSA |        29 |     116 | B12D     |
| Q42338_ARATH |         2 |      88 | B12D     |
| Q9LJ47_ARATH |         1 |      87 | B12D     |
| Q6YU38_ORYSJ |         2 |      84 | B12D     |
| O22414_ORYSA |         3 |      89 | B12D     |
| Q1H8M8_BETVU |         4 |      90 | B12D     |
| Q6YU35_ORYSJ |        11 |      97 | B12D     |
| Q6Z4G5_ORYSJ |         1 |      87 | B12D     |
+--------------+-----------+---------+----------+

Other regions, active site and disulphide bond information for a sequence

Other regions, sites, disulphides

These tables contain-sequence specific information about the sequences in the UniProtKB database. The other_regions tables contains coiled coil, low complexity, signal peptide and transmembrane regions. The context_pfam_regions table contains context domains; context domains are those that do not score above the family gathering threshold, but are expected to be real based on the presence of the surrounding domains found in the protein. The pfamseq_markup table contains active site information which is taken from the UniProtKB feature table. Additional active site residues are predicted by Pfam based on conserved residues in a Pfam alignment. The pfamseq_disulphide tables contains disulphide bond information from the UniProtKB feature table.

Example query Give me all of the pfamB regions for the UniProtKB sequence 'VAV_HUMAN'
mysql> SELECT DISTINCT pfamB.pfamB_acc, \
              pfamB_id, \
              seq_start, \
              seq_end \
       FROM   pfamB_reg, \
              pfamB, \
              pfamseq \
       WHERE  pfamseq_id = 'VAV_HUMAN' \
       AND    pfamB_reg.auto_pfamseq = pfamseq.auto_pfamseq \
       AND    pfamB_reg.auto_pfamB = pfamB.auto_pfamB;
+-----------+--------------+-----------+---------+
| pfamB_acc | pfamB_id     | seq_start | seq_end |
+-----------+--------------+-----------+---------+
| PB017628  | Pfam-B_17628 |       155 |     181 |
| PB045706  | Pfam-B_45706 |       585 |     616 |
+-----------+--------------+-----------+---------+
Example query Give me all of the transmembrane, signal-peptide, coiled-coils and low-complexity information for the UniProtKB sequnce 'VAV_HUMAN'
SQL
mysql> SELECT type_id, \
              source_id, \
              seq_start, \
              seq_end, \
              score \
       FROM   other_reg, \
              pfamseq \
       WHERE  pfamseq.pfamseq_id = 'VAV_HUMAN' \
       AND    other_reg.auto_pfamseq = pfamseq.auto_pfamseq;
+----------------+-----------+-----------+---------+--------+
| type_id        | source_id | seq_start | seq_end | score  |
+----------------+-----------+-----------+---------+--------+
| low_complexity | seg       |        42 |      51 | 1.5700 |
| low_complexity | seg       |       356 |     367 | 2.1900 |
+----------------+-----------+-----------+---------+--------+
Example query Give me all of the context regions for the UniProtKB sequence 'Q8I6U6_PLAF7'
SQL
mysql> SELECT seq_start, \
              seq_end, \
              domain_score, \
              pfamA.pfamA_acc, \
              pfamA_id, \
              pfamA.description \
       FROM   context_pfam_regions, \
              pfamseq, \
              pfamA \
       WHERE  pfamseq.pfamseq_id = 'Q8I6U6_PLAF7' \
       AND    context_pfam_regions.auto_pfamseq = pfamseq.auto_pfamseq \
       AND    pfamA.auto_pfamA = context_pfam_regions.auto_pfamA;
+-----------+---------+--------------+-----------+----------+--------------------------------------+
| seq_start | seq_end | domain_score | pfamA_acc | pfamA_id | description                          |
+-----------+---------+--------------+-----------+----------+--------------------------------------+
|     10250 |   10279 |     12800.00 | PF06513   | DUF1103  | Repeat of unknown function (DUF1103) |
+-----------+---------+--------------+-----------+----------+--------------------------------------+
      
Example query Give me all of the active site information for UniProtKB sequence 'Q22CX9'
SQL
mysql> SELECT pfamseq_acc, \
              pfamseq_id, \
              residue, \
              label \
       FROM   pfamseq, \
              pfamseq_markup, \
              markup_key \
       WHERE  pfamseq.auto_pfamseq = pfamseq_markup.auto_pfamseq \
       AND    pfamseq_markup.auto_markup = markup_key.auto_markup \
       AND    pfamseq_acc = 'Q22CX9';
+-------------+--------------+---------+----------------------------+
| pfamseq_acc | pfamseq_id   | residue | label                      |
+-------------+--------------+---------+----------------------------+
| Q22CX9      | Q22CX9_TETTH |     276 | Pfam predicted active site |
| Q22CX9      | Q22CX9_TETTH |     305 | Pfam predicted active site |
| Q22CX9      | Q22CX9_TETTH |     337 | Pfam predicted active site |
+-------------+--------------+---------+----------------------------+
Example query Give me all the residues involved in disulphide bonds in the UniProtKB sequence 'P98092'
SQL
mysql> SELECT pfamseq_acc, \
              pfamseq_id, \
              bond_start, \
              bond_end \
       FROM   pfamseq, \
              pfamseq_disulphide \
       WHERE  pfamseq_disulphide.auto_pfamseq = pfamseq.auto_pfamseq \
       AND    pfamseq_acc = 'Q43495';
+-------------+------------+------------+----------+ 
| pfamseq_acc | pfamseq_id | bond_start | bond_end | 
+-------------+------------+------------+----------+ 
| Q43495      | 108_SOLLC  |         41 |       77 | 
| Q43495      | 108_SOLLC  |         51 |       66 |
| Q43495      | 108_SOLLC  |         67 |       92 |
| Q43495      | 108_SOLLC  |         79 |       99 |
+-------------+------------+------------+----------+

Architecture information for a family

Architecture table

In Pfam, an architecture is the collection of domains that are present on a protein.

Example query Give me all of the architectures and UniProtKB protein sequences for the family 'Dehyd-heme_bind'
SQL
mysql> SELECT architecture, \
              pfamseq_id, \
              pfamseq_acc \
       FROM   architecture, \
              pfamseq \
       WHERE  architecture like '%Dehyd-heme_bind%' \
       AND    pfamseq.auto_architecture = architecture.auto_architecture;
+---------------------------------+--------------+-------------+
| architecture                    | pfamseq_id   | pfamseq_acc |
+---------------------------------+--------------+-------------+
| Dehyd-heme_bind~DUF1928         | A0NPA8_9RHOB | A0NPA8      |
| Dehyd-heme_bind~DUF1927~DUF1928 | Q8VUT0_PARDE | Q8VUT0      |
| Dehyd-heme_bind~DUF1927~DUF1928 | Q8VW85_PSEPU | Q8VW85      |
| Dehyd-heme_bind~DUF1927~DUF1928 | Q5P0U9_AZOSE | Q5P0U9      |
| Dehyd-heme_bind~DUF1927~DUF1928 | Q5P5Q6_AZOSE | Q5P5Q6      |
| Dehyd-heme_bind~DUF1927~DUF1928 | Q4K966_PSEF5 | Q4K966      |
| Dehyd-heme_bind~DUF1927~DUF1928 | Q3KBY9_PSEPF | Q3KBY9      |
| Dehyd-heme_bind~DUF1927~DUF1928 | Q2BKZ1_9GAMM | Q2BKZ1      |
| Dehyd-heme_bind~DUF1927~DUF1928 | Q2BHT3_9GAMM | Q2BHT3      |
| Dehyd-heme_bind~DUF1927~DUF1928 | Q1I9C7_PSEE4 | Q1I9C7      |
| Dehyd-heme_bind~DUF1927~DUF1928 | A1FCF3_PSEPU | A1FCF3      |
| Dehyd-heme_bind~DUF1927~DUF1928 | A1B2Q6_PARDP | A1B2Q6      |
| Dehyd-heme_bind~DUF1927~DUF1928 | A1K4V1_AZOSB | A1K4V1      |
+---------------------------------+--------------+-------------+

Annotation information for a family

Literature references

In addition to the Pfam annotation for each family, we also store InterPro annotation and their associated GO terms for each family. Links to other databases (e.g. SCOP) are also stored where appropriate.

Example query Give me the Pfam annotation and family information for the family 'CBS'
SQL
mysql> SELECT * FROM pfamA WHERE pfamA_id = 'CBS' \G
      auto_pfamA: 124
       pfamA_acc: PF00571
        pfamA_id: CBS
     description: CBS domain pair
    model_length: 114
          author: Bateman A
     seed_source: [1]
alignment_method: Manual
            type: Domain
  ls_sequence_GA: 19.5000
    ls_domain_GA: 19.5000
  fs_sequence_GA: 20.7000
    fs_domain_GA: 20.7000
  ls_sequence_TC: 19.5000
    ls_domain_TC: 19.5000
  fs_sequence_TC: 20.7000
    fs_domain_TC: 20.7000
  ls_sequence_NC: 19.4000
    ls_domain_NC: 19.4000
  fs_sequence_NC: 20.6000
    fs_domain_NC: 20.6000
           ls_mu: -57.9224
        ls_kappa: 0.2126
           fs_mu: -9.4208
        fs_kappa: 0.6305
         comment:  CBS domains are small intracellular modules that pair ...
     previous_id: NULL
     hmmbuild_ls: hmmbuild -F --hand HMM_ls SEED
 hmmcalibrate_ls: hmmcalibrate --cpu 1 --seed 0 HMM_ls
     hmmbuild_fs: hmmbuild -f -F --hand HMM_fs SEED
 hmmcalibrate_fs: hmmcalibrate --cpu 1 --seed 0 HMM_fs
        num_seed: 0
        num_full: 0
         updated: 2008-01-29 19:23:10
         created: 2003-04-07 12:59:11
         version: NULL
Example query Give me the interpro annotation for the family 'CBS'
SQL
mysql> SELECT interpro_id, \
              abstract \
       FROM   interpro, \
              pfamA \
       WHERE  pfamA.auto_pfamA = interpro.auto_pfamA \
       AND    pfamA_id = 'CBS'\G
interpro_id: IPR000644
   abstract: CBS (cystathionine-beta-synthase) domains are small ...
         
Example query Give me the gene ontology (GO) annotation and family information for the family 'p450'
SQL
mysql> SELECT go_id, \
              term, \
              category \
       FROM   gene_ontology AS go, \
              pfamA AS p \
       WHERE  go.auto_pfamA = p.auto_pfamA \
       AND    pfamA_id = 'p450';
+------------+------------------------+----------+
| go_id      | term                   | category |
+------------+------------------------+----------+
| GO:0020037 | heme binding           | function |
| GO:0005506 | iron ion binding       | function |
| GO:0006118 | electron transport     | process  |
| GO:0004497 | monooxygenase activity | function |
+------------+------------------------+----------+
      
Example query Give me all of the literature references for the family 'CBS'
SQL
mysql> SELECT pfamA_literature_references.comment, \
              order_added, \
              medline, \
              title, \
              literature_references.author, \
              journal \
       FROM   pfamA, \
              pfamA_literature_references, \
              literature_references \
       WHERE  pfamA_id = 'CBS' \
       AND    pfamA.auto_pfamA = pfamA_literature_references.auto_pfamA \
       AND    pfamA_literature_references.auto_lit = literature_references.auto_lit \G
*************************** 1. row ***************************
    comment: NULL
order_added: 4
    medline: 11524006
      title: Regulation of human cystathionine beta-synthase by S-adenosyl-L-methionine: evidence for two catalytically active conformations involving an autoinhibitory domain in the C-terminal region.
     author: Janosik M, Kery V, Gaustadnes M, Maclean KN, Kraus JP
    journal: Biochemistry 2001;40:10625-10633.
*************************** 2. row ***************************
    comment: Discovery of CBS domain.
order_added: 3
    medline: 9106071
      title: CBS domains in ClC chloride channels implicated in myotonia and nephrolithiasis (kidney stones).
     author: Ponting CP;
    journal: J Mol Med 1997;75:160-163.
*************************** 3. row ***************************
    comment: 3D Structure found as a sub-domain in TIM barrel ofinosine-monophosphate dehydrogenase.
order_added: 2
    medline: 10200156
      title: Characteristics and crystal structure of bacterial inosine-5'-monophosphate dehydrogenase.
     author: Zhang R, Evans G, Rotella FJ, Westbrook EM, Beno D, Huberman E, Joachimiak A, Collart FR;
    journal: Biochemistry 1999;38:4691-4700.
*************************** 4. row ***************************
    comment: Discovery and naming of the CBS domain.
order_added: 1
    medline: 9020585
      title: The structure of a domain common to archaebacteria and the homocystinuria disease protein.
     author: Bateman A;
    journal: Trends Biochem Sci 1997;22:12-13.
*************************** 5. row ***************************
    comment: NULL
order_added: 5
    medline: 14722619
      title: CBS domains form energy-sensing modules whose binding of adenosine ligands is disrupted by disease mutations.
     author: Scott JW, Hawley SA, Green KA, Anis M, Stewart G, Scullion GA, Norman DG, Hardie DG;
    journal: J Clin Invest 2004;113:274-284.
*************************** 6. row ***************************
    comment: NULL
order_added: 6
    medline: 14722609
      title: Bateman domains and adenosine derivatives form a binding contract.
     author: Kemp BE;
    journal: J Clin Invest 2004;113:182-184.
6 rows in set (0.18 sec)
Example query Give me all of the database references for the family 'A2M'
SQL
mysql> SELECT db_id, \
              pfamA_database_links.comment, \
              db_link, \
              other_params \
       FROM   pfamA, \
              pfamA_database_links \
       WHERE  pfamA_id = 'A2M' \
       AND    pfamA.auto_pfamA = pfamA_database_links.auto_pfamA;
+----------+---------+-----------+--------------+
| db_id    | comment | db_link   | other_params |
+----------+---------+-----------+--------------+
| HOMSTRAD | NULL    | A2M_B     |              |
| HOMSTRAD | NULL    | A2M_A     |              |
| SCOP     | NULL    | 1c3d      | fa;          |
| PROSITE  | NULL    | PDOC00440 |              |
+----------+---------+-----------+--------------+

Note: The other_params column contains 'fa;' where the Pfam family corresponds to a SCOP family, and 'sf;' where the Pfam family corresponds to a SCOP superfamily.


Clan data

Clan table

A clan contains a set of related Pfam-A families. The information we use to determine which families belong to the same clan include related structure, related function, matching of the same sequence to HMMs from different families and profile-profile comparisons. Note that not all Pfam-A families belong to a clan.

Example query Give me the ID of the clan to which Pfam family 'EGF' belongs
SQL
mysql> SELECT clan_id, \
              clan_acc \
       FROM   clans, \
              clan_membership, \
              pfamA \
       WHERE  clans.auto_clan = clan_membership.auto_clan \
       AND    clan_membership.auto_pfamA = pfamA.auto_pfamA \
       AND    pfamA.pfamA_id = 'EGF';
+---------+----------+
| clan_id | clan_acc |
+---------+----------+
| EGF     | CL0001   |
+---------+----------+
Example query Give me all of the Pfam-A families that belong to clan 'CL0001'
SQL
mysql> SELECT pfamA_acc, \
              pfamA_id \
       FROM   clans, \
              clan_membership, \
              pfamA \
       WHERE  clans.auto_clan = clan_membership.auto_clan \
       AND    clan_membership.auto_pfamA = pfamA.auto_pfamA \
       AND    clan_acc = 'CL0001';
+-----------+-----------------+
| pfamA_acc | pfamA_id        |
+-----------+-----------------+
| PF07645   | EGF_CA          |
| PF04863   | EGF_alliinase   |
| PF07974   | EGF_2           |
| PF09120   | EGF-like_subdom |
| PF00008   | EGF             |
| PF09289   | FOLN            |
| PF00053   | Laminin_EGF     |
| PF09064   | Tme5_EGF_like   |
+-----------+-----------------+
Example query Give me the clan description and comment for clan 'CL0001'
SQL
mysql> SELECT clan_acc, \
              clan_id, \
              clan_description, \
              clan_comment \
       FROM   clans \
       WHERE  clan_acc = 'CL0001' \G
        clan_acc: CL0001
         clan_id: EGF
clan_description: EGF superfamily
    clan_comment: Members of this clan all belong to the EGF superfamily  ...
      
Example query Give me the literature references for clan 'CL0001'
SQL
mysql> SELECT comment, \
              order_added, \
              medline, \
              title, \
              author, \
              journal \
       FROM   clans, \
              literature_references, \
              clan_lit_refs \
       WHERE  clans.auto_clan = clan_lit_refs.auto_clan \
       AND    clan_lit_refs.auto_lit = literature_references.auto_lit \
       AND    clan_acc = 'CL0001' \G
*************************** 1. row ***************************
    comment: NULL
order_added: 2
    medline: 11852228
      title: Domain structure and organisation in extracellular matrix proteins.
     author: Hohenester E, Engel J;
    journal: Matrix Biol 2002;21:115-128.
*************************** 2. row ***************************
    comment: NULL
order_added: 1
    medline: 3282918
      title: Structure and function of epidermal growth factor-like regions in proteins.
     author: Appella E, Weber IT, Blasi F;
    journal: FEBS Lett 1988;231:1-4.
Example query Give me the first 5 architectures for the clan 'CL0001'
SQL
mysql> SELECT architecture, \
              type_example, \
              no_seqs \
       FROM   architecture, \
              clan_architecture, \
              clans \
       WHERE  architecture.auto_architecture = clan_architecture.auto_architecture \
       AND    clan_architecture.auto_clan = clans.auto_clan \
       AND    clan_acc = 'CL0001'
       LIMIT  5 \G
*************************** 1. row ***************************
architecture: Ldl_recept_a~EGF_CA~Ldl_recept_b~Ldl_recept_a~EGF~Ldl_recept_b~EGF~Ldl_recept_b
type_example: 2403803
     no_seqs: 1
*************************** 2. row ***************************
architecture: EGF_CA~Pkinase_Tyr~Pkinase_Tyr~Pkinase_Tyr
type_example: 1309029
     no_seqs: 1
*************************** 3. row ***************************
architecture: EGF_CA~GPS~7tm_2
type_example: 168833
     no_seqs: 11
*************************** 4. row ***************************
architecture: EGF_CA~EGF_CA~EGF_CA~Sushi~Sushi~Sushi~Sushi
type_example: 1486049
     no_seqs: 1
*************************** 5. row ***************************
architecture: EGF~EGF~EGF~EGF~EGF_CA~EGF~EGF~EGF_CA~EGF~EGF~EGF_CA~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~Notch~Notch~Notch~NOD~NODP~Ank
type_example: 57059
     no_seqs: 1
Example query Give me the database links for clan 'CL0001'
SQL
mysql> SELECT db_id, \
              comment, \
              db_link, \
              other_params \
       FROM   clan_database_links, \
              clans \
       WHERE  clan_database_links.auto_clan = clans.auto_clan \
       AND    clan_acc = 'CL0001';
+-------+---------+------------+--------------+
| db_id | comment | db_link    | other_params |
+-------+---------+------------+--------------+
| SCOP  | NULL    | 57196      |              |
| CATH  | NULL    | 2.10.25.10 |              |
+-------+---------+------------+--------------+

Dead families and clans

Dead families

Sometimes we find that two or more families within Pfam can be merged into a single family, which leads to the deletion of Pfam-A families. The dead_families and dead_clans tables contain information about families and clans that have been deleted. These tables may be of use if you need to track what happened to the members of a particular family that is no longer in Pfam.

Example query Give me all of the information about 'dead' Pfam-A family 'PF06700'
SQL
mysql> SELECT * FROM dead_families WHERE pfamA_acc = 'PF06700';
+-----------+-----------------+---------------------------------+------------+
| pfamA_acc | pfamA_id        | comment                         | forward_to |
+-----------+-----------------+---------------------------------+------------+
| PF06700   | 2oxo_fer_oxidoB |  Merged into TPP binding domain | PF02775    |
+-----------+-----------------+---------------------------------+------------+

Hidden Markov model (HMM) tables

HMM tables

The tables pfamA_HMM_ls and pfamA_HMM_ls contain the HMMs for the global and fragment models respectively. It is unlikely that you will need to query these tables. The table pfamA_web contains information about the percentage identity, average length and average coverage for a Pfam-A family.


Nested domains

Nested domains

Some domains in Pfam are disrupted by the insertion of another domain (or domains) within them. The domain that is inserted into another is known as a nested domain.

Example query Give me all of the nested domains and the domains in which they are nested
SQL
mysql> SELECT A.pfamA_id, \
              B.pfamA_id AS nested_domain \
       FROM   pfamA AS A, \
              pfamA AS B, \
              nested_domains \
       WHERE  A.auto_pfamA = nested_domains.auto_pfamA \
       AND    B.auto_pfamA = nested_domains.nests_auto_pfamA;
+-----------------+-----------------+
| pfamA_id        | nested_domain   |
+-----------------+-----------------+
| IMPDH           | CBS             |
| PAP_central     | NTP_transf_2    |
| Peptidase_M10   | fn2             |
| UCH             | zf-MYND         |
| Radical_SAM     | Fer4            |
| Asp             | SapB_1          |
| Asp             | SapB_2          |
| HhH-GPD         | HHH             |
| Peptidase_S8    | PA              |
| CAF1            | R3H             |
| CAF1            | zf-CCCH         |
| RNA_pol_Rpb1_5  | RNA_pol_Rpb1_7  |
...

Structural data

PDB table

In order for the Protein DataBank (PDB) information to be useful to Pfam, we need to map between PDB residues and UniProtKB sequence residues. This is not a trivial task and this mapping information is provided by the Macromolecular Structure Database (MSD) group. The msd_data table contains this residue-by-residue mapping.

Example query Give me the PDB information for structure '2abl'
SQL
mysql> SELECT pdb_id, \
              header, \
              title \
       FROM   pdb \
       WHERE  pdb_id = '2abl';
+--------+-------------+-------------------------------------------------------------+
| pdb_id | header      | title                                                       |
+--------+-------------+-------------------------------------------------------------+
| 2abl   | TRANSFERASE |    SH3-SH2 DOMAIN FRAGMENT OF HUMAN BCR-ABL TYROSINE KINASE |
+--------+-------------+-------------------------------------------------------------+
Example query Give me the first 10 residue mappings for the structure '2abl'
SQL
mysql> SELECT pdb_id, \
              pdb_res, \
              pdb_seq_number, \
              pfamseq_acc, pfamseq_res, pfamseq_seq_number from msd_data, pdb, pfamseq where pdb.auto_pdb=msd_data.auto_pdb and pfamseq.auto_pfamseq=msd_data.auto_pfamseq and pdb_id='2abl' limit 10;
+--------+---------+----------------+-------------+-------------+--------------------+
| pdb_id | pdb_res | pdb_seq_number | pfamseq_acc | pfamseq_res | pfamseq_seq_number |
+--------+---------+----------------+-------------+-------------+--------------------+
| 2abl   | GLY     |             76 | P00519      | G           |                 57 |
| 2abl   | PRO     |             77 | P00519      | P           |                 58 |
| 2abl   | SER     |             78 | P00519      | S           |                 59 |
| 2abl   | GLU     |             79 | P00519      | E           |                 60 |
| 2abl   | ASN     |             80 | P00519      | N           |                 61 |
| 2abl   | ASP     |             81 | P00519      | D           |                 62 |
| 2abl   | PRO     |             82 | P00519      | P           |                 63 |
| 2abl   | ASN     |             83 | P00519      | N           |                 64 |
| 2abl   | LEU     |             84 | P00519      | L           |                 65 |
| 2abl   | PHE     |             85 | P00519      | F           |                 66 |
+--------+---------+----------------+-------------+-------------+--------------------+
Example query Give me the structural data for the family 'CBS'
SQL
mysql> SELECT pdb_id, \
             chain, \
             pdb_res_start, \
             pdb_res_end \
      FROM   pdb, \
             pdb_pfamA_reg, \
             pfamA \
      WHERE  pfamA_id = 'CBS' \
      AND    pfamA.auto_pfamA = pdb_pfamA_reg.auto_pfamA 
      AND    pdb_pfamA_reg.auto_pdb = pdb.auto_pdb;
+--------+-------+---------------+-------------+
| pdb_id | chain | pdb_res_start | pdb_res_end |
+--------+-------+---------------+-------------+
| 1vrd   | A     |            87 |         202 |
| 1vrd   | B     |            87 |         202 |
| 1nfb   | B     |           112 |         232 |
| 1nfb   | A     |           112 |         232 |
| 1b3o   | B     |           112 |         232 |
| 1jr1   | A     |           112 |         232 |
| 1pvm   | B     |             7 |         126 |
| 1pvm   | A     |             7 |         126 |
| 1yav   | A     |            20 |         141 |
| 1yav   | B     |            20 |         141 |
...

Genomes

Genome tables

The tables in this section allow you to retrieve domain information about a particular species, or to retrieve all of the species which contain a partciular Pfam domain.

Example query Return all the species and basic Pfam information for 'bacteria'
SQL
mysql> SELECT   ncbi_code, \
                species, \
                num_distinct_regions, \
                num_total_regions, \
                num_proteins, \
                sequence_coverage, \
                residue_coverage, \
                total_genome_proteins \
       FROM     genome_species \
       WHERE    grouping LIKE '%Bacteria%' \
       ORDER BY species \G
*************************** 1. row ***************************
            ncbi_code: 62977
              species: Acinetobacter sp. (strain ADP1)
 num_distinct_regions: 1394
    num_total_regions: 3105
         num_proteins: 2388
    sequence_coverage: 72
     residue_coverage: 56
total_genome_proteins: 3302
*************************** 2. row ***************************
            ncbi_code: 180835
              species: Agrobacterium tumefaciens (strain C58 / ATCC 33970 (Washington University))
 num_distinct_regions: 1644
    num_total_regions: 5184
         num_proteins: 4077
    sequence_coverage: 76
     residue_coverage: 59
total_genome_proteins: 5397
Example query Give me all the Pfam-A domains for the species 'Arabidopsis thaliana'
SQL
mysql> SELECT   genome_seqs.auto_pfamA, \
                pfamA_acc, \
                pfamA_id, \
                description, \
                sum(count) \
       FROM     genome_seqs, \
                pfamA \
       WHERE    genome_seqs.ncbi_code = '3702' \
       AND      genome_seqs.auto_pfamA = pfamA.auto_pfamA \
       GROUP BY genome_seqs.auto_pfamA;
+------------+-----------+--------------+-----------------------------------------------------------------+------------+
| auto_pfamA | pfamA_acc | pfamA_id     | description                                                     | sum(count) |
+------------+-----------+--------------+-----------------------------------------------------------------+------------+
|          1 | PF00389   | 2-Hacid_dh   | D-isomer specific 2-hydroxyacid dehydrogenase, catalytic domain |          9 |
|          2 | PF00198   | 2-oxoacid_dh | 2-oxoacid dehydrogenases acyltransferase (catalytic domain)     |         10 |
|          4 | PF03171   | 2OG-FeII_Oxy | 2OG-Fe(II) oxygenase superfamily                                |        118 |
|          5 | PF01073   | 3Beta_HSD    | 3-beta hydroxysteroid dehydrogenase/isomerase family            |          4 |
|          6 | PF04419   | 4F5          | 4F5 protein family                                              |          1 |
|          7 | PF03061   | 4HBT         | Thioesterase superfamily                                        |         12 |
|         11 | PF01661   | Macro        | Macro domain                                                    |          3 |
|         13 | PF00962   | A_deaminase  | Adenosine/AMP deaminase                                         |          2 |
|         14 | PF01490   | Aa_trans     | Transmembrane amino acid transporter protein                    |         49 |
|         15 | PF00004   | AAA          | ATPase family associated with various cellular activities (AAA) |        121 |
...

Note: The ncbi_code for the species 'Arabidopsis thaliana' is 3702. This information can be found in the ncbi_taxonomy table.

Example query Give me all of the UniProtKB protein sequences for the species 'Arabidopsis thaliana'
SQL
mysql> SELECT pfamseq.pfamseq_id \
       FROM   pfamseq, \
              genome_seqs \
       WHERE  pfamseq.ncbi_code = '3702' \
       AND    genome_seqs.auto_pfamseq = pfamseq.auto_pfamseq;
+-------------+
| pfamseq_id  |
+-------------+
| 12S1_ARATH  |
| 12S2_ARATH  |
| 14331_ARATH |
| 14332_ARATH |
| 14333_ARATH |
| 14334_ARATH |
| 14335_ARATH |
| 14336_ARATH |
| 14337_ARATH |
...
Example query Give me all of the UniProtKB protein sequences from the species 'Arabidopsis thaliana' that belong to Pfam-A domain 'PF00106'
SQL
mysql> SELECT pfamseq.pfamseq_id \
       FROM   pfamseq, \
              genome_seqs, \
              pfamA \
       WHERE  genome_seqs.ncbi_code = '3702' \
       AND    genome_seqs.auto_pfamseq = pfamseq.auto_pfamseq \
       AND    genome_seqs.auto_pfamA = pfamA.auto_pfamA \
       AND    pfamA_acc = 'PF00106';
+--------------+
| pfamseq_id   |
+--------------+
| FABG_ARATH   |
| PORA_ARATH   |
| PORB_ARATH   |
| PORC_ARATH   |
| O22985_ARATH |
| O49332_ARATH |
| O80711_ARATH |
| O80713_ARATH |
| O80714_ARATH |
| O80924_ARATH |
...

Related families

Related families

PRC and SCOOP are two pieces of software that we use to determine which Pfam families are related. The scores from these programs have been very useful in deciding which Pfam-A families should belong to the same clan. As a rough guide, a PRC E-value score of less than 0.001, or a SCOOP score greater than 50 shows that two families are closely related.

Example query Give me all of the Pfam-A families that have a PRC E-value score of less than 0.001 for the pfamA family 'ABC1'
SQL
mysql> SELECT A.pfamA_id, \
              B.pfamA_id, \
              model_start1, \
              model_end1, \
              length1, \
              model_start2, \
              model_end2, \
              length2, \
              evalue \
       FROM   pfamA AS A, \
              pfamA AS B, \
              pfamA2pfamA_PRC_results \
       WHERE  A.auto_pfamA = pfamA2pfamA_PRC_results.auto_pfamA1 \
       AND    B.auto_pfamA = pfamA2pfamA_PRC_results.auto_pfamA2 \
       AND    evalue < 1e-03 \
       AND    A.pfamA_id = 'ABC1';
+----------+-------------+--------------+------------+---------+--------------+------------+---------+--------------+
| pfamA_id | pfamA_id    | model_start1 | model_end1 | length1 | model_start2 | model_end2 | length2 | evalue       |
+----------+-------------+--------------+------------+---------+--------------+------------+---------+--------------+
| ABC1     | Pkinase_Tyr |           17 |         57 |      41 |            5 |         52 |      48 | 1.284523e-04 |
| ABC1     | Pkinase     |           17 |         43 |      27 |            5 |         31 |      27 | 2.537716e-04 |
| ABC1     | ABC1        |            1 |        126 |     126 |            1 |        126 |     126 | 3.600000e-68 |
+----------+-------------+--------------+------------+---------+--------------+------------+---------+--------------+

Note: The model_start and model_end values let you know which region of the models are similar.

Example query Give me all of the Pfam-A families that have a PRC E-value score of less than 0.001 when compared to the Pfam-B family 'PB007609'
SQL THIS QUERY GIVES NO RESULTS
mysql> SELECT pfamB_acc, \
              pfamA_id, \
              model_start1, \
              model_end1, \
              length1, \
              model_start2, \
              model_end2, \
              length2, \
              evalue \
       FROM   pfamB AS b, \
              pfamB2pfamA_PRC_results AS ab, \
              pfamA AS a\
       WHERE  b.auto_pfamB = ab.auto_pfamB \
       AND    ab.auto_pfamA = a.auto_pfamA \
       AND    evalue < 1e-03 \
       AND    pfamB_acc = 'PB007609';
+-----------+----------------+--------------+------------+---------+--------------+------------+---------+--------------+
| pfamB_acc | pfamA_id       | model_start1 | model_end1 | length1 | model_start2 | model_end2 | length2 | evalue       |
+-----------+----------------+--------------+------------+---------+--------------+------------+---------+--------------+
| PB007609  | Ala_racemase_N |           13 |         72 |      60 |            2 |         65 |      64 | 3.600000e-09 |
+-----------+----------------+--------------+------------+---------+--------------+------------+---------+--------------+

Note: This query currently returns no results. PRC comparisons for Pfam-B families were not added to Pfam release 23.0, due to computational constraints. We hope to re-instate this data in a later release.

Example query Give me all of the Pfam-B families that have a PRC E-value score of less than 0.001 when compared to the Pfam-B family 'PB007609'
SQL THIS QUERY GIVES NO RESULTS
mysql> SELECT a.pfamB_acc, \
              b.pfamB_acc, \
              model_start1, \
              model_end1, \
              length1, \
              model_start2, \
              model_end2, \
              length2, \
              evalue \
       FROM   pfamB AS a, \
              pfamB AS b, \
              pfamB2pfamB_PRC_results AS ab \
       WHERE  a.auto_pfamB = ab.auto_pfamB1 \
       AND    b.auto_pfamB = ab.auto_pfamB2 \
       AND    evalue < 1e-03 \
       AND    a.pfamB_acc = 'PB006860';
+-----------+-----------+--------------+------------+---------+--------------+------------+---------+--------------+
| pfamB_acc | pfamB_acc | model_start1 | model_end1 | length1 | model_start2 | model_end2 | length2 | evalue       |
+-----------+-----------+--------------+------------+---------+--------------+------------+---------+--------------+
| PB006860  | PB006860  |            1 |         62 |      62 |            1 |         62 |      62 | 1.300000e-44 |
+-----------+-----------+--------------+------------+---------+--------------+------------+---------+--------------+

Note: This query currently returns no results. PRC comparisons for Pfam-B families were not added to Pfam release 23.0, due to computational constraints. We hope to re-instate this data in a later release.

Example query Give me all pf the Pfam-A families that have a SCOOP score greater than 50 when compared to the family 'ABC1'
SQL
mysql> SELECT a.pfamA_id, \
              b.pfamA_id, \
              score \
       FROM   pfamA AS a, \
              pfamA AS b, \
              pfamA2pfamA_scoop_results AS ab \
       WHERE  a.auto_pfamA = ab.auto_pfamA1 \
       AND    b.auto_pfamA = ab.auto_pfamA2 \
       AND    a.pfamA_id = 'ABC1' \
       AND    score > 50;
+----------+----------+---------+
| pfamA_id | pfamA_id | score   |
+----------+----------+---------+
| ABC1     | APH      | 83.0761 |
+----------+----------+---------+

NCBI data

Version table

In addition to searching all of the sequences in UniProtKB, we also search the protein sequences from NCBI against Pfam. The ncbi_pfamA_reg tables contains all of the sequence regions (both significant and insignificant) that match each HMM. The ncbi_map table links the GI number to its corresponding UniProtKB entry(s). Note that not all GI numbers have a corresponding UniProtKB entry.

Example query Give me all of the Pfam-A domains for NCBI protein 'GI:1000125'
SQL
mysql> SELECT pfamA_acc, \
              pfamA_id, \
              seq_start, \
              seq_end \
       FROM   ncbi_pfamA_reg, \
              pfamA \
       WHERE  ncbi_pfamA_reg.gi = '1000125' \
       AND    ncbi_pfamA_reg.auto_pfamA = pfamA.auto_pfamA \
       AND    in_full = 1;
+-----------+-----------+-----------+---------+
| pfamA_acc | pfamA_id  | seq_start | seq_end |
+-----------+-----------+-----------+---------+
| PF00069   | Pkinase   |       657 |     916 |
| PF00433   | Pkinase_C |       936 |     983 |
| PF02185   | HR1       |        47 |     119 |
| PF02185   | HR1       |       136 |     213 |
| PF02185   | HR1       |       217 |     294 |
+-----------+-----------+-----------+---------+

Note: The query must include 'in_full=1' in order to retrieve only significant hits.

Example query Give me all of the NCBI protein domains for the Pfam-A family 'AalphaY_MDB'
SQL
mysql> SELECT gi, \
              seq_start, \
              seq_end, \
              pfamA_id \
       FROM   pfamA, \
              ncbi_pfamA_reg \
       WHERE  pfamA_id = 'AalphaY_MDB' \
       AND    pfamA.auto_pfamA = ncbi_pfamA_reg.auto_pfamA \
       AND    in_full = 1;
+---------+-----------+---------+-------------+
| gi      | seq_start | seq_end | pfamA_id    |
+---------+-----------+---------+-------------+
| 8650517 |         1 |     147 | AalphaY_MDB |
| 2314885 |         1 |     149 | AalphaY_MDB |
|  169861 |         1 |     146 | AalphaY_MDB |
|  169855 |         1 |     146 | AalphaY_MDB |
|  169857 |         1 |     147 | AalphaY_MDB |
+---------+-----------+---------+-------------+

Note: The query must include 'in_full=1' in order to retrieve only significant hits.


Metagenomics data

Metaseq tables

We have searched a set of metagenomics seqeuences against Pfam. The metagenomics sequences that we searched are found in the metaseq table. Note that the meta_pfamA_reg table is different to the ncbi_pfamA_reg and pfamA_reg_full tables in that it contains only significant data.

Example query Give me all of the Pfam-A domains for metagenomics protein 'JCVI_ORF_1096665732460'
SQL
mysql> SELECT pfamA_acc, \
              pfamA_id, \
              seq_start, \
              seq_end \
       FROM   metaseq, \
              pfamA, \
              meta_pfamA_reg \
       WHERE  metaseq_id = 'JCVI_ORF_1096665732460' \
       AND    metaseq.auto_metaseq = meta_pfamA_reg.auto_metaseq \
       AND    meta_pfamA_reg.auto_pfamA = pfamA.auto_pfamA;
+-----------+-----------+-----------+---------+
| pfamA_acc | pfamA_id  | seq_start | seq_end |
+-----------+-----------+-----------+---------+
| PF02934   | GatB_N    |         1 |      65 |
| PF01162   | GatB      |        82 |     150 |
| PF02637   | GatB_Yqey |       151 |     284 |
+-----------+-----------+-----------+---------+
Example query Give me all of the metagenomics domains for the family '3-alpha'
SQL
mysql> SELECT metaseq_id, \
              seq_start, \
              seq_end, \
              pfamA_id \
       FROM   pfamA, \
              metaseq, \
              meta_pfamA_reg \
       WHERE  pfamA_id = '3-alpha' \
       AND    pfamA.auto_pfamA = meta_pfamA_reg.auto_pfamA \
       AND    meta_pfamA_reg.auto_metaseq = metaseq.auto_metaseq;
+------------------------+-----------+---------+----------+
| metaseq_id             | seq_start | seq_end | pfamA_id |
+------------------------+-----------+---------+----------+
| JCVI_ORF_1096672077456 |       120 |     166 | 3-alpha  |
| JCVI_ORF_1096672196106 |       133 |     178 | 3-alpha  |
| JCVI_ORF_1096687352712 |       172 |     218 | 3-alpha  |
| JCVI_ORF_1096685142424 |       172 |     218 | 3-alpha  |
| JCVI_ORF_1096690616764 |        16 |      62 | 3-alpha  |
| JCVI_ORF_1096700525628 |       183 |     229 | 3-alpha  |
| JCVI_ORF_1096670986256 |        46 |      92 | 3-alpha  |
+------------------------+-----------+---------+----------+

Pfam FTP site

The following list describes a few of the important files in the Pfam FTP site. Some of these files may be very large (of the order of several hundred megabytes). Please check the sizes on the FTP site before trying to download them over a slow connection.

relnotes.txt
Release notes
pfamseq.gz
A fasta version of Pfam's underlying sequence database
Pfam-A.hmm.gz
The Pfam HMM library for Pfam-A families
Pfam-B.hmm.gz
The Pfam HMM library for Pfam-B families
Pfam-A.full.gz
The full alignments of the curated families
Pfam-A.seed.gz
The seed alignments of the curated families
Pfam-B.gz
Automatically generated alignments of sequence clusters in SWISSPROT and TrEMBL that are not modelled in the curated part of Pfam
Pfam-C.gz
The contains the information about clans and the Pfam-A membership
swisspfam.gz
The domain structure of SWISSPROT and TrEMBL proteins according to Pfam
COPYRIGHT
Copyright notice for Pfam
GNULICENSE
The full text of the GNU Library General Public License under which Pfam is licensed

Installing the Pfam website

Documentation update

October 2009

The documentation in this tab is currently out of date. Although the general information is still largely accurate, the details of the site and underlying database may be inaccurate. We hope to update the documentation within the coming weeks.

The source code for this website and the ancillary systems that it uses are all freely available for download. The website is designed to be portable, so that it can be installed and run at your local site if required. This section gives an overview of the requirements for running the site, a brief description of the steps involved in installing it, and links to detailed installation instructions.

Requirements

Software

The site is written in entirely in perl, using the Catalyst web framework. It runs under mod_perl in the Apache web-server. All data are retrieved from MySQL databases, running locally. Sequence searches are performed by a separate job queuing system, which uses various third-party software to perform searches, generate alignments, etc., including HMMER, genewise and wublast.

Hardware resources

The hardware requirements for the whole system are significant. Although it is possible to install all components on a single machine, we would not recommend it. Ideally you should have one or more web-server machines, a separate database server, and one or more machines to serve the back-end job queuing system. That said, although we give an idea of the hardware that we use at WTSI below, a local installation could be run on a significantly lower specification system.

Web server

The Pfam website includes mainly dynamically generated pages, with a large number of statically served items. The best performance can be gained by separating the two kinds of data onto two (or more) separate machines, so that dynamic data are served by one server, static data by another. If this is not possible, a single large machine should still give resonable performance. We serve a development site from a 3GHz Intel Xeon with 4Gb of physical memory.

Database server

You will first need a reasonable amount of disk space for the database files. The database is distributed as a set of gzip-compressed table dumps, which total about 12Gb. Once uncompressed these table files take up around 35Gb and once the tables are installed into MySQL, the database will require around 150Gb of disk space.

The MySQL database daemon will run happily on most machines, but in order to get the performance required to serve the website, you will need a machine with a fast processor (preferably multiple processors) and a large amount of memory. We run our database on a four processor AMD Opteron 280 server with 8Gb of physical memory.

Queuing system

Our job queuing system can be run on the same machine as the website or database server, but we would recommend running it on a separate machine or, ideally, on a farm of machines. This will ensure that the site can handle multiple requests for sequence alignments, sequence searches, etc. We run our queuing system on a farm of 14 dual-core 2.8GHz Xeons, each with 4Gb of physical memory.

Installation

You will need to install three sub-systems:

  • the MySQL databases
  • the back-end job queuing system
  • the website itself

Database

  1. If you don't have it already, install MySQL
  2. Download the database files from the WTSI FTP site
  3. Install the database tables in your MySQL server

Back-end

  1. Install the required third-party software such as HMMER
  2. Install the perl prerequisites
  3. Download the data files for running the offline searches
  4. Retrieve the queuing system code from CVS
  5. Configure and start the queues

Website

  1. Configure cpan
  2. Install catalyst
  3. Install perl prerequisites for the website
  4. Retrieve the website code from CVS
  5. Configure the website
  6. Configure apache
  7. Restart apache

Detailed installation instructions

The process of installing the three sub-systems is described in detail in three Portable Document Format (PDF) files. You will need a PDF-reader in order to view these instructions.

Database installation notes
installing the Pfam databases
Offline script installation notes
installing the "backend" scripts that run the job queuing system
PfamWeb installation notes
installing the website itself

Privacy issues

This section outlines the ways in which the Pfam website handles information about users. This should not be read as a legal document, but as a description of how we handle information that could be considered sensitive. It should be read in conjunction with the privacy policy documents of the individual Pfam consortium member sites. If you have any concerns about the way that information is used in the website, please contact us at the address given at the bottom of the page and we will be more than happy to discuss your concerns.

Although we make every possible effort to keep this site and the data that it manipulates safe and secure, we make no claim to be able to protect sensitive or privileged information. If you are at all concerned about sensitive information being released, please do not use the site and consider installing the Pfam database and/or this website locally.

Urchin

We use Urchin, a software package closely related to Google Analytics (GA), to track the usage of this website. Urchin uses a single-pixel "web bug" image, which is served from every page, a javascript script that collects information about each request, and cookies that maintain information about your usage of the site between visits. You can read more about how GA works on the Google Analytics website, which includes a detailed description of how traffic is tracked and analysed.

We use the information generated by Urchin purely for audit and accounting purposes, and to help us assess the usefulness and popularity of different features of the site. It does not provide the ability to track individual users' usage of the site. However, Urchin does provides a high-level overview of the traffic that passes through the site, including such information as the approximate geographical location of users, how often and for how long they visited the site, etc.

We understand that this level of tracking may be worrying to some of our users. If you have any concerns about our use of Urchin, please feel free to contact us.

Browsing

All web servers maintain fairly detailed logs of their activity. This includes keeping a record of every request that they serve, usually along with the IP address of the client that made the request. This is true of the web servers that host the Pfam websites.

Although our servers do collect information about your IP address during the normal process of serving the Pfam website, we do not use this information explicitly. The Pfam group uses server logs only to help with development and debugging of the site.

Searches

The sequence search feature of the site allows you to upload a protein or DNA sequence to be searched against our library of HMMs. The sequence that you upload is stored in a database and is retrieved by a set of scripts that actually perform the search. Although we do not have any information that could be used to link that sequence to you personally, you should be aware that the sequence itself is accessible to systems administrators and other users who maintain the Pfam site.

The batch search function allows you to submit larger searches, the results of which are emailed to you. Obviously, this requires you to provide identifiable information, namely an email address. However, beyond the routine backups of our databases, we do not store any information about email addresses and sequences in the longer term and we make no attempt to keep track of the searches that a particular user may be performing.

Information from other types of search, such as a keyword search, is held only in the web server logs but, as described above, no attempt is made to interpret these logs except as part of development or debugging of the site.

Cookies

We use cookies to maintain some information about you between your visits to the site. The information that is stored cannot be used to identify you personally and cannot be used to track your usage of the site.

If you are at all concerned about the use of cookies in the Pfam site, you are free to block all cookies from this site and you should not experience any problems. You may see some unintended behaviour, such as being notified of all new features every time you visit the index page, but the core functionality of the site should be unaffected.

Third-party javascript libraries

This site makes heavy use of javascript and relies on javascript libraries that are developed by various groups and companies. In order to improve the performance of the Pfam website, we no longer serve these files ourselves, but rely on files that are hosted on third-party web-servers. In particular, we use various files that are provided by the AJAX libraries APIs, hosted by google code, and components of the Yahoo! User Interface Library (YUI), hosted by Yahoo!.

As these services are provided by commercial sites, it's likely that their usage will be carefully monitored by the companies that provide them. Although the Pfam site does not pass any information about you to these third-party sites, the sites themselves may use cookies to track your usage of the files that they serve. If you are concerned about the privacy implications of this monitoring, you may want to block cookies from the third-party hosting sites.

The Pfam Consortium

Pfam is maintained by an international consortium of researchers that has been borne out of its original development by Erik Sonnhammer, Sean Eddy and Richard Durbin. The current list of consortium members, their institutes and primary roles are listed below.

Wellcome Trust Sanger Institute (UK)

  • Alex Bateman - Co-ordinator of the Pfam, Merops and Rfam databases
  • Penny Coggill - Pfam database annotator
  • Rob Finn - Project leader
  • Jaina Mistry - Pfam research and development
  • Prasad Gunasekaran - Pfam development
  • John Tate - Web development

Janelia Farm Research Center (USA)

  • Sean Eddy - Co-ordinator of Pfam-USA, founding developer and author of HMMER software

Stockholm Bioinformatics Center (Sweden)

  • Erik Sonnhammer - Co-ordinator of Pfam-Sweden and founding developer

Mirror Sites

Previous contributors

  • Shimelis Assefa
  • Matthew Bashton
  • Ewan Birney
  • Lorenzo Cerrutti
  • Lachlan Coin
  • Richard Durbin
  • Matthew Fenech
  • O. Luke Gavin
  • Sam Griffiths-Jones
  • Kevin Howe
  • Nicola Kerrison
  • Mhairi Marshall
  • Nina Mian
  • William Mifsud
  • Simon Moxon
  • Joanne Pollington
  • Stephen John Sammut
  • David Studholme
  • Corin Yeats

Pfam is a collaborative venture and we hope to be able to interact with as many people as possible, in order to provide a quality database. Please get in touch with any one of us for more information about Pfam. You can email Pfam using the address found at the bottom of the page.

How to contact Pfam

Contact Pfam

You can contact us in various ways. Each of the Pfam consortium sites provides a contact email address, which you can find at the bottom of every page. You can use this address to contact the specific Pfam group.

We also run a central helpdesk, which handles annotation comments, data enquiries and general problems with the Pfam websites. We use a request tracking system to monitor emails to the helpdesk, so you should receive an automated response to your email, letting you know that the system has logged your mail and notified us of its arrival.

Mailing list

The Pfam mailing list is a low traffic list that has important announcements, such as releases or major changes.

To join the mailing list send a mail to pfamlist-subscribe@sanger.ac.uk.

If you should want to unsubscribe from the list send a mail to pfamlist-unsubscribe@sanger.ac.uk.

Xfam blog

The Pfam group contributes to the Xfam blog. The blog is used to announce releases, new features and important changes to Pfam, as well as for posts discussing general issues surrounding the Pfam resource. You can see blog posts that are specific to Pfam here.

RSS feeds

You can keep in touch with the latest goings by subscribing to the RSS feed from the Xfam blog.