Please note: this site relies heavily on the use of javascript. Without a javascript-enabled browser, this site will not function correctly. Please enable javascript and reload the page, or switch to a different browser.
0  structures 0  species 0  interactions 0  sequences 0  architectures

Pfam Help

Help Summary

Pfam 27.0 (Mar 2013 , 14831 families)

Proteins are generally comprised of one or more functional regions, commonly termed domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifying the domains present in a protein can provide insights into the function of that protein.

The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and hidden Markov models (HMMs).

There are two levels of quality to Pfam families: Pfam-A and Pfam-B. Pfam-A entries are derived from the underlying sequence database, known as Pfamseq, which is built from the most recent release of UniProtKB at a given time-point. Each Pfam-A family consists of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov models (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases.

Pfam-B families are un-annotated and of lower quality as they are generated automatically from the non-redundant clusters of the latest ADDA release. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found.

Pfam entries are classified in one of four ways:

Family:
A collection of related protein regions
Domain:
A structural unit
Repeat:
A short unit which is unstable in isolation but forms a stable structure when multiple copies are present
Motifs:
A short unit found outside globular domains

Related Pfam entries are grouped together into clans; the relationship may be defined by similarity of sequence, structure or profile-HMM.

Pfam Changes

This section details the changes that we plan to make or have made to Pfam. This includes changes to the flatfiles, MySQL database and the public website.


Latest changes to Pfam data

Changes between Pfam release 25 and 26

Release 26.0 contains a total of 13672 families, with 1445 new families and 46 families killed since the last release. 79.42% of all proteins in Pfamseq contain a match to at least one Pfam domain. 57.68% of all residues in the sequence database fall within Pfam domains. Pfam 26.0 is based on UniProt release 2011_06.

Show past changes.


Latest changes to website

Release 3.2 (27th February 2013)

  • Selections in sunbursts species trees: sunbursts now support selections and the generation of an alignment and FASTA files from a selection.
  • DNA searches: the DNA search system has been entirely re-written and now uses a six-frame translation to generate protein sequences, which are then searched using our standard pfam_scan.pl script. Results are shown in an interactive page, as for protein sequences.
  • Family page alignment tab: the alignment tab has been redesigned to make it easier to navigate the wider range of available alignment types (now including those generated from representative proteomes) and display options.
  • Scores in protein summary: the protein page now includes E-values and bit scores in the summary table beneath the domain graphic. These are hidden by default but can be shown/hidden on demand.

Show past changes.

Site organisation

Site organisation

The family page is the major page for accessing information contained within Pfam as it describes the Pfam family entries. Most referring sites link to this page. Alternatively, users can navigate to family pages by entering the Pfam identifier or accession number, either via the home page, the "Jump-to" boxes or the keyword search box, or by clicking on a domain name or graphic from anywhere on the website. As with all Pfam pages, there is the context-sensitive icon bar in the top right hand corner that provides a quick overview about the contents of the tabs. The tabs on the family page cover the following topics: functional annotation; domain organisation or architectures; alignments; HMM logo; trees; curation and models; species distribution; interactions; and structures.

back to top


jump to...

Using the "Jump to" search

Many pages in the site include a small search box, entitled "Jump to...". The "Jump to..." box allows you to go immediately to the page for any entry in the Pfam site entry, including Pfam families, clans and UniProt sequence entries.

The "Jump to..." search understands accessions and IDs for most types of entry. For example, you can enter either a Pfam family accession, e.g. PF02171, or, if you find it easier to remember, a family ID, such as piwi. Note that the search is case insensitive.

Because some identifiers can be ambiguous, the "Jump to..." search may need to test several types of identifier to find the entry that you're looking for. For example, Pfam A family IDs (e.g. Kazal_1) and Pfam clan IDs (e.g. Kazal) aren't easily distinguished, so if you enter kazal, the search will first look for a family called kazal and, if it doesn't find one, will then look for a clan called kazal. If all of the guesses fail, you'll see an error message saying "Entry not found".

The order in which the search tries the various types of ID and accession is given below:

  • Pfam A accession, e.g. PF02171
  • Pfam A identifier, e.g. piwi
  • Pfam B accession, e.g. PB000001
  • Pfam B identifier, e.g. Pfam-B_1
  • UniProt sequence accession, e.g. P00789
  • UniProt sequence ID, e.g. CANX_CHICK
  • NCBI "GI" number, e.g. 113594566
  • NCBI secondary accession, e.g. BAF18440.1
  • Pfam clan accession, e.g. CL0005
  • metaseq ID, e.g. JCVI_ORF_1096665732460
  • metaseq accession, e.g. JCVI_PEP_1096665732461
  • Pfam clan accession, e.g. CL0005
  • Pfam clan ID, e.g. Kazal
  • PDB entry, e.g. 2abl
  • Proteome species name, e.g. Homo sapiens

back to top


keyword search

Keyword search

Every page in the Pfam site includes a search box in the page header. You can use this to find Pfam A families which match a particular keyword. The search includes several different areas of the Pfam database:

  • text fields in Pfam entries, e.g. family descriptions
  • UniProt sequence entry description and species fields
  • HEADER and TITLE fields from PDB entries
  • Gene Ontology IDs and terms
  • InterPro entry abstracts

Each Pfam A entry is listed only once in the results table, although it might have been found in more than one area of the database.

back to top


Searching a protein sequence against Pfam

Searching a protein sequence against the Pfam library of HMMs will enable you to find out the domain architecture of the protein. If your protein is present in the version of UniProt, NCBI Genpept or the metagenomic sequence set that we used to make the current release of Pfam, we have already calculated its domain architecture. You can access this by entering the sequence accession or ID in the 'view a sequence' box on the Pfam homepage.

If your sequence is not in the Pfam database, you could perform a single-sequence or a batch search by clicking on the 'Search' link at the top of the Pfam page.

Single protein search

If your protein is not recognised by Pfam, you will need to paste the protein sequence into the search page. We will search your sequence against our HMMs and instantly display the matches for you.

Batch search

If you have a large number of sequences to search (up to several thousand), you can use our batch upload facility. This allows you to upload a file of your sequences in FASTA format, and we will run them against our HMMs and email the results back to you, usually within 48 hours. We request that you put a maximum of 5000 sequences in each file.

Local protein searches

If you have a very large number of protein searches to perform, or you do not wish to post your sequence across the web, it may be more convenient to run the Pfam searches locally using the 'pfam_scan.pl' script. To do this you will need the HMMER3 software, the Pfam HMM libraries and a couple of additional data files from the Pfam website. You will also need to download a few modules from CPAN, most notably Moose.

Full details on how to get 'pfam_scan.pl' up and running can be found on our FTP site.

Proteome analysis

Pfam pre-calculates the domain compositions and architectures for all the proteomes present in Integr8. To see the list of proteomes, click on the 'browse' link at the top of the Pfam website, and click on a letter of the alphabet in the 'proteomes' section. By clicking on a particular organism, you will be be able to view the proteome page for that organism. From here you can view the domain organisation and the domain composition for that proteome.

The taxonomy query allows quick identification of families/domains which are present in one species but are absent from another. It can also be used to find families/domains that are unique to a particular species (note this can be very slow).

back to top


Finding proteins with a specific set of domain combinations ('architectures')

Pfam allows you to retrieve all of the proteins with a particular domain combination (e.g. proteins containing both a CBS domain and an IMPDH domain) using the domain query tool. For a more detailed study of domain architectures you can use PfamAlyzer. PfamAlyzer allows you to find proteins which contain a specific combination of domains and to specify particular species and the evolutionary distances allowed between domains.

back to top


Wikipedia annotation

The Pfam consortium is now coordinating the annotation of Pfam families via Wikipedia. On the summary tab of some family pages, you'll find the text from a Wikipedia article that we feel provides a good description of the Pfam family. If a family has a Wikipedia article assigned to it, we now show the text of that article on the summary tab, in preference to the traditional Pfam annotation text.

If a family does not yet have a Wikipedia article assigned to it, there are several ways for you to help us add one. You can find much more information about the process in the Pfam and Wikipedia tab.

back to top

What is Pfam ?

Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam HMM represents a protein family or domain. By searching a protein sequence against the Pfam library of HMMs, you can determine which domains it carries i.e. its domain architecture. Pfam can also be used to analyse proteomes and questions of more complex domain architectures.

For each Pfam accession we have a family page, which can be accessed in several ways: from the 'View a Pfam Family' search box on the HOME page, by clicking on any graphical image of a domain, by searching for a particular family using the 'Keyword Search' box on the top right hand corner of most website pages, or by pasting the family identifier or accession into the 'JUMP TO' box that is present on most pages in the site.

back to top

What is the difference between Pfam-A and Pfam-B families ?

There are two levels of quality to Pfam families: Pfam-A and Pfam-B.

Pfam-A entries are derived from the underlying sequence database, known as Pfamseq, which is built from the most recent release of UniProtKB at a given time-point. For each Pfam-A family we build a single curated profile hidden Markov model (profile HMM) from the seed alignment (a small set of representative members of the family) using the HMMER3 software, and search this against Pfamseq to provide an automatically generated full alignment. All sequences that score above the cut-off threshold value determined for that family are included in the full alignment, which should then contain all detectable protein sequences belonging to that family.

We also search our Pfam-A HMMs against NCBI Genpept and a set of metagenomic sequences, and these alignments are available from the 'Alignments' tab of the Pfam-A family page. As the seed alignments have been manually checked for quality by a Pfam curator Pfam-A matches are very unlikely to be false matches. Pfam-A families also carry a summary annotation and links to other databases

To complement the Pfam-A families, we automatically generate Pfam-B families using the ADDA database. Pfam-B families have no associated annotation or literature reference and are of much lower quality than Pfam-A families, as their alignments have not been manually checked by a Pfam curator. Pfam-B families are formed by taking alignments of sequence segments from ADDA and removing any Pfam-A residues from them. Some Pfam-B families are composed of low complexity regions and may not reflect true relationships and we therefore we recommend you verify that sequences in a Pfam-B family are related by using other methods, such as BLAST.

Since Pfam 24.0, we have built HMMs for the first (and largest) 20,000 Pfam-B families. Using the Pfam website, users are able to perform a single-sequence or batch search against both the Pfam-A and Pfam-B HMMs.

All families in Pfam are non-overlapping, such that no amino acid belongs to more than one family/domain. At each Pfam release we search all our models against an updated version of UniProt and NCBI Genpept, and regenerate our Pfam-B families using the most recent version of ADDA.

back to top

What is on a Pfam-A family page ?

From the family page you can view the Pfam annotation for a family. We also provide access to many other sources of information, including annotation from the InterPro database, where available, cross-links to other databases and other tools for protein analysis. Since release 25.0 we have also started displaying relevant articles from Wikipedia where available.

Via the tabs on the left-hand side of the page, you can view:

  • the domain architectures in which this family is found
  • the alignments for the family in various formats, including alignments of matches to the NCBI and metagenomic sets, as well as in 'heat-map' format. All alignments can be downloaded
  • the phylogenetic and species distribution trees, either as a traditional, interactive tree or as a "sunburst" plot
  • the HMM logo
  • the structural information for each family where available

back to top

What is a clan ?

Some of the Pfam families are grouped into clans. Pfam defines a clan as a collection of families that have arisen from a single evolutionary origin. Evidence of their evolutionary relationship can be in the form of similarity in tertiary structures, or, when structures are not available, from common sequence motifs. The seed alignments for all families within a clan are aligned and the resulting alignment (called the clan alignment) can be accessed from a link on the clan page. Each clan page includes a clan alignment, a description of the clan and database links, where appropriate. The clan pages can be accessed by following a link from the family page, or alternatively they can be accessed by clicking on 'clans' under the 'browse' by menu on the top of any Pfam page.

back to top

What criteria do you use for putting families into clans ?

We use a variety of measures. Where possible we do use structures to guide us and that is always the gold standard. In the absence of a structure we use:

  1. profile comparisons such as HHsearch
  2. the fact that a sequence significantly matches two HMMs in the same region of the sequence
  3. a method called SCOOP, that looks for common matches in search results that may indicate a relationship

All of this sort of information is then used by one of our curators to make a decision about where families are related and we strive to find information in literature that support the relationship, e.g. common function.

back to top

I was wondering if it is possible to build Wise2 with HMMER3 support ?

The way we get round the problem with the difference in HMMER versions, is to convert the HMMs that are in HMMER3 format to HMMER2 format using the HMMER3 program "hmconvert" (with -2) flag. To make the searches feasible, we screen the DNA for potential domains using ncbi-blast and the Pfam-A.fasta as a target library. GeneWise is then used to calculate a subset of HMMs against the DNA. There is some down-weighting of the bits-per-position between H2 and H3 HMMs that the conversion does not account for, leading inevitably to some false negatives for some families/sequences. However, until GeneWise is patched to deal with HMMER3 models, this is the best course of action.

back to top

What happened to the Pfam_ls and Pfam_fs files?

In the past, each Pfam family was represented by two profile-hidden Markov models (HMMs). One of these could match partially to a family and was called local or fs mode, the other required a sequence to match to the whole length of the HMM, and was called glocal or ls mode. With HMMER2, we found that the combination of the two models gave us the most sensitive searches. However, HMMER3 models are only available for searching in local (fs) mode. Because of the improvements in HMMER3, this single model is as sensitive as the two combined HMMER2 models. This means that we no longer provide two HMM libraries called 'HMM_ls' and 'HMM_fs'. Instead, a single library is available called 'Pfam-A.hmm'.

back to top

Can I search DNA against Pfam ?

The Wise2 software package allows the comparison of protein HMMs to genomic DNA. We use this package to allow users to search single DNA sequences against the library of Pfam HMMs. Paste your DNA sequence into the DNA search box on the search page. The results take approximately 2 minutes for a 1kb sequence, and approximately 1 hour for a 80kb sequence.

back to top

How can I search Pfam locally ?

If you have a large number of sequences or you don't want to post your sequence across the web, you can search your sequence locally using the 'pfam_scan.pl' script.

In terms of HMMs and formats, Pfam is based around the HMMER3 package. This will need to be installed on your local machine. You will need also to download the Pfam HMM libraries from the FTP site, as well as a few modules from CPAN, most notably Moose.

Full details on how to get 'pfam_scan.pl' up and running can be found on our FTP site.

back to top

Why doesn't Pfam include my sequence ?

Pfam is built from a fixed release of UniProt. At each Pfam release we incorporate sequences from the latest release of UniProt. This means that, at any time, the sequences used by Pfam might be several months behind those in the most up-to-date versions of the sequence databases. If your sequence isn't in Pfam, you can still find out what domains it contains by pasting it into the sequence search box on the search page.

back to top

Why is there apparent redundancy of UniProt IDs in the full-length FASTA sequence file ?

A given Pfam family may match a single protein sequence multiple times, if the domain/family is a repeating unit, for example, or when the HMM matches only to short stretches of the sequence but matches several times. In such cases the FASTA file with the full length sequences will contain multiple copies of the same sequence.

back to top

How many accurate alignments do you have ?

Release 27.0 has 14831 families. Over 79.9% of the proteins in SWISSPROT 2012_06 and TrEMBL 2012_06 have at least one match to a Pfam-A family.

back to top

How can I submit a new domain ?

If you know of a domain that is not present in Pfam, you can submit it to us by email (pfam-help@sanger.ac.uk) and we will endeavour to build a Pfam entry for it. We ask that you supply us with a multiple sequence alignment of the domain (please send the alignment file as a text file (e.g. .txt) and not in the format of a specific application such as Microsoft Word (e.g. a .doc) file), and associated literature evidence if available.

back to top

What is iPfam ?

iPfam is a resource that describes domain-domain interactions that are observed in PDB entries. Where two or more Pfam domains occur in a single structure, it analyses them to see if the are close enough to form an interaction. If they are close enough it calculates the bonds forming the interaction. Further information can be found on the iPfam help pages.

back to top

Can I search my protein against Pfam ?

Of course! Please use this search form.

back to top

What is the difference between the - and . characters in your full alignments ?

The '-' and '.' characters both represent gap characters. However they do tell you some extra information about how the HMM has generated the alignment. The '-' symbols are where the alignment of the sequence has used a delete state in the HMM to jump past a match state. This means that the sequence is missing a column that the HMM was expecting to be there. The '.' character is used to pad gaps where one sequence in the alignment has sequence from the HMMs insert state. See the alignment below where both characters are used. The HMM states emitting each column are shown. Note that residues emitted from the Insert (I) state are in lower case.

FLPA_METMA/1-193     ---MPEIRQLSEGIFEVTKD.KKQLSTLNLDPGKVVYGEKLISVEGDE
FBRL_XENLA/86-317    RKVIVEPHR-HEGIFICRGK.EDALVTKNLVPGESVYGEKRISVEDGE
FBRL_MOUSE/90-321    KNVMVEPHR-HEGVFICRGK.EDALFTKNLVPGESVYGEKRVSISEGD
O75259/81-312        KNVMVEPHR-HEGVFICRGK.EDALVTKNLVPGESVYGEKRVSISEGD
FBRL_SCHPO/71-303    AKVIIEPHR-HAGVFIARGK.EDLLVTRNLVPGESVYNEKRISVDSPD
O15647/71-301        GKVIVVPHR-FPGVYLLKGK.SDILVTKNLVPGESVYGEKRYEVMTED
FBRL_TETTH/64-294    KTIIVK-HR-LEGVFICKGQ.LEALVTKNFFPGESVYNEKRMSVEENG
FBRL_LEIMA/57-291    AKVIVEPHMLHPGVFISKAK.TDSLCTLNMVPGISVYGEKRIELGATQ
Q9ZSE3/38-276        SAVVVEPHKVHAGIFVSRGKsEDSLATLNLVPGVSVYGEKRVQTETTD
HMM STATES           MMMMMMMMMMMMMMMMMMMMIMMMMMMMMMMMMMMMMMMMMMMMMMMM
    

back to top

What do the SS lines in the alignment mean ?

These lines are structural information. The SS stands for secondary structure, and this is taken from DSSP. The following list gives the definitions for each code letter:

  • C: random Coil
  • H: alpha-helix
  • G: 3(10) helix
  • I: pi-helix
  • E: hydrogen bonded beta-strand (extended strand)
  • B: residue in isolated beta-bridge
  • T: h-bonded turn (3-turn, 4-turn, or 5-turn)
  • S: bend (five-residue bend centered at residue i)

back to top

You don't have domain YYYY in Pfam !

We are very keen to be alerted to new domains. If you can provide us with a multiple alignment then we will try hard to incorporate it into the database. If you know of a domain, but don't have a multiple alignment, we still want to know, for simple families just one sequence is enough. Again E-mail pfam-help@sanger.ac.uk.

back to top

Are there other databases which do this ?

To a certain extent yes, there are a number of "second generation" databases which are trying to organise protein space into evolutionarily conserved regions. Examples include:

PROSITE
This originally was based around regular expression patterns but now also includes profiles.
PRINTS
This is based around protein "finger-prints" of a series of small conserved motifs making up a domain.
SMART
This is a database concentrating on extracellular modules and signaling domains.
ADDA
This is an automatic algorithm for domain decomposition and clustering of protein domain families.
InterPro
Combines information from Pfam, Prints, SMART, Prosite and PRODOM.
CDD
The Conserved Domain Database is derived from Pfam and SMART databases.

back to top

So which database is better ?

As with everything, it depends on your problem: we would certainly suggest using more than one method. Pfam is likely to provide more interpretable results, with crisp definitions of domains in a protein.

back to top

Glossary of terms used in Pfam

These are some of the commonly used terms in the Pfam website.

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM match. The envelope coordinates delineate the region on the sequence where the match has been probabilistically determined to lie, whereas the alignment coordinates delineate the region over which HMMER is confident that the alignment of the sequence to the profile HMM is correct. Our full alignments contain the envelope coordinates from HMMER3.

Architecture

The collection of domains that are present on a protein.

Clan

A collection of related Pfam entries. The relationship may be defined by similarity of sequence, structure or profile-HMM.

Domain

A structural unit.

Domain score

The score of a single domain aligned to an HMM. Note that, for HMMER2, if there was more than one domain, the sequence score was the sum of all the domain scores for that Pfam entry. This is not quite true for HMMER3.

DUF

Domain of unknown function.

Envelope coordinates

See Alignment coordinates.

Family

A collection of related protein regions.

Full alignment

An alignment of the set of related sequences which score higher than the manually set threshold values for the HMMs of a particular Pfam entry.

Gathering threshold (GA)

Also called the gathering cut-off, this value is the search threshold used to build the full alignment. The gathering threshold is assigned by a curator when the family is built. The GA is the minimum score a sequence must attain in order to belong to the full alignment of a Pfam entry. For each Pfam HMM we have two GA cutoff values, a sequence cutoff and a domain cutoff.

HMMER

The suite of programs that Pfam uses to build and search HMMs. Since Pfam release 24.0 we have used HMMER version 3 to make Pfam. See the HMMER site for more information.

Hidden Markov model (HMM)

A HMM is a probablistic model. In Pfam we use HMMs to transform the information contained within a multiple sequence alignment into a position-specific scoring system. We search our HMMs against the UniProt protein database to find homologous sequences.

HMMER3

The suite of programs that Pfam uses to build and search HMMs. See the HMMER site for more information.

iPfam

A resource that describes domain-domain interactions that are observed in PDB entries. Where two or more Pfam domains occur in a single structure, it analyses them to see if the are close enough to form an interaction. If they are close enough it calculates the bonds forming the interaction.

Metaseq

A collection of sequences derived from various metagenomics datasets.

Motif

A short unit found outside globular domains.

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment.

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of representative sequences. We manually set a threshold value for each HMM and search our models against the UniProt database. All of the sequnces which score above the threshold for a Pfam entry are included in the entry's full alignment.

Pfam-B

An automatically generated alignment which is formed by taking a cluster of sequences from the ADDA database and removing Pfam-A residues from them. Since Pfam-B families are automatically generated, we recommend that you verify that the sequences in a Pfam-B family are related, using other methods such as BLAST. For Pfam 24.0, we have made HMMs for the first (and therefore largest) 20,000 Pfam-B familes. Users can search their sequences against the Pfam-B HMMs in addition to the Pfam-A HMMs when performing both single-sequence searches and batch searches on the website.

Posterior probability

HMMER3 reports a posterior probability for each residue that matches a 'match' or 'insert' state in the profile HMM. A high posterior probability shows that the alignment of the amino acid to the match/insert state is likely to be correct, whereas a low posterior probability indicates that there is alignment uncertainty. This is indicated on a scale with '*' being 10, the highest certainty, down to 1 being complete uncertainty. Within Pfam we display this information as a heat map view, where green residues indicate high posterior probability, and red ones indicate a lower posterior probability.

Repeat

A short unit which is unstable in isolation but forms a stable structure when multiple copies are present.

Seed alignment

An alignment of a set of representative sequences for a Pfam entry. We use this alignment to construct the HMMs for the Pfam entry.

Sequence score

The total score of a sequence aligned to a HMM. If there is more than one domain, the sequence score is the sum of all the domain scores for that Pfam entry. If there is only a single domain, the sequence and the domains score for the protein will be identical. We use the sequence score to determine whether a sequence belongs to the full alignment of a particular Pfam entry.

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment.

Help With Pfam HMM scores

E-values and Bit-scores

Pfam-A is based around hidden Markov model (HMM) searches, as provided by the HMMER3 package. In HMMER3, like BLAST, E-values (expectation values) are calculated. The E-value is the number of hits that would be expected to have a score equal to or better than this value by chance alone. A good E-value is much less than 1. A value of 1 is what would be expected just by chance. In principle, all you need to decide on the significance of a match is the E-value.

E-values are dependent on the size of the database searched, so we use a second system in-house for maintaining Pfam models, based on a bit score (see below), which is independent of the size of the database searched. For each Pfam family, we set a bit score gathering (GA) threshold by hand, such that all sequences scoring at or above this threshold appear in the full alignment. It works out that a bit score of 20 equates to an E-value of approximately 0.1, and a score 25 of to approximately 0.01. From the gathering threshold both a "trusted cutoff" (TC) and a "noise cutoff" (NC) are recorded automatically. The TC is the score for the next highest scoring match above the GA, and the NC is the score for the sequence next below the GA, i.e. the highest scoring sequence not included in the full alignment.

Sequence versus domain scores

There's an additional wrinkle in the scoring system. HMMER3 calculates two kinds of scores, the first for the sequence as a whole and the second for the domain(s) on that sequence. The "sequence score" is the total score of a sequence aligned to the model (the HMM); the "domain score" is the score for a single domain — these two scores are virtually identical where only one domain is present on a sequence. Where there are multiple occurrences of the domain on a sequence any individual match may be quite weak, but the sequence score is the sum of all the individual domain scores, since finding multiple instances of a domain increases our confidence that that sequence belongs to that protein family, i.e. truly matches the model.

Meaning of bit-score for non-mathematicians

A bit score of 0 means that the likelihood of the match having been emitted by the model is equal to that of it having been emitted by the Null model (by chance). A bit score of 1 means that the match is twice as likely to have been emitted by the model than by the Null. A bit score of 2 means that the match is 4 times as likely to have been emitted by the model than by the Null. So, a bit score of 20 means that the match is 2 to the power 20 times as likely to have been emitted by the model than by the Null.

References & Bibliography

Pfam References

The Pfam protein families database: M. Punta, P.C. Coggill, R.Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E.L.L. Sonnhammer, S.R. Eddy, A. Bateman, R.D. Finn Nucleic Acids Research (2012)  Database Issue 40:D290-D301
The Pfam protein families database: R.D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J.E. Pollington, O.L. Gavin, P. Gunesekaran, G. Ceric, K. Forslund, L. Holm, E.L. Sonnhammer, S.R. Eddy, A. Bateman Nucleic Acids Research (2010)  Database Issue 38:D211-D222
The Pfam protein families database: R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, J.S. Sammut, H.R. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L. Sonnhammer and A. Bateman Nucleic Acids Research (2008)  Database Issue 36:D281-D288
Pfam: clans, web tools and services: R.D. Finn, J. Mistry, B. Schuster-Böckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L.L. Sonnhammer and A. Bateman Nucleic Acids Research (2006)  Database Issue 34:D247-D51
Enhanced protein domain discovery by using language modeling techniques from speech recognition: L. Coin, A. Bateman and R. Durbin Proc. Natl. Acad. Sci. USA. (2003) 100(8):4516-20
The Pfam Protein Families Database: A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats and S.R. Eddy Nucleic Acids Research (2004) 32:D138-D141
The Pfam Protein Families Database: A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S.R. Eddy, S. Griffiths-Jones, K.L. Howe, M. Marshall and E.L. Sonnhammer Nucleic Acids Research (2002) 30(1):276-280
The Pfam Protein Families Database: A. Bateman, E. Birney, R. Durbin, S.R. Eddy, K.L. Howe and E.L. Sonnhammer Nucleic Acids Research  (2000) 28:263-266
Pfam 3.1: 1313 multiple alignments match the majority of proteins: A. Bateman, E. Birney, R. Durbin, S.R. Eddy, R.D. Finn and E.L.L. Sonnhammer Nucleic Acids Research (1999) 27:260-262
Pfam: multiple sequence alignments and HMM-profiles of protein domains: E.L.L. Sonnhammer, S.R. Eddy, E. Birney, A. Bateman and R. Durbin Nucleic Acids Research (1998) 26:320-322
Pfam: a comprehensive database of protein families based on seed alignments: E.L.L. Sonnhammer, S.R. Eddy and R. Durbin Proteins (1997) 28:405-420

Book Chapters on Pfam

Pfam: the protein families database R.D. Finn (eds M.J. Dunn, L.B. Jorde, P.F.R. Little, S. Subramaniam) Genetics, Genomics, Proteomics and Bioinformatics, Section 6: Protein Families  (2005) ISBN 978-0-470-84974-3
Identifying protein domains with the Pfam database R.D. Finn, A. Bateman and S. Griffiths-Jones Current protocols in bioinformatics  ISBN 978-0-471-25093-7
Pfam: a domain-centric method for analysing proteins and proteomes J. Mistry and R.D. Finn Methods in Molecular Biology - Comparative Genomics

The Pfam consortium is now coordinating the annotation of Pfam families via Wikipedia. This is some background on the process.

A new approach to annotation

Pfam families have traditionally been annotated by our curators as they were added into the Pfam database, but the annotation step has become by far the most time-consuming part of building a Pfam family. As we adapt the Pfam model to cope with the dramatic increases in sequence data that are on the horizon, we have had to consider ways to make this step quicker and more efficient. We are also striving constantly to improve the quality and depth of our annotations, and to this end we have now adopted the Wikipedia model of annotation that was pioneered by the Rfam resource.

In this approach we will gradually reduce the prominence of our traditional, curator-produced family annotations, replacing them with Wikipedia articles. Starting from Pfam release 25.0, you will see that some family pages in the Pfam website show Wikipedia content rather than our own annotations. Ultimately we hope to be able to assign a detailed, high-quality Wikipedia article in Pfam, but we need the help of the Pfam and Wikipedia communities to make this happen.

back to top

Wikipedia content in the Pfam website

wikipedia tab...

When we build a new Pfam family, we try to find a Wikipedia article that describes the family and provides what we feel to be a valuable annotation for it. If we can't find a suitable article we will, in many cases, generate a new Wikipedia article ourselves. Hence, most new families will be assigned a Wikipedia article as soon as they are created.

Where a Wikipedia article has been assigned to a family, the main summary tab of the family page will show the content of the article, rather than the Pfam annotation. You can still see the old Pfam annotation, along with the Interpro annotation text, in adjacent tabs. Note that we will no longer be updating Pfam annotation text for any family that has a Wikipedia article. Instead we will try to make improvements or corrections to the article itself and will encourage our users to make improvements and corrections themselves.

pfam tab...

Unfortunately, while new Pfam families will have Wikipedia articles assigned when they are created, we simply do not have the resources to be able to revisit older, pre-existing Pfam families. The family pages for these families still continue to show the Pfam annotation, but we hope to replace this with Wikipedia content wherever possible.

back to top

Contributing annotations

You can now contribute to the improvement of Pfam annotations in several ways. First and foremost, if you come across a family that does not yet have a Wikipedia article assigned to it, we would really like to add one. If you know of an article that would provide a useful description of a family, please let us know via our annotation submission form (click the "Add annotation" button on the family page) or by email. You can find our email address at the bottom of every page.

One of the advantages of using Wikipedia to provide our annotations is that any user can now contribute to that annotation text. In many cases, families that do not yet have a Wikipedia article can be assigned an article that already exists. In some cases, however, no suitable article exists, and in that case we would encourage you to consider adding one to Wikipedia yourself.

back to top

Editing Wikipedia articles

You can see these notes on every family page by clicking "More" on the Wikipedia content tab.

Before you edit for the first time

Wikipedia is a free, online encyclopedia. Although anyone can edit or contribute to an article, Wikipedia has some strong editing guidelines and policies, which promote the Wikipedia standard of style and etiquette. Your edits and contributions are more likely to be accepted (and remain) if they are in accordance with this policy.

You should take a few minutes to view the following pages:

How your contribution will be recorded

Anyone can edit a Wikipedia entry. You can do this either as a new user or you can register with Wikipedia and log on. When you click on the "Edit Wikipedia article" button, your browser will direct you to the edit page for this entry in Wikipedia. If you are a registered user and currently logged in, your changes will be recorded under your Wikipedia user name. However, if you are not a registered user or are not logged on, your changes will be logged under your computer's IP address. This has two main implications. Firstly, as a registered Wikipedia user your edits are more likely seen as valuable contribution (although all edits are open to community scrutiny regardless). Secondly, if you edit under an IP address you may be sharing this IP address with other users. If your IP address has previously been blocked (due to being flagged as a source of 'vandalism') your edits will also be blocked. You can find more information on this and creating a user account at Wikipedia.

If you have problems editing a particular page, contact us at pfam-help@sanger.ac.uk and we will try to help.

back to top

Does Pfam agree with the content of the Wikipedia entry ?

Pfam has chosen to link families to Wikipedia articles. In some case we have created or edited these articles but in many other cases we have not made any direct contribution to the content of the article. The Wikipedia community does monitor edits to try to ensure that (a) the quality of article annotation increases, and (b) vandalism is very quickly dealt with. However, we would like to emphasise that Pfam does not curate the Wikipedia entries and we cannot guarantee the accuracy of the information on the Wikipedia page.

back to top

Contact us

Community annotation is a new facility of the Pfam web site. If you have problems editing or experience problems with these pages please contact us.

back to top

How to link to Pfam?

Pfam is maintained by a consortium of researchers based at the Wellcome Trust Sanger Institute, Cambridge, UK (WTSI), Stockholm Bioinformatics Center, Stockholm, Sweden (SBC), and Janelia Farm, Maryland, USA. All three sites run the same Pfam website and linking to different sites only requires that you change the site name, not the parameters in the URL.

Although we have no plans to change the locations of resources within this site dramatically, webmasters are advised to link only to the following types of page within the site.

Home pages

WTSI:
http://pfam.sanger.ac.uk/
SBC:
http://pfam.sbc.su.se/
Janelia:
http://pfam.janelia.org/

Searching a protein sequence against Pfam

WTSI:
http://pfam.sanger.ac.uk/search?tab=sequenceSearchBlock
SBC:
http://pfam.sbc.su.se/search?tab=sequenceSearchBlock
Janelia:
http://pfam.janelia.org/search?tab=sequenceSearchBlock

Searching a DNA sequence against Pfam

WTSI:
http://pfam.sanger.ac.uk/search?tab=sequenceDnaBlock
SBC:
http://pfam.sbc.su.se/search?tab=sequenceDnaBlock
Janelia:
http://pfam.janelia.org/search?tab=sequenceDnaBlock

Linking to Pfam family pages

You can refer to Pfam families either by accession or ID. You can also refer to a family by "entry", although this is a convenience that should be used only if you're not sure if what you have is an accession or an ID.

Pfam accession numbers are more stable between releases than IDs and we strongly recommend that you link by accession number.

Here are some examples of linking to Pfam at WTSI:

By accession:
http://pfam.sanger.ac.uk/family?acc=PF00002
By ID:
http://pfam.sanger.ac.uk/family?id=7tm_2
Using "entry":
http://pfam.sanger.ac.uk/family?entry=PF00002 or
http://pfam.sanger.ac.uk/family?entry=7tm_2
Directly:
http://pfam.sanger.ac.uk/family/PF00002 or
http://pfam.sanger.ac.uk/family/7tm_2

You can link to Pfam family data at the other sites by changing "pfam.sanger.ac.uk" to "pfam.sbc.su.se" or "pfam.janelia.org".

Linking to protein sequence pages

As for Pfam family pages, you can refer to protein sequence pages by accession, ID or entry. Protein IDs are unstable and do change between releases, so, again, we strongly recommend that you use protein accessions where possible.

Here are some examples of linking to protein sequence pages at WTSI:

By accession:
http://pfam.sanger.ac.uk/protein?acc=P15498
By ID:
http://pfam.sanger.ac.uk/protein?id=VAV_HUMAN
Using "entry":
http://pfam.sanger.ac.uk/protein?entry=P15498 or
http://pfam.sanger.ac.uk/protein?entry=VAV_HUMAN
Directly:
http://pfam.sanger.ac.uk/protein/P15498 or
http://pfam.sanger.ac.uk/protein/VAV_HUMAN

Again, to generate links to the other Pfam sites, change "pfam.sanger.ac.uk" to "pfam.sbc.su.se" or "pfam.janelia.org".

Linking to the "jump to" search

The Pfam website features a search tool that tries to guess the type of any accession or ID that it is given. For example, if given "VAV_HUMAN", the search returns the URL for the protein sequence page for the VAV_HUMAN entry. If given "1w9h", the search returns the URL for the PDB entry (structure) 1w9h.

You can use the "jump to" search if you need to link to Pfam but can't be sure what type of accession or ID you will be using in your link. By default, the search returns the URL that it has found, as a simple, plain text HTTP response. Adding the parameter redirect=1 will make the "jump to" tool redirect to the URL that it finds or, if it couldn't find an appropriate URL, to the Pfam homepage.

Return URL:
http://pfam.sanger.ac.uk/search/jump?entry=P15498
Redirect:
http://pfam.sanger.ac.uk/search/jump?entry=P15498&redirect=1

Note that, although it may be convenient to link to Pfam using this search tool, there is no error reporting for your users if the search fails to find an appropriate URL in the Pfam site. It is much safer to link directly to the correct section of the site. Please contact us if you need help with building specific links.

One of the visualisations provided by the Pfam website is a graphical representation of the features found within a sequence, termed domain graphics. There are a variety of different shapes and styles and each one has a particular meaning. This page gives an in-depth description of the elements of Pfam domain graphics.

The library that generates the images in this page and throughout the Pfam site uses a JSON string to describe the domain graphic. Each of the example graphics in this page is followed by a link that can be used to show the JSON snippet that produced it.

Generating graphics

You can try generating your own graphics using the domain graphics generator. The JSON descriptions in this page can be pasted directly into the generator to produce the graphics that you see here.

You can also generate the domain graphics for specific sequences, using the UniProt graphics generator.

Using the domain graphics code

Finally, if you would like to use the javascript library in your own site, we have put together an example page, showing how to set up the library and its dependencies. Look at the source code of the page for an explanation.


The sequence

The base sequence, undecorated by any domains or features, is represented by a plain grey bar:

Show JSON

The length of the domain graphic that is drawn is proportional to the length of the sequence itself. The graphics in this page are drawn with a X-scale of 0.5 pixels per amino-acid, so that a 400 residue sequence will result in a 200 pixel-wide image. Any domains or features which are drawn on the sequence are also scaled by the same factor.

back to top


Pfam-A

The high quality, curated Pfam-A domains are classified into one of four different types: family, domain, repeat and motif (more details). These different classification types are rendered slightly differently.

Family/domain

It is possible for a sequence to match either the full length of a Pfam HMM (a full length match), or to match a portion of an HMM (a fragment match). The two types of match are rendered differently.

Both family and domain entries are rendered as rectangles with curved ends when the sequence is a full length match. Different types of domain are displayed with different colours. When the domain image is long enough, the domain name is shown within the domain itself. In most cases, you can click on the domains to visit the "family page" for that domain. Moving the mouse over the domain image should also display a tooltip showing the domain name, as well as the start and end positions of the domain.

Show JSON

From Pfam 24.0 onwards, Pfam has been generated using HMMER3, which introduces the concept of "envelope coordinates" for a match. Envelope regions are represented in domain graphics as lighter coloured regions. The graphic above shows short envelope regions at the ends of both domains.

When the sequence does not match the full length of the HMM that models a Pfam entry, matching domain fragments are shown. When a sequence match does not pass through the first position in the HMM, the N-terminal side of the domain graphic is drawn with a jagged edge instead of a curved edge. Similarly, when a sequence match does not pass through the last position of the HMM, the C-terminal side of the domain graphic is drawn with a jagged edge. In some rarer cases, the sequence match may not pass through either of the first or last positions of the HMM, in which case both sides are drawn with jagged edges. Examples of all three cases are shown here:

Show JSON

back to top

Repeat/motif

Repeats and motifs are types of Pfam domain which do not form independently folded units. In order to distinguish them from domains of type family and domain, repeats and motifs are represented by rectangles with straight edges. As for families and domains, partial matches are represented with jagged edges.

Show JSON

back to top

Discontinuous nested domains

Some domains in Pfam are disrupted by the insertion of another domain (or domains) within them. A number of names have been given to this arrangement: discontinuous (referring to the outer domain), inserted or nested (both referring to the inner domain). For example, in many sequences containing an IMPDH domain, the IMPDH domain is continuous along the primary sequence. However, in some cases the linear sequence of the IMPDH domain is broken by the insertion of a CBS domain, as shown below.

Where three-dimensional structures are available for representatives of a Pfam domain, it is generally clear that the three-dimensional arrangement of the domain containing the nested domain is maintained. Typically the nested domain is found inserted within a surface exposed loop, having little or no effect on the structure of the other domain. Such an arrangement explains why and how these nested domains can be functionally tolerated.

To represent this arrangement of domain graphically, the discontinuous domain is represented in two parts (as shown below). These two parts are joined by a line bridging them.

Show JSON

back to top

Context domains

Context domains in Pfam are those that, despite not scoring above the family gathering threshold, are expected to be real, based on the presence of the surrounding domains found in the protein. The method is described in:

Enhanced protein domain discovery by using language modeling techniques from speech recognition: L. Coin, A. Bateman and R. Durbin Proc. Natl. Acad. Sci. USA. (2003) 100(8):4516-20

In some cases it is possible for a protein without any matches to gain context domains. This happens when two or more weak matches support each other. This is most often seen with multiple tandem repeats such as WD40 and leucine rich repeats such as LRR_1.

Within the Pfam domain graphics, the context domains are represented by rectangles that are coloured from white to pink as shown below. These images are interactive in the same manner as the Pfam-A graphics.

Show JSON

Please note that context domains are generated automatically and have not been subjected to the same high level of quality control as Pfam-A domains. Therefore, context domains, although likely to be correct should always be verified by other means.

back to top


Pfam-B

Pfam-B regions are automatically generated clusters that supplement the high quality Pfam-A regions. The mechanism for generating Pfam-B regions is detailed here. These regions are represented by a small rectangle, coloured with three stripes. As for Pfam-A regions, clicking on a Pfam-B domain takes the user to the Pfam-B summary page for that entry. Moving the mouse over the striped image will show a tooltip listing the Pfam-B identifier and its start and end points.

Show JSON

back to top


Other sequence motifs

In addition to domains, smaller sequences motifs are represented by the domain graphics. Currently the following motifs are represented: signal peptides, low complexity regions, coiled-coils and transmembrane regions. These usually take lower prority than other regions that are drawn and they are therefore often obscured by, for example, a Pfam-A graphic being drawn over the top of them. An example of each motif is shown here.

Show JSON

back to top

Signal peptides

Signal peptides are short regions (<60 residues long) found at the N-terminus of proteins, which direct the post-translational transport of a protein and are subsequently removed by peptidases. More specifically, a signal peptide is characterised by a short hydrophobic helix (approximately 7-15 residues). This helix is preceded by a slight positively charged region of highly variable length (approximately 1-12 residues). Between the hydrophobic helix and the cleavage site is a somewhat polar and uncharged region, of between 3 and 8 amino-acids. In Pfam, we use Phobius for the prediction of signal peptides and represent them graphically by a small orange box.

A combined transmembrane topology and signal peptide prediction method: L. Kall, A. Krogh and E.L.L. Sonnhammer J. Mol. Biol. (2004) 338(5):1027-36

back to top

Low complexity regions

Low complexity regions are regions of biased sequence composition, usually comprised of different types of repeats. These regions have been shown to be functionally important in some proteins, but they are generally not well understood and are masked out to focus on globular domains within the protein.

Within Pfam, we use SEG to calculate low complexity regions in Pfam. The presence of a low complexity region is indicated by a cyan rectangle.

back to top

Disordered regions

We use the IUPred method for the prediction of disordered regions in the query sequence. The IUPred server provides more detailed disorder prediction results than currently offered here.

Bioinformatical approaches to characterize intrinsically disordered/unstructured proteins. Z. Dosztányi, B. Mészáros, I. Simon Brief Bioinform (2010) 11:225-43
IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Z. Dosztányi, V. Csizmok, P. Tompa, I. Simon Bioinformatics (2005) 21:3433-3434

back to top

Coiled-coils

Coiled coils are motifs found in proteins that structurally form alpha-helices that wrap or wind around each other. Normally, two to three helices are involved, but cases of up to seven alpha-helices have been reported. Coilded-coild are found in a wide variety of proteins, many functionally very important. In Pfam we use ncoils, to identify these motifs. Coiled-coils are represented by a small lime-green rectangle.

back to top

Transmembrane regions

Integral membrane proteins contain one or more transmembrane regions that are comprised of an alpha-helix that passes through or "spans" a membrane. Transmembrane helices are quite variable in length, with the average being about 20 amino-acids in length. Again, Phobius is used for the prediction of transmebrane regions, which are represented by a red rectangle.

back to top


Other Sequence features

Below is a demonstration of how disulphide bridges and active residues are representated in Pfam. Each of these features can appear above or below the sequence, but in this case the disulphide bridges are shown above the sequence and the active site residues below the line.

Show JSON

back to top

Disulphide bridges

Disulphide bridges play a fundamental role in the folding and stability of some proteins. They are formed by covalent bonding between the thiol groups from two cysteine residues. The disulphide bridge annotations used in Pfam come from UniProt and are represented by a solid bridge-shaped line. When mutliple disulphide bonds occur, the heights of the bridges are adjusted to avoid overlaps between them. Inter-protein disulphides are represented by single vertical lines. As always, moving the mouse over the "bridge graphic" shows the details of the bond in a tooltip.

back to top

Active site residues

Within an enyzme, a small number of residues are directly involved in catalysis of a reaction. These are termed active site residues. Within Pfam there are three categories of active site: those that are experimentally determined, those that are predicted by UniProt and those predicted by Pfam. All three types are represented by a "lollipop" with a diamond head. The head is coloured red, pink and purple for each of the three types respectively.

Pfam-predicted active sites are determined by using the experimental data and transferring these annotations through a Pfam alignment.

back to top

"Lollipops"

A wide range of different lollipop styles can be create by combining different line and head colours with different drawing styles. The lollipop head can be drawn as a square, circle or diamond, as a simple coloured bar, or as an arrow (pointing away from the sequence) or a "pointer" (an arrow pointing towards the sequence).

Show JSON

back to top


Tooltips

If appropriate metadata are present in the sequence description, the domain graphics library can also add tooltips to the image. The example below is a "live" domain graphic and its description includes the necessary metadata for generating tooltips; move your mouse over the various domains and sequence features to see them.

Show JSON

back to top

This is an introduction to the "RESTful" interface to the Pfam website. REST (or Representation State Transfer) refers to a style of building websites which makes it easy to interact programmatically with the services provided by the site. A programmatic interface, commonly called an Application Programming Interface (API) allows users to write scripts or programs to access data, rather than having to rely on a browser to view a site.


Basic concepts

URLs

A RESTful service typically sends and receives data over HTTP, the same protocol that's used by websites and browsers. As such, the services provided through a RESTful interface are identified using URLs.

In the Pfam website we use the same basic URL to provide both the standard HTML representation of Pfam data and the alternative XML representation. To see the data for a particular Pfam-A family, you would visit the following URL in your browser:

http://pfam.janelia.org/family/Piwi

To retrieve the data in XML format, just add an extra parameter, output=xml, to the URL:

http://pfam.janelia.org/family/Piwi?output=xml 

The response from the server will now be an XML document, rather than an HTML page.

back to top

Sending requests

Using curl

Although you can use a browser to retrieve family data in XML format, it's most useful to send requests and retrieve XML programmatically. The simplest way to do this is using a Unix command line tool such as curl:

Example
shell% curl -LH 'Expect:' -F output=xml 'http://pfam.janelia.org/family/Piwi'
<?xml version="1.0" encoding="UTF-8"?>
<!-- information on Pfam-A family PF02171 (Piwi), generated: 16:35:52 26-Oct-2009 -->
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/pfam_family.xsd"
      release="24.0"
      release_date="2009-10-07">
  <entry entry_type="Pfam-A" accession="PF02171" id="Piwi">
    ...

Note: we have recently changed the web server that we use for serving the Pfam site. Due to a bug in the server itself, requests that come from curl are normally rejected. The current work-around is to add an extra parameter to the curl command line: -H 'Expect:'. This should avoid problems with requests being rejected.

Using a script

Most programming languages have the ability to send HTTP requests and receive HTTP responses. A Perl script to retrieve data about a Pfam family might be as trivial as this:

Example
#!/usr/bin/perl

use strict;
use warnings;

use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->env_proxy;

my $res = $ua->get( 'http://pfam.janelia.org/family/Piwi?output=xml' );

if ( $res->is_success ) {
  print $res->content;
}
else {
  print STDERR $res->status_line, "\n";
}

back to top

Retrieving data

Although XML is just plain text and therefore human-readable, it's intended to be parsed into a data structure. Extending the Perl script above, we can add the ability to parse the XML using an external Perl module, XML::LibXML:

Example
#!/usr/bin/perl

use strict;
use warnings;

use LWP::UserAgent;
use XML::LibXML;

my $ua = LWP::UserAgent->new;
$ua->env_proxy;

my $res = $ua->get( 'http://pfam.janelia.org/family/Piwi?output=xml' );

die "error: failed to retrieve XML: " . $res->status_line . "\n"
  unless $res->is_success;

my $xml = $res->content;

my $xml_parser = XML::LibXML->new();
my $dom = $xml_parser->parse_string( $xml );

my $root = $dom->documentElement();
my ( $entry ) = $root->getChildrenByTagName( 'entry' );

print 'accession: ' . $entry->getAttribute( 'accession' ) . "\n";

This script now prints out the accession for the family "Piwi" (PF02171).

back to top


Available services

The following is a list of the sections of the website which are currently available as RESTful services.

Pfam ID/accession conversion

This is a simple service to return the accession and ID for a Pfam family, given either the ID or accession as input. Any of the following URLs will return the same simple XML document:

http://pfam.janelia.org/family/acc?id=Piwi&output=xml
http://pfam.janelia.org/family/Piwi/acc?output=xml
http://pfam.janelia.org/family/id?output=xml&acc=PF02171
http://pfam.janelia.org/family/Piwi/id?output=xml
http://pfam.janelia.org/family?entry=Piwi&output=xml
Example
shell% curl -LH 'Expect:' -F output=xml 'http://pfam.janelia.org/family/Piwi/acc'
<?xml version="1.0" encoding="UTF-8"?>
<!-- information on Pfam-A family PF02171 (Piwi), generated: 16:37:09 26-Oct-2009 -->
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/pfam_family.xsd"
      release="24.0"
      release_date="2009-10-07">
  <entry entry_type="Pfam-A" accession="PF02171" id="Piwi" />
</pfam>%

You can see the XML schema for this XML document here.

Note that, as a convenience, you can also omit the output=xml parameter and the response will contain only the ID or accession, as a plain text string:

Example
shell% curl -LH 'Expect:' 'http://pfam.janelia.org/family/Piwi/acc'
PF02171
shell% curl -LH 'Expect:' 'http://pfam.janelia.org/family/PF02171/id'
Piwi

back to top

Pfam-A annotations

You can retrieve a sub-set of the data in a Pfam-A family page as an XML document using any of the following styles of URL:

http://pfam.janelia.org/family?id=Piwi&output=xml
http://pfam.janelia.org/family?output=xml&acc=PF02171
http://pfam.janelia.org/family?entry=Piwi&output=xml
http://pfam.janelia.org/family/Piwi?output=xml

The last two styles, using the entry parameter or an extended URL, accept either accessions or identifiers. The accession/ID is case-insensitive in all cases.

Example
shell% curl -LH 'Expect:' -F output=xml 'http://pfam.janelia.org/family/Piwi'
<?xml version="1.0" encoding="UTF-8"?>
<!-- information on Pfam-A family PF02171 (Piwi), generated: 16:35:52 26-Oct-2009 -->
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/pfam_family.xsd"
      release="24.0"
      release_date="2009-10-07">
  <entry entry_type="Pfam-A" accession="PF02171" id="Piwi">
    <description>
<![CDATA[
Piwi domain
]]>
    </description>
    <comment>
<![CDATA[
This domain is found in the protein Piwi and its relatives.  The function of this 
domain is the dsRNA guided hydrolysis of ssRNA. Determination of the crystal 
structure of Argonaute reveals that PIWI is an RNase H domain, and identifies 
Argonaute as Slicer, the enzyme that cleaves mRNA in the RNAi RISC complex [2].  
In addition, Mg+2 dependence and production of 3'-OH and 5' phosphate products 
are shared characteristics of RNaseH and RISC. The PIWI domain core has a tertiary 
structure belonging to the RNase H family of enzymes.  RNase H fold proteins all 
have a five-stranded mixed beta-sheet surrounded by helices. By analogy to 
RNase H enzymes which cleave single-stranded RNA guided by the DNA strand in an 
RNA/DNA hybrid, the PIWI domain can be inferred to cleave single-stranded RNA, 
for example mRNA, guided by double stranded siRNA.
]]>
    </comment>
    <curation_details>
      <status>CHANGED</status>
      <seed_source>Bateman A</seed_source>
      <num_archs>16</num_archs>
      <num_seqs>
        <seed>21</seed>
        <full>756</full>
      </num_seqs>
      <num_species>140</num_species>
      <num_structures>22</num_structures>
      <percentage_identity>30</percentage_identity>
      <av_length>277.50</av_length>
      <av_coverage>33.67</av_coverage>
      <type>Family</type>
    </curation_details>
    <hmm_details hmmer_version="3.0b2" model_version="10" model_length="304">
      <build_commands>hmmbuild  -o /dev/null HMM SEED</build_commands>
      <search_commands>hmmsearch -Z 9421015 -E 1000 HMM pfamseq</search_commands>
      <cutoffs>
        <gathering>
          <sequence>19.9</sequence>
          <domain>19.9</domain>
        </gathering>
        <trusted>
          <sequence>20.0</sequence>
          <domain>21.0</domain>
        </trusted>
        <noise>
          <sequence>18.6</sequence>
          <domain>19.5</domain>
        </noise>
      </cutoffs>
    </hmm_details>
  </entry>
</pfam>%

You can see the XML schema for this XML document here.

Some Pfam families are removed or merged into others, in which case they become "dead" families. If you try to retrieve annotation information about a dead family, you'll get a simple XML document that only includes information on the replacement (if any) for the family:

Example
shell% curl -LH 'Expect:' -F output=xml 'http://pfam.janelia.org/family/PF06700'
<?xml version="1.0" encoding="UTF-8"?>
<!-- information on dead Pfam-A family PF06700 (2oxo_fer_oxidoB), generated: 16:34:44 26-Oct-2009 -->
<dead_pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xmlns="http://pfam.sanger.ac.uk/"
           xsi:schemaLocation="http://pfam.sanger.ac.uk/
                               http://pfam.sanger.ac.uk/static/documents/schemas/pfam_family.xsd"
           release="24.0"
           release_date="2009-10-07">
  <entry accession="PF06700"
         id="2oxo_fer_oxidoB">
    <forward_to>PF02775</forward_to>
    <comment>Merged into TPP binding domain</comment>
  </entry>
</dead_pfam>

You can see the XML schema for this XML document here.

back to top

Pfam-A family list

You can retrieve a list of all Pfam-A families in the latest Pfam release, either as an XML document or as a tab-delimited text file. Both formats contain the Pfam-A accession, Pfam-A identifier and description:

http://pfam.janelia.org/families?output=xml
http://pfam.janelia.org/families?output=text

You can also view the list in a web browser by removing the output=xml parameter from the URL.

Example
shell% curl -LH 'Expect:' -F output=xml 'http://pfam.janelia.org/families'
<?xml version="1.0" encoding="UTF-8"?>
<!-- all Pfam-A families, generated: 16:12:54 26-Oct-2009 -->
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/pfam_families.xsd"
      release="24.0" 
      release_date="2009-10-07">
  <entry entry_type="Pfam-A" accession="PF00001" id="7tm_1">
    <description>
<![CDATA[
7 transmembrane receptor (rhodopsin family)
]]>
    </description>
  </entry>
  ...

You can see the XML schema for this XML document here.

back to top

Protein sequence data

You can retrieve a sub-set of the data in a protein page as an XML document using any of the following styles of URL:

http://pfam.janelia.org/protein?id=CANX_CHICK&output=xml
http://pfam.janelia.org/protein?output=xml&acc=P00789
http://pfam.janelia.org/protein?entry=P00789&output=xml
http://pfam.janelia.org/protein/P00789?output=xml

As for Pfam-A families, arguments are all case-insensitive and the entry parameter accepts either ID or accession.

Example
shell% curl -LH 'Expect:' -F output=xml 'http://pfam.janelia.org/protein/P00789'
<?xml version="1.0" encoding="UTF-8"?>
<!-- information on UniProt entry P00789 (CANX_CHICK), generated: 16:28:26 26-Oct-2009 -->
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/protein.xsd"
      release="24.0"
      release_date="2009-10-07">
  <entry entry_type="sequence" db="uniprot" db_release="57.6" accession="P00789" id="CANX_CHICK">
    <description>
<![CDATA[
Calpain-1 catalytic subunit EC=3.4.22.52
]]>
    </description>
    <taxonomy tax_id="9031" species_name="Gallus gallus (Chicken)">
      Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archosauria; 
      Dinosauria; Saurischia; Theropoda; Coelurosauria; Aves; Neognathae; Galliformes; 
      Phasianidae; Phasianinae; Gallus.
    </taxonomy>
    <sequence length="705" md5="934014b14ecb71623fa5898c7f81862a" crc64="ABCDDC56298E48AA" version="2">
      MMPFGGIAARLQRDRLRAEGVGEHNNAVKYLNQDYEALKQECIESGTLFRDPQFPAGPTALGFKELGPYSSKTR
      GVEWKRPSELVDDPQFIVGGATRTDICQGALGDCWLLAAIGSLTLNEELLHRVVPHGQSFQEDYAGIFHFQIWQ
      FGEWVDVVVDDLLPTKDGELLFVHSAECTEFWSALLEKAYAKLNGCYESLSGGSTTEGFEDFTGGVAEMYDLKR
      APRNMGHIIRKALERGSLLGCSIDITSAFDMEAVTFKKLVKGHAYSVTAFKDVNYRGQQEQLIRIRNPWGQVEW
      TGAWSDGSSEWDNIDPSDREELQLKMEDGEFWMSFRDFMREFSRLEICNLTPDALTKDELSRWHTQVFEGTWRR
      GSTAGGCRNNPATFWINPQFKIKLLEEDDDPGDDEVACSFLVALMQKHRRRERRVGGDMHTIGFAVYEVPEEAQ
      GSQNVHLKKDFFLRNQSRARSETFINLREVSNQIRLPPGEYIVVPSTFEPHKEADFILRVFTEKQSDTAELDEE
      ISADLADEEEITEDDIEDGFKNMFQQLAGEDMEISVFELKTILNRVIARHKDLKTDGFSLDSCRNMVNLMDKDG
      SARLGLVEFQILWNKIRSWLTIFRQYDLDKSGTMSSYEMRMALESAGFKLNNKLHQVVVARYADAETGVDFDNF
      VCCLVKLETMFRFFHSMDRDGTGTAVMNLAEWLLLTMCG
    </sequence>
    <matches>
      <match accession="PF00648" id="Peptidase_C2" type="Pfam-A">
        <location start="48" end="347" ali_start="48" ali_end="347" hmm_start="1" hmm_end="298" evalue="2.6e-148" bitscore="502.00" />
      </match>
      <match accession="PF01067" id="Calpain_III" type="Pfam-A">
        <location start="358" end="513" ali_start="358" ali_end="512" hmm_start="1" hmm_end="144" evalue="3.5e-57" bitscore="201.20" />
      </match>
    </matches>
  </entry>
</pfam>

You can see the XML schema for this XML document here.

back to top

Sequence searches

The Pfam website includes a form that allows users to upload a protein sequence and see a list of the Pfam domains that are found on their search sequence. We've now implemented a RESTful interface to this search tool, making it possible to run single-sequence Pfam searches programmatically.

Running a search is a two step process:

  1. submit the search sequence and specify search parameters
  2. retrieve search results in XML format

The reason for separating the operation into two steps rather than performing a search in a single operation is that the time taken to perform a sequence search will vary according to the length of the sequence searched. Most web clients, browsers or scripts, will simply time-out if a response is not received within a short time period, usually less than a minute. By submitting a search, waiting and then retrieving results as a separate operation, we avoid the risk of a client reaching a time-out before the results are returned.

The following example uses simple command-line tools to submit the search and retrieve results, but the whole process is easily transferred to a single script or program.

back to top

Save your sequence to file

It is usually most convenient to save your sequence into a plain text file, something like this:

Example
shell% cat test.seq 
MMASTENNEKDNFMRDTASRSKKSRRRSLWIAAGAVPTAIALSLSLASPA
AVAQSSFGSSDIIDSGVLDSITRGLTDYLTPRDEALPAGEVTYPAIEGLP
AGVRVNSAEYVTSHHVVLSIQSAAMPERPIKVQLLLPRDWYSSPDRDFPE
IWALDGLRAIEKQSGWTIETNIEQFFADKNAIVVLPVGGESSFYTDWNEP
NNGKNYQWETFLTEELAPILDKGFRSNGERAITGISMGGTAAVNIATHNP
EMFNFVGSFSGYLDTTSNGMPAAIGAALADAGGYNVNAMWGPAGSERWLE
NDPKRNVDQLRGKQVYVSAGSGADDYGQDGSVATGPANAAGVGLELISRM
TSQTFVDAANGAGVNVIANFRPSGVHAWPYWQFEMTQAWPYMADSLGMSR
EDRGADCVALGAIADATADGSLGSCLNNEYLVANGVGRAQDFTNGRAYWS
PNTGAFGLFGRINARYSELGGPDSWLGFPKTRELSTPDGRGRYVHFENGS
IYWSAATGPWEIPGDMFTAWGTQGYEAGGLGYPVGPAKDFNGGLAQEFQG
GYVLRTPQNRAYWVRGAISAKYMEPGVATTLGFPTGNERLIPGGAFQEFT
NGNIYWSASTGAHYILRGGIFDAWGAKGYEQGEYGWPTTDQTSIAAGGET
ITFQNGTIRQVNGRIEESR

The sequence should contain only valid sequence characters, i.e. letters, excluding "J" and "O". You can break the sequence across multiple lines to make it easier to handle.

Submit the search

Example
shell% curl -LH 'Expect:' -F seq='<test.seq' -F output=xml 'http://pfam.sanger.ac.uk/search/sequence'
<?xml version="1.0" encoding="UTF-8"?>
<jobs xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/submission.xsd">
  <job job_id="adabec68-703f-48c4-bec7-07f1ab965fbb">
    <opened>2010-03-26 11:25:57</opened>
    <result_url>http://pfam.sanger.ac.uk/search/sequence/resultset/adabec68-703f-48c4-bec7-07f1ab965fbb?output=xml</result_url>
  </job>
</jobs>

You can see the XML schema for this XML document here.

When using curl the value of the parameter "seq" needs to be quoted so that its value is taken correctly from the file "test.seq". The second parameter can also be added directly to the URL, as a regular CGI-style parameter, if you prefer.

The search service accepts the following parameters (you can see a more complete description of these settings here):

Parameter Description Accepted values Default Notes
evalue use this E-value cut-off valid float < 10.0 none the default is to have no E-value and to use the gathering threshold. See note below. If an E-value is given, it will be used, regardless of the value of "ga"
ga use gathering threshhold 0 | 1 1
searchBs do search for Pfam-B hits 0 | 1 0 setting "skipAs=0" implies "searchBs=1"; you must search for at least one type of family
skipAs don't search for Pfam-A hits 0 | 1 0
seq protein sequence valid sequence characters none required

Note: this documentation previously suggested that searches submitted through the RESTful interface used an E-value cut-off of 1.0 by default. This is incorrect. RESTful searches use the gathering threshold and not an E-value of 1.0. This is the opposite of the behaviour of the searches run through the web interface. We apologise for the inconsistency.

Wait for the search to complete

Although you can check for results immediately, if you poll before your job has completed, you won't receive an XML document. Instead, the HTTP response to your request will have its status set appropriately and the body of the response will contain only string giving the status. You should ideally check the HTTP status of the response, rather than relying on the body of the response.

These are the possible status codes for the response:

HTTP status
code
Status
description
Response
body
Notes
202 Accepted PEND / RUN The job has been accepted by the search system and is either pending (waiting to be started) or running. After a short delay, your script should check for results again
502 Bad gateway FAIL There was a problem scheduling or running the job. The job has failed and will not produce results. There is no need to check the status again
503 Service unavailable HOLD Your job was accepted but is on hold. This status will not be assigned by the search system, but by an administrator. There is probably a problem with the job and you should contact the help desk for assistance with it
410 Gone DEL Your job was deleted from the search system. This status will not be assigned by the search system, but by an administrator. There was probably a problem with the job and you should contact the help desk for assistance with it
500 Internal server error Error message There was some problem with running your job, but it does not fall into any of the other categories. The body of the response will contain an error message from the server. Contact the help desk for assistance with the problem

When writing a script to submit searches and retrieve results, please add a short delay between the submission and the first attempt to retrieve results. Most search jobs are returned within four to five seconds of submission, depending greatly on the length of the sequence to be searched.

Retrieve results

The XML that was returned from the first query includes one or more URLs from which you can now retrieve results, given in the <result_url>. You can now poll these URLs to retrieve XML documents with the search hits.

Example
shell% curl -LH 'Expect:' 'http://pfam.sanger.ac.uk/search/sequence/resultset/adabec68-703f-48c4-bec7-07f1ab965fbb?output=xml'
<?xml version="1.0" encoding="UTF-8"?>
<pfam xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://pfam.sanger.ac.uk/"
      xsi:schemaLocation="http://pfam.sanger.ac.uk/
                          http://pfam.sanger.ac.uk/static/documents/schemas/results.xsd"
      release="24.0"
      release_date="2009-10-07">
  <results job_id="adabec68-703f-48c4-bec7-07f1ab965fbb">
    <matches>
      <protein length="669">
        <database id="pfam" release="24.0" release_date="2009-10-07">
          <match accession="PF08310.4" id="LGFP" type="Pfam-A" class="Repeat">
            <location start="422" end="475" ali_start="422" ali_end="473" 
              hmm_start="1" hmm_end="52" evalue="5.9e-10" bitscore="38.3" evidence="hmmer v3.0b2" significant="1" />
            <location start="476" end="529" ali_start="476" ali_end="528" 
              hmm_start="1" hmm_end="53" evalue="4.8e-23" bitscore="80.2" evidence="hmmer v3.0b2" significant="1" />
            <location start="530" end="580" ali_start="530" ali_end="578" 
              hmm_start="1" hmm_end="51" evalue="3e-06" bitscore="26.4" evidence="hmmer v3.0b2" significant="1" />
            <location start="581" end="633" ali_start="581" ali_end="633" 
              hmm_start="1" hmm_end="54" evalue="1e-19" bitscore="69.5" evidence="hmmer v3.0b2" significant="1" />
          </match>
          <match accession="PF00756.13" id="Esterase" type="Pfam-A" class="Family">
            <location start="122" end="392" ali_start="122" ali_end="390" 
              hmm_start="1" hmm_end="250" evalue="3.1e-62" bitscore="209.9" evidence="hmmer v3.0b2" significant="1" />
          </match>
        </database>
      </protein>
    </matches>
  </results>
</pfam>

You can see the XML schema for this XML document here.

Since the search is performed by the same server as searches in the Pfam website, you can view your results in a web page by modifying the URL slightly:

http://pfam.janelia.org/search/sequence/resultset/adabec68-703f-48c4-bec7-07f1ab965fbb

Note that old search results are generally cleared out after some time, so if you wait too long before trying to view your hits in the website, you may find that they are already gone.

back to top

Retrieve domain graphics description

When you run a sequence search via the browser, the results page includes a Pfam domain graphic, showing the locations of any matching Pfam families on your search sequence. When running a search via the RESTful interface, you can't retrieve the domain graphic directly, since it's generated using a javascript class in the browser. However, you can retrieve the JSON string that describes the graphic:

http://pfam.janelia.org/search/sequence/graphic/adabec68-703f-48c4-bec7-07f1ab965fbb

Check the domain graphics documentation for details on how you can use the JSON string locally.

The Pfam MySQL database contains all of the data accessible via the website. The database currently consists of 75 tables. Below is some basic documentation on the schema layout and how smaller numbers of tables can be put together to enable access to a subset of the data. At the time of writing, between releases 24.0 and 25.0, the fields within the tables and the results of queries are correct. The data within the tables will change with each release. Although we do not anticipate any major changes to the database, we reserve the right to make changes with or without warning; we will endeavour to update this document if such changes are made.


VERSION table

Version table

The VERSION table contains information that relates to a particular Pfam release. It contains the version number of the Pfam database, the version numbers of the Swiss-Prot and TrEMBL databases that were used to build Pfam, and some statistics about the number of families and coverage. This table is stand-alone and does not link to any of the other tables.

Example query Give me all of the version information for the Pfam database
SQL
mysql> SELECT * FROM VERSION \G

                          pfam_release: 24.0
                     pfam_release_date: 2009-09-24
                    swiss_prot_version: 57.6
                        trembl_version: 40.6
                         hmmer_version: 3.0b2
                        pfamA_coverage: 73.7
             pfamB_additional_coverage: 11.8
                pfamA_residue_coverage: 51.2
     pfamB_additional_residue_coverage: 12.4
                       number_families: 11912

back to top


Domain information

Domain information

Two of the central tables in the Pfam database are pfamseq, which contains the UniProtKB sequence database, and pfamA, which contains information about the Pfam-A families. Most of the other tables in the database link to one or both of these tables, either directly or indirectly.

The table pfamA_reg_seed contains the Pfam regions that are present in a seed alignment, and the pfamA_reg_full_significant table contains all of the sequence regions that match the HMM and score above the curated threshold, i.e. are significant matches, for each family. There is also a table named pfamA_reg_full_insignificant which contains, as the name suggests, all the insignificant matches for each family. Insignificant matches are those which match the HMM with an E-value less than 1000, but score below the curated bit score threshold for each family.

The table pfamA_reg_full_significant contains a column called 'in_full'. The matches that are present in the full alignment for a Pfam family have this column set to 1, while those that are not present in the full alignment have the 'in_full' column set to 0. A significant match will only be excluded from the full alignment (in_full = 0) if it matches a family that belongs to a clan, and the match overlaps with another more significant (lower E-value) match to a family within the clan.

For each sequence match we store two sets of coordinates, the envelope coordinates and the alignment coordinates. The envelope co-ordinates delineate the region on the sequence where the match has been probabilistically determined to lie, whereas the alignment coordinates delineate where HMMER is confident that the alignment of the sequence to the profile HMM is correct. Our full alignments contain the envelope coordinates. In the database, envelope start and end positions are stored in the seq_start and seq_end fields columns, and the alignment coordinates are stored in the ali_start and ali_end fields.

The Pfam database has historically been built on the UniProtKB database. However, as of release 22.0 we also provide Pfam domain data for the NCBI sequence database (GenPept) and a set of metagenomics sequences. Further information about querying the NCBI and metagenomics data sets can be found below.

Example query Give me all of the domains for UniProtKB protein sequence 'VAV_HUMAN'
SQL
mysql> SELECT pfamA_acc, \
              pfamA_id, \
              seq_start, \
              seq_end \
       FROM   pfamseq, \
              pfamA, \
              pfamA_reg_full_significant \
       WHERE  pfamseq_id = 'VAV_HUMAN' \
       AND    in_full = 1 \
       AND    pfamseq.auto_pfamseq = pfamA_reg_full_significant.auto_pfamseq \
       AND    pfamA_reg_full_significant.auto_pfamA = pfamA.auto_pfamA;

+-----------+----------+-----------+---------+
| pfamA_acc | pfamA_id | seq_start | seq_end |
+-----------+----------+-----------+---------+
| PF00130   | C1_1     |       516 |     568 |
| PF00307   | CH       |         2 |     119 |
| PF00621   | RhoGEF   |       198 |     372 |
| PF00169   | PH       |       403 |     504 |
| PF00017   | SH2      |       671 |     745 |
| PF00018   | SH3_1    |       788 |     834 |
| PF00018   | SH3_1    |       615 |     652 |
+-----------+----------+-----------+---------+

To report all of the overlapping domains within any clans, leave out the 'in_full =1' clause. More information on clans can be found later in this document.

Example query Give me all the UniProtKB sequences in the full alignment for the family 'B12D'
SQL
mysql> SELECT pfamseq_id, \
              pfamseq_acc, \
              seq_start, \
              seq_end, \
              pfamA_id \
       FROM   pfamA, \
              pfamseq, \
              pfamA_reg_full_significant \
       WHERE  pfamA_id = 'B12D' \
       AND    in_full = 1 \
       AND    pfamA.auto_pfamA = pfamA_reg_full_significant.auto_pfamA \
       AND    pfamA_reg_full_significant.auto_pfamseq = pfamseq.auto_pfamseq;

+--------------+-------------+-----------+---------+----------+
| pfamseq_id   | pfamseq_acc | seq_start | seq_end | pfamA_id |
+--------------+-------------+-----------+---------+----------+
| C1MQA1_9CHLO | C1MQA1      |        46 |     135 | B12D     |
| A9TC09_PHYPA | A9TC09      |         8 |      93 | B12D     |
| A9NTZ5_PICSI | A9NTZ5      |         7 |      92 | B12D     |
| Q84MX3_ORYSJ | Q84MX3      |         1 |      75 | B12D     |
| B9HTB9_POPTR | B9HTB9      |         3 |      89 | B12D     |
| B6T0R2_MAIZE | B6T0R2      |         1 |      87 | B12D     |
| A6MZE4_ORYSI | A6MZE4      |         3 |      89 | B12D     |
...
Example query Give me all the UniProtKB sequences in the seed alignment for the family 'B12D'
SQL
mysql> SELECT pfamseq_id, \
              pfamseq_acc, \
              seq_start, \
              seq_end, \
              pfamA_id \
       FROM   pfamA, \
              pfamseq, \
              pfamA_reg_seed \
       WHERE  pfamA_id = 'B12D' \
       AND    pfamA.auto_pfamA = pfamA_reg_seed.auto_pfamA \
       AND    pfamA_reg_seed.auto_pfamseq = pfamseq.auto_pfamseq;

+--------------+-------------+-----------+---------+----------+
| pfamseq_id   | pfamseq_acc | seq_start | seq_end | pfamA_id |
+--------------+-------------+-----------+---------+----------+
| Q9XHD5_IPOBA | Q9XHD5      |         3 |      89 | B12D     |
| Q42338_ARATH | Q42338      |         2 |      88 | B12D     |
| Q940E1_CASSA | Q940E1      |        29 |     116 | B12D     |
| Q9LJ47_ARATH | Q9LJ47      |         1 |      87 | B12D     |
| O22414_ORYSA | O22414      |         3 |      89 | B12D     |
| Q6YU35_ORYSJ | Q6YU35      |        11 |      97 | B12D     |
| Q6YU38_ORYSJ | Q6YU38      |         2 |      84 | B12D     |
| Q6Z4G5_ORYSJ | Q6Z4G5      |         1 |      87 | B12D     |
| Q1H8M8_BETVU | Q1H8M8      |         4 |      90 | B12D     |
+--------------+-------------+-----------+---------+----------+

back to top


Pfamseq - other tables

Other regions, sites, disulphides

This section contains a few tables that link to the pfamseq table, but don't fit nicely into any of the sections described above.

The pfam_annseq table contains binary Perl data structures which are used internally to generate the Pfam domain graphics. This table is not intended for use by Pfam users, as it is very dependent on Perl module versions.

The evidence table contains the UniProtKB evidence code key that is used in the evidence field in the pfamseq table.

UniProtKB sequences have secondary accessions if they have been merged or split. Secondary accession numbers are stored in the table called secondary_pfamseq_acc.

Example query Give me the secondary accession(s) for the UniProtKB sequence 'P15455'
mysql> SELECT secondary_acc \
       FROM   pfamseq, \
              secondary_pfamseq_acc \
       WHERE  pfamseq.auto_pfamseq = secondary_pfamseq_acc.auto_pfamseq \
       AND    pfamseq_acc= 'P15455';

+---------------+
| secondary_acc |
+---------------+
| Q3E711        |
| Q9FFH7        |
+---------------+

back to top


Pfam-B, other sequence regions, active site and disulphide bond information for a sequence

Other regions, sites, disulphides

These tables contain sequence specific information about the sequences in the UniProtKB database. The other_regions tables contains coiled coil, low complexity, signal peptide and transmembrane regions. The context_pfam_regions table contains context domains; context domains are those that do not score above the family gathering threshold, but are expected to be real based on the presence of the surrounding domains found in the protein. The pfamseq_markup table contains active site information which is taken from the UniProtKB feature table. Additional active site residues are predicted by Pfam based on conserved residues in a Pfam alignment. The pfamseq_disulphide tables contains disulphide bond information from the UniProtKB feature table.

The pfamseq table also links to the Pfam-B family tables. Pfam-B families are automatically generated protein sequence clusters produced by the ADDA database, and they give an indication of additional conserved regions that are not covered by Pfam-A families. Pfam-B families have an associated alignment, but do not have any annotation or literature references. Unlike Pfam-A alignments, Pfam-B alignments have not been manually checked for quality by a Pfam curator.

Example query Give me all of the transmembrane, signal-peptide, coiled-coils and low-complexity information for the UniProtKB sequnce 'VAV_HUMAN'
mysql> SELECT type_id, \
              source_id, \
              seq_start, \
              seq_end, \
              score \
       FROM   other_reg, \
              pfamseq \
       WHERE  pfamseq.pfamseq_id = 'VAV_HUMAN' \
       AND    other_reg.auto_pfamseq = pfamseq.auto_pfamseq;

+----------------+-----------+-----------+---------+--------+
| type_id        | source_id | seq_start | seq_end | score  |
+----------------+-----------+-----------+---------+--------+
| low_complexity | seg       |        42 |      51 | 1.5700 |
| low_complexity | seg       |       356 |     367 | 2.1900 |
+----------------+-----------+-----------+---------+--------+
Example query Give me all of the context regions for the UniProtKB sequence 'Q8I6U6_PLAF7'
mysql> SELECT seq_start, \
              seq_end, \
              domain_score, \
              pfamA.pfamA_acc, \
              pfamA_id, \
              pfamA.description \
       FROM   context_pfam_regions, \
              pfamseq, \
              pfamA \
      WHERE  pfamseq.pfamseq_acc = 'O74056' \
      AND    context_pfam_regions.auto_pfamseq = pfamseq.auto_pfamseq \
      AND    pfamA.auto_pfamA = context_pfam_regions.auto_pfamA;

+-----------+---------+--------------+-----------+----------+--------------------------+
| seq_start | seq_end | domain_score | pfamA_acc | pfamA_id | description              |
+-----------+---------+--------------+-----------+----------+--------------------------+
|       602 |     641 |         4.96 | PF00400   | WD40     | WD domain, G-beta repeat |
|       711 |     750 |         7.05 | PF00400   | WD40     | WD domain, G-beta repeat |
|      1420 |    1457 |         9.29 | PF00400   | WD40     | WD domain, G-beta repeat |
|      1824 |    1862 |         8.96 | PF00400   | WD40     | WD domain, G-beta repeat |
+-----------+---------+--------------+-----------+----------+--------------------------+
Example query Give me all of the active site information for UniProtKB sequence 'Q22CX9'
mysql> SELECT pfamseq_acc, \
              pfamseq_id, \
              residue, \
              label \
       FROM   pfamseq, \
              pfamseq_markup, \
              markup_key \
       WHERE  pfamseq.auto_pfamseq = pfamseq_markup.auto_pfamseq \
       AND    pfamseq_markup.auto_markup = markup_key.auto_markup \
       AND    pfamseq_acc = 'Q10L43';
+-------------+--------------+---------+----------------------------+

| pfamseq_acc | pfamseq_id   | residue | label                      |
+-------------+--------------+---------+----------------------------+
| Q10L43      | Q10L43_ORYSJ |    1889 | Pfam predicted active site |
| Q10L43      | Q10L43_ORYSJ |    1836 | Pfam predicted active site |
| Q10L43      | Q10L43_ORYSJ |    1819 | Pfam predicted active site |
+-------------+--------------+---------+----------------------------+
Example query Give me all the residues involved in disulphide bonds in the UniProtKB sequence 'P98092'
mysql> SELECT pfamseq_acc, \
              pfamseq_id, \
              bond_start, \
              bond_end \
       FROM   pfamseq, \
              pfamseq_disulphide \
       WHERE  pfamseq_disulphide.auto_pfamseq = pfamseq.auto_pfamseq \
       AND    pfamseq_acc = 'Q43495';

+-------------+------------+------------+----------+
| pfamseq_acc | pfamseq_id | bond_start | bond_end |
+-------------+------------+------------+----------+
| Q43495      | 108_SOLLC  |         67 |       92 |
| Q43495      | 108_SOLLC  |         51 |       66 |
| Q43495      | 108_SOLLC  |         41 |       77 |
| Q43495      | 108_SOLLC  |         79 |       99 |
+-------------+------------+------------+----------+
Example query Give me all of the pfamB regions for the UniProtKB sequence 'VAV_HUMAN'
mysql> SELECT DISTINCT pfamB.pfamB_acc, \
              pfamB_id, \
              seq_start, \
              seq_end \
       FROM   pfamB_reg, \
              pfamB, \
              pfamseq \
       WHERE  pfamseq_id = 'SPT13_HUMAN' \
       AND    pfamB_reg.auto_pfamseq = pfamseq.auto_pfamseq \
       AND    pfamB_reg.auto_pfamB = pfamB.auto_pfamB;

+-----------+--------------+-----------+---------+
| pfamB_acc | pfamB_id     | seq_start | seq_end |
+-----------+--------------+-----------+---------+
| PB066871  | Pfam-B_66871 |         3 |      72 |
+-----------+--------------+-----------+---------+

back to top


Architecture information for a family

Architecture table

In Pfam, an architecture is the combination of domains that are present on a protein. The architecture table can be used to find out which combination of domains are found on particular sets of proteins, or to find out which proteins share the same domains architecture.

Example query Give me all of the architectures and UniProtKB protein sequences for the family 'Dehyd-heme_bind'
SQL
mysql> SELECT architecture_acc, \
              pfamseq_id, \
              pfamseq_acc \
       FROM   architecture, \
              pfamseq \
       WHERE  architecture like '%Dehyd-heme_bind%' \
       AND    pfamseq.auto_architecture = architecture.auto_architecture;

+-------------------------+--------------+-------------+
| architecture_acc        | pfamseq_id   | pfamseq_acc |
+-------------------------+--------------+-------------+
| PF09098 PF09099 PF09100 | Q8VUT0_PARDE | Q8VUT0      |
| PF09098 PF09099 PF09100 | Q8VW85_PSEPU | Q8VW85      |
| PF09098 PF09099 PF09100 | Q5P0U9_AZOSE | Q5P0U9      |
| PF09098 PF09099 PF09100 | Q5P5Q6_AZOSE | Q5P5Q6      |
| PF09098 PF09099 PF09100 | Q4K966_PSEF5 | Q4K966      |
| PF09098 PF09099 PF09100 | Q3KBY9_PSEPF | Q3KBY9      |
| PF09098 PF09099 PF09100 | Q2BKZ1_9GAMM | Q2BKZ1      |
| PF09098 PF09099 PF09100 | Q2BHT3_9GAMM | Q2BHT3      |
| PF09098 PF09099 PF09100 | A0NPA8_9RHOB | A0NPA8      |
| PF09098 PF09099 PF09100 | Q1I9C7_PSEE4 | Q1I9C7      |
| PF09098 PF09099 PF09100 | A1B2Q6_PARDP | A1B2Q6      |
| PF09098 PF09099 PF09100 | A1K4V1_AZOSB | A1K4V1      |
| PF09098 PF09099 PF09100 | A5W2T6_PSEP1 | A5W2T6      |
| PF09098 PF09099 PF09100 | A8ES16_ARCB4 | A8ES16      |
| PF09098 PF09099 PF09100 | B0KQV9_PSEPG | B0KQV9      |
| PF09098 PF09099 PF09100 | B1J9N8_PSEPW | B1J9N8      |
| PF09098 PF09099 PF09100 | B1N7H2_PSEPU | B1N7H2      |
| PF09098 PF09100         | B6BG33_9RHOB | B6BG33      |
| PF09098 PF09099 PF09100 | B9Z0G3_9NEIS | B9Z0G3      |
| PF09098 PF09099 PF09100 | C4KB05_9RHOO | C4KB05      |
| PF09098 PF09099 PF09100 | C4ZJ19_9RHOO | C4ZJ19      |
+-------------------------+--------------+-------------+
Example query Give me all the sequences which have the architecture 'PF09098 PF09099 PF09100'
SQL
mysql> SELECT pfamseq_acc, \
              pfamseq_id  \
       FROM   architecture,  \
              pfamseq \
       WHERE  architecture.auto_architecture = pfamseq.auto_architecture \
       AND    architecture_acc = "PF09098 PF09099 PF09100";

+-------------+--------------+
| pfamseq_acc | pfamseq_id   |
+-------------+--------------+
| Q8VUT0      | Q8VUT0_PARDE |
| Q8VW85      | Q8VW85_PSEPU |
| Q5P0U9      | Q5P0U9_AZOSE |
| Q5P5Q6      | Q5P5Q6_AZOSE |
| Q4K966      | Q4K966_PSEF5 |
| Q3KBY9      | Q3KBY9_PSEPF |
| Q2BKZ1      | Q2BKZ1_9GAMM |
| Q2BHT3      | Q2BHT3_9GAMM |
| A0NPA8      | A0NPA8_9RHOB |
| Q1I9C7      | Q1I9C7_PSEE4 |
| A1B2Q6      | A1B2Q6_PARDP |
| A1K4V1      | A1K4V1_AZOSB |
| A5W2T6      | A5W2T6_PSEP1 |
| A8ES16      | A8ES16_ARCB4 |
| B0KQV9      | B0KQV9_PSEPG |
| B1J9N8      | B1J9N8_PSEPW |
| B1N7H2      | B1N7H2_PSEPU |
| B9Z0G3      | B9Z0G3_9NEIS |
| C4KB05      | C4KB05_9RHOO |
| C4ZJ19      | C4ZJ19_9RHOO |
+-------------+--------------+

Annotation information for a family

Literature references

In addition to the Pfam annotation, we also store InterPro annotation and their associated GO terms for each family. Links to other databases, e.g. SCOP) are also stored where appropriate. The pfamA table contains the GA, TC and NC cut-offs for each family, and additional information surrounding the Pfam-A family, including the number of sequences in the seed and full alignment. The pfamA_interactions table contains, where data are available, pairs of interacting Pfam domains. The data in this table are taken from the iPfam resource, which describes physical interactions between Pfam domains that have a representative structure in the PDB.

Example query Give me the Pfam annotation for the family 'CBS'
SQL
mysql> SELECT comment FROM pfamA WHERE pfamA_id = 'CBS' \G
      
*************************** 1. row ***************************
comment: CBS domains are small intracellular modules that pair together
to form a stable globular domain [2]. This family represents a single CBS
domain. Pairs of these domains have been termed a Bateman domain [6]. CBS
domains have been shown to bind ligands with an adenosyl group such as
AMP, ATP and S-AdoMet [5].  CBS domains are found attached to a wide
range of other protein domains suggesting that CBS domains may play a
regulatory role making proteins sensitive to adenosyl carrying ligands.
The region containing the CBS domains in Cystathionine-beta synthase is
involved in regulation by S-AdoMet [4]. CBS domain pairs from AMPK bind
AMP or ATP [5]. The CBS domains from IMPDH and the chloride channel CLC2
bind ATP [5].
Example query Give me the interpro annotation for the family 'CBS'
SQL
mysql> SELECT interpro_id, \
              abstract \
       FROM   interpro, \
              pfamA \
       WHERE  pfamA.auto_pfamA = interpro.auto_pfamA \
       AND    pfamA_id = 'CBS'\G
interpro_id: IPR000644
   abstract: CBS (cystathionine-beta-synthase) domains are small ...
         
Example query Give me the gene ontology (GO) annotation and family information for the family 'p450'
SQL
mysql> SELECT go_id, \
              term, \
              category \
       FROM   gene_ontology AS go, \
              pfamA AS p \
       WHERE  go.auto_pfamA = p.auto_pfamA \
       AND    pfamA_id = 'p450';

+------------+---------------------------+----------+
| go_id      | term                      | category |
+------------+---------------------------+----------+
| GO:0009055 | electron carrier activity | function |
| GO:0020037 | heme binding              | function |
| GO:0005506 | iron ion binding          | function |
| GO:0004497 | monooxygenase activity    | function |
+------------+---------------------------+----------+
Example query Give me all of the literature references for the family 'CBS'
SQL
mysql> SELECT pfamA_literature_references.comment, \
              order_added, \
              pmid, \
              title, \
              literature_references.author, \
              journal \
       FROM   pfamA, \
              pfamA_literature_references, \
              literature_references \
       WHERE  pfamA_id = 'CBS' \
       AND    pfamA.auto_pfamA = pfamA_literature_references.auto_pfamA \
       AND    pfamA_literature_references.auto_lit = literature_references.auto_lit \G

*************************** 1. row ***************************
    comment: Discovery and naming of the CBS domain.
order_added: 1
       pmid: 9020585
      title: The structure of a domain common to archaebacteria and the homocystinuria disease protein.
     author: Bateman A;
    journal: Trends Biochem Sci 1997;22:12-13.
...
Example query Give me all of the database references for the family 'A2M'
SQL
mysql> SELECT db_id, \
              pfamA_database_links.comment, \
              db_link, \
              other_params \
       FROM   pfamA, \
              pfamA_database_links \
       WHERE  pfamA_id = 'A2M' \
       AND    pfamA.auto_pfamA = pfamA_database_links.auto_pfamA;

+----------+---------+-----------+--------------+
| db_id    | comment | db_link   | other_params |
+----------+---------+-----------+--------------+
| PROSITE  |         | PDOC00440 |              |
| SCOP     |         | 1c3d      | fa           |
| HOMSTRAD |         | A2M_A     |              |
| HOMSTRAD |         | A2M_B     |              |
+----------+---------+-----------+--------------+

Note: The other_params column contains 'fa;' where the Pfam family corresponds to a SCOP family, and 'sf;' where the Pfam family corresponds to a SCOP superfamily.

Example query Give me the interacting domains for the domain 'EGF'
SQL
mysql> SELECT a.pfamA_id, \
              b.pfamA_id \
       FROM   pfamA as a, \
              pfamA as b, \
              pfamA_interactions \
       WHERE  a.auto_pfamA = pfamA_interactions.auto_pfamA_A \
       AND    b.auto_pfamA = pfamA_interactions.auto_pfamA_B \
       AND    a.pfamA_id = "EGF";

+----------+----------------+
| pfamA_id | pfamA_id       |
+----------+----------------+
| EGF      | EGF            |
| EGF      | Ldl_recept_a   |
| EGF      | Lectin_C       |
| EGF      | Trypsin        |
| EGF      | CUB            |
| EGF      | Gla            |
| EGF      | Recep_L_domain |
| EGF      | TSP_3          |
| EGF      | An_peroxidase  |
| EGF      | TSP_C          |
| EGF      | EGF_CA         |
+----------+----------------+

back to top


Clan data

Clan table

A Pfam clan is a set of related Pfam-A families. The information we use to determine which families belong to the same clan includes related structure, related function, matching of the same sequence to HMMs from different families, and profile-profile comparisons. Note that not all Pfam-A families belong to a clan and that a Pfam-A family cannot belong to more than one clan.

Example query Give me the id and accession of the clan to which Pfam family 'EGF' belongs
SQL
mysql> SELECT clan_id, \
              clan_acc \
       FROM   clans, \
              clan_membership, \
              pfamA \
       WHERE  clans.auto_clan = clan_membership.auto_clan \
       AND    clan_membership.auto_pfamA = pfamA.auto_pfamA \
       AND    pfamA.pfamA_id = 'EGF';

+---------+----------+
| clan_id | clan_acc |
+---------+----------+
| EGF     | CL0001   |
+---------+----------+
Example query Give me all of the Pfam-A families that belong to clan 'CL0001'
SQL
mysql> SELECT pfamA_acc, \
              pfamA_id \
       FROM   clans, \
              clan_membership, \
              pfamA \
       WHERE  clans.auto_clan = clan_membership.auto_clan \
       AND    clan_membership.auto_pfamA = pfamA.auto_pfamA \
       AND    clan_acc = 'CL0001';

+-----------+---------------+
| pfamA_acc | pfamA_id      |
+-----------+---------------+
| PF01414   | DSL           |
| PF04863   | EGF_alliinase |
| PF00052   | Laminin_B     |
| PF00053   | Laminin_EGF   |
| PF07645   | EGF_CA        |
| PF00008   | EGF           |
| PF07974   | EGF_2         |
| PF09064   | Tme5_EGF_like |
| PF09289   | FOLN          |
| PF12661   | hEGF          |
| PF12662   | cEGF          |
+-----------+---------------+
Example query Give me the clan description and comment for clan 'CL0001'
SQL
mysql> SELECT clan_acc, \
              clan_id, \
              clan_description, \
              clan_comment \
       FROM   clans \
       WHERE  clan_acc = 'CL0001' \G

clan_acc: CL0001
clan_id: EGF
clan_description: EGF superfamily
clan_comment: Members of this clan all belong to the EGF superfamily ...
Example query Give me the literature references for clan 'CL0001'
SQL
mysql> SELECT comment, \
              order_added, \
              pmid, \
              title, \
              author, \
              journal \
       FROM   clans, \
              literature_references, \
              clan_lit_refs \
       WHERE  clans.auto_clan = clan_lit_refs.auto_clan \
       AND    clan_lit_refs.auto_lit = literature_references.auto_lit \
       AND    clan_acc = 'CL0001' \G

*************************** 1. row ***************************
    comment: NULL
order_added: 2
       pmid: 11852228
      title: Domain structure and organisation in extracellular matrix proteins.
     author: Hohenester E, Engel J;
    journal: Matrix Biol 2002;21:115-128.
*************************** 2. row ***************************
    comment: NULL
order_added: 1
       pmid: 3282918
      title: Structure and function of epidermal growth factor-like regions in proteins.
     author: Appella E, Weber IT, Blasi F;
    journal: FEBS Lett 1988;231:1-4.
Example query Give me the first 5 architectures for the clan 'CL0001'
SQL
mysql> SELECT architecture, \
              architecture_acc, \
              type_example, \
              no_seqs \
       FROM   architecture, \
              clan_architecture, \
              clans \
       WHERE  architecture.auto_architecture = clan_architecture.auto_architecture \
       AND    clan_architecture.auto_clan = clans.auto_clan \
       AND    clan_acc = 'CL0001'
       LIMIT  5 \G

*************************** 1. row ***************************
    architecture: DSL~EGF_2~EGF_2~EGF
architecture_acc: PF01414 PF07974 PF07974 PF00008
    type_example: 4790
         no_seqs: 2
*************************** 2. row ***************************
    architecture: MNNL~DSL~EGF~EGF~EGF~EGF~EGF~EGF
architecture_acc: PF07657 PF01414 PF00008 PF00008 PF00008 PF00008 PF00008 PF00008
    type_example: 22062
         no_seqs: 20
*************************** 3. row ***************************
    architecture: MNNL~DSL~EGF~EGF~EGF~EGF~EGF
architecture_acc: PF07657 PF01414 PF00008 PF00008 PF00008 PF00008 PF00008
    type_example: 22068
         no_seqs: 9
*************************** 4. row ***************************
    architecture: MNNL~DSL~EGF~EGF~EGF~EGF~EGF~EGF~EGF_CA
architecture_acc: PF07657 PF01414 PF00008 PF00008 PF00008 PF00008 PF00008 PF00008 PF07645
    type_example: 22131
         no_seqs: 8
*************************** 5. row ***************************
    architecture: MNNL~DSL~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF~EGF
architecture_acc: PF07657 PF01414 PF00008 PF00008 PF00008 PF00008 PF00008 PF00008 PF00008 PF00008 PF00008 PF00008 PF00008
    type_example: 43821
         no_seqs: 4
Example query Give me the database links for clan 'CL0001'
SQL
mysql> SELECT db_id, \
              comment, \
              db_link, \
              other_params \
       FROM   clan_database_links, \
              clans \
       WHERE  clan_database_links.auto_clan = clans.auto_clan \
       AND    clan_acc = 'CL0001';

+-------+---------+------------+--------------+
| db_id | comment | db_link    | other_params |
+-------+---------+------------+--------------+
| SCOP  | NULL    | 57196      |              |
| CATH  | NULL    | 2.10.25.10 |              |
+-------+---------+------------+--------------+

back to top


Dead families and clans

Dead families

Sometimes we find that two or more Pfam-A families can be merged into a single family, which leads to the deletion of Pfam-A families. Likewise we might merged two clans together, which results in the deletion of a clan. The dead_families and dead_clans tables contain information about Pfam-A families and clans that have been deleted. These tables may be of use if you need to track what happened to the members of a particular family/clan that is no longer in Pfam.

Example query Give me all of the information about 'dead' Pfam-A family 'PF06700'
SQL
mysql> SELECT * FROM dead_families WHERE pfamA_acc = 'PF09410' \G

 pfamA_acc: PF09410
  pfamA_id: DUF2006
   comment: Merged into PF07143
forward_to: PF07143
      user: jm14
    killed: 2009-08-25 10:33:41
Example query Give me all of the information about 'dead' clan 'CL0152'
SQL
mysql> SELECT * FROM dead_clans WHERE clan_acc = "CL0152" \G

        clan_acc: CL0152
         clan_id: XI_TIM
clan_description: Xylose isomerase-like TIM barrel superfamily
 clan_membership:
         comment: Merged clan in to TIM_barrel clan
      forward_to: CL0036
            user: rdf
          killed: 2009-06-22 17:47:17

back to top


Nested domains

Nested domains

Some Pfam-A domains are disrupted by the insertion of another domain (or domains) within them. The domain that is inserted into another is known as a nested domain. The nested_locations table stores all the nested Pfam-A domains. It also stores the coordinates of the nested domain with respect to a sequence that is present in the seed alignment of the domain in which it nests.

Example query Give me all of the nested domains and the domains in which they are nested
SQL
mysql> SELECT A.pfamA_id, \
              B.pfamA_id AS nested_domain \
       FROM   pfamA AS A, \
              pfamA AS B, \
              nested_domains \
       WHERE  A.auto_pfamA = nested_locations.auto_pfamA \
       AND    B.auto_pfamA = nested_locations.nests_auto_pfamA;

+-----------------+-----------------+
| pfamA_id        | nested_domain   |
+-----------------+-----------------+
| CBS             | DRTGG           |
| RecO_C          | TF_Zn_Ribbon    |
| IMPDH           | CBS             |
| End_beta_propel | End_beta_barrel |
| DUF2330         | TonB            |
| Peptidase_M10   | fn2             |
| SBP_bac_3       | Ion_trans_2     |
| SBP_bac_3       | Ion_trans       |
| RWD             | zf-CCCH         |
| NIR_SIR         | Fer4            |
| Asp             | SapB_2          |
| Asp             | SapB_1     |
| PAP_central     | NTP_transf_2    |
...
Example query Give me the nested data for the family IMPDH
SQL
mysql> SELECT pfamA_id,  \
              nested_pfamA_acc, \
              pfamseq_acc, \
              seq_version, \
              seq_start, \
              seq_end \
       FROM   pfamA, \
              nested_locations\
       WHERE  pfamA.auto_pfamA = nested_locations.auto_pfamA \
       AND    pfamA_id ="IMPDH";

+----------+------------------+-------------+-------------+-----------+---------+
| pfamA_id | nested_pfamA_acc | pfamseq_acc | seq_version | seq_start | seq_end |
+----------+------------------+-------------+-------------+-----------+---------+
| IMPDH    | PF00571          | P24547      |           2 |       111 |     232 |
+----------+------------------+-------------+-------------+-----------+---------+

back to top


NCBI and metagenomics data

Version table

In addition to searching all of the sequences in UniProtKB, we also search the protein sequences from NCBI (GenPept), and a set of metagenomic sequences against the Pfam-A HMMs. The ncbi_pfamA_reg table contains all of the significant GenPept matches for each Pfam-A HMM, and the meta_pfamA_reg contains all of the significant metagenomics matches for each Pfam-A HMM. Unlike the situation for UniProtKB data, we do not exclude overlapping matches between families in the same clan from the NCBI and metagenomics alignments, and therefore the in_full columns in the ncbi_pfamA_reg and meta_pfamA_reg tables are obsolete (and may be removed in future versions of the database). The ncbi_map table links the GI number to its corresponding UniProtKB entry(s). Note that not all GI numbers have a corresponding UniProtKB entry. The metagenomics sequences that we searched can be found in the metaseq table.

Example query Give me all of the Pfam-A domains for NCBI protein 'GI:1000125'
SQL
mysql> SELECT pfamA_acc, \
              pfamA_id, \
              seq_start, \
              seq_end \
       FROM   ncbi_pfamA_reg, \
              pfamA \
       WHERE  ncbi_pfamA_reg.gi = '1000125' \
       AND    ncbi_pfamA_reg.auto_pfamA = pfamA.auto_pfamA;

+-----------+-------------+-----------+---------+
| pfamA_acc | pfamA_id    | seq_start | seq_end |
+-----------+-------------+-----------+---------+
| PF02185   | HR1         |        47 |     114 |
| PF02185   | HR1         |       136 |     207 |
| PF02185   | HR1         |       217 |     289 |
| PF00433   | Pkinase_C   |       936 |     983 |
| PF07714   | Pkinase_Tyr |       657 |     905 |
| PF00069   | Pkinase     |       657 |     916 |
+-----------+-------------+-----------+---------+
Example query Give me all of the NCBI protein domains for the Pfam-A family 'AalphaY_MDB'
SQL
mysql> SELECT gi, \
              seq_start, \
              seq_end, \
              pfamA_id \
       FROM   pfamA, \
              ncbi_pfamA_reg \
       WHERE  pfamA_id = 'AalphaY_MDB' \
       AND    pfamA.auto_pfamA = ncbi_pfamA_reg.auto_pfamA \
       AND    in_full = 1;

+---------+-----------+---------+-------------+
| gi      | seq_start | seq_end | pfamA_id    |
+---------+-----------+---------+-------------+
| 8650517 |         1 |     147 | AalphaY_MDB |
| 2314885 |         1 |     149 | AalphaY_MDB |
|  169855 |         1 |     146 | AalphaY_MDB |
|  169861 |         1 |     146 | AalphaY_MDB |
|  169857 |         1 |     147 | AalphaY_MDB |
+---------+-----------+---------+-------------+
Example query Give me all of the metagenomics domains for the family '3-alpha'
SQL
mysql> SELECT metaseq_acc, \
              seq_start, \
              seq_end, \
              pfamA_id \
       FROM   pfamA, \
              metaseq, \
              meta_pfamA_reg \
       WHERE  pfamA_id = '3-alpha' \
       AND    pfamA.auto_pfamA = meta_pfamA_reg.auto_pfamA \
       AND    meta_pfamA_reg.auto_metaseq = metaseq.auto_metaseq;

+-------------+-----------+---------+----------+
| metaseq_acc | seq_start | seq_end | pfamA_id |
+-------------+-----------+---------+----------+
| EBC58818.1  |       172 |     218 | 3-alpha  |
| EBB53961.1  |       183 |     229 | 3-alpha  |
| 2001510128  |        48 |      93 | 3-alpha  |
| ECV16193.1  |       172 |     218 | 3-alpha  |
| EBT67177.1  |       169 |     215 | 3-alpha  |
| ECV05831.1  |       172 |     218 | 3-alpha  |
| 2001448363  |        23 |      62 | 3-alpha  |
| ECV29667.1  |       419 |     465 | 3-alpha  |
| EBT80309.1  |       133 |     178 | 3-alpha  |
| ECB16942.1  |        46 |      92 | 3-alpha  |
| EBL10473.1  |       173 |     219 | 3-alpha  |
| ECN56225.1  |        16 |      62 | 3-alpha  |
| 2000097130  |       178 |     223 | 3-alpha  |
| EDF57633.1  |       169 |     215 | 3-alpha  |
| EDF88754.1  |       185 |     230 | 3-alpha  |
| ECD80201.1  |       120 |     166 | 3-alpha  |
| EBV29651.1  |        74 |     119 | 3-alpha  |
| EBG15763.1  |       173 |     215 | 3-alpha  |
| EDC05583.1  |       227 |     273 | 3-alpha  |
| EDD80781.1  |       172 |     218 | 3-alpha  |
| EDB76302.1  |       169 |     215 | 3-alpha  |
+-------------+-----------+---------+----------+

Structural data

PDB table

In order for the Protein DataBank (PDB) information to be useful to Pfam, we need to map between PDB residues and UniProtKB sequence residues, which is not a trivial task. We store the residue-by-residue mapping that is provided by the PDBe group in the pdb_residue_data table.

Example query Give me the first 10 residue mappings for the structure '2abl'
SQL
mysql> SELECT pdb.pdb_id, \
              pdb_res, \
              pdb_seq_number, \
              pfamseq_acc, \
              pfamseq_res, \
              pfamseq_seq_number \
       FROM   pdb_residue_data, \
              pdb  \
       WHERE  pdb.pdb_id = pdb_residue_data.pdb_id \
       AND    pdb.pdb_id = '2abl' \
       LIMIT  10;

+--------+---------+----------------+-------------+-------------+--------------------+
| pdb_id | pdb_res | pdb_seq_number | pfamseq_acc | pfamseq_res | pfamseq_seq_number |
+--------+---------+----------------+-------------+-------------+--------------------+
| 2abl   | GLY     |             76 | P00519      | G           |                 57 |
| 2abl   | PRO     |             77 | P00519      | P           |                 58 |
| 2abl   | SER     |             78 | P00519      | S           |                 59 |
| 2abl   | GLU     |             79 | P00519      | E           |                 60 |
| 2abl   | ASN     |             80 | P00519      | N           |                 61 |
| 2abl   | ASP     |             81 | P00519      | D           |                 62 |
| 2abl   | PRO     |             82 | P00519      | P           |                 63 |
| 2abl   | ASN     |             83 | P00519      | N           |                 64 |
| 2abl   | LEU     |             84 | P00519      | L           |                 65 |
| 2abl   | PHE     |             85 | P00519      | F           |                 66 |
+--------+---------+----------------+-------------+-------------+--------------------+
Example query Give me all the structures that map to the family 'Globin'
SQL
mysql> SELECT pdb_pfamA_reg.pdb_id, \
              chain, \
              pdb_res_start, \
              pdb_res_end, \
              seq_start, 
              seq_end \
       FROM   pdb, \
              pdb_pfamA_reg, \
              pfamA \
       WHERE  pfamA_id = 'Globin' \
       AND    pfamA.auto_pfamA = pdb_pfamA_reg.auto_pfamA \
       AND    pdb_pfamA_reg.pdb_id = pdb.pdb_id;

+--------+-------+---------------+-------------+-----------+---------+
| pdb_id | chain | pdb_res_start | pdb_res_end | seq_start | seq_end |
+--------+-------+---------------+-------------+-----------+---------+
| 2vyy   | A     |             1 |          87 |         2 |      88 |
| 1v07   | A     |             1 |          87 |         2 |      88 |
| 2vyz   | A     |             1 |          87 |         2 |      88 |
| 1kr7   | A     |             1 |          87 |         2 |      88 |
| 1wmu   | A     |             6 |         106 |         6 |     106 |
| 2z6n   | A     |             6 |         106 |         6 |     106 |
| 1v75   | A     |             6 |         106 |         6 |     106 |
| 3fs4   | C     |             6 |         106 |         6 |     106 |
| 3fs4   | A     |             6 |         106 |         6 |     106 |
| 2mhb   | B     |             7 |         111 |         7 |     111 |
... 

back to top


Proteomes

Proteome tables

At each Pfam release, we download the proteome set from the Integr8 resource. The species and NCBI taxonomy IDs for all of the proteomes in our Integr8 proteome set can be found in the complete_proteomes table, along with some statistics about the number of families and coverage. The tables in this section allow you to retrieve domain information about a particular species, or to retrieve all of the species which contain a particular Pfam domain.

Example query Give me the Pfam summary for the human (Homo sapiens) proteome
SQL
mysql> SELECT * FROM complete_proteomes WHERE species = "Homo sapiens" \G

        auto_proteome: 511
           ncbi_taxid: 9606
              species: Homo sapiens
             grouping: Eukaryota
 num_distinct_regions: 4315
    num_total_regions: 68410
         num_proteins: 32474
    sequence_coverage: 72
     residue_coverage: 40
total_genome_proteins: 44817
      total_aa_length: 19347573
     total_aa_covered: 7758075
   total_seqs_covered: 32474
Example query Give me all the Pfam-A domains for the species 'Arabidopsis thaliana'
SQL
 mysql> SELECT   pfamA_acc, \
                 pfamA_id, \
                 description, \
                 SUM(count) \
        FROM     complete_proteomes, \
                 proteome_regions, \
                 pfamA \
        WHERE    complete_proteomes.ncbi_taxid = 3702 \
        AND      proteome_regions.auto_pfamA = pfamA.auto_pfamA \
        AND      proteome_regions.auto_proteome = complete_proteomes.auto_proteome \
        GROUP BY proteome_regions.auto_pfamA;

+-----------+----------------+-----------------------------------------------------------------+------------+
| pfamA_acc | pfamA_id       | description                                                     | sum(count) |
+-----------+----------------+-----------------------------------------------------------------+------------+
| PF00389   | 2-Hacid_dh     | D-isomer specific 2-hydroxyacid dehydrogenase, catalytic domain |         10 |
| PF00198   | 2-oxoacid_dh   | 2-oxoacid dehydrogenases acyltransferase (catalytic domain)     |         11 |
| PF03171   | 2OG-FeII_Oxy   | 2OG-Fe(II) oxygenase superfamily                                |        144 |
| PF01073   | 3Beta_HSD      | 3-beta hydroxysteroid dehydrogenase/isomerase family            |          9 |
| PF04419   | 4F5            | 4F5 protein family                                              |          3 |
| PF03061   | 4HBT           | Thioesterase superfamily                                        |         13 |
...

Note: The ncbi_code for the species 'Arabidopsis thaliana' is 3702. This information can be found in the ncbi_taxonomy table.

Example query Give me all of the UniProtKB protein sequences for the species 'Arabidopsis thaliana'
SQL
mysql> SELECT pfamseq.pfamseq_id \
       FROM   pfamseq, \
              genome_seqs \
       WHERE  pfamseq.ncbi_code = '3702' \
       AND    genome_seqs.auto_pfamseq = pfamseq.auto_pfamseq;
+-------------+
| pfamseq_id  |
+-------------+
| 12S1_ARATH  |
| 12S2_ARATH  |
| 14331_ARATH |
| 14332_ARATH |
| 14333_ARATH |
| 14334_ARATH |
| 14335_ARATH |
| 14336_ARATH |
| 14337_ARATH |
...
Example query Give me all of the UniProtKB protein sequences from the species 'Arabidopsis thaliana' that belong to Pfam-A domain 'PF00106'
SQL
mysql> SELECT pfamseq.pfamseq_id \
       FROM   pfamseq, \
              genome_seqs, \
              pfamA \
       WHERE  genome_seqs.ncbi_code = '3702' \
       AND    genome_seqs.auto_pfamseq = pfamseq.auto_pfamseq \
       AND    genome_seqs.auto_pfamA = pfamA.auto_pfamA \
       AND    pfamA_acc = 'PF00106';
+--------------+
| pfamseq_id   |
+--------------+
| FABG_ARATH   |
| PORA_ARATH   |
| PORB_ARATH   |
| PORC_ARATH   |
| O22985_ARATH |
| O49332_ARATH |
| O80711_ARATH |
| O80713_ARATH |
| O80714_ARATH |
| O80924_ARATH |
...

back to top


Related families

Related families

SCOOP and HHsearch are two pieces of software that we use to help to determine which Pfam-A families are related. The scores from these programs have been a very useful aid in deciding which Pfam-A families should belong to the same clan. As a rough guide, a SCOOP score greater than 50 or a HHsearch E-value score of less than 0.01 is an indication that two families are closely related.

Example query Give me all pf the Pfam-A families that have a SCOOP score greater than 50 when compared to the family 'ABC1'
SQL
mysql> SELECT a.pfamA_id, \
              b.pfamA_id, \
              score \
       FROM   pfamA AS a, \
              pfamA AS b, \
              pfamA2pfamA_scoop_results AS p\
       WHERE  a.auto_pfamA = p.auto_pfamA1 \
       AND    b.auto_pfamA = p.auto_pfamA2\
       AND    score > 50 \
       AND    a.pfamA_id = "ABC1";

+----------+-----------------+---------+
| pfamA_id | pfamA_id        | score   |
+----------+-----------------+---------+
| ABC1     | Pkinase         | 82.5059 |
| ABC1     | Pox_ser-thr_kin | 77.4765 |
| ABC1     | Kdo             | 62.4555 |
| ABC1     | Seadorna_VP7    | 53.7196 |
| ABC1     | Pkinase_Tyr     | 85.9087 |
| ABC1     | RIO1            | 60.2442 |
+----------+-----------------+---------+
Example query Give me all pf the Pfam-A families that have a HHsearch E-value score of less than 0.01 when compared to the family 'AAA'
SQL
mysql> SELECT a.pfamA_id, \
              b.pfamA_id, \
              evalue \
       FROM   pfamA AS a, \
              pfamA AS b, \
              pfamA2pfamA_hhsearch_results AS p \
       WHERE  a.auto_pfamA = p.auto_pfamA1 \
       AND    b.auto_pfamA = p.auto_pfamA2 \
       AND    evalue < 0.01 \
       AND    a.pfamA_id = "AAA";

+----------+-----------------+---------+
| pfamA_id | pfamA_id        | evalue  |
+----------+-----------------+---------+
| AAA      | Sigma54_activat | 0.0069  |
| AAA      | RuvB_N          | 5.6E-08 |
| AAA      | AAA_2           | 0.00011 |
| AAA      | AAA_5           | 0.00011 |
| AAA      | Viral_helicase1 | 0.0045  |
| AAA      | AAA             | 8.8E-23 |
| AAA      | IstB            | 0.00063 |
+----------+-----------------+---------+
Example query Give me all of the UniProtKB protein sequences for the species 'Gallus gallus'
SQL
mysql> SELECT pfamseq_id, \
              pfamseq_acc FROM pfamseq, \
              proteome_pfamseq, \
              complete_proteomes\
       WHERE  complete_proteomes.species = "Gallus gallus" \
       AND    pfamseq.auto_pfamseq = proteome_pfamseq.auto_pfamseq \
       AND    proteome_pfamseq.auto_proteome =complete_proteomes.auto_proteome;

+--------------+-------------+
| pfamseq_id   | pfamseq_acc |
+--------------+-------------+
| Q788U5_CHICK | Q788U5      |
| Q5ZHK8_CHICK | Q5ZHK8      |
| Q7T1N6_CHICK | Q7T1N6      |
| Q9DEG1_CHICK | Q9DEG1      |
| PRTG_CHICK   | Q589G5      |
| B5BSF2_CHICK | B5BSF2      |
| Q8QGR9_CHICK | Q8QGR9      |
| CLLD6_CHICK  | Q5ZHV7      |
| Q5ZK24_CHICK | Q5ZK24      |
| HXC8_CHICK   | Q9YH13      |
...

back to top


Data Files - Alignments, trees and HMMs

HMM tables

The seed, full, NCBI and metaseq alignments are all stored as gzipped files in the database, as is the HMM for each family. Note that the NCBI and metaseq alignments may contain overlapping matches to Pfam-A families that belong to the same clan, however, the UniprotKB alignments (seed and full) will not. This is because we have performed a clan filtering step on the UniProtKB data such that where there are overlapping Pfam-A matches within a clan, only the lowest E-value scoring match is included in the full alignment.

back to top

Pfam FTP site

The Pfam FTP site is organised into the following structure:

|
+- Tools/
|
+- papers/
|
+- current_release/
    |
    +- database_files/
|
+- releases/
    |
    +- Pfam23.0/
    |   |
    |   +- database_files/
    |
    +- Pfam22.0/
    |   |
    |   +- database_files/
    |
    +- ...
    |
    +- Pfam1.0/

The most important directory is probably the current_release directory. This contains the flat-files for the current release. Some of these files may be very large (of the order of several hundred megabytes). Please check the sizes on the FTP site before trying to download them over a slow connection. The files, most of which are compressed using gzip, are:

Pfam-A.dead.gz
Listing of families that have been deleted from the database
Pfam-A.fasta.gz
A 90% non-redundant set of fasta formatted sequence for each Pfam-A family. The sequences are only the regions hit by the model and not full length protein sequences.
Pfam-A.full.gz
The full alignments of the curated families, searched against pfamseq/UniProt.
Pfam-A.full.metagenomcis.gz
The full alignments of the curated families, searched against Metagenomic proteins.
Pfam-A.full.ncbi.gz
The full alignments of the curated families, searched against NCBI GenPept proteins.
Pfam-A.hmm.dat.gz
A data file that contains information about each Pfam-A family
Pfam-A.hmm.gz
The Pfam HMM library for Pfam-A families
Pfam-A.seed.gz
The seed alignments of the curated families
Pfam-B.fasta.gz
A 90% non-redundant set of fasta formatted sequence for each Pfam-B family. The sequences are only the regions hit by the model and not full length protein sequences.
Pfam-B.gz
Automatically generated alignments of sequence clusters in SWISSPROT and TrEMBL that are not modelled in the curated part of Pfam
Pfam-B.hmm.gz
The Pfam HMM library for Pfam-B families. Note that we build HMMs for only the largest 20,000 Pfam-Bs.
Pfam-B.hmm.dat.gz
A data file containng information about each of the 20,000 Pfam-B HMMs.
Pfam-C.gz
The contains the information about clans and the Pfam-A membership
active_site.dat.gz
Tar-ball of data required for the predictions of active sites by Pfam scan.
database.tar
A tar-ball of the database_files directory.
database_files
Directory contains two files per table from the MySQL database. The .sql.gz file contains the table structure, the .txt.gz files contains the content of the table as a tab delimited file with field enclosed by a single quote (').
diff.gz
Stores the change status of entries between this release and last.
metapfam.gz
ASCII representation of the domain structure of Metagenomic proteins according to Pfam
metaseq.gz
Metagenomic sequence database used in this release
ncbi.gz
NCBI GenPept sequence database used in this release.
ncbipfam.gz
ASCII representation of the domain structure of GenPept proteins according to Pfam
pdbmap.gz
Mapping between PDB structures and Pfam domains.
pfamseq.gz
A fasta version of Pfam's underlying sequence database
relnotes.txt
Release notes
swisspfam.gz
ASCII representation of the domain structure of UniProt proteins according to Pfam
uniprot_sprot.dat.gz
Data files from UniProt containing SwissProt annotations.
uniprot_trembl.dat.gz
Data files from UniProt containing TrEMBL annotations.
userman.txt
File containing information about the flatfile format
Pfam-A.regions.tsv.gz
A tab separated file containing UniProtKB sequences and Pfam-A family information
Pfam-A.clans.tsv.gz
A tab separated file containing Pfam-A family and clan information for all Pfam-A families

The papers directory contains each NAR database issue article describing Pfam. For a detailed description of the latest changes to Pfam, please consult (and cite) these papers.

The releases directory contains all the flat files and database dumps (where appropriate) for all version of Pfam to-date. The files in more recent releases are the same as described for the current release, but in older releases the contents do change.

The Tools directory contains code for running pfam_scan.pl. The README file in this directory contains detailed information on how to install and run the script. Note that we have gone for a modular design for the script, enabling the functionally on the script to be easily incorporated into other Perl scripts. The ChangeLog file lists the versions and changes to the current version of pfam_scan.pl (and modules). There is also an archived version of pfam_scan.pl that works with HMMER2. This is no longer supported. There is also Perl code for predicting active sites found in the ActSitePred directory, the functionality of which has been rolled into the latest version of pfam_scan.pl

The top level directory also contains the following two files:

COPYRIGHT
Copyright notice for Pfam
GNULICENSE
The full text of the GNU Library General Public License under which Pfam is licensed

It also contains a further directory, sitesearch, that contain a subset of information from Pfam in an XML file. This XML file, primarily for use by the Sanger Web Team, is indexed using lucene and used in the WTSI site search. This is updated at each release.

Pfam website virtual machine

Installing the website has always been somewhat complex and we no longer support direct installation of the site. Instead, we have built a virtual machine that includes all of the software needed to run the Pfam website within your local environment.

You can find instructions on downloading and installing the VM on a set of dedicated pages

You can still find the source code for the website and the ancillary systems in our SVN repository, at http://pfamsvn.sanger.ac.uk/.

Privacy issues

This section outlines the ways in which the Pfam website handles information about users. This should not be read as a legal document, but as a description of how we handle information that could be considered sensitive. It should be read in conjunction with the privacy policy documents of the individual Pfam consortium member sites. If you have any concerns about the way that information is used in the website, please contact us at the address given at the bottom of the page and we will be more than happy to discuss your concerns.

Although we make every possible effort to keep this site and the data that it manipulates safe and secure, we make no claim to be able to protect sensitive or privileged information. If you are at all concerned about sensitive information being released, please do not use the site and consider installing the Pfam database and/or this website locally.

Urchin

We use Urchin, a software package closely related to Google Analytics (GA), to track the usage of this website. Urchin uses a single-pixel "web bug" image, which is served from every page, a javascript script that collects information about each request, and cookies that maintain information about your usage of the site between visits. You can read more about how GA works on the Google Analytics website, which includes a detailed description of how traffic is tracked and analysed.

We use the information generated by Urchin purely for audit and accounting purposes, and to help us assess the usefulness and popularity of different features of the site. It does not provide the ability to track individual users' usage of the site. However, Urchin does provides a high-level overview of the traffic that passes through the site, including such information as the approximate geographical location of users, how often and for how long they visited the site, etc.

We understand that this level of tracking may be worrying to some of our users. If you have any concerns about our use of Urchin, please feel free to contact us.

Browsing

All web servers maintain fairly detailed logs of their activity. This includes keeping a record of every request that they serve, usually along with the IP address of the client that made the request. This is true of the web servers that host the Pfam websites.

Although our servers do collect information about your IP address during the normal process of serving the Pfam website, we do not use this information explicitly. The Pfam group uses server logs only to help with development and debugging of the site.

Searches

The sequence search feature of the site allows you to upload a protein or DNA sequence to be searched against our library of HMMs. The sequence that you upload is stored in a database and is retrieved by a set of scripts that actually perform the search. Although we do not have any information that could be used to link that sequence to you personally, you should be aware that the sequence itself is accessible to systems administrators and other users who maintain the Pfam site.

The batch search function allows you to submit larger searches, the results of which are emailed to you. Obviously, this requires you to provide identifiable information, namely an email address. However, beyond the routine backups of our databases, we do not store any information about email addresses and sequences in the longer term and we make no attempt to keep track of the searches that a particular user may be performing.

Information from other types of search, such as a keyword search, is held only in the web server logs but, as described above, no attempt is made to interpret these logs except as part of development or debugging of the site.

Cookies

We use the following cookies to maintain some information about you between your visits to the site. The information that is stored cannot be used to identify you personally and cannot be used to track your usage of the site.

Cookie name Purpose Criteria
ts Timestamp when annotation submission form was loaded in browser Required
hide_posts Keep track of whether blog posts have been hidden in home page Optional

In addition to these Pfam-specific cookies, Urchin uses a series of cookies. You can read more about these in the Urchin documentation .

If you are at all concerned about the use of cookies in the Pfam site, you are free to block all cookies from this site and you should not experience any problems. You may see some unintended behaviour, such as being notified of all new features every time you visit the index page, but the core functionality of the site should be unaffected.

Third-party javascript libraries

This site makes heavy use of javascript and relies on javascript libraries that are developed by various groups and companies. In order to improve the performance of the Pfam website, we no longer serve these files ourselves, but rely on files that are hosted on third-party web-servers. In particular, we use various files that are provided by the AJAX libraries APIs, hosted by google code, and components of the Yahoo! User Interface Library (YUI), hosted by Yahoo!.

As these services are provided by commercial sites, it's likely that their usage will be carefully monitored by the companies that provide them. Although the Pfam site does not pass any information about you to these third-party sites, the sites themselves may use cookies to track your usage of the files that they serve. If you are concerned about the privacy implications of this monitoring, you may want to block cookies from the third-party hosting sites.

The Pfam Consortium

Pfam is maintained by an international consortium of researchers that has been borne out of its original development by Erik Sonnhammer, Sean Eddy and Richard Durbin. The current list of consortium members, their institutes and primary roles are listed below.

Wellcome Trust Sanger Institute (UK)

  • Marco Punta - Pfam project leader
  • Penny Coggill - Pfam database annotator
  • Ruth Eberhardt - Pfam database annotator
  • Jaina Mistry - Pfam developer
  • John Tate - Web developer
  • Alex Bateman - Faculty lead

Janelia Farm Research Campus (USA)

  • Rob Finn - Pfam project leader
  • Jody Clements - Web developer
  • Sean Eddy - Founding developer and author of HMMER software

Stockholm Bioinformatics Center (Sweden)

  • Kristoffer Forslund - Pfam developer
  • Erik Sonnhammer - Coordinator of Pfam-Sweden and founding developer

External contributors

Pfam includes families that have been built by external contributors:

NCBI

  • Lakshminarayan Iyer
  • L. Aravind
  • Zhang Dapeng
  • Robson De Souza
  • Vivek Anantharaman

Sanford-Burnham Medical Research Institute

  • Adam Godizk
  • Lukasz Jaroszewski
  • Kyle Ellrott

Previous contributors

  • Gabriel Aldam
  • Shimelis Assefa
  • Matthew Bashton
  • Ewan Birney
  • Lorenzo Cerrutti
  • Lachlan Coin
  • Richard Durbin
  • Matthew Fenech
  • O. Luke Gavin
  • Prasad Gunasekaran
  • Sam Griffiths-Jones
  • Kevin Howe
  • Nicola Kerrison
  • Mhairi Marshall
  • Nina Mian
  • William Mifsud
  • Simon Moxon
  • Joanne Pollington
  • Stephen-John Sammut
  • Benjamin Schuster-Böckler
  • David Studholme
  • Benjamin Vella-Briffa
  • Corin Yeats
  • Arthur Wuster

Pfam is a collaborative venture and we hope to be able to interact with as many people as possible, in order to provide a quality database. Please get in touch with any one of us for more information about Pfam. You can email Pfam using the address found at the bottom of the page.

Mirror Sites

The Pfam website is available from several mirrors around the world:

How to contact Pfam

Contact Pfam

You can contact us in various ways. Each of the Pfam consortium sites provides a contact email address, which you can find at the bottom of every page. You can use this address to contact the specific Pfam group.

We also run a central helpdesk, which handles annotation comments, data enquiries and general problems with the Pfam websites. We use a request tracking system to monitor emails to the helpdesk, so you should receive an automated response to your email, letting you know that the system has logged your mail and notified us of its arrival.

Mailing list

The Pfam mailing list is a low traffic list that has important announcements, such as releases or major changes.

To join the mailing list send a mail to pfamlist-subscribe@sanger.ac.uk.

If you should want to unsubscribe from the list send a mail to pfamlist-unsubscribe@sanger.ac.uk.

Xfam blog

The Pfam group contributes to the Xfam blog. The blog is used to announce releases, new features and important changes to Pfam, as well as for posts discussing general issues surrounding the Pfam resource. You can see blog posts that are specific to Pfam here.

RSS feed

You can keep in touch with the latest goings on by subscribing to the RSS feed from the Xfam blog.