0 structures 0 species 0 interactions 0 sequences 0 architectures

Pfam Help

 

Help Summary

About Pfam

Pfam 22.0 (Jul 2007 , 9318 families)

Proteins are generally comprised of one or more functional regions, commonly termed domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifing the domains present in a protein can provide insights into the function of that protein.

The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and hidden Markov models (HMMs).

There are two levels of quality to Pfam families: Pfam-A and Pfam-B. Pfam-A entries are derived from the underlying sequence database, known as Pfamseq, which is built from the most recent release of UniProtKB at a given time-point. Each Pfam-A family consists of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov models (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases.

Pfam-B families are un-annotated and of lower quality as they are generated automatically from the non-redundant clusters of the latest ProDom release. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found.

Pfam entries are classified in one of four ways:

Family:
A collection of related proteins
Domain:
A structural unit which can be found in multiple protein contexts
Repeat:
A short unit which is unstable in isolation but forms a stable structure when multiple copies are present
Motifs:
A short unit found outside globular domains

Related Pfam entries are grouped together into clans; the relationship may be defined by similarity of sequence, structure or profile-HMM.

Pfam Changes

This section details the changes that we plan to make or have made to Pfam. This includes changes to the flatfiles, MySQL database and the public website.


Latest changes to Pfam data

Changes between Pfam releases 21 and 22

This is the new Pfam website, running on Pfam release 22.0. Release 22.0 contains a total of 9318 families, with 380 new families and 19 families killed since the latest release. 73.23% of all proteins in Pfamseq contain a match to at least one Pfam domain. 50.79% of all residues in the sequence database fall within Pfam domains. Pfam 22.0 is based on UniProt release 9.7, a composite of Swiss-Prot release 51.7 and TrEMBL release 34.7.

The only change in this release is that Phobius has been used to calculate both signal peptides and transmembrane regions, instead of signalp and tmhmm.

Show past changes.


Latest changes to website

Release 1.4 (7th January 2008)

  • Improved sequence validation: the sequence search tool now accepts the commonly seen FASTA header styles
  • New help section on privacy: we now have a brief overview of the privacy issues surrounding use of the Pfam website.
  • New features notification: the index page includes a section listing those features that have been added to the site since your last visit.
  • Tooltips on domain graphics: as you move your mouse over domain graphics you should now see small tooltips which give details of the domain under the cursor, as in the old Pfam website.
  • Links underlined: all links should now be underlined, either permananently, or when you move your mouse over them.
  • Change status: the page for a Pfam-A family now shows whether the family has changed since the previous Pfam release.
  • Sequence search defaults changed: the defaults for the Pfam sequence search in the homepage have been changed, so that the search now produces results from both "ls" and "fs" models, merged according to the rules for each family.
  • Molecular surfaces in AstexViewer: when you view a protein structure using AstexViewer, you will now see the Pfam-A regions outlined with a semi-transparent molecular surface.
  • New species tree tools: you can now download sequences (in FASTA format) or sequence accessions for selected nodes in a species tree.
  • Fixed sequence display: the page for a protein sequence now shows the correct sequence for the entry. Previously duplicated the first residue in the sequence.
  • Fixed URL for XML schema in help pages: uploaded XML files describing domain graphics are validated against an XML schema. In the example XML snippets in the domain graphics help page, the URL for that document was wrong.
  • Fixed a bug in sequence search output: there was a bug in the output of batch sequence searches, causing occasional missing residues in the output. This has been corrected.

Show past changes.

Getting Started using Pfam

jump to...

Using the "Jump to" search

Many pages in the site include a small search box, entitled "Jump to...". The "Jump to..." box allows you to go immediately to the page for any entry in the Pfam site entry, including Pfam families, clans and UniProt sequence entries.

The "Jump to..." search understands accessions and IDs for most types of entry. For example, you can enter either a Pfam family accession, e.g. PF02171, or, if you find it easier to remember, a family ID, such as piwi. Note that the search is case insensitive.

Because some identifiers can be ambiguous, the "Jump to..." search may need to test several types of identifier to find the entry that you're looking for. For example, Pfam A family IDs (e.g. Kazal_1) and Pfam clan IDs (e.g. Kazal) aren't easily distinguished, so if you enter kazal, the search will first look for a family called kazal and, if it doesn't find one, will then look for a clan called kazal. If all of the guesses fail, you'll be returned to the page where you started the search.

The order in which the search tries the various types of ID and accession is given below:

  • Pfam A accession, e.g. PF02171
  • Pfam B, e.g. PB000001
  • Pfam clan accession, e.g. CL0005
  • UniProt sequence accession, e.g. P00789
  • UniProt sequence ID, e.g. CANX_CHICK
  • PDB entry, e.g. 2abl
  • Pfam A ID, e.g. Piwi
  • Pfam clan ID, e.g. Kazal

keyword search

Keyword search

Every page in the Pfam site includes a search box in the page header. You can use this to find Pfam A families which match a particular keyword. The search includes several different areas of the Pfam database:

  • text fields in Pfam entries, e.g. family descriptions
  • UniProt sequence entry description and species fields
  • HEADER and TITLE fields from PDB entries
  • Gene Ontology IDs and terms
  • InterPro entry abstracts

Each Pfam A entry is listed only once in the results table, although it might have been found in more than one area of the database.


Searching a protein sequence against Pfam

Searching a protein sequence against the Pfam library of HMM will allow you to find out the domain architecture of the protein.

Single protein search

If your protein is present in the version of UniProt we used to make the current release of Pfam, we have already calculated its domain architecture which you can access by entering the Uniprot accession or id on the Pfam homepage. If your protein is new, you will need to paste the protein sequence into the search page and we will search your sequence against our HMMs and display the matches for you.

Medium scale protein searches (fewer than 5,000 sequences)

If you have a few thousand protein searches to do, you can use our batch upload facility to upload a file of your sequences in FASTA format. We will run them against our HMMs and email the results back to you, usually within 48 hours. We request that you to put a maximum of 1000 sequences in each file.

Large scale protein searches (more than 5,000 sequences)

If you have a large number of protein searches, it may be more convenient to run Pfam locally. To do this you will need the HMMER2 software, the Pfam HMM libraries and a couple of additional files from the Pfam website.

  1. Download the HMMER2 software.
  2. Download the Pfam files Pfam_ls, Pfam_fs, Pfam-A.seed and Pfam-C from the ftp site.

    These files contain the HMMs and additional information that is required to carry out the searches.

  3. Download a copy of pfam_scan.pl from the ftp site.

    This is a wrapper script around hmmpfam, the HMMER program that searches query sequences against a library of profile HMMs. If perldoc is installed, more detailed instructions on how to use pfam_scan.pl can be found by typing on the command line 'perldoc pfam_scan.pl'.

  4. On the command line enter:

    pfam_scan.pl -d <directory_location_of_Pfam> <files fasta_file_of_proteome>

    For example if the files were downloaded into a folder called pfam_files, and the FASTA sequences were in a file called sequences.fasta, type:

    pfam_scan.pl -d pfam_files sequences.fasta

    pfam_scan.pl will search the FASTA sequences against the profile HMMs and report all matches to families that score higher than the manually set thresholds for each of the Pfam families. The output is in the following format:

    <seq_id> <seq_start> <seq_end> <hmm_acc> <hmm_start> <hmm_end> <bit_score> <evalue> <hmm_name>

    e.g.

    O00519     95       562     PF01425.9       1      513      526.5     2.6e-155  Amidase
    O01636     22       139     PF02408.8       1      137      157.3     3.6e-44   DUF141
    O03046     12       137     PF02788.5       1      132      295.2     1.1e-85   RuBisCO_large_N
        		

Proteome analysis

Pfam pre-calculates the domain architecture for the proteomes present in Integr8. To see the list of proteomes, visit the genomes page within the Pfam website. From here you will be able to access a table showing the Pfam families that are found in a particular proteome, a short description of each of those famillies and the number of proteins that match each family. Links from these pages will allow you to investigate particular proteins further. You can also compare two proteomes and out which protein families are shared between the proteomes, and which are unique to a particular proteome.

The taxonomy query allows allows quick identification of families/domains which are present in one species but are absent from another. It can also be used to find families/domains that are unique to a particular species (note this can be very slow).


Finding proteins with a specific set of domain combinations ('architectures')

Pfam allows you to retrive all the proteins with a particular domain combination (e.g. proteins containing both a CBS domain and an IMPDH domain) by using the domain query tool. For a more detailed study of domain architectures you should use PfamAlyzer, a tool that is hosted from the Swedish Pfam site. PfamAlyzer allows the user to find proteins which contain a specific combination of domains, and it allows the user to specify particular species and the distances allowed between domains.

What is Pfam ?

Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam HMM represents a protein family or domain. By searching a protein sequence against our the Pfam library of HMMs you can find out its domain architecture. You can also use Pfam to analyse proteomes and domain architectures.

For each Pfam entry we have a family page which can be accessed in several ways; by following links for a particular family/domain on the website, clicking on a graphical image of a domain and by searching for a particular family using the search box on the Pfam website. From the family page you can view the alignments and annotation (including annotation from the InterPro database where available), view the species distribution of a domain, access cross-links to other databases and use other tools for protein analysis. We display structural information for each family where available.

back to top

What is the difference between Pfam-A and Pfam-B families ?

Pfam contains two types of families, Pfam-A and Pfam-B. Pfam-A families are manually curated HMM based families which we build using an alignment of a small number of representative sequences (we call this alignment the 'seed' alignment). We manually set a threshold value for each HMM, and this determines the minimum score a sequence must attain to belong to the family. We search all of our HMMs against the UniProt database, and include all sequences that score above the cut-off value for a particular family in the family's full alignment. For each family we build two HMMs, one to represent fragment matches and one to represent full length matches. We use the HMMER2 software to build and search our profile HMMs. Pfam-A matches are very unlikely to be false matches.

To complement the Pfam-A families, we automatically generate Pfam-B families using the PRODOM database. Pfam-B families are formed by taking alignments of sequence segments from PRODOM, and removing any Pfam-A residues from them. Some Pfam-B families are composed of low complexity regions and may not reflect true relationships, therefore we recommend you verify that sequences in a Pfam-B family are related by using other methods such as BLAST.

All families in Pfam are non-overlapping such that no amino acid belongs to more that one family/domain. At each Pfam release we search all of our models against an updated version of UniProt, and regenerate our Pfam-B families using the most recent version of PRODOM.

back to top

What is a clan ?

Some of the Pfam families are grouped into 'clans'. Pfam defines a clan as a collection of families that have arisen from a single evolutionary origin. Evidence of their evolutionary relationship can be in the form of similarity in tertiary structures, or when structures are not available, by common sequence motifs. The seed alignments for all families within a clan are aligned and the resulting alignment (called the clan alignment) can be accessed from a link on the clan page. Each clan page includes a clan alignment, a description of the clan and database links where appropriate. The clan pages can be accessed by following a link from the family page, or alternatively they can be accessed by clicking on 'clans' under the 'browse' by menu on tab on any Pfam page.

back to top

Why are there two HMMs foreach Pfam entry ?

For each Pfam entry we build two HMMs, one to represent full length matches (ls model), and one to represent fragment matches (fs model). When you search a protein on our site we search them against both the ls and fs models.

back to top

Can I search DNA against Pfam ?

The Wise2 software package allows the comparison of protein HMMs to genomic DNA. We use this package to allow users to search single DNA sequences against the library of Pfam HMMs. To do this, you will need to go to the Pfam website hosted at the Sanger Institute and paste your DNA sequence in the DNA search box on the search page. The results take approximately 2 minutes for a 1kb sequence, and approximately 1 hours for a 80kb sequence.

back to top

How can I submit a new domain ?

If you know of a domain that is not present in Pfam, you can submit it to us by email (pfam-help@sanger.ac.uk) and we will endeavour to build a Pfam entry for it. We ask that you supply us with a multiple sequence alignment of the domain (please send the alignment file as a text file (e.g. .txt) and not in the format of a specific application such as Microsoft Word (e.g. a .doc) file), and associated literature evidence if available.

back to top

What is iPfam ?

iPfam is a resource that describes domain-domain interactions that are observed in PDB entries. Where two or more Pfam domains occur in a single structure, it analyses them to see if the are close enough to form an interaction. If they are close enough it calculates the bonds forming the interaction. Further information can be found on the iPfam help pages.

back to top

How can I search Pfam locally ?

If you have a large number of sequences or you don't want to post your sequence across the web, you can search your sequence locally. Please see the help page on getting started with Pfam.

back to top

Why doesn't Pfam doesn't include my sequence ?

Pfam is built from a fixed release of UniProt. At each Pfam release we incorporate sequences from the latest release of UniProt. This means that at any time the sequences used by Pfam could be several months behind those in the most up-to-date versions of the sequence databases. If your sequence isn't in Pfam you can still find out what domains it contains by pasting it into the search box in Pfam.

back to top

What's on the family page ?

Each family has the following data:

  • A seed alignment which is a hand edited multiple alignment representing the family.
  • Hidden Markov Models (HMM) derived from the seed alignment, which can be used to find new members of the domain and also take a set of sequences to realign them to the model. One HMM is in ls mode (global) the other is an fs mode (local) model.
  • A full alignment which is an automatic alignment of all the examples of the domain using the two HMMs to find and then align the sequences.
  • Annotation that contains a brief description of the domain, links to other databases and some Pfam specific data. To record how the family was constructed.

back to top

How many accurate alignments do you have ?

Release 22.0 has 9318 families. Over 73.2 of the proteins in SWISSPROT 51.7 and TrEMBL 34.7 have at least one match to a Pfam-A family.

back to top

Can I search my protein against Pfam ?

Of course! Please use this search form.

back to top

Can I search my dna sequence in translation against Pfam ?

Yes you can, on the DNA search page.

back to top

What is the difference between the - and . characters in your full alignments ?

The '-' and '.' characters both represent gap characters. However they do tell you some extra information about how the HMM has generated the alignment. The '-' symbols are where the alignment of the sequence has used a delete state in the HMM to jump past a match state. This means that the sequence is missing a column that the HMM was expecting to be there. The '.' character is used to pad gaps where one sequence in the alignment has sequence from the HMMs insert state. See the alignment below where both characters are used. The HMM states emitting each column are shown. Note that residues emitted from the Insert (I) state are in lower case.

FLPA_METMA/1-193     ---MPEIRQLSEGIFEVTKD.KKQLSTLNLDPGKVVYGEKLISVEGDE
FBRL_XENLA/86-317    RKVIVEPHR-HEGIFICRGK.EDALVTKNLVPGESVYGEKRISVEDGE
FBRL_MOUSE/90-321    KNVMVEPHR-HEGVFICRGK.EDALFTKNLVPGESVYGEKRVSISEGD
O75259/81-312        KNVMVEPHR-HEGVFICRGK.EDALVTKNLVPGESVYGEKRVSISEGD
FBRL_SCHPO/71-303    AKVIIEPHR-HAGVFIARGK.EDLLVTRNLVPGESVYNEKRISVDSPD
O15647/71-301        GKVIVVPHR-FPGVYLLKGK.SDILVTKNLVPGESVYGEKRYEVMTED
FBRL_TETTH/64-294    KTIIVK-HR-LEGVFICKGQ.LEALVTKNFFPGESVYNEKRMSVEENG
FBRL_LEIMA/57-291    AKVIVEPHMLHPGVFISKAK.TDSLCTLNMVPGISVYGEKRIELGATQ
Q9ZSE3/38-276        SAVVVEPHKVHAGIFVSRGKsEDSLATLNLVPGVSVYGEKRVQTETTD
HMM STATES           MMMMMMMMMMMMMMMMMMMMIMMMMMMMMMMMMMMMMMMMMMMMMMMM
	

back to top

What do the SS and SA lines in the alignment mean ?

These lines are structural information. The SS stands for secondary structure, and this is taken from DSSP. The following list gives the definitions for each code letter:

C
Random Coil
H
Alpha-helix
G
3(10) helix
I
Pi-helix
E
Hydrogen bonded beta-strand (extended strand)
B
Residue in isolated beta-bridge
T
H-bonded turn (3-turn, 4-turn, or 5-turn)
S
Bend (five-residue bend centered at residue i)

The SA stands for surface accessibility. It is expressed on a scale between 0 and 9 where 0 is a buried residue and 9 is a solvent exposed residue.

You can find more information here.

back to top

You don't have domain YYYY in Pfam !

We are very keen to be alerted to new domains. If you can provide us with a multiple alignment then we will try hard to incorporate it into the database. If you know of a domain, but don't have a multiple alignment, we still want to know, for simple families just one sequence is enough. Again E-mail pfam-help@sanger.ac.uk.

back to top

Can we have this running sensibly locally. Do you have software to support Pfam locally?

In terms on HMMs and formats, Pfam is based around the HMMER2 package. This will need to be installed on your local machine, and you will need to also download the Pfam HMM libraries from the FTP site. These files are called Pfam_ls (global models) and Pfam_fs (fragment models).

back to top

Are there other databases which do this ?

To a certain extent yes, they are a number of "second generation" databases which are trying to organise the protein database into evolutionary conserved regions. Examples include:

PROSITE
This originally was based around regular expression patterns but now also includes profiles.
PRINTS
This is based around protein "finger-prints" of a series of small conserved motifs making up a domain.
BLOCKS
This is based around automatic ungapped alignments.
SMART
This is a database concentrating on extracellular modules and signaling domains.
PRODOM
This is an automatically generated domain database based on PSI-BLAST searching.
InterPro
Combines information from Pfam, Prints, SMART, Prosite and PRODOM.
CDD
The Conserved Domain Database is derived from Pfam and SMART databases.

back to top

So which database is better ?

As with everything, it depends on your problem: I would certainly suggest using more than one method. Pfam is likely to provide more interpretable results with crisp definitions of domains in a protein.

back to top

Help With Pfam HMM scores

What Pfam HMM scores mean

Pfam-A is based around hidden Markov model (HMM) searches, as provided by the HMMER2 package. In HMMER2, like BLAST, E-values (expectation values) are calculated. The E-value is the number of hits that would be expected to have a score equal or better than this by chance alone. A good E-value is much less than 1. Around 1 is what we expect just by chance. In principle, all you need to decide on the significance of a match is the E-value.

However, there are a few complications.

The most serious complication is that there are no analytical results available for accurately determining E-values for gapped alignments, especially profile HMM alignments. HMMER uses empirical methods to estimate E-values. These methods are generally rather accurate. However, when in doubt, HMMER tends to err on the conservative side.

We use a second, and even more empirical, system in maintaining Pfam models. This system is implemented in the Pfam database rather than in the HMMER software. For each Pfam family, we record a "trusted cutoff" and a "noise cutoff", TC1 and NC1. TC1 is the lowest score for sequences we included in the family (e.g. in the Full alignment). NC1 is the highest score for sequences we did not include in the Full alignment. (Since Full alignments are produced automatically, the trusted sequence cutoff is always greater than the noise sequence cutoff.)

Therefore, we can consider a hit very significant if it scores better than the trusted cutoff, better than the noise cutoff, and has a significant E-value. Sometimes sequences score better than the cutoffs though they don't have significant E-values; these are marginal hits that we've chosen to include in the family.

Sequence versus domain scores

There's one additional wrinkle in the scoring scheme. HMMER2 calculates two kinds of scores. The "sequence classification score" is the total score of a sequence aligned to a model; if there are more than one domain, the sequence score is the sum of all (finding multiple domains increases our confidence that the sequence belongs to that protein family, even if each domain individually is a weak match.) The "domain score" is a score for a single domain (these two scores are identical for single domain proteins).

One of the features provided by the Pfam website is a graphical representation of the features found within a sequence, termed domain graphics. There are a variety of different shapes and styles and each one has a particular meaning. This page gives an in-depth description of the elements of Pfam domain graphics.

The library which generates the images in this page and throughout the Pfam site uses an XML language to describe the domain graphic that is required. Each of the example graphics in this page is followed by a link that can be used to show the XML that produced it.

We provide a set of tools, described in the Tools & Web Services section of the help pages, that allow you to generate custom domain graphics by uploading your own XML file, or to generate graphics for a specific UniProt sequence, given the UniProt accession or ID.


The sequence

The base sequence, undecorated by any domains or features, is represented by a plain grey bar:

Show XML

The length of the domain graphic that is drawn is proportional to the length of the sequence itself. The graphics in this page are drawn with a X-scale of 0.5 pixels per amino-acid, so that a 200 residue sequence will result in a 100 pixel-wide image. Any domains or features which are drawn on the sequence are also scaled by the same factor.

back to top


Pfam-A

The high quality, curated Pfam-A domains are classified into one of four different types: family, domain, repeat and motif (more details). These different classification types are rendered slightly differently.

Family/domain

It is possible for a sequence to match either the full length of a Pfam HMM (a full length match), or to match a portion of an HMM (a fragment match). The two types of match are rendered differently.

Both family and domainentries are rendered as rectangles with curved ends when the sequence is a full length match. The curves at the ends become less pronounced when the domains are short, as shown in the second domain below. Different types of domain are displayed with different colours. When the domain image is long enough, the domain name is shown within the domain itself. In most cases, you can click on the domains to visit the "family page" for that domain. Moving the mouse over the domain image should also display a tooltip showing the domain name, as well as the start and end positions of the domain.

Show XML

When the sequence does not match the full length of the HMM that models a Pfam entry, matching domain fragments are shown. When a sequence match does not pass through the first position in the HMM, the N-terminal side of the domain graphic is drawn with a jagged edge instead of a curved edge. Similarly, when a sequence match does not pass through the last position of the HMM, the C-terminal side of the domain graphic is drawn with a jagged edge. In some rarer cases, the sequence match may not pass through either of the first or last positions of the HMM, in which case both sides are drawn with jagged edges. Examples of all three cases are shown here:

Show XML

back to top

Repeat/motif

Repeats and motifs are types of Pfam domain which do not form independently folded units. In order to distinguish them from domains of type family and domain, repeats and motifs are represented by rectangles with straight edges. As for families and domains, partial matches are represented with jagged edges.

Show XML

back to top

Discontinuous nested domains

Some domains in Pfam are disrupted by the insertion of another domain (or domains) within them. A number of names have been given to this arrangement: discontinuous (referring to the outer domain), inserted or nested (both referring to the inner domain). For example, in many sequences containing an IMPDH domain, the IMPDH domain is continuous along the primary sequence. However, in some cases the linear sequence of the IMPDH domain is broken by the insertion of a CBS domain, as shown below.

Where three-dimensional structures are available for representatives of a Pfam domain, it is generally clear that the three-dimensional arrangement of the domain containing the nested domain is maintained. Typically the nested domain is found inserted within a surface exposed loop, having little or no effect on the structure of the other domain. Such an arrangement explains why and how these nested domains can be functionally tolerated.

To represent this arrangement of domain graphically, the discontinuous domain is represented in two parts (as shown below). These two parts are joined by a line bridging them. The vertical parts of the line are dashed, while the horizontal line is solid (to distinguish it from a disulphide bridge).

Show XML back to top

Context domains

Context domains in Pfam are those that, despite not scoring above the family gathering threshold, are expected to be real, based on the presence of the surrounding domains found in the protein. The method is described in:

Enhanced protein domain discovery by using language modeling techniques from speech recognition: L. Coin, A. Bateman and R. Durbin Proc. Natl. Acad. Sci. USA. (2003) 100(8):4516-20

In some cases it is possible for a protein without any matches to gain context domains. This happens when two or more weak matches support each other. This is most often seen with multiple tandem repeats such as WD40 and leucine rich repeats such as LRR_1.

Within the Pfam domain graphics, the context domains are represented by rectangles that are coloured from white to pink as shown below. These images are interactive in the same manner as the Pfam-A graphics.

Show XML

Please note that context domains are generated automatically and have not been subjected to the same high level of quality control as Pfam-A domains. Therefore, context domains, although likely to be correct should always be verified by other means.

back to top


Pfam-B

Pfam-B regions are automatically generated clusters that supplement the high quality Pfam-A regions. The mechanism for generating Pfam-B regions is detailed here. These regions are represented by a small rectangle, coloured with three stripes. As for Pfam-A regions, clicking on a Pfam-B domain takes the user to the Pfam-B summary page for that entry. Moving the mouse over the striped image will show a tooltip listing the Pfam-B identifier and its start and end points. If the Pfam-B region is long enough, its identifier will also be displayed on the image.

Show XML

back to top


Other sequence motifs

In addition to domains, smaller sequences motifs are represented by the domain graphics. Currently the following motifs are represented: signal peptides, low complexity regions, coiled-coils and transmembrane regions. These usually take lower prority than other regions that are drawn and they are therefore often obscured by, for example, a Pfam-A graphic being drawn over the top of them. An example of each motif is shown here.

Show XML

back to top

Signal peptides

Signal peptides are short regions (<60 residues long) found at the N-terminus of proteins, which direct the post-translational transport of a protein and are subsequently removed by peptidases. More specifically, a signal peptide is characterised by a short hydrophobic helix (approximately 7-15 residues). This helix is preceded by a slight positively charged region of highly variable length (approximately 1-12 residues). Between the hydrophobic helix and the cleavage site is a somewhat polar and uncharged region, of between 3 and 8 amino-acids. In Pfam, we use Phobius for the prediction of signal peptides and represent them graphically by a small orange box.

A combined transmembrane topology and signal peptide prediction method: L. Kall, A. Krogh and E.L.L. Sonnhammer J. Mol. Biol. (2004) 338(5):1027-36

back to top

Low complexity regions

Low complexity regions are regions of biased sequence composition, usually comprised of different types of repeats. These regions have been shown to be functionally important in some proteins, but they are generally not well understood and are masked out to focus on globular domains within the protein.

Within Pfam, we use SEG to calculate low complexity regions in Pfam. The presence of a low complexity region is indicated by a cyan rectangle.

back to top

Coiled-coils

Coiled coils are motifs found in proteins that structurally form alpha-helices that wrap or wind around each other. Normally, two to three helices are involved, but cases of up to seven alpha-helices have been reported. Coilded-coild are found in a wide variety of proteins, many functionally very important. In Pfam we use ncoils, to identify these motifs. Coiled-coils are represented by a small lime-green rectangle.

back to top

Transmembrane regions

Integral membrane proteins contain one or more transmembrane regions that are comprised of an alpha-helix that passes through or "spans" a membrane. Transmembrane helices are quite variable in length, with the average being about 20 amino-acids in length. Again, Phobius is used for the prediction of transmebrane regions, which are represented by a red rectangle.

back to top


Other Sequence features

Below is a demonstration of how disulphide bridges and active residues are representated in Pfam. Each of these features can appear above or below the sequence, but in this case the disulphide bridges are shown above the sequence and the active site residues below the line.

Show XML

back to top

Disulphide bridges

Disulphide bridges play a fundamental role in the folding and stability of some proteins. They are formed by covalent bonding between the thiol groups from two cysteine residues. The disulphide bridge annotations used in Pfam come from UniProt and are represented by a solid bridge-shaped line. When mutliple disulphide bonds occur, the heights of the bridges are adjusted to avoid overlaps between them. Inter-protein disulphides are represented by single vertical lines. As always, moving the mouse over the "bridge graphic" shows the details of the bond in a tooltip.

back to top

Active site residues

Within an enyzme, a small number of residues are directly involved in catalysis of a reaction. These are termed active site residues. Within Pfam there are three categories of active site: those that are experimentally determined, those that are predicted by UniProt and those predicted by Pfam. All three types are represented by a "lollipop" with a diamond head. The head is coloured red, pink and purple for each of the three types respectively.

Pfam-predicted active sites are determined by using the experimental data and transferring these annotations through a Pfam alignment.

back to top


Other features

In addition to the drawing features outlined above, the Pfam domain graphics library includes some additional, general purpose representation styles.

Arrows

Arrows can be drawn perpendicular to the sequence, and can point either towards or away from the sequence line. They can be drawn with different vertical line styles (solid, dashed or bold) and can be placed above or below the sequence. The example below shows the different arrow styles that are available:

Show XML

back to top

Additional "lollipop" styles

A wide range of different lollipop styles can be create by combining different line and head colours with different drawing styles. For example, a lollipop can be drawn with either bold (solid) or dashed lines. The lollipop head can be drawn as either a square, circle or diamond.

Show XML

back to top

Guide to Pfam tools and services

Tools

Producing your own graphics

As we are regularly approached for producing domain graphics for use in publications, we have produced a tool for users to upload a "domain graphics" XML. This file will be validated against the schema and subsequently rendered. The images that the tool produces can then be saved for your own use.

If there is an existing sequence in Pfam that you wish to alter/elaborate then the XML used by Pfam for this sequence can also be obtained using this tool.

You can see a detailed description of the XML language that describes Pfam domain images in the Guide to Graphics section of the help pages.

There is a similar tool which allows you to see the domain graphic for a given UniProt entry.


Web services

Service description

The set of web services offered by Pfam provides access to various parts of the Pfam database, with the main focus on the family information. The currently available methods are:

isAlive
test to see if the services are up and running
getIdByAcc
retrieve a Pfam ID, given a Pfam accession (PFXXXXX)
getPfamAnnotation
retrieve a Pfam annotation
getPfamGO
get the Gene Ontology (GO) terms for a given Pfam family
getPfamInterPro
get the InterPro information for a Pfam family
getPfamMembership
retrieve the sequences in a family, in the format " acc / start - end"
annotateSequenceById
retrieve all of the Pfam domains for a given sequence, where the sequence is identified by its UniProt ID
annotateSequenceByAcc
retrieve all of the Pfam domains for a given sequence, where the sequence is identified by its UniProt accession
annotateSequenceByMd5
retrieve all of the Pfam domains for a given sequence, where the sequence is identified by its MD5 checksum

Proposed methods

The following methods may be added to the service in the future:

  • getDomainGraphicByAcc
  • getDomainGraphicById
  • getSeedAlign
  • getFullAlign
  • buildAndSearch
  • compareAlignToPfam
  • submitPfamScan

WSDL

The WSDL for the services can be found at: http://services.sanger.ac.uk/Pfam/PfamWebServices.wsdl

Prerequisites

To run the perl client you will need to install the SOAP::Lite package from CPAN first. SOAP::Lite is a collection of perl modules which provides a simple and lightweight interface to the Simple Object Access Protocol (SOAP).

Example client

You can download an example perl client. To run it, simply type: perl pfamClient.pl The output from the script should be similar to this:

shell% perl pfamClient.pl 
Checking isAlive
Service is alive: Pfam Web Services are alive at Wed May 30 13:32:12 2007
Checking getId2Acc
The id for PF00069 is: Pkinase
Checking getAcc2Id
The acc for Pkinase is: PF00069
Checking getPfamAnnotation
The description for PF00069 is:
ID   Pkinase
AC   PF00069
DE   Protein kinase domain
PI   pkinase;
AU   Sonnhammer ELL
SE   Unknown
GA   -70.30 -70.30; 20.00 20.00;
TC   -70.30 -70.30; 20.00 20.00;
NC   -70.40 -70.40; 19.90 19.90;
TC   Protein kinase domain
NC   Protein kinase domain
TP   Domain
BM   hmmbuild -F HMM_ls SEED
BM   hmmcalibrate --cpu 1 --seed 0 HMM_ls
BM   hmmbuild -f -F HMM_fs SEED
BM   hmmcalibrate --cpu 1 --seed 0 HMM_fs
AM   globalfirst
...

Contact

If you have questions or comments about the Pfam web services, please contact us via the email address below. If you plan to use Pfam web services for a course or if you intend to make a large number of requests, please let us know, so that we can make sure that sufficient resources are available to you.

How to link to Pfam?

Pfam is maintained by a consortium of researchers based at the Wellcome Trust Sanger Institute, Cambridge, UK (WTSI), Stockholm Bioinformatics Center, Stockholm, Sweden (SBC), and Janelia Farm, Virginia, USA. All three sites run the same Pfam website and linking to different sites only requires that you change the site name, not the parameters in the URL.

Although we have no plans to change the locations of resources within this site dramatically, webmasters are advised to link only to the following types of page within the site.

Home pages

WTSI:
http://pfam.sanger.ac.uk/
SBC:
http://pfam.sbc.su.se/
Janelia:
http://pfam.janelia.org/

Searching a protein sequence against Pfam

WTSI:
http://pfam.sanger.ac.uk/search?tab=sequenceSearchBlock
SBC:
http://pfam.sbc.su.se/search?tab=sequenceSearchBlock
Janelia:
http://pfam.janelia.org/search?tab=sequenceSearchBlock

Searching a DNA sequence against Pfam

WTSI:
http://pfam.sanger.ac.uk/search?tab=sequenceDnaBlock
SBC:
http://pfam.sbc.su.se/search?tab=sequenceDnaBlock
Janelia:
http://pfam.janelia.org/search?tab=sequenceDnaBlock

Linking to Pfam family pages

You can refer to Pfam families either by accession or ID. You can also refer to a family by "entry", although this is a convenience that should be used only if you're not sure if what you have is an accession or an ID.

Pfam accession numbers are more stable between releases than IDs and we strongly recommend that you link by accession number.

Here are some examples of linking to Pfam at WTSI:

By accession:
http://pfam.sanger.ac.uk/family?acc=PF00002
By ID:
http://pfam.sanger.ac.uk/family?id=7tm_2
Using "entry":
http://pfam.sanger.ac.uk/family?entry=PF00002 or
http://pfam.sanger.ac.uk/family?entry=7tm_2

You can link to Pfam family data at the other sites by changing "pfam.sanger.ac.uk" to "pfam.sbc.su.se" or "pfam.janelia.org".

Linking to protein sequence pages

As for Pfam family pages, you can refer to protein sequence pages by accession, ID or entry. Protein IDs are unstable and do change between releases, so, again, we strongly recommend that you use protein accessions where possible.

Here are some examples of linking to protein sequence pages at WTSI:

By accession:
http://pfam.sanger.ac.uk/protein?acc=P15498
By ID:
http://pfam.sanger.ac.uk/protein?id=VAV_HUMAN
Using "entry":
http://pfam.sanger.ac.uk/protein?entry=P15498 or
http://pfam.sanger.ac.uk/protein?entry=VAV_HUMAN

Again, to generate links to the other Pfam sites, change "pfam.sanger.ac.uk" to "pfam.sbc.su.se" or "pfam.janelia.org".

Pfam FTP site

The following list describes a few of the important files in the Pfam FTP site. Some of these files may be very large (of the order of several hundred megabytes). Please check the sizes on the FTP site before trying to download them over a slow connection.

relnotes.txt
Release notes
pfamseq.gz
A fasta version of Pfam's underlying sequence database
Pfam_ls.gz
The Pfam HMM library
Pfam_fs.gz
The Pfam fragment HMM library
Pfam-A.full.gz
The full alignments of the curated families
Pfam-A.seed.gz
The seed alignments of the curated families
Pfam-B.gz
Automatically generated alignments of sequence clusters in SWISSPROT and TrEMBL that are not modelled in the curated part of Pfam
Pfam-C.gz
The contains the information about clans and the Pfam-A membership
swisspfam.gz
The domain structure of SWISSPROT and TrEMBL proteins according to Pfam
prior.tar
A set of prior files for specific Pfam families
diff
Summary of changes since the last release
Icarus
Icarus files for SRS indexing of the files in the Pfam distribution
COPYRIGHT
Copyright notice for Pfam
GNULICENSE
The full text of the GNU Library General Public License under which Pfam is licensed

References & Bibliography

Pfam References

Pfam: clans, web tools and services: R.D. Finn, J. Mistry, B. Schuster-Böckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L.L. Sonnhammer and A. Bateman Nucleic Acids Research (2006)  Database Issue 34:D247-D51
Enhanced protein domain discovery by using language modeling techniques from speech recognition: L. Coin, A. Bateman and R. Durbin Proc. Natl. Acad. Sci. USA. (2003) 100(8):4516-20
The Pfam Protein Families Database: A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats and S.R. Eddy Nucleic Acids Research (2004) 32:D138-D141
The Pfam Protein Families Database: A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller,