Find Pfam families within your sequence of interest. Paste your protein or DNA sequence into the box below to have it searched for matching Pfam families. More...
The form will accept both protein and DNA sequences. Protein sequences
are searched directly using our standard search methodology, using our
pfam_scan.pl wrapper around the
HMMER package. DNA
sequences are also searched using HMMER, but we use a six-frame
translation to generate six separate protein sequences from your
uploaded DNA and then search each frame separately. We make no attempt
to generate meaningful open reading frames. The matches from all
six searches are combined in the result tables.
We check all sequences before running a search. In order to avoid problems with the validation of your sequence, you should use only plain, unformatted text. Below are some of the validation checks that we apply to sequences.
- sequence length must be less than 10,000 residues
- only residue symbols are allowed in the sequence (letters or "*");
sequences containing other characters will not be accepted.
Note that "-" was previously accepted as a valid sequence character, but is not allowed in the latest version of HMMER.
- FASTA-header lines are accepted but will be removed
- sequence length must be less than 80,000 nucleotides
- only valid nucleotide symbols are allowed in the sequence
(ACGTRYKMSWBDHVN); sequences containing other characters will not be
- FASTA-header lines are accepted but will be removed
If you have problems getting your sequence to upload, please check that it passes all of these tests. Note that although we do allow FASTA-style header lines on a sequence, some characters in header lines can still cause the sequence to be rejected. If in doubt, please remove header lines before pasting in your sequence.
You can see examples of sequences that will successfully pass all of the validation tests by clicking the Example buttons below the search form.
Protein search options
The form supports several search options for protein sequences. Note that these controls will be disabled if you paste in a DNA sequence and will be ignored by the server when it searches your DNA.
The default threshold for the HMM search is an E-value of 1.0, but you can also use the gathering threshold for each HMM, or you can specify your own E-value setting. Note that the E-value that you give must be positive and < 10.0.
By default the search will only look for Pfam-A families on your sequence but, by checking the box below, you can also search for Pfam-B hits. Note that the Pfam-B search is now performed using HMMER, using automatically generated HMMs. We generate HMMs for only the 20,000 largest Pfam-B families.
DNA search parameters
The threshold for HMM searches of DNA-derived protein sequences is the gathering threshold. Note that this is different to the default for protein searches.
Batch sequence search
Upload a FASTA-format file containing multiple protein sequences to be searched for matching Pfam families. Results of the search will be returned to you at the email address that you specify. Please check the More...below for the restrictions on uploaded sequence files.
We accept only protein sequences and your uploaded file must conform to a fairly strict interpretation of the FASTA file format. We apply the following checks to the format of uploaded sequence files. Files that do not conform to the following rules will be rejected by the server
Files must contain only header lines and sequence lines. Header lines, which begin with ">", can be used to describe the sequence that follows. There is no fixed format for header lines but we restrict the characters that are allowed. If your header lines contain any of the following characters, your file will be rejected:
; \ ! *
Note that we explicitly include the semi-colon (;) in the list of forbidden characters, although this may be used to denote comments in some versions of the FASTA-format. Please do not use comments in the FASTA files that you upload here.
Batch searches are run using our pfam_scan.pl script, which uses programs from the HMMER suite. These treat the first "word" on the header line as the ID for a sequence.
The ID is taken to be the characters after the initial ">" and before the first whitespace character. The sequence IDs must be unique and your uploaded sequence file will be rejected if we find the same ID for multiple sequences.
Header rows must not have any whitespace between the ">" and the remainder of the header row. Header rows must also have content; files with blank header lines will not be accepted.
Your sequence should be a valid protein sequence. As such, the sequence line should contain only amino-acid symbols, i.e. capital letters excluding "J". In the context of a Pfam search gaps and translation stops have little meaning and should not normally be used, but we do accept "-" or "*" to denote gaps and translation stops respectively. Nucleotide sequences are not considered valid and will be rejected.
Searches run on a "compute farm" with a limited number of "slots". Each search takes one slot and once all slots are in use, new jobs wait in a queue for the next slot to become free. In order to prevent large jobs occupying slots for very long periods, which can impact the availability of the system for other users, we place a number of restrictions on the size of job that we will accept.
Files must have fewer than 500,000 lines and fewer than 5000 sequences.
Each sequence must be between 6 and 20000 amino-acids in length.
We use heuristics to check that a sequence has a reasonable level of variation, in order to prevent large strings of identical sequence or a large number of duplicate residues being searched. If you find that you cannot submit a valid sequence because of this restriction, please let us know.
If you specify an E-value cut-off for your search, that E-value must be a positive number.
Search for keywords within the textual data in the Pfam database. More...
The search currently covers the following sections of the database:
- text fields within Pfam entries, such as description and comments
- sequence description and species fields
- the HEADER and TITLE records from PDB files
- Gene ontology (GO) IDs and terms
- InterPro abstracts
You can perform the same search from anywhere within the Pfam site, using the keyword search box at the top right-hand side of every page.
Domain architecture search
Search for sequence architectures using the PfamAlyzer applet. More...
PfamAlyzer is designed to provide insight into the Pfam protein domain database. It integrates and extends many popular Pfam tools and provides means for the study of domain architecture evolution.
PfamAlyzer is implemented as a Java applet. It requires a Java runtime environment (JRE) of at least Java2 1.4.2. Most browsers are shipped with previous Java versions. However, there are plug-ins for various browsers and platforms available which enable you to run PfamAlyzer. You may find out which Java version you currently have installed at this page.
Why do I get a security warning when I start PfamAlyzer?
In general, Java does not grant applets access to the local computer's resources. PfamAlyzer enables the user to enter sequences by means of cut-and-paste. For this purpose, PfamAlyzer requires access to the system clipboard which is also beyond scope the applet's sandbox. Therefore, PfamAlyzer comes a signed applet which is allowed to leave the sandbox in which unsigned applets are alwalys confined. However, the user is alerted if this fact and asked for permission. Currently, PfamAlyzer does not use a certificate issued by a trusted authority which is what the security warning is about. We ask you kindly to grant PfamAlyzer access to the clipboard.
Search for sequence architectures with the specified composition. More...
You can specify one or more domains that must appear in an architecture and one or more domains that must not appear in the result architectures:
- choose a group of families or the first letter of the family ID from the list below
- in the left-most list, select the ID of a domain
- choose Includes or Does NOT include to add the selected domain to the appropriate list
- continue until the Includes and Does NOT include lists show all of the domains that you need
- press Submit query to search for architectures matching your specification
Note that the query itself can take some time to run, depending on how many domains you add to each constraint. If the search returns a lot of results, you may also find that your browser hangs momentarily, while the domain graphics are rendered in the page.
The following architectures match your query architecture:
This form supports two types of query: you can enter a complex boolean expression that precisely defines the species distribution of families that you want, such as "Caenorhabditis elegans AND NOT Homo sapiens", or you can check the "Find domains unique to query term" box and enter a single taxonomic descriptor, e.g. "apicomplexa" to find all Pfam-A families that are unique to that particular level. More...
In order to find species names in our database, you must give the full, unabbreviated name in your query. For example, if you want to find families in Caenorhabditis elegans, you must spell out "Caenorhabditis"; "C. elegans" does not exist in the species database and will not return any results. The capitalisation of your query does not affect the results: "Caenorhabditis" is the same as "caenorhabditis".
Because species names can be long, complex and easily misspelt, the form will offer suggestions for possible completions of terms that you enter. For example, as you start typing c-a-e-n, a drop-down panel will appear, showing the possible completions:
As you enter more letters, the suggestions will be refined accordingly:
Note that only the last term in the field is used to offer completions, so if you have already entered "caenorhabditis elegans", going back and shortening the first word to give "caen elegans" will not give any suggestions.
Boolean queries can contain logical operators, such as AND, NOT and OR, combined with braces ("(" and ")") to form a description of a set of families with a particular species distribution.
For example, the following query:
Caenorhabditis elegans AND NOT Homo sapiens
will retrieve all families which are found in C. elegans but not in human, whilst
Caenorhabditis elegans AND Caenorhabditis briggsae AND NOT Homo sapiens
will retrieve all families which are found in both C. elegans and C. briggsae but not in human, and
( Caenorhabditis elegans OR Caenorhabditis briggsae ) AND NOT Homo sapiens
will retrieve families that are found in either C. elegans or C. briggsae but not in human.
Families unique to a single species
By checking the box marked "Find domains unique to query term" you can limit the query to finding those domains which are found only in a single species or taxonomic level. For example, checking the box and entering "Caenorhabditis elegans" will return a list of Pfam-A domains that are present only in C. elegans, whilst "metazoa" will find families that exist in metazoans.
Note that you can only enter a single species term when looking for unique sequences. You will see an error message if the search field contains more than one species.