Skip to Main Content

Sequence Similarity Searching

A guide to sequence similarity searching using BLAST and other tools.

Starting a Protein BLAST

You can access the Protein BLAST (blastp) from NCBI's BLAST home page.

You can also launch a protein blast from the protein record in NCBI's protein database. There is a "BLAST" link in the right menu of every protein record in the protein database.

Here's a protein record you can use to practice on:
http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1BixP7hJvx64q1iYgMhDgWs/

To BLAST from this record, just click the Run BLAST link on the right side of the page.

Similar to nucleotide BLAST, you can also enter a FASTA sequence in the sequence window or an accession number from the protein database. When you click the BLAST link from a protein record, the accession number is loaded in the sequence window.

You can also choose to query a subrange of the protein: you will need to know the amino acid numbers of the range you wish to query. You can get details of ranges of a protein by conserved domain type from the Details section of the NCBI protein record.

Choosing a Database to Search (protein BLAST)

For protein BLASTS, you have fewer options for databases to search, but more algorithm and matrix choices.  You can search:

  • All "non-redundant" protein sequences
    • Be warned, "non-redundant" is not literally true.  There will be many instances of the same protein represented many times in this database choice.  The reason for this is that there are often many sequence submissions for any particular protein, with small differences in length, mutation of amino acids or different isoforms
  • Reference proteins (refseq), which is the subset of proteins in the databases that have been curated and verified by experimentation and validation by the NCBI
  • Model organism proteins
  • UniProt/Swissprot protein sequences - searches only Swissprot sequences.  This search returns a slightly smaller set of proteins
  • Patented protein sequences
  • Protein Data Bank sequences - this search is especially useful if you want to find whether your protein sequence has high similarity with known structures
  • Metagenomic proteins
  • Transcriptome Shotgun Assembly proteins

You can choose to exclude environmental samples, models (proteins with reference numbers beginning with XP), or non-redundant RefSeq proteins.

Similar to nucleotide BLAST, you can also choose to limit to, or exclude, specific organisms, or query using an Entrez (NCBI) formatted search string.

(Hint: for the sample protein linked above, try limiting to viruses in the Organism section)

Algorithm Parameters

Remember to open the Algorithm Parameters using the link below the BLAST button. These are optional, but can greatly improve your results.

The general guidelines for algorithm selection described above for nucleotide BLAST apply also to protein BLAST, except you have slightly different gap costs and you now have the choice of scoring matrices.

PAM or BLOSUM?

  • The lower the number following the word PAM, the more stringent the criteria for the search
  • The lower the number following the word BLOSUM, the less stringent the criteria for the search 
  • BLOSUM62 is the default and performs with the best combination of specificity and precision
  • Overall, BLOSUM is more efficient and more accurate in finding matches, but PAM can be useful when searching for more weakly related sequences
  • Try the same search with different matrices and compare results

PSI-BLAST and PHI-BLAST

PSI-BLAST is Position-Specific Iterated BLAST.  It is especially useful if you suspect weak sequence similarity due to evolutionary change, but still suspect overall conservation (such as a specific protein fold or function).  PSI-BLAST creates a positional matrix--a PSSM: Position Specific Scoring Matrix--after your first BLAST iteration.  It finds regions in this matrix of highest correlation between your query and the database matches, then weights these regions more heavily on successive iterations of the BLAST.

  • You can use the matrix created by PSI-BLAST to apply to different databases, or simply check off the checkboxes and run the next iteration from the bottom of the search results summaries.
  • When results fall below the E-value cutoff, they should generally be excluded in subsequent iterations, unless they are interesting or expected for your search set (they may just be short regions of similarity)
  • Including these results may be useful when attempting to find all members of a protein family, or finding the most diverse members of the family
  • PSI-BLAST is good for finding distantly-related protein sequences for protein phylogenetic discovery
  • Use good judgment:  think about how realistic distant matches may be

PHI-BLAST (Pattern Hit Initiated BLAST) is a subset of protein BLAST in which you can input a specific pattern that you want to find in all protein matches.  It is useful when you are looking for a region of functional importance in proteins (such as specific binding sites).  The pattern syntax is from PROSITE. 

An example:  [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV] (taken from NCBI's PHI-BLAST training materials)

This translates to mean: any one of the amino acids LIVMF followed by G followed by E followed by any single character followed by any one of GAS followed by any one of LIVM followed by any 5 to 11 characters followed by R followed by one of STAQ followed by A followed by any single character followed by one of LIVMA followed by any single character followed by one of STACV.

For more information on PHI-BLAST syntax (along with the above example), see the NCBI PHI-BLAST rules page.

You can also choose Filters and Masking from the section on those options (see this guide's Nucleotide BLAST section).

After selecting your parameters (remember: it's helpful to click the checkbox "Show results in a new window":
Hit the BLAST button!