Skip to Main Content

Sequence Similarity Searching

A guide to sequence similarity searching using BLAST and other tools.

Basic Properties of all BLAST Searches

The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a substitution matrix.

The scoring matrix rewards matches between the query sequence and the database sequence(s) with positive scores, while penalizing gaps or mismatches between the query and target.

DNA or nucleotide BLAST searches do not have specialized substitution matrices, but instead directly compare sequence-to-sequence applying scores to matches and mismatches. However, the NCBI nucleotide BLAST suite does have some specialized algorithms for highly similar (megablast) and somewhat similar (blastn) searches.

Protein BLAST searches allow the user to apply specialized amino acid substitution matrices (PAM and BLOSUM) which can be set to allow higher degrees of dissimilarity with less penalty. These are beneficial for searching for conserved domains that may have a lower sequence exact match identity.

Amino Acid Substitution Matrices

PAM: Percent Accepted Mutation.
A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit is the amount of evolution that will change, on average, 1% of amino acids in a protein sequence. A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have a certain amount (x) of evolutionary divergence. [Taken from the NCBI Glossary - PAM definition]. 

How PAM matrices were derived:

  • Comparison of 71 groups of closely related proteins yielding 1,572 changes. (>85% identity)
  • Different PAM matrices are derived from the PAM 1 matrix by matrix multiplication
  • The matrices are converted to log odds matrices. (Frequency of change divided by probability of chance alignment converted to log base 2.)
  • A PAM 250 matrix has 250 point changes per 100 amino acids.  It is similar in stringency to a BLOSUM45 matrix

Dayhoff, Schwartz, and Orcutt (1978) Atlas Protein Seq. Struc. 5:345-352

BLOSUM:  Blocks Substitution Matrix.
A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. [Taken from the NCBI Glossary - BLOSUM definition]

How BLOSUM matrices were derived:

  • Based on comparisons of Blocks of sequences derived from the Blocks database 
  • The Blocks database contains a multiple alignment of ungapped segments corresponding to the most highly conserved regions of proteins. - (Thus emphasizes local alignment versus global alignment)

Henikoff, S.; Henikoff, J.G. (1992). Amino Acid Substitution Matrices from Protein Blocks. PNAS. 89 (22): 10915–10919. doi:10.1073/pnas.89.22.10915. PMC50453; PMID 1438297.

BLOSUM62 performed better than all versions of PAM matrices in finding more distant relationships between protein sequences in published comparisons from the reference above.  This is why it is usually the default choice of matrix for most BLAST programs.