Skip to Main Content

Sequence Similarity Searching

A guide to sequence similarity searching using BLAST and other tools.

Pattern Matching Algorithms

Most sequence similarity search tools are based on foundations created by seminal pattern matching algorithms. Some of these are described here. Even though this area of bioinformatics is older, in comparison to data analysis for sequencing experiments, these tools still hold value in determining evolutionary change and laid the foundations for genome alignment tools.

Needleman-Wunsch

This algorithm finds the best global alignment between any 2 sequences.

  • CPU and time-intensive
  • Often misses domain or motif alignments in sequences, since it disfavors local alignment of highly similar regions
  • Provides the basis of NCBI's Global Align tool
  • Originally published in Needleman, S.B. and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 48(3):443-53 (1970).

Needleman-Wunsch pairwise example

(Image from Wikipedia.org: https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm)

In this example of a Needleman-Wunsch matrix:

Match = +1
Mismatch = -1
Gap = -1

Scores are filled in beginning in the upper left corner at the beginning of both sequences.  When matrix is complete, the best value is found in the right-most column, and a path is tracked backwards to the beginning of the two sequences to find the best alignment.

Smith-Waterman

Smith-Waterman is an extension of Needleman-Wunsch that compares segments of all possible lengths (thus creating local alignments) between two sequences to maximize alignment

  • Very sensitive search
  • CPU and time-intensive
  • Originally published in Smith, T.F. and Waterman, M.S. Identification of common molecular subsequences. J Mol Biol. 147(1):195-7 (1981).

A typical Smith-Waterman matrix may have:

Match = + 1
MisMatch = 0.3
Gap = (1 + 0.3k*)
(* where k = number of residues included in gap)

All cell values start at zero and are not allowed to fall below zero (so a new alignment path can begin at any point).  Values in cells, like in Needleman-Wunsch, are based upon value of cell plus the highest value in row, column or direct diagonal using gap penalties.  FASTA and BLAST are based on this type of comparison matrix.

FASTA

FASTA generates local alignments. The algorithm uses lookup tables (hash tables) to increase speed. Sensitivity and speed are determined by the size of the "word" used for the initial lookup table.  FASTA builds diagonals in conjunction with the results of the lookup tables.

  • Sensitive search
  • Fast (the name reflects this: "fast alignment")
  • Published in Pearson, W.R. and Lipman, D.J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A.85(8):2444-8 (1988).

BLAST

BLAST stands for Basic Local Alignment Search Tool. It generates local alignments.  It begins a search by indexing all character strings of a certain wordsize within the query sequence by their starting position in the query.  BLAST then scans the target database looking for matches between the "words" indexed in the query and strings found within the database sequences.  When a word match is found (two nearby words in the case of protein searches), BLAST attempts to extend both forward and backward from the match to produce an alignment. BLAST will continue this extension as long as the alignment score continues to increase or until it drops by a critical amount owing to the negative scores given by mismatches.

  • Fairly sensitive search
  • Very fast
  • Published in Altschul, S.F. et al. Basic local alignment search tool. J Mol Biol. 215(3):403-10 (1990), and Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402. [PMC free article]