Skip to Main Content

Sequence Similarity Searching

A guide to sequence similarity searching using BLAST and other tools.

Entering Query Sequences

On the BLAST query page, you may notice a number of small question marks (?).  If you click on any of these, a line of text will open that explains the field and gives you links to more detail about the section in the BLAST tutorial pages on the NCBI website.

Here is a "mystery sequence"  you can use for practice.


Copy the whole sequence, along with the line that begins with the ">", and paste it into the box on the blastn page. Try it once without changing ANY of the settings (use megablast), then try it again with blastn instead of megablast.

When entering sequences in the query area

  • You can enter a copied sequence in FASTA format:  characterized by the carat ">" followed by a descriptive line of text.
  • You can also use just the gi or accession number for a sequence you find in NCBI Protein or NCBI Nuceotide databases.  The gi number is often consistent across all international databases, but not always, so if you find a sequence outside of NCBI's databases, enter the FASTA format in the search box on the BLAST page.
  • You can also use the Browse button to search for a FASTA formatted file on your computer to upload.

Choosing a Database to Search (Nucleotide BLAST)

For nucleotide BLAST, you have many genomic and nucleotide databases to choose from.

  • Choose genomic and transcript databases for mouse or human
  • Choose the non-redundant nucleotide databases (nr/nt) for a search of ALL available nucleotides in ALL species (this is the default)
  • Choose a subset of nucleotide databases such as reference sequences, expressed sequence tags, high-throughput sequence data and many more
  • You can use the Organism region (optional) to select a specific species to search.  Just start typing a species in the box and the "smart index" will suggest species to choose from.  You can also choose to exclude species using this field by checking the Exclude checkbox.

  • Use the Exclude region (optional) to exclude predicted model sequences or unfiltered environmental samples, and the Limit to field (optional) to limit to sequences from specific material

  • Use the Entrez Query region (optional) to define fields or molecule types to search using square brackets and boolean operators (click the ? and select more in the pop-up to get more information on constructing queries.

For our example query, just choose the nucleotide collection (nr/nt)

Choosing the Type of BLAST Program (Nucleotide BLAST)

  • Megablast is good for searching for genes that have high similarity between species, and is the default for blastn
  • Discontiguous megablast is good for VERY dissimilar sequences
  • Blastn is the original nucleotide BLAST and a good place to start when you are not sure if you are going to find high or low similarity

Algorithm Settings (Nucleotide BLAST)

Click on the Algorithm parameters text link (below the BLAST button).   This will expand a menu of algorithm settings that are specific to the type of BLAST you've chosen in the section above.  Generally, the default settings have been shown to be optimal for each type of BLAST, but you may want to try adjusting these settings to see how your results will be affected.

  • Max target sequences - set at 100.  You can re-set this number for more or fewer results.
  • Short queries - checked by default to automatically adjust for short query sequences.  If your sequence isn't short, you can uncheck this box, but generally it does not provide you any greater benefit to uncheck it.
  • Expect threshold - this is the number of sequences you can expect to be matched purely by chance.  The default is 10, but for a more stringent search, set this number lower or set it higher for a more relaxed search.
  • Word size - the number of nucleotides that are used to start a match (called "seeding").  Default is 28 for megablast, and 11 for the less stringent discontiguous megablast and blastn.
  • Match, mismatch scores - set at 2 and -3 for blastn or discontiguous megablast but matches are lower and the mismatches higher (eg. 1,-2) for megablast.  This reflects a ratio:  a ratio of 0.5 (1,-2) is best for 95% conserved sequences, so is used for megablast, while the ratio of 0.66 (2,-3) is used for the less conserved searches in blastn and discontiguous megablast.
  • Gap costs - linear for megablast and based on the match and mismatch scores, so gap costs will compound as the gaps are found and extended.  For all other nucleotide BLASTs, the gap opening cost is 5, and the cost to extend a gap is 2.  If you are doing a search across a number of species, you may want to reduce the existence and extension costs, to assign less penalty to gaps.  If you want your sequences to be as similar as possible, select higher numbers for both existence and extension gap costs.
  • Filters and masking - When to filter or mask? 
    • Filter for regions in genomic sequences which you suspect have low complexity regions, such as SINES, LINES or virus-inserted repeats
    • You can select filtering species-specific repeats
    • Masking features can be useful for regions that may code for structural features that don't have functional significance and you want to de-emphasize these regions when searching for matches (this is especially pertinent for protein BLASTs). 
      • Masking for lookup table only will mask features when the initial lookup search is done, and thus speed up the search, but will unmask them for the final similarity score assignment
      • To mask features for lookup and scoring, choose the mask lower case letters option.  Then, in a Word document or text program, highlight the region and change its case to lower case.

Hit the BLAST Button!

It is generally useful to check the box Show results in a new window.  This way you can always come back to the BLAST setup page, change a few parameters and try the search again.