blast

 

Function

BLAST search of query sequence(s) against sequence search set

Description

blast is an EMBOSS "wrapper" program for the program blastall from the NCBI's BLAST (Basic Local Alignment Search Tool) suite. blast has options to perform a similarity search of a sequence against a sequence databank regardless of whether the sequence and the databank are protein or nucleic acid :
blastp    compares an amino acid query sequence against a
          protein sequence databank.
blastn    compares a nucleic acid query sequence against a
          nucleic acid sequence databank.
blastx    compares the six-frame conceptual translation
          products of a nucleic acid query sequence (both
          strands) against a protein sequence databank.
tblastn   compares a protein query sequence against a
          nucleic acid sequence databank dynamically
          translated in all six reading frames (both strands).
tblastx   compares the six-frame translations of a nucleic acid
          query sequence against the six-frame translations
          of a nucleic acid sequence databank.
BLAST has been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits. BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity.

To describe the algorithm briefly, BLAST compares a query sequence with a database sequence by first locating two non-overlapping sequence segments in common within a certain distance of each other, and then attempting to extend these so-called "hits" into locally optimal alignments between the sequences being compared. We provide a more detailed description below.

sequence filtering

If the query sequence contains regions of low compositional complexity they can give a huge number of uninteresting matches. Therefore by default BLAST removes such regions ; for blastn it uses the DUST filter and for the other programs it uses the SEG filter. You can switch "filtering" off with the parameter -noseqfilter. For proteins you can instead of or together with SEG use a coiled-coiled filter, based on the work of Lupas et al. and written by John Kuzio (-seqcoilfilter).

It is also possible to request "soft filtering" (-seqsoftfilter), that means that masking should only be done when searching for the initial "hits" but that growing alignments are allowed to extend into a region of low compositional complexity.

Algorithm

In this discussion, we will use blastp, which searches for similarities between protein queries and protein databases, as a prototype for BLAST. However, the ideas are immediately applicable to comparisons involving conceptual translations of query sequences and databases, and extend to similarity searches between nucleic acid sequences as well.

Preliminaries

BLAST uses a substitution matrix (such as the BLOSUM or PAM matrices) to assign a score to the alignment of any pair of amino acids. An aggregate score for an alignment segment can be computed by summing the scores of each amino acid pair in that segment. When given two sequences to compare, the older (ungapped) version of the BLAST algorithm searched for arbitrary but equal length segments within each sequence that have a maximal aggregate score which meets or exceeds some threshold or cutoff score. BLAST looks for locally optimal alignments between the two sequences whose scores cannot be improved either by extending or trimming. Such locally optimal alignments are called "high-scoring segment pairs," or HSPs.
If we assume a simple protein model in which amino acids occur randomly at all positions and in proportion to the frequencies at which they are found within the database and query sequences, then we can compute a normalized score (expressed in units called bits) from the nominal score of an HSP. Such normalized scores allow direct statistical comparison of results regardless of the scoring system used (see "Generating Gapped Extensions" for a caveat to this). Furthermore, the normalized score can be used to compute an expect value, or E-value, which is the number of distinct HSPs having at least that normalized score expected to occur by chance. This theory has not been proved for gapped local alignments and their associated scores, but there are indications that it remains valid.

Turning Hits Into HSPs

The central idea of the BLAST algorithm is that any statistically significant alignment between two sequences is likely to contain a high-scoring pair of aligned words. A word is simply a sequence segment of specified length (usually 3 for protein sequences). BLAST begins its comparison of a query sequence to a database by scanning the database for words that score at least the threshold score T when aligned with some word within the query sequence. Any word pair satisfying this condition is called a hit. The diagonal of a hit involving words starting at positions (x, y) of the database and query sequences is defined as x-y. The distance between two hits on the same diagonal is defined as the difference between their first coordinates.
Once a hit is found, BLAST determines whether the hit lies within an alignment having an aggregate score high enough to be reported. It does this by extending the hit in both directions until the running alignment's score has dropped more than some quantity X below the maximum score yet attained. This extension step is quite costly, taking upwards of 90% of BLAST's execution time under most circumstances.
In order to reduce the number of extensions it has to perform, BLAST takes advantage of the fact that an interesting HSP is typically much longer than a single hit. In fact, it is likely to contain multiple hits on the same diagonal within a relatively short distance of one another. Therefore, BLAST chooses a length A and invokes an ungapped extension if and only if two non-overlapping hits are found on the same diagonal within distance A of one another (any hit that overlaps the most recent one is ignored).

Generating Gapped Extensions

Gapped extensions allow BLAST to maintain its sensitivity while tolerating a much higher chance of missing any single moderately scoring HSP. However, gapped extensions take about 500 times longer to execute than ungapped extensions. Therefore, BLAST triggers a gapped extension for an HSP only when its score exceeds a moderate score Sg specifically chosen so that no more than about one gapped extension is invoked per 50 database sequences.
To generate the gapped local alignment, BLAST uses a standard dynamic programming algorithm for pairwise sequence alignment which traverses the cells of a path graph, the dimensions of which are the lengths of the two sequences being compared, performing a fixed amount of computation per each cell. Starting from a single aligned pair of residues, called the seed, the dynamic programming proceeds both forward and backward through the path graph considering only those cells for which the optimal local alignment score falls no more than X below the best alignmnet score yet found (this is a simple generalization of BLAST's method for constructing HSPs). The region of the path graph explored adapts to the alignment being produced.
The seed for the dynamic programming is the central residue pair of the length-11 segment of the HSP having the highest alignment score. If the HSP itself is shorter than 11 residues in length, its central pair of residues is chosen.
The resulting gapped alignment is reported only if it has an E-value low enough to be of interest. For any alignment actually reported, BLAST performs a gapped extension that records "traceback" information using a substantially larger X parameter than that employed during the search stage to increase the accuracy of the alignment.
Because BLAST produces gapped alignments only for those few database sequences likely to be related to the query, it cannot estimate the parameters necessary to compute normalized scores on the fly. Instead, BLAST must rely on estimates of these parameters generated beforehand by random simultion. For this reason, BLAST cannot use a scoring system for which no simulation has been performed and still produce accurate estimates of statistical significance.

Statistics

   To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. In this context, "chance" can mean the comparison of (i) real but non-homologous sequences ; (ii) real sequences that are shuffled to preserve compositional properties ; or (iii) sequences that are generated randomly based upon a DNA or protein sequence model. Analytic statistical results invariably use the last of these definitions of chance, while empirical results based on simulation and curve-fitting may use any of the definitions.

The statistics of global sequence comparison

   Unfortunately, under even the simplest random models and scoring systems, very little is known about the random distribution of optimal global alignment scores. Monte Carlo experiments can provide rough distributional results for some specific scoring systems and sequence compositions, but these can not be generalized easily. Therefore, one of the few methods available for assessing the statistical significance of a particular global alignment is to generate many random sequence pairs of the appropriate length and composition, and calculate the optimal alignment score for each. While it is then possible to express the score of interest in terms of standard deviations from the mean, it is a mistake to assume that the relevant distribution is normal and convert this Z-value into a P-value ; the tail behavior of global alignment scores is unknown. The most one can say reliably is that if 100 random alignments have a score inferior to the alignment of interest, the P-value in question is likely less than 0.01. One further pitfall to avoid is exaggerating the significance of a result found among multiple tests. When many alignments have been generated, e.g. in a database search, the significance of the best must be discounted accordingly. An alignment with P-value 0.0001 in the context of a single trial may be assigned a P-value of only 0.1 if it was selected as the best among 1000 independent trials.

The statistics of local sequence comparison

   Fortunately statistics for the scores of local alignments, unlike those of global alignments, are well understood. This is particularly true for local alignments lacking gaps, which we will consider first. Such alignments were precisely those sought by the original BLAST database search programs.
   A local alignment without gaps consists simply of a pair of equal length segments, one from each of the two sequences being compared. A modification of the Smith-Waterman or Sellers algorithms will find all segment pairs whose scores can not be improved by extension or trimming. These are called high-scoring segment pairs or HSPs.
   To analyze how high a score is likely to arise by chance, a model of random sequences is needed. For proteins, the simplest model chooses the amino acid residues in a sequence independently, with specific background probabilities for the various residues. Additionally, the expected score for aligning a random pair of amino acid is required to be negative. Were this not the case, long alignments would tend to have a high score independently of whether the segments aligned were related, and the statistical theory would break down.
   Just as the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution. (We will elude the many technical points required to make this statement rigorous.) In studying optimal local sequence alignments, we are essentially dealing with the latter case. In the limit of sufficiently large sequence lengths m and n, the statistics of HSP scores are characterized by two parameters, K and lambda. Most simply, the expected number of HSPs with score at least S is given by the formula




We call this the E-value for the score S.
   This formula makes eminently intuitive sense. Doubling the length of either sequence should double the number of HSPs attaining a given score. Also, for an HSP to attain the score 2x it must attain the score x twice in a row, so one expects E to decrease exponentially with score. The parameters K and lambda can be thought of simply as natural scales for the search space size and the scoring system respectively.

Bit scores

   Raw scores have little meaning without detailed knowledge of the scoring system used, or more simply its statistical parameters K and lambda. Unless the scoring system is understood, citing a raw score alone is like citing a distance without specifying feet, meters or light years. By normalizing a raw score using the formula




one attains a "bit score" S', which has a standard set of units. The E-value corresponding to a given bit score is simply




Bit scores subsume the statistical essence of the scoring system employed, so that to calculate significance one needs to know in addition only the size of the search space.

P-values

   The number of random HSPs with score >= S is described by a Poisson distribution. This means that the probability of finding exactly a HSPs with score >=S is given by




where E is the E-value of S given by equation (1) above. Specifically the chance of finding zero HSPs with score >=S is e-E, so the probability of finding at least one such HSP is




This is the P-value associated with the score S. For example, if one expects to find three HSPs with score >= S, the probability of finding at least one is 0.95. The BLAST programs report E-values rather than P-values because it is easier to understand the difference between, for example, E-values of 5 and 10 than P-values of 0.993 and 0.99995. However, when E < 0.01, P-values and E-values are nearly identical.

Database searches

   The E-value of equation (1) applies to the comparison of two proteins of lengths m and n. How does one assess the significance of an alignment that arises from the comparison of a protein of length m to a database containing many different proteins, of varying lengths? One view is that all proteins in the database are a priori equally likely to be related to the query. This implies that a low E-value for an alignment involving a short database sequence should carry the same weight as a low E-value for an alignment involving a long database sequence. To calculate a "database search" E-value, one simply multiplies the pairwise-comparison E-value by the number of sequences in the database. Recent versions of the FASTA protein comparison programs take this approach.
   An alternative view is that a query is a priori more likely to be related to a long than to a short sequence, because long sequences are often composed of multiple distinct domains. If we assume the a priori chance of relatedness is proportional to sequence length, then the pairwise E-value involving a database sequence of length n should be multiplied by N/n, where N is the total length of the database in residues. Examining equation (1), this can be accomplished simply by treating the database as a single long sequence of length N. The BLAST programs take this approach to calculating database E-value. Notice that for DNA sequence comparisons, the length of database records is largely arbitrary, and therefore this is the only really tenable method for estimating statistical significance.

The statistics of gapped alignments

   The statistics developed above have a solid theoretical foundation only for local alignments that are not permitted to have gaps. However, many computational experiments and some analytic results strongly suggest that the same theory applies as well to gapped alignments. For ungapped alignments, the statistical parameters can be calculated, using analytic formulas, from the substitution scores and the background residue frequencies of the sequences being compared. For gapped alignments, these parameters must be estimated from a large-scale comparison of "random" sequences.
   Some database search programs, such as FASTA or various implementations of the Smith-Waterman algorithm, produce optimal local alignment scores for the comparison of the query sequence to every sequence in the database. Most of these scores involve unrelated sequences, and therefore can be used to estimate lambda and K. This approach avoids the artificiality of a random sequence model by employing real sequences, with their attendant internal structure and correlations, but it must face the problem of excluding from the estimation scores from pairs of related sequences. The BLAST programs achieve much of their speed by avoiding the calculation of optimal alignment scores for all but a handful of unrelated sequences. They must therefore rely upon a pre-estimation of the parameters lambda and K for a selected set of substitution matrices and gap costs. This estimation could be done using real sequences, but has instead relied upon a random sequence model, which appears to yield fairly accurate results.

Edge effects

   The statistics described above tend to be somewhat conservative for short sequences. The theory supporting these statistics is an asymptotic one, which assumes an optimal local alignment can begin with any aligned pair of residues. However, a high-scoring alignment must have some length, and therefore can not begin near to the end of either of two sequences being compared. This "edge effect" may be corrected for by calculating an "effective length" for sequences ; the BLAST programs implement such a correction. For sequences longer than about 200 residues the edge effect correction is usually negligible.

The choice of substitution scores

   The results a local alignment program produces depend strongly upon the scores it uses. No single scoring scheme is best for all purposes, and an understanding of the basic theory of local alignment scores can improve the sensitivity of one's sequence analyses. As before, the theory is fully developed only for scores used to find ungapped local alignments, so we start with that case.
   A large number of different amino acid substitution scores, based upon a variety of rationales, have been described. However the scores of any substitution matrix with negative expected score can be written uniquely in the form






where the qij, called target frequencies, are positive numbers that sum to 1, the pi are background frequencies for the various residues, and lambda is a positive constant. The lambda here is identical to the lambda of equation (1).
   Multiplying all the scores in a substitution matrix by a positive constant does not change their essence : an alignment that was optimal using the original scores remains optimal. Such multiplication alters the parameter lambda but not the target frequencies qij. Thus, up to a constant scaling factor, every substitution matrix is uniquely determined by its target frequencies. These frequencies have a special significance :

A given class of alignments is best distinguished from chance by the substitution matrix whose target frequencies characterize the class.

To elaborate, one may characterize a set of alignments representing homologous protein regions by the frequency with which each possible pair of residues is aligned. If valine in the first sequence and leucine in the second appear in 1% of all alignment positions, the target frequency for (valine, leucine) is 0.01. The most direct way to construct appropriate substitution matrices for local sequence comparison is to estimate target and background frequencies, and calculate the corresponding log-odds scores of formula (6). These frequencies in general can not be derived from first principles, and their estimation requires empirical input.

The PAM and BLOSUM amino acid substitution matrices

   While all substitution matrices are implicitly of log-odds form, the first explicit construction using formula (6) was by Dayhoff and coworkers. From a study of observed residue replacements in closely related proteins, they constructed the PAM (for "point accepted mutation") model of molecular evolution. One "PAM" corresponds to an average change in 1% of all amino acid positions. After 100 PAMs of evolution, not every residue will have changed : some will have mutated several times, perhaps returning to their original state, and others not at all. Thus it is possible to recognize as homologous proteins separated by much more than 100 PAMs. Note that there is no general correspondence between PAM distance and evolutionary time, as different protein families evolve at different rates.
   Using the PAM model, the target frequencies and the corresponding substitution matrix may be calculated for any given evolutionary distance. When two sequences are compared, it is not generally known a priori what evolutionary distance will best characterize any similarity they may share. Closely related sequences, however, are relatively easy to find even with non-optimal matrices, so the tendency has been to use matrices tailored for fairly distant similarities. For many years, the most widely used matrix was PAM-250, because it was the only one originally published by Dayhoff.
   Dayhoff's formalism for calculating target frequencies has been criticized , and there have been several efforts to update her numbers using the vast quantities of derived protein sequence data generated since her work . These newer PAM matrices do not differ greatly from the original ones.
   An alternative approach to estimating target frequencies, and the corresponding log-odds matrices, has been advanced by Henikoff & Henikoff. They examine multiple alignments of distantly related protein regions directly, rather than extrapolate from closely related sequences. An advantage of this approach is that it cleaves closer to observation ; a disadvantage is that it yields no evolutionary model. A number of tests suggest that the "BLOSUM" matrices produced by this method generally are superior to the PAM matrices for detecting biological relationships.

DNA substitution matrices

   While we have discussed substitution matrices only in the context of protein sequence comparison, all the main issues carry over to DNA sequence comparison. One warning is that when the sequences of interest code for protein, it is almost always better to compare the protein translations than to compare the DNA sequences directly. The reason is that after only a small amount of evolutionary change, the DNA sequences, when compared using simple nucleotide substitution scores, contain less information with which to deduce homology than do the encoded protein sequences .
   Sometimes, however, one may wish to compare non-coding DNA sequences, at which point the same log-odds approach as before applies. An evolutionary model in which all nucleotides are equally common and all substitution mutations are equally likely yields different scores only for matches and mismatches. A more complex model, in which transitions are more likely than transversions, yields different "mismatch" scores for transitions and transversions. The best scores to use will depend upon whether one is seeking relatively diverged or closely related sequences.

Gap scores

   Our theoretical development concerning the optimality of matrices constructed using equation (6) unfortunately is invalid as soon as gaps and associated gap scores are introduced, and no more general theory is available to take its place. However, if the gap scores employed are sufficiently large, one can expect that the optimal substitution scores for a given application will not change substantially. In practice, the same substitution scores have been applied fruitfully to local alignments both with and without gaps. Appropriate gap scores have been selected over the years by trial and error, and most alignment programs will have a default set of gap scores to go with a default set of substitution scores. If the user wishes to employ a different set of substitution scores, there is no guarantee that the same gap scores will remain appropriate. No clear theoretical guidance can be given, but "affine gap scores", with a large penalty for opening a gap and a much smaller one for extending it, have generally proved among the most effective.

Low complexity sequence regions

   There is one frequent case where the random models and therefore the statistics discussed here break down. As many as one fourth of all residues in protein sequences occur within regions with highly biased amino acid composition. Alignments of two regions with similarly biased composition may achieve very high scores that owe virtually nothing to residue order but are due instead to segment composition. Alignments of such "low complexity" regions have little meaning in any case : since these regions most likely arise by gene slippage, the one-to-one residue correspondence imposed by alignment is not valid. While it is worth noting that two proteins contain similar low complexity regions, they are best excluded when constructing alignments . The BLAST programs employ the SEG algorithm to filter low complexity regions from proteins before executing a database search.

Usage

Here is a sample session with blast

> blast
BLAST search of query sequence(s) against sequence search set
         1 : blastn (nuc against nuc)
         2 : blastp (prot against prot)
         3 : blastx (nuc translated against prot)
         4 : tblastn (prot against nuc translated)
         5 : tblastx (nuc translated against nuc translated)
Select type of search you want to run [2]: 
Query sequence(s): sw:papa1_carpa
         1 : standard set
         2 : user defined set
         3 : user provided BLAST databank
Select search set type [1]: 
        sw : SwissProt (highly annotated protein databank)
        up : UniProt (SwissProt + TrEMBL, EMBL ORF translations)
  uniref100 : UniRef100 (UniProt nonredundant subset)
  uniref90 : UniRef90 (UniRef100 subset with no more than 90% identity)
  uniref50 : UniRef50 (UniRef100 subset with no more than 50% identity)
      remt : REM-TrEMBL (old EMBL ORF translations not incl. in UniProt)
       pir : PIR (old general protein databank)
        gp : GenPept (GenBank ORF translations)
   refseqp : RefSeq (NCBI reference protein sequences)
       pdb : PDB (proteins with known 3D structure)
    gpcrdb : G protein coupled receptors
Standard protein search set [up]: sw
Word size [3]: 
E() value cutoff [10.0]: 
Output file [papa1_carpa.blastp]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
   -program            menu       [2] Search type : nuc. or prot. (Values: 1
                                  (blastn (nuc against nuc)); 2 (blastp (prot
                                  against prot)); 3 (blastx (nuc translated
                                  against prot)); 4 (tblastn (prot against nuc
                                  translated)); 5 (tblastx (nuc translated
                                  against nuc translated)))
  [-seqs]              seqall     Query sequence(s)
   -dbtype             menu       [1] Search set type : public databank or
                                  databank provided by user (Values: 1
                                  (standard set); 2 (user defined set); 3
                                  (user provided BLAST databank))
*  -nucdb              menu       [emblnontags] Standard nucleic acid search
                                  set (Values: em (EMBL (general nucleic acid
                                  databank)); emblnontags (EMBL without EST
                                  and GSS); hum (EMBL humans); mus (EMBL
                                  mice); rod (EMBL other rodents); mam (EMBL
                                  other mammals); vrt (EMBL other
                                  vertebrates); inv (EMBL invertebrates); pln
                                  (EMBL plants); fun (EMBL fungi); pro (EMBL
                                  bacteria); phg (EMBL bacteriophages); vrl
                                  (EMBL other viruses); est (EMBL Expressed
                                  Sequence Tags); gss (EMBL Genome Survey
                                  Sequences); sts (EMBL Sequence Tagged
                                  Sites); htg (EMBL High Throughput Genomic);
                                  htc (EMBL High Throughput cDNA); env (EMBL
                                  environmental samples); pat (EMBL patents);
                                  tgn (EMBL transgenic); syn (EMBL synthetic);
                                  unc (EMBL unclassified); new (EMBL updates
                                  since last release); wgs (EMBL Whole Genome
                                  Shotgun); refseq (RefSeq (NCBI reference
                                  sequences)); refseqwgs (RefSeq Whole Genome
                                  Shotgun); refseqgen (RefSeq other genomic);
                                  refseqrna (RefSeq transcripts); vec
                                  (Intelligenetics vector databank); emvec
                                  (EMBL vector subset); epd (Eukaryotic
                                  Promoter Database); ligm (ImMunoGeneTics
                                  databank Igg. + TcR genes); hla
                                  (ImMunoGeneTics databank human MHC genes);
                                  pdbn (PDB (nucleic acids with known 3D
                                  structure)))
*  -protdb             menu       [up] Standard protein search set (Values: sw
                                  (SwissProt (highly annotated protein
                                  databank)); up (UniProt (SwissProt + TrEMBL,
                                  EMBL ORF translations)); uniref100
                                  (UniRef100 (UniProt nonredundant subset));
                                  uniref90 (UniRef90 (UniRef100 subset with no
                                  more than 90% identity)); uniref50
                                  (UniRef50 (UniRef100 subset with no more
                                  than 50% identity)); remt (REM-TrEMBL (old
                                  EMBL ORF translations not incl. in
                                  UniProt)); pir (PIR (old general protein
                                  databank)); gp (GenPept (GenBank ORF
                                  translations)); refseqp (RefSeq (NCBI
                                  reference protein sequences)); pdb (PDB
                                  (proteins with known 3D structure)); gpcrdb
                                  (G protein coupled receptors))
*  -userdb             seqall     User defined search set
*  -userblastdb        infile     User provided BLAST format databank (you can
                                  make one using makeblastdb)
   -wordsize           integer    [11 for blastn, 3 for other search types]
                                  Word size (7 or more for blastn, 2 or 3 for
                                  other search types)
   -expect             float      [10.0] E() value = number of databank
                                  sequences with same or higher bit score that
                                  you expect to find by chance. BLAST lists
                                  sequences with an E() value lower than the
                                  cutoff. (Number 0.000 or more)
  [-outfile]           outfile    [*.blast] Output file name

   Additional (Optional) qualifiers (* if not always prompted):
*  -strand             selection  [both] Strand to search. By default BLAST
                                  searches both strands, but for blastn and
                                  (t)blastx you can choose to search only the
                                  top or bottom strand of the databank
                                  respectively query sequence.
*  -match              integer    [1] Nucleotide match reward (Integer 0 or
                                  more)
*  -mismatch           integer    [-3] Nucleotide mismatch penalty (Integer up
                                  to 0)
*  -matrix             selection  [3] Amino acid comparison matrix
*  -gappenalty         integer    [5 for blastn, 11 for other search types]
                                  Gap penalty (Integer 0 or more)
*  -gaplength          integer    [2 for blastn, 1 for other search types] Gap
                                  length penalty. BLAST subtracts from the
                                  similarity score for each gap a penalty of
                                  type <Gap penalty> + <Gap length penalty> *
                                  n. Only certain combinations of matrix and
                                  gap penalty are allowed, see on-line manual.
                                  (Integer 0 or more)
*  -seqgencode         menu       [1] Genetic code for translating query
                                  sequence(s) (Values: 1 (Standard); 2
                                  (Vertebrate Mitochondrial); 3 (Yeast
                                  Mitochondrial); 4 (Mold, Protozoan,
                                  Coelenterate Mitochondrial and
                                  Mycoplasma/Spiroplasma); 5 (Invertebrate
                                  Mitochondrial); 6 (Ciliate, Dasycladacean
                                  and Hexamita); 9 (Echinoderm Mitochondrial);
                                  10 (Euplotid); 11 (Bacterial); 12
                                  (Alternative Yeast); 13 (Ascidian
                                  Mitochondrial); 14 (Flatworm Mitochondrial);
                                  15 (Blepharisma); 16 (Chlorophycean
                                  Mitochondrial); 21 (Trematode
                                  Mitochondrial); 22 (Scenedesmus obliquus
                                  mitochondrial); 23 (Thraustochytrium
                                  mitochondrial))
*  -dbgencode          menu       [1] Genetic code for translating databank
                                  sequences (Values: 1 (Standard); 2
                                  (Vertebrate Mitochondrial); 3 (Yeast
                                  Mitochondrial); 4 (Mold, Protozoan,
                                  Coelenterate Mitochondrial and
                                  Mycoplasma/Spiroplasma); 5 (Invertebrate
                                  Mitochondrial); 6 (Ciliate, Dasycladacean
                                  and Hexamita); 9 (Echinoderm Mitochondrial);
                                  10 (Euplotid); 11 (Bacterial); 12
                                  (Alternative Yeast); 13 (Ascidian
                                  Mitochondrial); 14 (Flatworm Mitochondrial);
                                  15 (Blepharisma); 16 (Chlorophycean
                                  Mitochondrial); 21 (Trematode
                                  Mitochondrial); 22 (Scenedesmus obliquus
                                  mitochondrial); 23 (Thraustochytrium
                                  mitochondrial))
*  -compstats          menu       [0] The E() value can be computed more
                                  accurately if the composition of the
                                  sequences being compared is taken into
                                  account. For blastp and tblastn you can
                                  choose to adjust or rescale the scoring
                                  scheme, as is done for PSI-BLAST (Values: 0
                                  (none); 1 (scale); 2 (adjust conditionally,
                                  otherwise scale); 3 (adjust))
   -format             menu       [0] Alignment format (Values: 0 (pairwise);
                                  1 (query-anchored, showing identities); 2
                                  (query-anchored, no identities); 3 (flat
                                  query-anchored, show identities); 4 (flat
                                  query-anchored, no identities); 5
                                  (query-anchored, no identities and blunt
                                  ends); 6 (flat query-anchored, no identities
                                  and blunt ends); 7 (XML Blast output); 8
                                  (tabular); 9 (tabular, with comment lines);
                                  10 (ASN.1))

   Advanced (Unprompted) qualifiers:
   -[no]gaps           toggle     [Y] Make gapped alignments (is default)
   -[no]seqfilter      boolean    [Y] Filter low complexity segments out of
                                  query sequence(s) (is default)
   -seqcoilfilter      boolean    Filter coiled coils out of query sequence(s)
   -seqsoftfilter      boolean    Use soft filtering, that is, filter only at
                                  initial hit searching, not at hit extension
   -[no]doublehit      boolean    [Y] Try to extend hit only if there is a
                                  second hit (not for blastn, is default for
                                  other search types)
   -window             integer    [not for blastn, 40 for other search types]
                                  Multiple hits window size (Integer 0 or
                                  more)
   -effdbsize          float      [0.000] Effective databank size for
                                  statistical calculations (Number 0.000 or
                                  more)
   -keep               integer    [0] Keep only n best hits from same region.
                                  Default is to show them all. If you use this
                                  option, a value of 100 is recommended.
                                  (Integer 0 or more)
   -listsize           integer    [50] Show only the n best scoring sequences
                                  that satisfy E() cutoff (Integer 0 or more)
   -align              integer    [25] Show only alignments for the n first
                                  sequences (Integer 0 or more, but not >
                                  listsize)

   Associated qualifiers:

   "-seqs" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-userdb" associated qualifiers
   -sbegin             integer    Start of each sequence to be used
   -send               integer    End of each sequence to be used
   -sreverse           boolean    Reverse (if DNA)
   -sask               boolean    Ask for begin/end/reverse
   -snucleotide        boolean    Sequence is nucleotide
   -sprotein           boolean    Sequence is protein
   -slower             boolean    Make lower case
   -supper             boolean    Make upper case
   -sformat            string     Input sequence format
   -sdbname            string     Database name
   -sid                string     Entryname
   -ufo                string     UFO features
   -fformat            string     Features format
   -fopenfile          string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
-program Search type : nuc. or prot.
1 (blastn (nuc against nuc))
2 (blastp (prot against prot))
3 (blastx (nuc translated against prot))
4 (tblastn (prot against nuc translated))
5 (tblastx (nuc translated against nuc translated))
2
[-seqs]
(Parameter 1)
Query sequence(s) Readable sequence(s) Required
-dbtype Search set type : public databank or databank provided by user
1 (standard set)
2 (user defined set)
3 (user provided BLAST databank)
1
-nucdb Standard nucleic acid search set
em (EMBL (general nucleic acid databank))
emblnontags (EMBL without EST and GSS)
hum (EMBL humans)
mus (EMBL mice)
rod (EMBL other rodents)
mam (EMBL other mammals)
vrt (EMBL other vertebrates)
inv (EMBL invertebrates)
pln (EMBL plants)
fun (EMBL fungi)
pro (EMBL bacteria)
phg (EMBL bacteriophages)
vrl (EMBL other viruses)
est (EMBL Expressed Sequence Tags)
gss (EMBL Genome Survey Sequences)
sts (EMBL Sequence Tagged Sites)
htg (EMBL High Throughput Genomic)
htc (EMBL High Throughput cDNA)
env (EMBL environmental samples)
pat (EMBL patents)
tgn (EMBL transgenic)
syn (EMBL synthetic)
unc (EMBL unclassified)
new (EMBL updates since last release)
wgs (EMBL Whole Genome Shotgun)
refseq (RefSeq (NCBI reference sequences))
refseqwgs (RefSeq Whole Genome Shotgun)
refseqgen (RefSeq other genomic)
refseqrna (RefSeq transcripts)
vec (Intelligenetics vector databank)
emvec (EMBL vector subset)
epd (Eukaryotic Promoter Database)
ligm (ImMunoGeneTics databank Igg. + TcR genes)
hla (ImMunoGeneTics databank human MHC genes)
pdbn (PDB (nucleic acids with known 3D structure))
emblnontags
-protdb Standard protein search set
sw (SwissProt (highly annotated protein databank))
up (UniProt (SwissProt + TrEMBL, EMBL ORF translations))
uniref100 (UniRef100 (UniProt nonredundant subset))
uniref90 (UniRef90 (UniRef100 subset with no more than 90% identity))
uniref50 (UniRef50 (UniRef100 subset with no more than 50% identity))
remt (REM-TrEMBL (old EMBL ORF translations not incl. in UniProt))
pir (PIR (old general protein databank))
gp (GenPept (GenBank ORF translations))
refseqp (RefSeq (NCBI reference protein sequences))
pdb (PDB (proteins with known 3D structure))
gpcrdb (G protein coupled receptors)
up
-userdb User defined search set Readable sequence(s) Required
-userblastdb User provided BLAST format databank (you can make one using makeblastdb) Input file Required
-wordsize Word size 7 or more for blastn, 2 or 3 for other search types 11 for blastn, 3 for other search types
-expect E() value = number of databank sequences with same or higher bit score that you expect to find by chance. BLAST lists sequences with an E() value lower than the cutoff. Number 0.000 or more 10.0
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.<program>
Additional (Optional) qualifiers Allowed values Default
-strand Strand to search. By default BLAST searches both strands, but for blastn and (t)blastx you can choose to search only the top or bottom strand of the databank respectively query sequence. Choose from selection list of values both
-match Nucleotide match reward Integer 0 or more 1
-mismatch Nucleotide mismatch penalty Integer up to 0 -3
-matrix Amino acid comparison matrix Choose from selection list of values BLOSUM62
-gappenalty Gap penalty Integer 0 or more 5 for blastn, 11 for other search types
-gaplength Gap length penalty. BLAST subtracts from the similarity score for each gap a penalty of type <Gap penalty> + <Gap length penalty> * n. Only certain combinations of matrix and gap penalty are allowed, see on-line manual. Integer 0 or more 2 for blastn, 1 for other search types
-seqgencode Genetic code for translating query sequence(s)
1 (Standard)
2 (Vertebrate Mitochondrial)
3 (Yeast Mitochondrial)
4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)
5 (Invertebrate Mitochondrial)
6 (Ciliate, Dasycladacean and Hexamita)
9 (Echinoderm Mitochondrial)
10 (Euplotid)
11 (Bacterial)
12 (Alternative Yeast)
13 (Ascidian Mitochondrial)
14 (Flatworm Mitochondrial)
15 (Blepharisma)
16 (Chlorophycean Mitochondrial)
21 (Trematode Mitochondrial)
22 (Scenedesmus obliquus mitochondrial)
23 (Thraustochytrium mitochondrial)
1
-dbgencode Genetic code for translating databank sequences
1 (Standard)
2 (Vertebrate Mitochondrial)
3 (Yeast Mitochondrial)
4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)
5 (Invertebrate Mitochondrial)
6 (Ciliate, Dasycladacean and Hexamita)
9 (Echinoderm Mitochondrial)
10 (Euplotid)
11 (Bacterial)
12 (Alternative Yeast)
13 (Ascidian Mitochondrial)
14 (Flatworm Mitochondrial)
15 (Blepharisma)
16 (Chlorophycean Mitochondrial)
21 (Trematode Mitochondrial)
22 (Scenedesmus obliquus mitochondrial)
23 (Thraustochytrium mitochondrial)
1
-compstats The E() value can be computed more accurately if the composition of the sequences being compared is taken into account. For blastp and tblastn you can choose to adjust or rescale the scoring scheme, as is done for PSI-BLAST
0 (none)
1 (scale)
2 (adjust conditionally, otherwise scale)
3 (adjust)
0
-format Alignment format
0 (pairwise)
1 (query-anchored, showing identities)
2 (query-anchored, no identities)
3 (flat query-anchored, show identities)
4 (flat query-anchored, no identities)
5 (query-anchored, no identities and blunt ends)
6 (flat query-anchored, no identities and blunt ends)
7 (XML Blast output)
8 (tabular)
9 (tabular, with comment lines)
10 (ASN.1)
0
Advanced (Unprompted) qualifiers Allowed values Default
-[no]gaps Make gapped alignments (is default) Toggle value Yes/No Yes
-[no]seqfilter Filter low complexity segments out of query sequence(s) (is default) Boolean value Yes/No Yes
-seqcoilfilter Filter coiled coils out of query sequence(s) Boolean value Yes/No No
-seqsoftfilter Use soft filtering, that is, filter only at initial hit searching, not at hit extension Boolean value Yes/No No
-[no]doublehit Try to extend hit only if there is a second hit (not for blastn, is default for other search types) Boolean value Yes/No Yes
-window Multiple hits window size Integer 0 or more not for blastn, 40 for other search types
-effdbsize Effective databank size for statistical calculations Number 0.000 or more 0.000
-keep Keep only n best hits from same region. Default is to show them all. If you use this option, a value of 100 is recommended. Integer 0 or more 0
-listsize Show only the n best scoring sequences that satisfy E() cutoff Integer 0 or more 50
-align Show only alignments for the n first sequences Integer 0 or more, but not > listsize 25

Input file format

blast searches a query sequence against a search sequence set. For the query sequence you can provide any normal sequence USA. You can submit several query sequences at the same time. In this case the "wrapper" blast will launch blastall for each individual sequence and make sure the output files have different names. Do note that the results of the individual searches are in no way merged.

You can select your search set in three different ways :

  1. a standard set is a copy of a public databank or a local databank installed by the managers of the server computer and "visible" to all users. You must choose from a selector.
  2. a user defined set is a set of sequences (public and/or private) that you can specify using a normal sequence USA. It is convenient to use a List File. The set will be transformed on-the-fly into a temporary BLAST format databank.
  3. a user provided BLAST databank is a databank in BLAST format (usually private). A BLAST format databank consists of three binary files (for proteins <basename>.phr <basename>.pin <basename>.psq, for nucleic acids <basename>.nhr <basename>.nin <basename>.nsq ; huge databanks can consist of several compound databanks enumerated in a "Library File" <basename>.pal respectively <basename>.nal). You can precise the databank by the name of any of its files (or eventally by the basename, but only if there exists a file with that name).
Note that if you want to search a selected set of sequences taken from the public databanks and/or want to search a set of your own private sequences you can choose between options 2 and 3. If your set is small or if you want to search it just once, the "a user defined set" is good, otherwise, to save time, it is recommended to make a databank in BLAST format using the program makeblastdb and choose "a user provided BLAST databank".

Output file format

Output files for usage example

File: papa1_carpa.blastp

BLASTP 2.2.18 [Mar-02-2008]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= PAPA1_CARPA P00784 Papain precursor (EC 3.4.22.2) (Papaya
proteinase I) (PPI) (Allergen Car p 1).
         (345 letters)

Database: SwissProt (manually annotated part of UniProt, including
splice variants) 
           387,470 sequences; 145,667,171 total letters



                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

sw:PAPA1_CARPA Papain precursor (EC 3.4.22.2) (Papaya proteinase...   721   0.0  
sw:PAPA3_CARPA Caricain precursor (EC 3.4.22.30) (Papaya protein...   514   e-145
sw:PAPA4_CARPA Papaya proteinase 4 precursor (EC 3.4.22.25) (Pap...   487   e-137
sw:PAPA2_CARPA Chymopapain precursor (EC 3.4.22.6) (Papaya prote...   451   e-126
sw:XCP1_ARATH Xylem cysteine proteinase 1 precursor (EC 3.4.22.-...   333   1e-90
sw:XCP2_ARATH Xylem cysteine proteinase 2 precursor (EC 3.4.22.-...   322   2e-87
sw:RD21A_ARATH Cysteine proteinase RD21a precursor (EC 3.4.22.-)...   280   1e-74

  [Part of this file has been deleted for brevity]

sw:CATL1_RAT Cathepsin L1 precursor (EC 3.4.22.15) (Major excret...   192   2e-48
sw:BROM2_ANACO Stem bromelain (EC 3.4.22.32).                         192   3e-48
sw:CATL_SARPE Cathepsin L precursor (EC 3.4.22.15) [Contains: Ca...   191   4e-48
sw:CATK_HUMAN Cathepsin K precursor (EC 3.4.22.38) (Cathepsin O)...   191   5e-48
sw:CATH_RAT Cathepsin H precursor (EC 3.4.22.16) (Cathepsin B3) ...   191   5e-48
sw:CATK_RABIT Cathepsin K precursor (EC 3.4.22.38) (OC-2 protein).    189   2e-47
sw:CYSP2_HOMAM Digestive cysteine proteinase 2 precursor (EC 3.4...   189   2e-47

>sw:PAPA1_CARPA Papain precursor (EC 3.4.22.2) (Papaya proteinase I)
           (PPI) (Allergen Car p 1).
          Length = 345

 Score =  721 bits (1861), Expect = 0.0
 Identities = 345/345 (100%), Positives = 345/345 (100%)

Query: 1   MAMIPSISKLLFVAICLFVYMGLSFGDFSIVGYSQNDLTSTERLIQLFESWMLKHNKIYK 60
           MAMIPSISKLLFVAICLFVYMGLSFGDFSIVGYSQNDLTSTERLIQLFESWMLKHNKIYK
Sbjct: 1   MAMIPSISKLLFVAICLFVYMGLSFGDFSIVGYSQNDLTSTERLIQLFESWMLKHNKIYK 60

Query: 61  NIDEKIYRFEIFKDNLKYIDETNKKNNSYWLGLNVFADMSNDEFKEKYTGSIAGNYTTTE 120
           NIDEKIYRFEIFKDNLKYIDETNKKNNSYWLGLNVFADMSNDEFKEKYTGSIAGNYTTTE
Sbjct: 61  NIDEKIYRFEIFKDNLKYIDETNKKNNSYWLGLNVFADMSNDEFKEKYTGSIAGNYTTTE 120

Query: 121 LSYEEVLNDGDVNIPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNE 180
           LSYEEVLNDGDVNIPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNE
Sbjct: 121 LSYEEVLNDGDVNIPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNE 180

Query: 181 YSEQELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKT 240
           YSEQELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKT
Sbjct: 181 YSEQELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKT 240

Query: 241 DGVRQVQPYNEGALLYSIANQPVSVVLEAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYG 300
           DGVRQVQPYNEGALLYSIANQPVSVVLEAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYG
Sbjct: 241 DGVRQVQPYNEGALLYSIANQPVSVVLEAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYG 300

Query: 301 PNYILIKNSWGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 345
           PNYILIKNSWGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN
Sbjct: 301 PNYILIKNSWGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 345


>sw:PAPA3_CARPA Caricain precursor (EC 3.4.22.30) (Papaya proteinase
           omega) (Papaya proteinase III) (PPIII) (Papaya peptidase
           A).
          Length = 348

 Score =  514 bits (1325), Expect = e-145
 Identities = 253/350 (72%), Positives = 286/350 (81%), Gaps = 7/350 (2%)

Query: 1   MAMIPSISKLLFVAICLFVYMGLSFGDFSIVGYSQNDLTSTERLIQLFESWMLKHNKIYK 60
           MAMIPSISKLLFVAICLFV+M +SFGDFSIVGYSQ+DLTSTERLIQLF SWML HNK Y+
Sbjct: 1   MAMIPSISKLLFVAICLFVHMSVSFGDFSIVGYSQDDLTSTERLIQLFNSWMLNHNKFYE 60

Query: 61  NIDEKIYRFEIFKDNLKYIDETNKKNNSYWLGLNVFADMSNDEFKEKYTGSIAGNYTTTE 120
           N+DEK+YRFEIFKDNL YIDETNKKNNSYWLGLN FAD+SNDEF EKY GS+     T E
Sbjct: 61  NVDEKLYRFEIFKDNLNYIDETNKKNNSYWLGLNEFADLSNDEFNEKYVGSLID--ATIE 118

Query: 121 LSY-EEVLNDGDVNIPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLN 179
            SY EE +N+  VN+PE VDWR+KGAVTPV++QGSCGSCWAFSAV T+EGI KIRTG L 
Sbjct: 119 QSYDEEFINEDTVNLPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLV 178

Query: 180 EYSEQELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAK 239
           E SEQEL+DC+RRS+GC GGYP  AL+ VA+ GIH R+ YPY+  Q  CR+++ G    K
Sbjct: 179 ELSEQELVDCERRSHGCKGGYPPYALEYVAKNGIHLRSKYPYKAKQGTCRAKQVGGPIVK 238

Query: 240 TDGVRQVQPYNEGALLYSIANQPVSVVLEAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGY 299
           T GV +VQP NEG LL +IA QPVSVV+E+ G+ FQLY+GGIF GPCG KVDHAV AVGY
Sbjct: 239 TSGVGRVQPNNEGNLLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVDHAVTAVGY 298

Query: 300 GPN----YILIKNSWGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 345
           G +    YILIKNSWGT WGE GYIRIKR  GNS GVCGLY SS+YP KN
Sbjct: 299 GKSGGKGYILIKNSWGTAWGEKGYIRIKRAPGNSPGVCGLYKSSYYPTKN 348


  [Part of this file has been deleted for brevity]


>sw:CYSP1_ORYSJ Cysteine protease 1 precursor (EC 3.4.22.-) (OsCP1).
          Length = 490

 Score =  236 bits (603), Expect = 1e-61
 Identities = 131/317 (41%), Positives = 180/317 (56%), Gaps = 23/317 (7%)

Query: 48  FESWMLKHNKIYKN------IDEKIYRFEIFKDNLKYIDETNKK---NNSYWLGLNVFAD 98
           ++ W+ +H +          I E   RF +F DNLK++D  N +      + LG+N FAD
Sbjct: 62  YDLWLARHRRGGGGGSRNGFIGEHERRFRVFWDNLKFVDAHNARADERGGFRLGMNRFAD 121

Query: 99  MSNDEFKEKYTGSI-AGNYTTTELSYEEVLNDGDVNIPEYVDWRQKGAVT-PVKNQGSCG 156
           ++N EF+  Y G+  AG       +Y    +DG   +P+ VDWR KGAV  PVKNQG CG
Sbjct: 122 LTNGEFRATYLGTTPAGRGRRVGEAYR---HDGVEALPDSVDWRDKGAVVAPVKNQGQCG 178

Query: 157 SCWAFSAVVTIEGIIKIRTGNLNEYSEQELLDCDR--RSYGCNGGYPWSALQLVAQYG-I 213
           SCWAFSAV  +EGI KI TG L   SEQEL++C R  ++ GCNGG    A   +A+ G +
Sbjct: 179 SCWAFSAVAAVEGINKIVTGELVSLSEQELVECARNGQNSGCNGGIMDDAFAFIARNGGL 238

Query: 214 HYRNTYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNEGALLYSIANQPVSVVLEAAGKD 273
                YPY  +   C   ++       DG   V   +E +L  ++A+QPVSV ++A G++
Sbjct: 239 DTEEDYPYTAMDGKCNLAKRSRKVVSIDGFEDVPENDELSLQKAVAHQPVSVAIDAGGRE 298

Query: 274 FQLYRGGIFVGPCGNKVDHAVAAVGYGPN------YILIKNSWGTGWGENGYIRIKRGTG 327
           FQLY  G+F G CG  +DH V AVGYG +      Y  ++NSWG  WGENGYIR++R   
Sbjct: 299 FQLYDSGVFTGRCGTNLDHGVVAVGYGTDAATGAAYWTVRNSWGPDWGENGYIRMERNVT 358

Query: 328 NSYGVCGLYTSSFYPVK 344
              G CG+   + YP+K
Sbjct: 359 ARTGKCGIAMMASYPIK 375


  Database: SwissProt (manually annotated part of UniProt, including
  splice variants)
    Posted date:  Apr 23, 2008  4:36 PM
  Number of letters in database: 145,667,171
  Number of sequences in database:  387,470
  
Lambda     K      H
   0.318    0.138    0.428 

Gapped
Lambda     K      H
   0.267   0.0410    0.140 


Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Sequences: 387470
Number of Hits to DB: 107,330,994
Number of extensions: 4951578
Number of successful extensions: 11493
Number of sequences better than 10.0: 226
Number of HSP's gapped: 11018
Number of HSP's successfully gapped: 243
Length of query: 345
Length of database: 145,667,171
Length adjustment: 117
Effective length of query: 228
Effective length of database: 100,333,181
Effective search space: 22875965268
Effective search space used: 22875965268
Neighboring words threshold: 11
Window for multiple hits: 40
X1: 16 ( 7.3 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 41 (21.7 bits)
S2: 69 (31.2 bits)

Data files

The amino acid comparison matrices used to compare proteins are hard coded in the program and cannot be changed.

Notes

You can adapt blast for special needs by modulating the parameters :

If you want to search a short peptide against a protein databank for nearly exact matches, you can :
set "Word size" to 2
set E() value to 1000
choose as scoring scheme PAM30, gap penalty 9, gap length penalty 1.

If you want to search a short oligonucleotide against a nucleic acid databank for nearly exact matches, you can :
set "Word size" to 7
set E() value to 1000

If you want to search a nucleic acid sequence against a nucleic acid databank just to check whether it is already in the databank, you can (for increased speed) :
set "Word size" to a higher value (e.g. 28)

If the query sequence contains a region that gives a "hit" with a great number of databank sequences and a second region that gives a weaker "hit", this second region can be overlooked because it appears a long way down in the list of "hits" or is even not reported at all because of the built-in limit of by default 500 reported "hits". You can mend this with the -keep=n option, which limits the number of reported "hits" against the same region of the query sequence.

Allowed scoring schemes

Note that only some combinations of matrix and gap penalty are allowed, because these are the ones for which the K and lambda parameters have been derived "experimentally" by searching a random sequence against a random databank and have been hard coded in the program :

scoring matrixgap penaltygap length penaltyrecommended
BLOSUM9062
72
82
91
2
101*
111
BLOSUM8062
72
82
91
2
101*
111
132
252
BLOSUM6262
72
82
91
2
101
2
111* (the default)
2
121
131
BLOSUM5093
103
113
122
3
132*
3
142
151
2
161
2
171
181
1
BLOSUM45103
113
122
3
132
3
142*
152
161
2
171
181
191
PAM3052
62
72
81
91*
101
PAM7062
72
82
91
101*
111
PAM250113
123
132
3
142
3
152*
3
162
171
2
181
191
201
211

Similarly, blastn supports only certain combinations of match reward, mismatch penalty and gap penalty. If both gap penalty and gap length penalty are above the maximum in following table blastn will shift to statistics for gapless alignments.

match reward/mismatch penaltygap penaltygap length penalty
2 / -704
22
4
42
4
1 / -302
11
2
21
2
2 / -504
22
4
42
4
1 / -202
11
2
21
2
31
2 / -304
22
4
33
42
4
52
62
4
4 / -535
45
55
65
128
1 / -102
12
21
2
31
2
41
2
5 / -486
106
2510

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.
  2. Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268.
  3. Karlin, S. & Altschul, S.F. (1993) "Applications and statistics for multiple high-scoring segments in molecular sequences." Proc. Natl. Acad. Sci. USA 90:5873-5877.
  4. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.
References about "filters" :
  1. Wootton, J.C. & Federhen, S. (1993) "Statistics of local complexity in amino acid sequences and sequence databases." Comput. Chem. 17:149-163.
  2. Hancock, J.M. & Armstrong, J.S. (1994) "SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences." Comput. Appl. Biosci. 10:67-70.
  3. Lupas, A., Van Dyke, M. & Stock, J. (1991) "Predicting coiled coils from protein sequences." Science 252:1162-1164.
  4. Wilson, J.A., Hill, J.E., Kuzio, J. & Faulkner P. (1995) "Characterization of the baculovirus Choristoneura fumiferana multicapsid nuclear polyhedrosis virus p10 gene indicates that the polypeptide contains a coiled-coil domain. J. Gen. Virol. 76:2923-2932.

Warnings

For protein searches only some combinations of matrix and gap penalty are allowed (see Notes).

If you want to use a protein "user provided BLAST databank" and you have in the same directory a nucleic acid databank with same basename you should not use the basename but one of the file names to point to the databank, because otherwise the program would assume you point to the nucleic acid databank and give an error message.

Diagnostic Error Messages

There are error messages specific to this program, which are issued when the user chooses to provide a databank in BLAST format of his own, but there are problems :

  <databank> cannot be accessed or is not BLAST format databank

  <databank> is not a nucleic acid databank !

  <databank> is not a protein databank !

Exit status

It exits prematurely with status 255 and an error message if there is a problem with a "User provided BLAST format databank", otherwise it exits with status 0.

Known bugs

None.

See also

Program nameDescription
ebi_blast WU-BLAST search of query sequence against sequence databank using EBI Web Services
ebi_fasta fastA search of query sequence against sequence databank using EBI Web Services
fasta fastA search of query sequence(s) against sequence search set
fasts Protein identification from peptides using fastA algorithm
phiblast Search protein sequence set combining matching of pattern with local alignment of a query sequence surrounding the match
psiblast Iterative BLAST search with generation of profile of protein sequence against protein sequence set
blast2seq Finds local alignments between two sequences, using BLAST
makeblastdb Make BLAST format sequence database

Author(s)

The wrapper application blast was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The program blastall itself was written by a team of developers working at the National Center for Biotechnology Information, Bethesda MD, U.S.A., comprising among others Stephen Altschul, David Lipman, Tom Madden, Alex Schaffer, Sergei Shavirin and Jinghui Zhang.

You can contact the BLAST development team at blast-help@ncbi.nlm.nih.gov

History

Completed 28 August 2002
Modified 20 March 2003 - adapted to BLAST version 2.2.5
Modified 23 February 2004 - adapted to BLAST version 2.2.8
Modified 3 September 2004 - added option to search only top or bottom strand
Modified 15 October 2004 - updated list of NCBI genetic codes
Modified 4 June 2006 - improved handling sequence filtering
Modified 22 April 2008 - added selector for composition dependent statistics

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.