fasta

 

Function

fastA search of query sequence(s) against sequence search set

Description

fasta is an EMBOSS "wrapper" program for a number of programs from Pearson's fastA package version 3.
fasta
Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequence database using the FASTA algorithm (Pearson and Lipman, 1988, Pearson, 1996). Search speed and selectivity are controlled with the ktup (wordsize) parameter. For protein comparisons, ktup=2 by default ; ktup=1 is more sensitive but slower. For DNA comparisons, ktup=6 by default ; ktup=3 or ktup=4 provides higher sensitivity ; ktup=1 should be used for oligonucleotides (DNA query lengths < 20).
fastx   fasty
Compare a DNA sequence to a protein sequence database, by comparing the translated DNA sequence in three frames and allowing gaps and frameshifts. fastx uses a simpler, faster algorithm for alignments that allows frameshifts only between codons ; fasty is slower but produces better alignments with poor quality sequences because frameshifts are allowed within codons.
tfastx tfasty
Compare a protein sequence to a DNA sequence database, calculating similarities with frameshifts to the forward and reverse orientations.
ssearch
Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequence database using the Smith-Waterman algorithm (Smith and Waterman, 1981). ssearch is about 10-times slower than FASTA, but is more sensitive for full-length protein sequence comparison.
ggsearch glsearch
Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequence database using an algorithm based on the Needleman and Wunsch algorithm (Needleman and Wunsch, 1970). ggsearch calculates an alignment score that is global in the query and global in the library ; glsearch calculates an alignment that is global in the query and local in the library and is designed for global alignments to domains. ggsearch only compares the query to library sequences that are beween 80% and 125% of the length of the query ; glsearch limits comparisons to library sequences that are longer than 80% of the query.

Algorithm

FastA uses the method of  Pearson and Lipman (1988) to search for similarities between one sequence (the query) and any group of sequences.

Hashing Step

The first step in the search is to locate regions of the query sequence and the search set sequence that have high densities of exact word matches. The algorithm for this step of the search is a modification of the algorithm of Wilbur and Lipman and may be referred to as a hash-table look-up search or hashing. Wilbur and Lipman searches (including FastA) belong to a class of comparisons that use direct addressing or k-tuple preprocessing to increase the speed of the search at the expense of some sensitivity.
The hashing process works as follows. After you give FastA a word size, it makes up a dictionary of all of the possible words of that size in the query sequence. A second dictionary is created for the opposite strand if the query is a nucleic acid sequence. Each word, such as GGATGG, is converted to a unique base-4 number that serves as an index to the corresponding dictionary entry. Each entry contains a list of numbers telling the location (coordinates) of the word in the query sequence. If the word does not occur, the entry contains only the number zero. So for each word in the searched sequences, FastA only has to look up the word in the dictionary to find out if it occurs in the query sequence.
It is important to realize that the hashing process cannot deal with ambiguity ! To partially compensate for this limitation, FastA converts an ambiguous base in a sequence to its most common nonambiguous constituent before calculating the index number of the word that contains the ambiguity. For example, A is the most common nucleotide in the sequence databases, so N is converted to A during the hashing step. Similarly, the ambiguous amino acids B, Z, and X are converted to their most common unambiguous constituent, so B (D or N) gets the same hash code as N, and X (any amino acid) gets the same hash code as alanine, the most common amino acid in the protein databases.
If a word from a search set sequence occurs in the query sequence, FastA computes a score for the word equal to the sum of the scoring factors (see next paragraph) for each symbol in the word. It then adds this score to the score of the diagonal on which the word occurs. If a word match overlaps another word on the same diagonal,only the scoring factor(s) for the non-overlapping symbol(s) is added to the score of the diagonal. If there are intervening mismatches between matching words on a diagonal, a constant penalty for each mismatching residue is subtracted from the score.

Scoring Step

At the end of the hashing step, the ten highest-scoring regions for the sequence pair (the regions with the highest density of exact word matches) are rescored using a scoring matrix that allows conservative replacements and runs of identities shorter than the size of a word. The ends of each region are trimmed to include only those residues that contribute to the highest score for the region, resulting in ten partial alignments without gaps. These are referred to as the initial regions. The score of the highest scoring initial region is saved as the init1 score.
Next, FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non-overlapping regions may be joined. The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. The score of the highest scoring region at the end of this step is saved as the initn score.

Aligning Step

After computing the initial scores, FastA determines the best segment of similarity between the query sequence and the search set sequence, using a variation of the Smith-Waterman algorithm. The score for this alignment is reported as the opt score.
FastA determines the opt score immediately if the initn score is greater than a given threshold. The opt scores are then used as the basis for keeping a list of the best matches found. The program calculates the default threshold from the length of the query sequence and the ktup setting.
Lastly, FastA uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair. (See Pearson (1995) for an explanation of how this z-score is calculated.)
The distribution of the z-scores tends to closely approximate an extreme-value distribution ; using this distribution, the program can estimate the number of sequences that would be expected to have, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E() score.

When all of the search set sequences have been compared to the query, the list of best scores is printed. If alignments were requested, the alignments are also printed. For searches with a protein query sequence against a protein search set, a full Smith-Waterman local alignment (not restricted to a band, and therefore allowing unlimited gap lengths) is performed, and a Smith-Waterman score is reported along with the other scores and the alignment itself. (This alignment may not be the same alignment that the "local alignment in a band" algorithm used to calculate the opt score during the search.). By default, the alignment for nucleic acid searches and TFastA is the same "local alignment in a band" that was performed to calculate the opt score.
 In evaluating the E() scores, the following rules of thumb can be used : for searches of a protein database of 10,000 sequences, sequences with E() less than 0.01 are almost always found to be homologous. Sequences with E() between  1 and 10 frequently turn out to be related as well.

Statistical significance

All the programs in the FASTA package version 3 attempt to calculate accurate estimates of the statistical significance of a match. For fasta, ssearch, and fastx/y, these estimates are very accurate (Pearson, 1998, Zhang et al., 1997). Altschul et al. (Altschul et al., 1994) provides an excellent review of the statistics of local similarity scores. Local sequence similarity scores follow the extreme value distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m and n are the lengths of the query and library sequence. This formula can be rewritten as : 1 - exp(-Kmn exp(-lambda x), which shows that the average score for an unrelated library sequence increases with the logarithm of the length of the library sequence.  The fasta programs use simple linear regression against the log of the library sequence length to calculate a normalized "z- score" with mean 50, regardless of library sequence length, and variance 10. These z-scores can then be used with the extreme value distribution and the poisson distribution (to account for the fact that each library sequence comparison is an independent test) to calculate the number of library sequences to obtain a score greater than or equal to the score obtained in the search. The original idea and routines to do the linear regression on library sequence length were provided by Phil Green, U. Washington. This version uses a slightly different strategy for fitting the data than those originally provided by Dr. Green.
The expected number of sequences is plotted in the histogram using an "*". Since the parameters for the extreme value distribution are not calculated directly from the distribution of similarity scores, the pattern of "*"'s in the histogram gives a qualitative view of how well the statistical theory fits the similarity scores calculated by the programs. For fasta, if optimized scores are calculated for each sequence in the database (the default), the agreement between the actual distribution of "z-scores" and the expected distribution based on the length dependence of the score and the extreme value distribution is usually very good. Likewise, the distribution of ssearch Smith-Waterman scores typically agrees closely with the actual distribution of "z-scores." The agreement with unoptimized scores, ktup=2, is often not very good, with too many high scoring sequences and too few low scoring sequences compared with the predicted relationship between sequence length and similarity score. In those cases, the expectation values may be overestimates.
The ggsearch and glsearch programs, which were added with version 35.3, perform global rather than local alignments and assume the scores of unrelated library sequences follow a normal distribution instead.
With version 33t01, all the FASTA programs also report a "bit" score, which is equivalent to the bit score reported by BLAST2. The FASTA33/BLAST2 bit score is calculated as : (lambda*S - ln K)/ln 2, where S is the raw similarity score, lambda and K are statistical parameters estimated from the distribution of unrelated sequence similarity scores. The statistical signficance of a given bit score depends on the lengths of the query and library sequences and the size of the library, but a 1 bit increase in score corresponds to a 2-fold reduction in expectation ; a 10-bit increase implies 1000-fold lower expectation, etc.
The statistical routines assume that the library contains a large sample of unrelated sequences. If this is not true, then statistical parameters can be estimated by using the -shuffle option, which makes the program estimate the statistical parameters from the scores of shuffled sequences. If there are fewer than 20 sequences in the library, the statistical calculations are not done.
For protein searches, library sequences with E() values < 0.01 for searches of a 10,000 entry protein database are almost always homologous. Frequently sequences with E()-values from 1 - 10 are related as well, but unrelated sequences ( 1 - 10 per search) will have scores in this range as well. Remember, however, that these E() values also reflect differences between the amino acid composition of the query sequence and that of the "average" library sequence. Thus, when searches are done with query sequences with "biased" amino-acid composition, unrelated sequences may have "significant" scores because of sequence bias.

Usage

Here is a sample session with fasta

> fasta
fastA search of query sequence(s) against sequence search set
         1 : fasta nuc against nuc
         2 : fasta prot against prot
         3 : fastx (nuc against prot with codon/aa alignment)
         4 : fasty (idem + allowing intracodon gaps)
         5 : tfastx (prot against nuc with codon/aa alignment)
         6 : tfasty (idem + allowing intracodon gaps)
         7 : ssearch (using SW instead) nuc against nuc
         8 : ssearch (using SW instead) prot against prot
         9 : ggsearch (NW global/global alignment) nuc against nuc
        10 : ggsearch (NW global/global alignment) prot against prot
        11 : glsearch (global/local alignment) nuc against nuc
        12 : glsearch (global/local alignment) prot against prot
Select type of search you want to run [1]:
Query sequence(s): embl:m25165
         1 : standard set
         2 : user defined set
         3 : user provided fastA databank
Select search set type [1]:
        em : EMBL (general nucleic acid databank)
  emblnontags : EMBL without EST and GSS
       hum : EMBL humans
       mus : EMBL mice
       rod : EMBL other rodents
       mam : EMBL other mammals
       vrt : EMBL other vertebrates
       inv : EMBL invertebrates
       pln : EMBL plants
       fun : EMBL fungi
       pro : EMBL bacteria
       phg : EMBL bacteriophages
       vrl : EMBL other viruses
       est : EMBL Expressed Sequence Tags
       gss : EMBL Genome Survey Sequences
       sts : EMBL Sequence Tagged Sites
       htg : EMBL High Throughput Genomic
       htc : EMBL High Throughput cDNA
       env : EMBL environmental samples
       pat : EMBL patents
       tgn : EMBL transgenic
       syn : EMBL synthetic
       unc : EMBL unclassified
       new : EMBL updates since last release
       wgs : EMBL Whole Genome Shotgun
    refseq : RefSeq (NCBI reference sequences)
  refseqwgs : RefSeq Whole Genome Shotgun
  refseqgen : RefSeq other genomic
  refseqrna : RefSeq transcripts
       vec : Intelligenetics vector databank
     emvec : EMBL vector subset
       epd : Eukaryotic Promoter Database
      ligm : ImMunoGeneTics databank Igg. + TcR genes
       hla : ImMunoGeneTics databank human MHC genes
      pdbn : PDB (nucleic acids with known 3D structure)
Standard nucleic acid search set [emblnontags]: phg
Word (ktup) size [6]:
E() value cutoff [2.0]:
Output file [m25165.fasta]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
   -program            menu       [1] Search type : fastA or optimal
                                  alignment, nuc. or prot. (Values: 1 (fasta
                                  nuc against nuc); 2 (fasta prot against
                                  prot); 3 (fastx (nuc against prot with
                                  codon/aa alignment)); 4 (fasty (idem +
                                  allowing intracodon gaps)); 5 (tfastx (prot
                                  against nuc with codon/aa alignment)); 6
                                  (tfasty (idem + allowing intracodon gaps));
                                  7 (ssearch (using SW instead) nuc against
                                  nuc); 8 (ssearch (using SW instead) prot
                                  against prot); 9 (ggsearch (NW global/global
                                  alignment) nuc against nuc); 10 (ggsearch
                                  (NW global/global alignment) prot against
                                  prot); 11 (glsearch (global/local alignment)
                                  nuc against nuc); 12 (glsearch
                                  (global/local alignment) prot against prot))
  [-seqs]              seqall     Query sequence(s)
   -dbtype             menu       [1] Search set type : public databank or
                                  databank provided by user (Values: 1
                                  (standard set); 2 (user defined set); 3
                                  (user provided fastA databank))
*  -nucdb              menu       [emblnontags] Standard nucleic acid search
                                  set (Values: em (EMBL (general nucleic acid
                                  databank)); emblnontags (EMBL without EST
                                  and GSS); hum (EMBL humans); mus (EMBL
                                  mice); rod (EMBL other rodents); mam (EMBL
                                  other mammals); vrt (EMBL other
                                  vertebrates); inv (EMBL invertebrates); pln
                                  (EMBL plants); fun (EMBL fungi); pro (EMBL
                                  bacteria); phg (EMBL bacteriophages); vrl
                                  (EMBL other viruses); est (EMBL Expressed
                                  Sequence Tags); gss (EMBL Genome Survey
                                  Sequences); sts (EMBL Sequence Tagged
                                  Sites); htg (EMBL High Throughput Genomic);
                                  htc (EMBL High Throughput cDNA); env (EMBL
                                  environmental samples); pat (EMBL patents);
                                  tgn (EMBL transgenic); syn (EMBL synthetic);
                                  unc (EMBL unclassified); new (EMBL updates
                                  since last release); wgs (EMBL Whole Genome
                                  Shotgun); refseq (RefSeq (NCBI reference
                                  sequences)); refseqwgs (RefSeq Whole Genome
                                  Shotgun); refseqgen (RefSeq other genomic);
                                  refseqrna (RefSeq transcripts); vec
                                  (Intelligenetics vector databank); emvec
                                  (EMBL vector subset); epd (Eukaryotic
                                  Promoter Database); ligm (ImMunoGeneTics
                                  databank Igg. + TcR genes); hla
                                  (ImMunoGeneTics databank human MHC genes);
                                  pdbn (PDB (nucleic acids with known 3D
                                  structure)))
*  -protdb             menu       [up] Standard protein search set (Values: sw
                                  (SwissProt (highly annotated protein
                                  databank)); up (UniProt (SwissProt + TrEMBL,
                                  EMBL ORF translations)); uniref100
                                  (UniRef100 (UniProt nonredundant subset));
                                  uniref90 (UniRef90 (UniRef100 subset with no
                                  more than 90% identity)); uniref50
                                  (UniRef50 (UniRef100 subset with no more
                                  than 50% identity)); remt (REM-TrEMBL (old
                                  EMBL ORF translations not incl. in
                                  UniProt)); pir (PIR (old general protein
                                  databank)); gp (GenPept (GenBank ORF
                                  translations)); refseqp (RefSeq (NCBI
                                  reference protein sequences)); pdb (PDB
                                  (proteins with known 3D structure)); gpcrdb
                                  (G protein coupled receptors))
*  -userdb             seqall     User defined search set
*  -userfastadb        infile     User provided fastA format databank (you can
                                  make one using seqret)
*  -wordsize           integer    [6 for fasta nucleic, 2 for other search
                                  types] Word (ktup) size (1 to 6 for fasta
                                  nucleic, 1 or 2 for other search types)
   -expect             float      [2.0 for nucleic, 10.0 for protein, 5.0 for
                                  mixed] E() value = number of databank
                                  sequences with same or higher Z-score that
                                  you expect to find by chance. fastA lists
                                  sequences with an E() value lower than the
                                  cutoff. (Number 0.000 or more)
  [-outfile]           outfile    [*.fasta] Output file name

   Additional (Optional) qualifiers (* if not always prompted):
*  -[no]reverse        boolean    [Y] Search also complementary strand (is
                                  default). If you switch this off fasta will
                                  search only the forward strand of the query
                                  sequence or of the search set sequences.
*  -match              integer    [5] Nucleotide match reward (Integer 0 or
                                  more)
*  -mismatch           integer    [-4] Nucleotide mismatch penalty (Integer up
                                  to 0)
*  -matrix             menu       [BL50] Amino acid comparison matrix (Values:
                                  BL50 (BLOSUM50); BL62 (BLOSUM62); BL80
                                  (BLOSUM80); P120 (PAM120); P250 (PAM250);
                                  M10 (Jones, Taylor, Thornton PAM10); M20
                                  (Jones, Taylor, Thornton PAM20); M40 (Jones,
                                  Taylor, Thornton PAM40); VT160 (Vingron
                                  resolvent PAM160); OPT5 (OPTIMA 5))
*  -intercodon         integer    [20] Frame shift penalty between codons
                                  (Integer 0 or more)
*  -intracodon         integer    [24] Frame shift penalty inside codons
                                  (Integer 0 or more)
   -gappenalty         integer    [14 for nucleic, 10 for protein, 12 for
                                  fastx/y, 14 for tfastx/y] Gap penalty
                                  (Integer 0 or more)
   -gaplength          integer    [4 for nucleic, 2 for protein, 2 for mixed]
                                  Gap length penalty. fastA subtracts from the
                                  similarity score for each gap a penalty of
                                  type <Gap penalty> + <Gap length penalty> *
                                  n (Integer 0 or more)
*  -gencode            menu       [1] Genetic code for translating sequences
                                  (Values: 1 (Standard); 2 (Vertebrate
                                  Mitochondrial); 3 (Yeast Mitochondrial); 4
                                  (Mold, Protozoan, Coelenterate Mitochondrial
                                  and Mycoplasma/Spiroplasma); 5
                                  (Invertebrate Mitochondrial); 6 (Ciliate,
                                  Dasycladacean and Hexamita); 9
                                  (Echinodermate Mitochondrial); 10
                                  (Euplotid); 11 (Eubacterial); 12
                                  (Alternative Yeast); 13 (Ascidian
                                  Mitochondrial); 14 (Flatworm Mitochondrial);
                                  15 (Blepharisma); 16 (Chlorophycean
                                  Mitochondrial); 21 (Trematode
                                  Mitochondrial); 22 (Scenedesmus obliquus
                                  Mitochondrial); 23 (Thraustochytrium
                                  Mitochondrial))
   -format             menu       [0] Alignment format (Values: 0 (standard);
                                  1 (x = conservative replacements, X =
                                  non-conservative substitutions); 2 (show
                                  only residues in sequence 2 that differ from
                                  sequence 1); 4 (alignment map); 5 (standard
                                  with added alignment map); 9 (long format
                                  best scores report); 10 (write alignments in
                                  parsable format))

   Advanced (Unprompted) qualifiers:
   -[no]stat           boolean    [Y] Compute statistics (is default). If you
                                  switch this off, fastA will sort on opt
                                  instead off bit score and report no more
                                  than 20 best scoring sequences.
   -shuffle            boolean    Use shuffled databank sequences for
                                  statistics. Useful if search set contains no
                                  sequences unrelated to query sequence.
   -hide               float      [0.000] Do not show sequences with E() value
                                  lower than f. (Number 0.000 or more)
   -zscore             boolean    write Z-score instead of bit score in list
   -[no]histogram      boolean    [Y] Show histogram (is default)
   -listsize           integer    [0] Show only the n best scoring sequences
                                  that satisfy E() cutoff (Integer 0 or more)
   -align              integer    [0] Show only alignments for the n first
                                  sequences (Integer 0 or more)
   -linesize           integer    [60] Number of residues per line of the
                                  alignment (Integer from 10 to 200)

   Associated qualifiers:

   "-seqs" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-userdb" associated qualifiers
   -sbegin             integer    Start of each sequence to be used
   -send               integer    End of each sequence to be used
   -sreverse           boolean    Reverse (if DNA)
   -sask               boolean    Ask for begin/end/reverse
   -snucleotide        boolean    Sequence is nucleotide
   -sprotein           boolean    Sequence is protein
   -slower             boolean    Make lower case
   -supper             boolean    Make upper case
   -sformat            string     Input sequence format
   -sdbname            string     Database name
   -sid                string     Entryname
   -ufo                string     UFO features
   -fformat            string     Features format
   -fopenfile          string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
-program Search type : fastA or optimal alignment, nuc. or prot.
1 (fasta nuc against nuc)
2 (fasta prot against prot)
3 (fastx (nuc against prot with codon/aa alignment))
4 (fasty (idem + allowing intracodon gaps))
5 (tfastx (prot against nuc with codon/aa alignment))
6 (tfasty (idem + allowing intracodon gaps))
7 (ssearch (using SW instead) nuc against nuc)
8 (ssearch (using SW instead) prot against prot)
9 (ggsearch (NW global/global alignment) nuc against nuc)
10 (ggsearch (NW global/global alignment) prot against prot)
11 (glsearch (global/local alignment) nuc against nuc)
12 (glsearch (global/local alignment) prot against prot)
1
[-seqs]
(Parameter 1)
Query sequence(s) Readable sequence(s) Required
-dbtype Search set type : public databank or databank provided by user
1 (standard set)
2 (user defined set)
3 (user provided fastA databank)
1
-nucdb Standard nucleic acid search set
em (EMBL (general nucleic acid databank))
emblnontags (EMBL without EST and GSS)
hum (EMBL humans)
mus (EMBL mice)
rod (EMBL other rodents)
mam (EMBL other mammals)
vrt (EMBL other vertebrates)
inv (EMBL invertebrates)
pln (EMBL plants)
fun (EMBL fungi)
pro (EMBL bacteria)
phg (EMBL bacteriophages)
vrl (EMBL other viruses)
est (EMBL Expressed Sequence Tags)
gss (EMBL Genome Survey Sequences)
sts (EMBL Sequence Tagged Sites)
htg (EMBL High Throughput Genomic)
htc (EMBL High Throughput cDNA)
env (EMBL environmental samples)
pat (EMBL patents)
tgn (EMBL transgenic)
syn (EMBL synthetic)
unc (EMBL unclassified)
new (EMBL updates since last release)
wgs (EMBL Whole Genome Shotgun)
refseq (RefSeq (NCBI reference sequences))
refseqwgs (RefSeq Whole Genome Shotgun)
refseqgen (RefSeq other genomic)
refseqrna (RefSeq transcripts)
vec (Intelligenetics vector databank)
emvec (EMBL vector subset)
epd (Eukaryotic Promoter Database)
ligm (ImMunoGeneTics databank Igg. + TcR genes)
hla (ImMunoGeneTics databank human MHC genes)
pdbn (PDB (nucleic acids with known 3D structure))
emblnontags
-protdb Standard protein search set
sw (SwissProt (highly annotated protein databank))
up (UniProt (SwissProt + TrEMBL, EMBL ORF translations))
uniref100 (UniRef100 (UniProt nonredundant subset))
uniref90 (UniRef90 (UniRef100 subset with no more than 90% identity))
uniref50 (UniRef50 (UniRef100 subset with no more than 50% identity))
remt (REM-TrEMBL (old EMBL ORF translations not incl. in UniProt))
pir (PIR (old general protein databank))
gp (GenPept (GenBank ORF translations))
refseqp (RefSeq (NCBI reference protein sequences))
pdb (PDB (proteins with known 3D structure))
gpcrdb (G protein coupled receptors)
up
-userdb User defined search set Readable sequence(s) Required
-userfastadb User provided fastA format databank (you can make one using seqret) Input file Required
-wordsize Word (ktup) size 1 to 6 for fasta nucleic, 1 or 2 for other search types 6 for fasta nucleic, 2 for other search types
-expect E() value = number of databank sequences with same or higher Z-score that you expect to find by chance. fastA lists sequences with an E() value lower than the cutoff. Number 0.000 or more 2.0 for nucleic, 10.0 for protein, 5.0 for mixed
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.<program>
Additional (Optional) qualifiers Allowed values Default
-[no]reverse Search also complementary strand (is default). If you switch this off fasta will search only the forward strand of the query sequence or of the search set sequences. Boolean value Yes/No Yes
-match Nucleotide match reward Integer 0 or more 5
-mismatch Nucleotide mismatch penalty Integer up to 0 -4
-matrix Amino acid comparison matrix
BL50 (BLOSUM50)
BL62 (BLOSUM62)
BL80 (BLOSUM80)
P120 (PAM120)
P250 (PAM250)
M10 (Jones, Taylor, Thornton PAM10)
M20 (Jones, Taylor, Thornton PAM20)
M40 (Jones, Taylor, Thornton PAM40)
VT160 (Vingron resolvent PAM160)
OPT5 (OPTIMA 5)
BL50
-intercodon Frame shift penalty between codons Integer 0 or more 20
-intracodon Frame shift penalty inside codons Integer 0 or more 24
-gappenalty Gap penalty Integer 0 or more 14 for nucleic, 10 for protein, 12 for fastx/y, 14 for tfastx/y
-gaplength Gap length penalty. fastA subtracts from the similarity score for each gap a penalty of type <Gap penalty> + <Gap length penalty> * n Integer 0 or more 4 for nucleic, 2 for protein, 2 for mixed
-gencode Genetic code for translating sequences
1 (Standard)
2 (Vertebrate Mitochondrial)
3 (Yeast Mitochondrial)
4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)
5 (Invertebrate Mitochondrial)
6 (Ciliate, Dasycladacean and Hexamita)
9 (Echinodermate Mitochondrial)
10 (Euplotid)
11 (Eubacterial)
12 (Alternative Yeast)
13 (Ascidian Mitochondrial)
14 (Flatworm Mitochondrial)
15 (Blepharisma)
16 (Chlorophycean Mitochondrial)
21 (Trematode Mitochondrial)
22 (Scenedesmus obliquus Mitochondrial)
23 (Thraustochytrium Mitochondrial)
1
-format Alignment format
0 (standard)
1 (x = conservative replacements, X = non-conservative substitutions)
2 (show only residues in sequence 2 that differ from sequence 1)
4 (alignment map)
5 (standard with added alignment map)
9 (long format best scores report)
10 (write alignments in parsable format)
0
Advanced (Unprompted) qualifiers Allowed values Default
-[no]stat Compute statistics (is default). If you switch this off, fastA will sort on opt instead off bit score and report no more than 20 best scoring sequences. Boolean value Yes/No Yes
-shuffle Use shuffled databank sequences for statistics. Useful if search set contains no sequences unrelated to query sequence. Boolean value Yes/No No
-hide Do not show sequences with E() value lower than f. Number 0.000 or more 0.000
-zscore write Z-score instead of bit score in list Boolean value Yes/No No
-[no]histogram Show histogram (is default) Boolean value Yes/No Yes
-listsize Show only the n best scoring sequences that satisfy E() cutoff Integer 0 or more 0
-align Show only alignments for the n first sequences Integer 0 or more 0
-linesize Number of residues per line of the alignment Integer from 10 to 200 60

Input file format

fasta searches a query sequence against a search sequence set. For the query sequence you can provide any normal sequence USA. There is however a built-in limit of 20.000 bases or amino acids. If the query sequence is longer than 20.000, the program will cut it in slices of 20.000 long, handle them as separate sequences and write the results the one after the other in the output file.
You can submit several query sequences at the same time. In this case the "wrapper" fasta will launch the search program for each individual sequence and make sure the output files have different names. Do note that the results of the individual searches are in no way merged.

You can select your search set in three different ways :

  1. a standard set is a copy of a public databank or a local databank installed by the managers of the server computer and "visible" to all users. You must choose from a selector.
  2. a user defined set is a set of sequences (public and/or private) that you can specify using a normal sequence USA. It is convenient to use a List File. The set will be transformed on-the-fly into a temporary fastA format databank.
  3. a user provided fastA databank is a databank in fastA format (usually private), that you can specify by the name of the file.
Note that if you want search a selected set of sequences taken from the public databanks and/or want to search a set of your own private sequences you can choose between options 2 and 3. If your set is small or if you want to search it just once, the "a user defined set" is good, otherwise, to save time, it is recommended to make a databank in fastA format using the program seqret and choose "a user provided fastA databank".

Output file format

The first part of the output file contains a histogram showing the distribution of the opt scores between the query and search set sequences. (See the algorithm topic for an explanation of opt score.) The histogram is composed of bins of size 2 that are labeled according to the higher score for that bin (the leftmost column of the histogram). For example, the bin labeled 24 stores the number of sequence pairs that had scores of 23 or 24.
The next two columns of the histogram list the number of opt scores that fell within each bin. The second column lists the number of opt scores observed in the search and the third column lists the number of opt scores that were expected.
The body of the histogram displays a graphical representation of the score distributions. Equal signs (=) indicate the number of scores of that magnitude that were observed during the search, while asterisks (*) plot the number of scores of that magnitude that were expected.
At the bottom of the histogram is a list of some of the parameters pertaining to the search.
Below the histogram, FastA displays a listing of the best scores. [r] after the sequence name in this list indicates that the match was found between the search set sequence and the reverse complement of the query sequence.
Following the list of best scores, FastA displays the alignments of the regions of best overlap between the query and search sequences. Note that for the purpose of clarity the "wrapper" makes sure the "naked" program outputs complete description lines rather than truncated ones and, for the purpose of avoiding automated output parsers to break, the "wrapper" makes sure the description line is concatenated on a single line rather than being wrapped on several lines.

Output files for usage example

File: m25165.fasta

# /opt/sw/fasta/bin/fasta -q -L -T 2 -l /opt/sw/fasta/fastlibs -n -E 2 -r 5/-4 -f 14 -g 4 -m 0 -w 60 embl-id:M25165 %+phg 6
FASTA searches a protein or DNA sequence data bank
 version 35.04 Oct. 7, 2008
Please cite:
 W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

Query: embl-id:M25165
  1>>>M25165 M25165.1 Bacteriophage lambda P(R) promoter, RNA polymerase bin - 94 nt
Library: EMBL bacteriophages 31460588 residues in  4076 sequences

       opt      E()
< 20    86     0:==========
  22     0     0:           one = represents 9 library sequences
  24     0     0:
  26     0     0:
  28     0     1:*
  30     0     6:*
  32    10    22:==*
  34    22    59:===   *
  36    49   121:======       *
  38   154   200:==================    *
  40   206   279:=======================       *
  42   291   341:=================================    *
  44   524   376:=========================================*=================
  46   384   383:==========================================*
  48   376   367:========================================*=
  50   440   335:=====================================*===========
  52   309   294:================================*==
  54   245   251:===========================*
  56   209   210:=======================*
  58   170   172:===================*
  60   137   140:===============*
  62    75   112:=========   *
  64    69    89:======== *
  66    86    70:=======*==
  68    61    55:======*
  70    33    43:====*
  72    22    34:===*
  74    37    26:==*==
  76    24    21:==*
  78    19    16:=*=
  80     6    12:=*
  82    10     9:*=
  84    10     8:*=
  86     2     6:*
  88     2     5:*          inset = represents 1 library sequences
  90     1     3:*
  92     4     3:*         :==*=
  94     0     2:*         : *
  96     1     2:*         :=*
  98     0     1:*         :*
 100     0     1:*         :*
 102     0     1:*         :*
 104     0     1:*         :*
 106     1     0:=         *=
 108     0     0:          *
 110     2     0:=         *==
 112     1     0:=         *=
 114     0     0:          *
 116     0     0:          *
 118     0     0:          *
>120    22     0:===       *======================
31462844 residues in  4100 sequences
Statistics:  Expectation_n fit: rho(ln(x))= 4.7414+/- 0.001; mu= 14.3669+/- 0.070
 mean_var=89.9386+/-13.435, 0's: 0 Z-trim: 108  B-trim: 80 in 2/84
 Lambda= 0.135239
 Kolmogorov-Smirnov  statistic: 0.0738 (N=29) at  42
Algorithm: FASTA (3.5 Sept 2006) [optimized]
Parameters: 5/-4 matrix (5:-4) ktup: 6
 join: 46, opt: 31, open/ext: -14/-4, width:  16
 Scan time:  2.770

The best scores are:                                      opt bits E(4100)
em:J02459 [Enterobacteria phage lambda] Entero (48502) [f]  470 101.9 6.4e-22
embl:DJ344380 Method for preparation of artifi (48502) [f]  470 101.9 6.4e-22
embl:DJ347897 Method for preparation of artifi (48502) [f]  470 101.9 6.4e-22
embl:J02459 Enterobacteria phage lambda, compl (48502) [f]  470 101.9 6.4e-22
embl:EF120455 Enterobacteria phage HK544 restr (3564) [f]  470 100.5 1.6e-21
embl:AF069529 Bacteriophage HK97, complete gen (39732) [f]  461 100.0 2.3e-21
embl:EF120461 Enterobacteria phage HK106 conse (4202) [f]  461 98.8 5.2e-21
embl:EF120460 Enterobacteria phage mEp234 cons (4202) [f]  461 98.8 5.2e-21
embl:M25165 Bacteriophage lambda P(R) promoter (  94) [f]  470 98.6   6e-21
embl:EF120459 Enterobacteria phage CL707 IS10  (5055) [f]  452 97.2 1.6e-20
embl:EF120458 Enterobacteria phage HK542 hypot (3972) [f]  452 97.1 1.8e-20
embl:EF120457 Enterobacteria phage HK244 hypot (3921) [f]  452 97.1 1.8e-20
embl:DQ372059 Enterobacteria phage lambda clon (1206) [f]  445 95.1 7.1e-20
embl:DQ372056 Enterobacteria phage lambda clon (1206) [f]  445 95.1 7.1e-20
embl:DQ372058 Enterobacteria phage lambda clon (1206) [f]  445 95.1 7.1e-20
embl:DQ372057 Enterobacteria phage lambda clon (1206) [f]  445 95.1 7.1e-20
embl:DQ372060 Enterobacteria phage lambda clon (1206) [f]  436 93.3 2.4e-19
embl:EF120456 Enterobacteria phage mEp332 rest (4251) [f]  429 92.6 3.9e-19
embl:M29179 Bacteriophage lambda O-R region pr (  88) [f]  400 84.9 7.9e-17
embl:X70116 Bacteriophage lambda promoter regi (  86) [f]  395 84.0 1.6e-16
embl:M25081 Bacteriophage lambda O-R operator, (  76) [f]  380 81.0 1.3e-15
em:J02459 [Enterobacteria phage lambda] Entero (48502) [r]  126 34.8     0.1
embl:DJ347897 Method for preparation of artifi (48502) [r]  126 34.8     0.1
embl:DJ344380 Method for preparation of artifi (48502) [r]  126 34.8     0.1
embl:J02459 Enterobacteria phage lambda, compl (48502) [r]  126 34.8     0.1
embl:AF069529 Bacteriophage HK97, complete gen (39732) [r]  126 34.6    0.11
embl:EF120459 Enterobacteria phage CL707 IS10  (5055) [r]  126 33.6    0.23
embl:EF120460 Enterobacteria phage mEp234 cons (4202) [r]  126 33.5    0.25
embl:EF120461 Enterobacteria phage HK106 conse (4202) [r]  126 33.5    0.25
embl:EF120458 Enterobacteria phage HK542 hypot (3972) [r]  126 33.5    0.25
embl:EF120457 Enterobacteria phage HK244 hypot (3921) [r]  126 33.5    0.25
embl:EF120455 Enterobacteria phage HK544 restr (3564) [r]  126 33.4    0.26
embl:EF120456 Enterobacteria phage mEp332 rest (4251) [r]  123 32.9    0.37
embl:DQ372056 Enterobacteria phage lambda clon (1206) [r]  126 32.8    0.38
embl:DQ372060 Enterobacteria phage lambda clon (1206) [r]  126 32.8    0.38
embl:DQ372057 Enterobacteria phage lambda clon (1206) [r]  126 32.8    0.38
embl:DQ372058 Enterobacteria phage lambda clon (1206) [r]  126 32.8    0.38
embl:DQ372059 Enterobacteria phage lambda clon (1206) [r]  126 32.8    0.38
embl:M25165 Bacteriophage lambda P(R) promoter (  94) [r]  126 31.5    0.96
embl:M29179 Bacteriophage lambda O-R region pr (  88) [r]  126 31.5    0.98
embl:X70116 Bacteriophage lambda promoter regi (  86) [r]  126 31.5    0.99
embl:M25081 Bacteriophage lambda O-R operator, (  76) [r]  126 31.4       1
embl:V00638 Lambda genome from map unit 74 bac (3400) [r]  113 30.8     1.5

>>em:J02459 [Enterobacteria phage lambda] Enterobacteria phage lambda, complete genome. (48502 nt)
 initn: 470 init1: 470 opt: 470  Z-score: 476.5  bits: 101.9 E(): 6.4e-22
banded Smith-Waterman score: 470; 100.0% identity (100.0% similar) in 94 nt overlap (1-94:37946-38039)

                                             10        20        30
M25165                               AAATCTATCACCGCAAGGGATAAATATCTA
                                     ::::::::::::::::::::::::::::::
em:J02 TTAATGGTTTCTTTTTTGTGCTCATACGTTAAATCTATCACCGCAAGGGATAAATATCTA
       37920     37930     37940     37950     37960     37970     

               40        50        60        70        80        90
M25165 ACACCGTGCGTGTTGACTATTTTACCTCTGGCGGTGATAATGGTTGCATGTACTAAGGAG
       ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
em:J02 ACACCGTGCGTGTTGACTATTTTACCTCTGGCGGTGATAATGGTTGCATGTACTAAGGAG
       37980     37990     38000     38010     38020     38030     

                                                                   
M25165 GTTG                                                        
       ::::                                                        
em:J02 GTTGTATGGAACAACGCATAACCCTGAAAGATTATGCAATGCGCTTTGGGCAAACCAAGA
       38040     38050     38060     38070     38080     38090     

>>embl:DJ344380 Method for preparation of artificial positive control DNAs used for multiplex PCR. (48502 nt)
 initn: 470 init1: 470 opt: 470  Z-score: 476.5  bits: 101.9 E(): 6.4e-22
banded Smith-Waterman score: 470; 100.0% identity (100.0% similar) in 94 nt overlap (1-94:37946-38039)

                                             10        20        30
M25165                               AAATCTATCACCGCAAGGGATAAATATCTA
                                     ::::::::::::::::::::::::::::::
embl:D TTAATGGTTTCTTTTTTGTGCTCATACGTTAAATCTATCACCGCAAGGGATAAATATCTA
       37920     37930     37940     37950     37960     37970     

               40        50        60        70        80        90
M25165 ACACCGTGCGTGTTGACTATTTTACCTCTGGCGGTGATAATGGTTGCATGTACTAAGGAG
       ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
embl:D ACACCGTGCGTGTTGACTATTTTACCTCTGGCGGTGATAATGGTTGCATGTACTAAGGAG
       37980     37990     38000     38010     38020     38030     

                                                                   
M25165 GTTG                                                        
       ::::                                                        
embl:D GTTGTATGGAACAACGCATAACCCTGAAAGATTATGCAATGCGCTTTGGGCAAACCAAGA
       38040     38050     38060     38070     38080     38090     


  [Part of this file has been deleted for brevity]


>>embl:V00638 Lambda genome from map unit 74 backward to map unit 67. (3400 nt)
rev-comp initn:  96 init1:  96 opt: 113  Z-score: 113.3  bits: 30.8 E():  1.5
banded Smith-Waterman score: 113; 69.6% identity (69.6% similar) in 56 nt overlap (72-17:3086-3138)

                  90        80        70        60        50       
M2516-         CAACCTCCTTAGTACATGCAACCATTATCACCGCCAGAGGTAAAATAGTCAA
                                     :: :::::::::::: ::::     :::::
embl:V ATGGTGGTCAGTGCGTCCTGCTGATGTGCTCAGTATCACCGCCAGTGGTATTTATGTCAA
        3060      3070      3080      3090      3100      3110     

        40        30        20        10                           
M2516- CACGCACGGTGTTAGATATTTATCCCTTGCGGTGATAGATTT                  
       :::   : : : ::   ::::::: :                                  
embl:V CACCGCCAGAGATA---ATTTATCACCGCAGATGGTTATCTGTATGTTTTTTATATGAAT
        3120         3130      3140      3150      3160      3170  



94 residues in 1 query   sequences
31460588 residues in 4076 library sequences
 Tcomplib [35.04] (2 proc)
 start: Wed Oct 15 16:18:06 2008 done: Wed Oct 15 16:18:08 2008
 Total Scan time:  2.770 Total Display time:  0.030

Function used was FASTA [version 35.04 Oct. 7, 2008]

Data files

The amino acid comparison matrices used to compare proteins are hard coded in the program and cannot be changed.

Notes

When the query sequence is very short there may be no "hits" with a low E() value. Therefore, to find something you may have to set E() to a higher value (e.g. 1000) or alternatively use the option -nostat.

If the search set contains no sequences that are unrelated to the query sequence the statistics will be flawed, because fasta uses low scoring databank sequences as model for false positives. You can mend this by requesting to use instead shuffled versions of databank sequences (option -shuffle).

The sensitivity of the search depends on the chosen scoring scheme and it is important to choose an appropriate gap penalty. Prof. Pearson recommends the following gap penalties for different protein scoring matrices :

scoring matrixrecommended gap penaltyrecommended gap length penalty
BLOSUM50122
BLOSUM62121
BLOSUM80122
PAM120203
PAM250122
Jones, Taylor, Thornton PAM10274
Jones, Taylor, Thornton PAM20264
Jones, Taylor, Thornton PAM40254
Vingron resolvent PAM160122
OPTIMA 5202

References

  1. Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C. (1994). Issues in searching molecular sequence databases. Nature Genet. 6,119-129.
  2. Altschul, S. F. and Gish, W. (1996). Local alignment statistics. Methods Enzymol. 266,460-480.
  3. Bairoch, A. and Apweiler, R. (1996). The Swiss-Prot protein sequence data bank and its new supplement TrEMBL. Nucleic Acids. Res. 24,21-25.
  4. Barker, W. C., Garavelli, J. S., Haft, D. H., Hunt, L. T., Marzec, C. R., Orcutt, B. C., Srinivasarao, G. Y., Yeh, L. S. L., Ledley, R. S., Mewes, H. W., Pfeiffer, F., and Tsugita, A. (1998). The PIR-International Protein Sequence Database. Nucleic Acids Res 26,27-32.
  5. Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978). A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, vol. 5, supplement 3. M. Dayhoff, ed. (Silver Spring, MD: National Biomedical Research Foundation), pp. 345-352.
  6. Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). The rapid generation of mutation data matrices from protein sequences. Comp. Appl. Biosci. 8,275-282.
  7. Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85,2444-2448.
  8. Pearson, W. R. (1995). Comparison of methods for searching protein sequence databases. Prot. Sci. 4,1145-1160.
  9. Pearson, W. R. (1996). Effective protein sequence comparison. Methods Enzymol. 266,227-258.
  10. Pearson, W. R. (1998). Empirical statistical estimates for sequence similarity searches. J. Mol. Biol. 276,71-84.
  11. Pearson, W. R. (1999). Flexible similarity searching with the FASTA3 program package. In Bioinformatics Methods and Protocols, S. Misener and S. A. Krawetz, ed. (Totowa, NJ: Humana Press), pp. 185-219.
  12. Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147,195-197.
  13. Wootton, J. C. and Federhen, S. (1993). Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17,149-163.
  14. Zhang, Z., Pearson, W. R., and Miller, W. (1997). Aligning a DNA sequence with a protein sequence. J. Computational Biology 4,339-349.
  15. Needleman, S. B. and C. D. Wunsch (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48, 443-453.
  16. Muller, T., and M. Vingron. 2000. Modeling amino acid replacement. Journal of Computational Biology 7:761-776.
  17. Kann, M. G. and Goldstein, R. A (2002). Performance evaluation of a new algorithm for the detection of remote homologs with sequence comparison. Proteins 48,367-377.

Warnings

If the search set contains no sequences that are unrelated to the query sequence the statistics will be flawed, because fasta uses low scoring databank sequences as model for false positives. You can mend this by requesting to use instead shuffled versions of databank sequences (option -shuffle).

Note that it is important to choose a gap penalty that is adapted to the choses scoring matrix. For protein searches, see the Notes for recommended gap penalties.

Diagnostic Error Messages

If a "User provided fastA format databank" that is supposed to be a nucleic acid databank contains characters only allowed in protein sequences, the program exits with error message :
  <databank> contains non-nucleic acid characters ! The first one occurs in sequence <sequence name>

Exit status

It exits prematurely with status 255 and an error message if a "User provided fastA format databank" that is supposed to be a nucleic acid databank contains characters only allowed in protein sequences, otherwise it exits with status 0.

Known bugs

None.

See also

Program nameDescription
blast BLAST search of query sequence(s) against sequence search set
ebi_blast WU-BLAST search of query sequence against sequence databank using EBI Web Services
ebi_fasta fastA search of query sequence against sequence databank using EBI Web Services
fasts Protein identification from peptides using fastA algorithm
phiblast Search protein sequence set combining matching of pattern with local alignment of a query sequence surrounding the match
psiblast Iterative BLAST search with generation of profile of protein sequence against protein sequence set
lfasta Finds local alignments between two sequences, using fastA

Author(s)

The wrapper application fasta was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The programs fasta,... themselves were written by
William R. Pearson
Department of Biochemistry
Box 440, Jordan Hall
U. of Virginia
Charlottesville, VA

wrp@virginia.EDU

History

Completed 28 August 2002
Modified 19 March 2003 - adapted to fastA version 3.4t21
Modified 20 February 2004 - adapted to fastA version 3.4t23 and added option to search only forward strand
Modified 19 September 2007 - adapted to fastA version 34.26.4
Modified 24 April 2008 - adapted to fastA version 35.1.6
Modified 15 October 2008 - adapted to fastA version 35.4.1 and added complete description lines in output

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.