blast2seq

 

Function

Finds local alignments between two sequences, using BLAST

Description

blast2seq is an EMBOSS "wrapper" program for the program bl2seq from the NCBI's BLAST (Basic Local Alignment Search Tool) suite. It performs a comparison between two sequences using any of the BLAST options :
blastp    compares two protein sequences
blastn    compares two nucleic acid sequences
blastx    compares the six-frame conceptual translation products

          of a nucleic acid sequence with a protein sequence
tblastn   compares a protein sequence with the six-frame
          conceptual translation products of a nucleic acid sequence
tblastx   compares the six-frame translations of a nucleic acid
          sequence with the six-frame translations of a nucleic acid
          sequencee
         

Algorithm

see the explanation of the blast algorithm.

Usage

Here is a sample session with blast2seq

> blast2seq
Finds local alignments between two sequences, using BLAST
         1 : blastn (nuc with nuc)
         2 : blastp (prot with prot)
         3 : blastx (nuc translated with prot)
         4 : tblastn (prot with nuc translated)
         5 : tblastx (nuc translated with nuc translated)
Select type of alignment [2]:
Input sequence: sw:tpa_human
Second sequence: sw:urok_human
Word size [3]:
E() value cutoff [10.0]:
Output file [tpa_human.blastp2seq]: tpa_urok.blastp2seq

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
   -program            menu       [2] Alignment type : nuc. or prot. (Values:
                                  1 (blastn (nuc with nuc)); 2 (blastp (prot
                                  with prot)); 3 (blastx (nuc translated with
                                  prot)); 4 (tblastn (prot with nuc
                                  translated)); 5 (tblastx (nuc translated
                                  with nuc translated)))
  [-asequence]         sequence   Sequence filename and optional format, or
                                  reference (input USA)
  [-bsequence]         sequence   Sequence filename and optional format, or
                                  reference (input USA)
   -wordsize           integer    [11 for blastn, 3 for other alignment types]
                                  Word size (Any integer value)
   -expect             float      [10.0] E() value = number of sequences with
                                  same or higher bit score that you expect to
                                  find by chance. BLAST lists alignments with
                                  an E() value lower than the cutoff. (Number
                                  0.000 or more)
  [-outfile]           outfile    [*.blast2seq] Output file name

   Additional (Optional) qualifiers (* if not always prompted):
*  -strand             selection  [both] Strand to search. By default BLAST
                                  searches both strands, but for blastn and
                                  (t)blastx you can choose to search only the
                                  top or bottom strand of the second
                                  respectively the first sequence.
*  -match              integer    [1] Nucleotide match reward (Integer 0 or
                                  more)
*  -mismatch           integer    [-3] Nucleotide mismatch penalty (Integer up
                                  to 0)
*  -matrix             selection  [3] Amino acid comparison matrix
*  -gappenalty         integer    [5 for blastn, 11 for other alignment types]
                                  Gap penalty (Integer 0 or more)
*  -gaplength          integer    [2 for blastn, 1 for other alignment types]
                                  Gap length penalty. BLAST subtracts from the
                                  similarity score for each gap a penalty of
                                  type <Gap penalty> + <Gap length penalty> *
                                  n. Only certain combinations of matrix and
                                  gap penalty are allowed, see on-line manual.
                                  (Integer 0 or more)

   Advanced (Unprompted) qualifiers:
   -[no]gaps           toggle     [Y] Make gapped alignments (is default)
   -[no]seqfilter      boolean    [Y] Filter low complexity segments out of
                                  first sequence (is default)
   -seqcoilfilter      boolean    Filter coiled coils out of first sequence
   -seqsoftfilter      boolean    Use soft filtering, that is, filter only at
                                  initial hit searching, not at hit extension
   -effdbsize          float      [0.000] Effective databank size for
                                  statistical calculations (Number 0.000 or
                                  more)

   Associated qualifiers:

   "-asequence" associated qualifiers
   -sbegin1            integer    Start of the sequence to be used
   -send1              integer    End of the sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-bsequence" associated qualifiers
   -sbegin2            integer    Start of the sequence to be used
   -send2              integer    End of the sequence to be used
   -sreverse2          boolean    Reverse (if DNA)
   -sask2              boolean    Ask for begin/end/reverse
   -snucleotide2       boolean    Sequence is nucleotide
   -sprotein2          boolean    Sequence is protein
   -slower2            boolean    Make lower case
   -supper2            boolean    Make upper case
   -sformat2           string     Input sequence format
   -sdbname2           string     Database name
   -sid2               string     Entryname
   -ufo2               string     UFO features
   -fformat2           string     Features format
   -fopenfile2         string     Features file name

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
-program Alignment type : nuc. or prot.
1 (blastn (nuc with nuc))
2 (blastp (prot with prot))
3 (blastx (nuc translated with prot))
4 (tblastn (prot with nuc translated))
5 (tblastx (nuc translated with nuc translated))
2
[-asequence]
(Parameter 1)
Sequence filename and optional format, or reference (input USA) Readable sequence Required
[-bsequence]
(Parameter 2)
Sequence filename and optional format, or reference (input USA) Readable sequence Required
-wordsize Word size Any integer value 11 for blastn, 3 for other alignment types
-expect E() value = number of sequences with same or higher bit score that you expect to find by chance. BLAST lists alignments with an E() value lower than the cutoff. Number 0.000 or more 10.0
[-outfile]
(Parameter 3)
Output file name Output file <sequence>.<program>2seq
Additional (Optional) qualifiers Allowed values Default
-strand Strand to search. By default BLAST searches both strands, but for blastn and (t)blastx you can choose to search only the top or bottom strand of the second respectively the first sequence. Choose from selection list of values both
-match Nucleotide match reward Integer 0 or more 1
-mismatch Nucleotide mismatch penalty Integer up to 0 -3
-matrix Amino acid comparison matrix Choose from selection list of values BLOSUM62
-gappenalty Gap penalty Integer 0 or more 5 for blastn, 11 for other alignment types
-gaplength Gap length penalty. BLAST subtracts from the similarity score for each gap a penalty of type <Gap penalty> + <Gap length penalty> * n. Only certain combinations of matrix and gap penalty are allowed, see on-line manual. Integer 0 or more 2 for blastn, 1 for other alignment types
Advanced (Unprompted) qualifiers Allowed values Default
-[no]gaps Make gapped alignments (is default) Toggle value Yes/No Yes
-[no]seqfilter Filter low complexity segments out of first sequence (is default) Boolean value Yes/No Yes
-seqcoilfilter Filter coiled coils out of first sequence Boolean value Yes/No No
-seqsoftfilter Use soft filtering, that is, filter only at initial hit searching, not at hit extension Boolean value Yes/No No
-effdbsize Effective databank size for statistical calculations Number 0.000 or more 0.000

Input file format

blast2seq needs 2 input sequences of appropriate type, which you can provide with any normal USA.

Output file format

Output files for usage example

File: tpa_urok.blastp2seq

Query= TPA_HUMAN P00750 Tissue-type plasminogen activator precursor
(EC 3.4.21.68) (tPA) (t- PA) (t-plasminogen activator) (Alteplase)
(Reteplase) [Contains: Tissue-type plasminogen activator chain A;
Tissue-type plasminogen activator chain B].
         (562 letters)



>UROK_HUMAN P00749 Urokinase-type plasminogen activator precursor
           (EC 3.4.21.73) (uPA) (U-plasminogen activator)
           [Contains: Urokinase-type plasminogen activator long
           chain A; Urokinase-type plasminogen activator short
           chain A; Urokinase-type plasminogen activator chain B].
          Length = 431

 Score =  299 bits (766), Expect = 1e-85
 Identities = 162/389 (41%), Positives = 214/389 (55%), Gaps = 30/389 (7%)

Query: 189 WCYVFKAGKYSSEFCSTPACSEGNSDCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKV 248
           WC   K  K+  + C      + +  CY GNG  YRG  S    G  CLPWNS  ++ + 
Sbjct: 50  WCNCPK--KFGGQHCEI----DKSKTCYEGNGHFYRGKASTDTMGRPCLPWNSATVLQQT 103

Query: 249 YTAQNPSAQALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCS----------- 297
           Y A    A  LGLGKHNYCRNPD   +PWC+V    +   + C V  C+           
Sbjct: 104 YHAHRSDALQLGLGKHNYCRNPDNRRRPWCYVQVGLKPLVQECMVHDCADGKKPSSPPEE 163

Query: 298 ---TCGLRQYSQPQFRIKGGLFADIASHPWQAAIFAKHRRSPGERFLCGGILISSCWILS 354
               CG ++  +P+F+I GG F  I + PW AAI+ +H R     ++CGG L+S CW++S
Sbjct: 164 LKFQCG-QKTLRPRFKIIGGEFTTIENQPWFAAIYRRH-RGGSVTYVCGGSLMSPCWVIS 221

Query: 355 AAHCFQERFPPHHLTVILGRTYRVVPGEEEQKFEVEKYIVHKEFDDDT--YDNDIALLQL 412
           A HCF +        V LGR+      + E KFEVE  I+HK++  DT  + NDIALL++
Sbjct: 222 ATHCFIDYPKKEDYIVYLGRSRLNSNTQGEMKFEVENLILHKDYSADTLAHHNDIALLKI 281

Query: 413 KSDSSRCAQESSVVRTVCLPPADLQLPDWTECELSGYGKHEALSPFYSERLKEAHVRLYP 472
           +S   RCAQ S  ++T+CLP         T CE++G+GK  +    Y E+LK   V+L  
Sbjct: 282 RSKEGRCAQPSRTIQTICLPSMYNDPQFGTSCEITGFGKENSTDYLYPEQLKMTVVKLIS 341

Query: 473 SSRCTSQHLLNRTVTDNMLCAGDTRSGGPQANLHDACQGDSGGPLVCLNDGRMTLVGIIS 532
              C   H     VT  MLCA D     PQ    D+CQGDSGGPLVC   GRMTL GI+S
Sbjct: 342 HRECQQPHYYGSEVTTKMLCAAD-----PQWKT-DSCQGDSGGPLVCSLQGRMTLTGIVS 395

Query: 533 WGLGCGQKDVPGVYTKVTNYLDWIRDNMR 561
           WG GC  KD PGVYT+V+++L WIR + +
Sbjct: 396 WGRGCALKDKPGVYTRVSHFLPWIRSHTK 424



 Score =  130 bits (327), Expect = 1e-34
 Identities = 64/141 (45%), Positives = 78/141 (55%), Gaps = 5/141 (3%)

Query: 72  NSGRAQCHSVPVKSCSEPRCFNGGTCQQALYFSDFV-CQCPEGFAGKCCEIDTRATCYED 130
           + G  + H VP    S   C NGGTC    YFS+   C CP+ F G+ CEID   TCYE 
Sbjct: 18  SKGSNELHQVP----SNCDCLNGGTCVSNKYFSNIHWCNCPKKFGGQHCEIDKSKTCYEG 73

Query: 131 QGISYRGTWSTAESGAECTNWNSSALAQKPYSGRRPDAIRLGLGNHNYCRNPDRDSKPWC 190
            G  YRG  ST   G  C  WNS+ + Q+ Y   R DA++LGLG HNYCRNPD   +PWC
Sbjct: 74  NGHFYRGKASTDTMGRPCLPWNSATVLQQTYHAHRSDALQLGLGKHNYCRNPDNRRRPWC 133

Query: 191 YVFKAGKYSSEFCSTPACSEG 211
           YV    K   + C    C++G
Sbjct: 134 YVQVGLKPLVQECMVHDCADG 154



 Score = 14.6 bits (26), Expect = 8.3
 Identities = 5/14 (35%), Positives = 7/14 (50%)

Query: 291 CDVPSCSTCGLRQY 304
           CD  +  TC   +Y
Sbjct: 31  CDCLNGGTCVSNKY 44


Lambda     K      H
   0.321    0.136    0.453 

Gapped
Lambda     K      H
   0.267   0.0410    0.140 


Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Sequences: 1
Number of Hits to DB: 899
Number of extensions: 47
Number of successful extensions: 16
Number of sequences better than 10.0: 1
Number of HSP's gapped: 3
Number of HSP's successfully gapped: 3
Length of query: 562
Length of database: 431
Length adjustment: 34
Effective length of query: 528
Effective length of database: 397
Effective search space:   209616
Effective search space used:   209616
Neighboring words threshold: 11
Window for multiple hits: 40
X1: 16 ( 7.4 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 26 (14.9 bits)
S2: 26 (14.6 bits)

Data files

The amino acid comparison matrices used to compare proteins are hard coded in the program and cannot be changed.

Notes

Allowed scoring schemes

Note that only some combinations of matrix and gap penalty are allowed, because these are the ones for which the K and lambda parameters have been derived "experimentally" by searching a random sequence against a random databank and have been hard coded in the program :

scoring matrixgap penaltygap length penaltyrecommended
BLOSUM9062
72
82
91
2
101*
111
BLOSUM8062
72
82
91
2
101*
111
132
252
BLOSUM6262
72
82
91
2
101
2
111* (the default)
2
121
131
BLOSUM5093
103
113
122
3
132*
3
142
151
2
161
2
171
181
1
BLOSUM45103
113
122
3
132
3
142*
152
161
2
171
181
191
PAM3052
62
72
81
91*
101
PAM7062
72
82
91
101*
111
PAM250113
123
132
3
142
3
152*
3
162
171
2
181
191
201
211

Similarly, blastn supports only certain combinations of match reward, mismatch penalty and gap penalty. If both gap penalty and gap length penalty are above the maximum in following table blastn will shift to statistics for gapless alignments.

match reward/mismatch penaltygap penaltygap length penalty
2 / -704
22
4
42
4
1 / -302
11
2
21
2
2 / -504
22
4
42
4
1 / -202
11
2
21
2
31
2 / -304
22
4
33
42
4
52
62
4
4 / -535
45
55
65
128
1 / -102
12
21
2
31
2
41
2
5 / -486
106
2510

References

Tatiana A. Tatusova, Thomas L. Madden (1999), "Blast 2 sequences - a new tool for comparing protein and nucleotide sequences", FEMS Microbiol Lett. 174:247-250

Warnings

For protein searches only some combinations of matrix and gap penalty are allowed (see Notes).

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
blastz Nonintersecting best local alignments, makes LAJ file
lfasta Finds local alignments between two sequences, using fastA
matcher Finds the best local alignments between two sequences
seqmatchall All-against-all comparison of a set of sequences
sim_lav Nonintersecting best local alignments, makes LALNVIEW file
supermatcher Match large sequences against one or more other sequences
water Smith-Waterman local alignment
wordfinder Match large sequences against one or more other sequences
wordmatch Finds all exact matches of a given size between 2 sequences
blast BLAST search of query sequence(s) against sequence search set
phiblast Search protein sequence set combining matching of pattern with local alignment of a query sequence surrounding the match
psiblast Iterative BLAST search with generation of profile of protein sequence against protein sequence set
makeblastdb Make BLAST format sequence database

Author(s)

The wrapper application blast was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The program bl2seq itself was written by a team of developers working at the National Center for Biotechnology Information, Bethesda MD, U.S.A., comprising among others Stephen Altschul, David Lipman, Tom Madden, Alex Schaffer, Sergei Shavirin and Jinghui Zhang.

You can contact the BLAST development team at blast-help@ncbi.nlm.nih.gov

History

Completed 28 August 2002
Modified 17 March 2003 - adapted to BLAST version 2.2.5
Modified 6 September 2004 - added option to search only top or bottom strand
Modified 4 June 2006 - improved handling sequence filtering

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.