bscan

 

Function

Scans proteins or nucleic acids for conserved motifs using Blocks

Description

bscan is an EMBOSS "wrapper" program for the BlockSearcher tool from the BLIMPS software suite of the FHCRC. It scans the entries from the Blocks databank against one or more protein sequences or (in translation) one or more nucleic acid sequences. Blocks are short multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.
Note that the Blocks database is not updated anymore since 7 April 2007.

The rationale behind searching a database of blocks is that information from multiply aligned sequences is present in a concentrated form, reducing background and increasing sensitivity to distant relationships. This information is represented in a position-specific scoring table or "profile", in which each column of the alignment is converted to a column of a table representing the frequency of occurrence of each of the 20 amino acids. For searching a database of blocks, the first position of the sequence is aligned with the first position of the first block, and a score for that amino acid is obtained from the profile column corresponding to that position. Scores are summed over the width of the alignment, and then the block is aligned with the next position. This procedure is carried out exhaustively for all positions of the sequence for all blocks in the database, and the best alignments between a sequence and entries in the BLOCKS database are noted. If a particular block scores highly, it is possible that the sequence is related to the group of sequences the block represents. Typically, a group of proteins has more than one region in common and their relationship is represented as a series of blocks separated by unaligned regions. If a second block for a group also scores highly in the search, the evidence that the sequence is related to the group is strengthened, and is further strengthened if a third block also scores it highly, and so on.

Algorithm

Usage

Here is a sample session with bscan

> bscan
Scans proteins or nucleic acids for conserved motifs using Blocks
Input sequence(s): sw:tpa_human
Output file [tpa_human.bscan]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-seqs]              seqall     Sequence(s) filename and optional format, or
                                  reference (input USA)
  [-outfile]           outfile    [*.bscan] Output file name

   Additional (Optional) qualifiers (* if not always prompted):
   -blocks             infile     [/opt/sw/blocks/blocks.dat] Blocks file.
                                  Default is the Blocks databank, but you can
                                  choose a personal file with protein motifs
                                  in blocks format instead.
   -expect             float      [1.0] Combined E() value = number of
                                  multiple block hits that you expect (Number
                                  from 0.000 to 100.000)
*  -gencode            menu       [0] Genetic code for translating sequences
                                  (Values: 0 (Standard); 1 (Vertebrate
                                  Mitochondrial); 2 (Yeast Mitochondrial); 3
                                  (Mold Mitochondrial and Mycoplasma); 4
                                  (Invertebrate Mitochondrial); 5 (Ciliate
                                  Nuclear); 6 (Echinoderm Mitochondrial); 7
                                  (Euplotid Nuclear); 8 (Bacterial and Plant
                                  Plastid); 9 (Alternative Yeast Nuclear); 10
                                  (Ascidian Mitochondrial); 11 (Flatworm
                                  Mitochondrial); 12 (Blepharisma
                                  Macronuclear); 13 (Chlorophycean
                                  Mitochondrial); 14 (Trematode
                                  Mitochondrial); 15 (Scenedesmus obliquus
                                  mitochondrial); 16 (Thraustochytrium
                                  mitochondrial code))
   -format             menu       [1] Output format (Values: 1 (Standard
                                  (summary and alignments)); 2 (Summary only);
                                  3 (GFF))

   Advanced (Unprompted) qualifiers:
   -raw                boolean    Store raw data of single block hits
   -histogram          boolean    Show histogram in raw data

   Associated qualifiers:

   "-seqs" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
[-seqs]
(Parameter 1)
Sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.bscan
Additional (Optional) qualifiers Allowed values Default
-blocks Blocks file. Default is the Blocks databank, but you can choose a personal file with protein motifs in blocks format instead. Input file /opt/sw/blocks/blocks.dat
-expect Combined E() value = number of multiple block hits that you expect Number from 0.000 to 100.000 1.0
-gencode Genetic code for translating sequences
0 (Standard)
1 (Vertebrate Mitochondrial)
2 (Yeast Mitochondrial)
3 (Mold Mitochondrial and Mycoplasma)
4 (Invertebrate Mitochondrial)
5 (Ciliate Nuclear)
6 (Echinoderm Mitochondrial)
7 (Euplotid Nuclear)
8 (Bacterial and Plant Plastid)
9 (Alternative Yeast Nuclear)
10 (Ascidian Mitochondrial)
11 (Flatworm Mitochondrial)
12 (Blepharisma Macronuclear)
13 (Chlorophycean Mitochondrial)
14 (Trematode Mitochondrial)
15 (Scenedesmus obliquus mitochondrial)
16 (Thraustochytrium mitochondrial code)
0
-format Output format
1 (Standard (summary and alignments))
2 (Summary only)
3 (GFF)
1
Advanced (Unprompted) qualifiers Allowed values Default
-raw Store raw data of single block hits Boolean value Yes/No No
-histogram Show histogram in raw data Boolean value Yes/No No

Input file format

bscan reads any normal sequence USA for one or more sequence(s).
You can submit several sequences at the same time. In this case the "wrapper" bscan will launch the script for each individual sequence and make sure the output files have different names. Do note that the results of the individual scans are in no way merged.

Output file format

The output looks like :

Output files for usage example

File: tpa_human.bscan


BLKPROB Version 12/23/06.1
Database=/opt/sw/blocks/blocks.dat
Here are your search results. The database searched was BLOCKS 14.3 (Apr 2007)
consisting of 29,068 blocks representing 5900 nonredundant entries documented 
in InterPro 14.0 keyed to Swiss-Prot 51.3 and TrEMBL 34.3.
If you found the Blocks Searcher useful, please cite:
   S Henikoff & JG Henikoff, "Protein family classification
   based on searching a database of blocks", Genomics 19:97-107 (1994).
==============================================================================
Each numbered result consists of one or more blocks from a InterPro entry found
in the query sequence. One set of the highest-scoring blocks that are in the 
correct order and separated by distances comparable to the Blocks database is 
selected for analysis. If this set includes multiple blocks the probability 
that the lower scoring blocks support the highest scoring block is reported. 
Maps of the database blocks and query sequence are shown:
AAA represents the first block roughly in proportion to its width.
  : represents the minimum distance between blocks in the database.
  . represents the maximum distance between blocks in the database.
< > indicate the sequence has been truncated to fit the page.
The query map is aligned on the highest scoring block. Multiple block hits 
that are consistent with the highest scoring block are separated by colons.
Block hits that are not consistent are mapped below. The alignment of the
query sequence with the sequence closest to it in the BLOCKS database is
shown. The distance between detected blocks is listed as (min, max): for the
database entry followed by the distance in the query. Upper case in the query
indicates at least one occurrence of the residue in that column of the block.

For interpretation of block hits, you might find it worthwhile to obtain the
full set of blocks and documentation for an entry. For this you can use the
MRS server of BEN and "Search" in the Blocks database "for" e.g.
"sac:IPB000104".
=============================================================================
Note: For searches using DNA queries, "Location" refers to the position
in the query in base pairs from 5' to 3' on the + strand, whereas the map and 
alignment show the translated position in amino acid residues as before.
============================================================================= 

Query=TPA_HUMAN  P00750 Tissue-type plasminogen activator precursor (EC 3.4.21.68) (tP
Size=562 Amino Acids
Blocks Searched=29068
Alignments Done=        17081713
Cutoff combined expected value for hits=  1
Cutoff block expected value for repeats/other=  1
==============================================================================
                                                             Combined
Family                                       Strand  Blocks   E-value
IPB000177  Apple domain                          1   4 of 15  4.7e-32
IPB003014  N/apple PAN                           1   3 of 8   1.6e-31
IPB000083  Fibronectin, type I                   1   3 of 3   3.4e-26
IPB001314  Chymotrypsin serine protease family   1   3 of 3   4.1e-24
IPB000001  Kringle                               1   2 of 2   9.7e-22
IPB001254  Serine protease, trypsin family       1   2 of 2   1.4e-18
IPB002049  Laminin-type EGF-like domain          1   1 of 2     0.019
IPB003966  Prothrombin signature                 1   1 of 8     0.099
IPB001438  Type II EGF-like signature            1   1 of 4      0.14
IPB001169  Integrin beta, C-terminal             1   1 of 8      0.73

==============================================================================
>IPB000177 4/15 blocks Combined E-value= 4.7e-32: Apple domain
Block    Frame    Location (aa)      Block E-value
IPB000177K  0        344-376             0.0008
IPB000177L  0        377-415               0.62
IPB000177N  0        499-533            3.4e-12
IPB000177O  0        534-562            4.5e-09
Other reported alignments:

                         |---  252 amino acids---|
           IPB000177 AAAABBBCCCCDDDDDEEEEFFFGGGHHHHIIIIJJJJ:.KKKLLLLMMM:::NNNOOO
           TPA_HUMAN     ::::::::::::::::::::::::::::::::::KKKLLLL::::::::NNNOOO

IPB000177K          <->K   (368,432):343                    
Q5NTB3|FA11_BOVIN  418     GAIIGNQWILTAAHCFNEVKSPNVLRVYSGILN
                           |  |   ||| ||||| |   |  | |  |   
TPA_HUMAN          344     GiLISscWILsAAHCFqErfpPhhLtVilGrty

IPB000177L         K<->L   (-1,0):0                               
Q6AZS7|Q6AZS7_XENLA456     ILNITKSTPFSELEKIIIHPHYTGAGNGSDIALLKLKTP
                                      | || | |          ||||| ||  
TPA_HUMAN          377     rvvpgEEeqkFEVEKyIVHkEfdddtydnDIALLqLKsd

IPB000177N         L<->N   (69,71):83                         
KLKB1_MOUSE|P26262 564     AGYKEGGTDACKGDSGGPLVCKHSGRWQLVGITSW
                            |      ||| |||||||||   ||  |||| ||
TPA_HUMAN          499     gGpqanlhDACqGDSGGPLVClnDGRmtLVGIiSW

IPB000177O         N<->O   (-1,0):0                     
KLKB1_MOUSE|P26262 599     GEGCGRKDQPGVYTKVSEYMDWILEKTQS
                           | ||| || |||||||  | |||      
TPA_HUMAN          534     GlGCGQKDvPGVYTKVtnYLDWIrdnmrp

------------------------------------------------------------------------------
>IPB003014 3/8 blocks Combined E-value= 1.6e-31: N/apple PAN
Block    Frame    Location (aa)      Block E-value
IPB003014D  0        340-358            2.1e-07
IPB003014G  0        509-519            6.3e-07
IPB003014H  0        529-556            2.6e-13
Other reported alignments:

                         |---  359 amino acids---|
           IPB003014 AA::::...............B:::::::::..............CD:::::EF:G:HH
           TPA_HUMAN                     ::::::::::::::::::::::::D::::::::::G:HH

IPB003014D          <->D   (247,721):339      
Q5NTB3|FA11_BOVIN  414     HLCGGAIIGNQWILTAAHC
                            ||||  |   ||| ||||
TPA_HUMAN          340     fLCGGiLISScWILSAAHC

IPB003014G         D<->G   (125,139):150
P06868|PLMN_BOVIN  758     CQGDSGGPLVC  
                           |||||||||||  
TPA_HUMAN          509     CQGDSGGPLVC  

IPB003014H         G<->H   (8,9):9                     
P26262|KLKB1_MOUSE 594     GITSWGEGCGRKDQPGVYTKVSEYMDWI
                           || ||| ||| || |||||||  | |||
TPA_HUMAN          529     GIISWGLGCGQKDvPGVYTKVtnYLDWI

------------------------------------------------------------------------------
>IPB000083 3/3 blocks Combined E-value= 3.4e-26: Fibronectin, type I
Block    Frame    Location (aa)      Block E-value
IPB000083A  0         41-58             1.4e-07
IPB000083B  0        342-361            1.8e-11
IPB000083C  0        510-519              0.004
Other reported alignments:

                         |---  265 amino acids---|
           IPB000083 AA:..........................BB:............................C
           TPA_HUMAN AA:::::::::::::::::::::::::::BB::::::::::::::C

IPB000083A          <->A   (3,2300):40       
TPA_HUMAN|P00750   41      CRDEKTQMIYQQHQSWLR
                           ||||||||||||||||||
TPA_HUMAN          41      CRDEKTQMIYQQHQSWLR

IPB000083B         A<->B   (10,286):283        
TPA_HUMAN|P00750   342     CGGILISSCWILSAAHCFQE
                           ||||||||||||||||||||
TPA_HUMAN          342     CGGILISSCWILSAAHCFQE

IPB000083C         B<->C   (8,302):148
FA12_BOVIN|P98140  538     QGDSGGPLVC 
                           |||||||||| 
TPA_HUMAN          510     QGDSGGPLVC 

------------------------------------------------------------------------------

  [Part of this file has been deleted for brevity]

------------------------------------------------------------------------------
>IPB001169 1/8 blocks Combined E-value=    0.73: Integrin beta, C-terminal
Block    Frame    Location (aa)      Block E-value
IPB001169F  0         99-120               0.74
Other reported alignments:

                         |---  382 amino acids---|
           IPB001169 A::::...BBB..CC..DD:...EEEE:::::::......F:......G........H
           TPA_HUMAN                                   ::::::F

IPB001169F          <->F   (381,705):98          
ITB1B_XENLA|P12607 480     GNGTFECGACRCNEGRIGKECE
                               |    | | ||  || ||
TPA_HUMAN          99      qAlyFSdfVCQCpEGFAGKcCE

------------------------------------------------------------------------------

10 possible hits reported

Interpreting results of a search

Heading
Query = Description line from query sequence
Size = Number of amino acids for protein query or base pairs for DNA query. Be sure this number is correct before interpreting your results.
Blocks searched = Number of blocks searched with query.
Alignments done = Number of alignments done between query and blocks searched. This number is used to determine the expected value for each hit.
Cutoff expected value = Maximum combined E-value reported. This is the number of matches expected to be found merely by chance.

Summary
One line is printed per hit, where a hit consists of blocks belonging to a protein family represented in the database of blocks searched with combined E-value less than or equal to the cutoff.

Details
Detailed information is printed for each hit, including alignments with the most similar sequence in each block.

Data files

bscan scans by default the Blocks databank. You can choose to search instead a personal databank with protein motifs in blocks format by setting -blocks=mydatabank.

Notes

None.

References

  1. Henikoff S, Henikoff JG : Protein family classification based on searching a database of blocks, Genomics 1994, 19:97-107.
  2. Henikoff S, Henikoff JG : Automated assembly of protein blocks for database searching. Nucleic Acids Res. 1991, 19:6565-6572.
  3. Henikoff JG and Henikoff S : Using substitution probabilities to improve position-specific scoring matrices, CABIOS 1996, 12:135-143.
  4. Wallace JC, Henikoff S : PATMAT : a searching and extraction program for sequence, pattern, and block queries and databases. CABIOS 1992, 8:249-254.
  5. Henikoff S, Henikoff JG : A protein family classifcation method for analysis of large DNA sequences, Proc. 27th HICSS 1994, p. 265-274.
  6. Henikoff S, Henikoff JG : Position-based sequence weights, J. Mol. Biol. 1994, 243:574-578.
  7. Tatusov RL, Altschul SF, Koonin EV : Detection of conserved segments in proteins : Iterative scanning of sequence databases with alignment blocks, PNAS 1994, 91:12091-12095.
  8. Henikoff JG, Henikoff S : Using substitution probabilities to improve position-specific scoring matrices, CABIOS 1996, 12:135-143.
  9. Henikoff S, Henikoff JG : Embedding strategies for effective use of multiple sequence alignment information, Protein Sci. 1997, 6:698-705.
  10. Bailey, T.L. and Gribskov, M. : Combining evidence using p-values : application to sequence homology searchers, Bioinformatics 1998, 14:48-54.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
antigenic Finds antigenic sites in proteins
digest Protein proteolytic enzyme or reagent cleavage digest
emast Motif detection
ememe Motif detection
epestfind Finds PEST motifs as potential proteolytic cleavage sites
fuzzpro Protein pattern search
fuzztran Protein pattern search after translation
genemark Finds potential genes using a species specific HMM
getorf Finds and extracts open reading frames (ORFs)
helixturnhelix Report nucleic acid binding motifs
iprscan Scans proteins or nucleic acids for conserved motifs using Interpro tools
marscan Finds MAR/SAR sites in nucleic sequences
oddcomp Find protein sequence regions with a biased composition
patmatdb Search a protein sequence with a motif
patmatmotifs Search a PROSITE motif database with a protein sequence
pepcoil Predicts coiled coil regions
phiblast Search protein sequence set combining matching of pattern with local alignment of a query sequence surrounding the match
plotorf Plot potential open reading frames
preg Regular expression search of a protein sequence
pscan Scans proteins for conserved motifs using PRINTS
ps_scan Scans proteins for conserved motifs using PROSITE (patterns and profiles)
showorf Pretty output of DNA translations
sigcleave Reports protein signal cleavage sites
sixpack Display a DNA sequence with 6-frame translation and ORFs
syco Synonymous codon usage Gribskov statistic plot
tcode Fickett TESTCODE statistic to identify protein-coding DNA
wobble Wobble base plot
ehmmpfam Scans sequences using Pfam-A or other HMM database

Author(s)

The wrapper application bscan was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The BLIMPS software suite itself was written by Bill Alford (billa@willapabay.org)

For BLIMPS questions, please contact:
Jorja Henikoff                                  jorja@fhcrc.org
Fred Hutchinson Cancer Research Center          FAX: 206-667-5889
1100 Fairview AV N, A1-162, PO Box 19024        Seattle, WA 98109-1024

History

Completed 30 August 2002
Modified 22 September 2004 - adapted to BLIMPS version 3.6

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.