psiblast

 

Function

Iterative BLAST search with generation of profile of protein sequence against protein sequence set

Description

psiblast is an EMBOSS "wrapper" program for the programs blastall and blastpgp of the BLAST suite that allows to access the PSI-BLAST and PSI-TBLASTN algorithms.

The blastpgp program can do an iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. In this usage, the program is called Position-Specific Iterated BLAST, or PSI-BLAST. The BLAST algorithm is not tied to a specific score matrix. Traditionally, it has been implemented using an AxA substitution matrix where A is the alphabet size. PSI-BLAST instead uses a QxA matrix, where Q is the length of the query sequence; at each position the cost of a letter depends on the position w.r.t. the query and the letter in the subject sequence.

The position-specific matrix for round i+1 is built from a constrained multiple alignment among the query and the sequences found with sufficiently low e-value in round i. The top part of the output for each round distinguishes the sequences into sequences found previously and used in the score model and sequences not used in the score model. PSI-BLAST "converges" and stops if all sequences found at round i+1 below the e-value threshold were already in the model at the beginning of the round. To prevent the program from running indefinitively, the psiblast "wrapper" has a built-in limit of by default 10 rounds, but you can change this.

The -makecheckout -checkout=xxx and -checkin=xxx options provide a "checkpointing" facility whereby a score model can be stored and later reused. When using -checkin, it is required that the query sequence specified on the command line match exactly the query sequence in the restart file. You can however search a different databank.

PSI-TBLASTN searches a protein query sequence against a nucleotide sequence database using a position specific matrix created by PSI-BLAST. The nucleotide sequence database is dynamically translated in all reading frames during PSI-TBLASTN search. Using a position specific matrix may enable finding more distantly related sequences. You can access this facility by running the "wrapper" program psiblast with options -psitblastn and -checkin=xxx.

Algorithm

   Many functionally and evolutionarily important protein similarities are recognizable only through comparison of three-dimensional structures. When such structures are not available, patterns of conservation identified from the alignment of related sequences can aid the recognition of distant similarities. There is a large literature on the definition and construction of these patterns, which have been variously called motifs, profiles, position-specific score matrices, and Hidden Markov Models. In essence, for each position in the derived pattern, every amino acid is assigned a score. If a residue is highly conserved at a particular position, that residue is assigned a high positive score, and others are assigned high negative scores. At weakly conserved positions, all residues receive scores near zero. Position-specific scores can also be assigned to potential insertions and deletions.
   The power of profile methods can be further enhanced through iteration of the search procedure. After a profile is run against a database, new similar sequences can be detected. A new multiple alignment, which includes these sequences, can be constructed, a new profile abstracted, and a new database search performed. The procedure can be iterated as often as desired or until convergence, when no new statistically significant sequences are detected.

The design of PSI-BLAST

   Iterated profile search methods have led to biologically important observations but, for many years, were quite slow and generally did not provide precise means for evaluating the significance of their results. This limited their utility for systematic mining of the protein databases. The principal design goals in developing the Position-Specific Iterated BLAST (PSI-BLAST) program were speed, simplicity and automatic operation. The procedure PSI-BLAST uses can be summarized in five steps:
(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program.
(2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions.
(3) The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly.
(4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments.
(5) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.

Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension of BLAST ; the results produced in iterative search steps are comparable to those produced from the first pass. Unlike most profile-based search methods, PSI-BLAST runs as one program, starting with a single protein sequence, and the intermediate steps of multiple alignment and profile construction are invisible to the user.

Estimation of statistical parameters for local alignment scores

   Computation experiments strongly suggest that the optimal gapped local alignment scores produced by the Smith-Waterman algorithm and approximated by FASTA or Gapped BLAST follow an extreme value distribution. Specifically, the probability that the optimal score S from the comparison of unrelated proteins is at least x is given by the equation




where K and lambda are statistical parameters dependent upon the scoring system and the background amino acid frequences of the sequences being compared. While FASTA estimates these parameters from the scores generated by actual database searches, BLAST estimates them beforehand for specific scoring schemes by comparing many random sequences generated using a standard protein amino acid composition.

Generalization to PSI-BLAST alignment scores

   In order for PSI-BLAST to iterate automatically, it needs to be able to generate accurate estimates of the statistical significance of the alignments it produces. Unfortunately, there is no analytic theory with which to estimate the statistical significance of a gapped local alignment of a profile and a simple sequence. We may conjecture, however, that the empirical statistics for optimal local alignments of two sequences may also apply to alignments involving a profile and a sequence.
   Even were this the case, we are faced with the problem of estimating the statistical parameters lambda and K for profiles we have seen for the first time. As before, this may be done by random simulation, but sufficiently precise estimates require on the order of half an hour on a typical current workstation. This is at least two orders of magnitude too long for use with PSI-BLAST.
   One hope is that if we construct the amino acid scores within each column of a PSI-BLAST profile to the same "scale" (i.e. with the same ungapped lambda) as those for a standard amino acid substitution matrix, and then use the same position-independent gap costs, the same gapped lambda may result.
   To review, for ungapped local alignments, any substitution matrix takes the form




where the qij are the target frequencies for aligned pairs of amino acids, the pi are background frequencies, and the subscript for lambda indicates it is the statistical parameter for ungapped local alignments. For a PSI-BLAST profile, each column has its own unique set of amino acid target frequencies qi. Following (2), the amino acid scores for this column may be constructed to the same "scale" by using the formula




The hope is that, given a specific set of gap costs, the gapped lambda for the PSI-BLAST profile will be the same as the gapped lambda for the standard substitution matrix, which may be calculated in advance.

Compositional adjustments

Amino acid substitution matrices may be adjusted in various ways to compensate for the amino acid compositions of the sequences being compared. The simplest adjustment is to scale all substitution scores by an analytically determined constant, while leaving the gap scores fixed; this procedure is called "composition-based statistics" (Schaffer et al., 2001). The resulting scaled scores yield more accurate E-values than standard, unscaled scores. A more sophisticated approach adjusts each score in a standard substitution matrix separately to compensate for the compositions of the two sequences being compared (Yu et al., 2003; Yu and Altschul, 2005; Altschul et al., 2005). Such "compositional score matrix adjustment" may be invoked only under certain specific conditions for which it has been empirically determined to be beneficial (Altschul et al., 2005); under all other conditions, composition-based statistics are used. Alternatively, compositional adjustment may be invoked universally.

Usage

Here is a sample session with psiblast

> psiblast
Iterative BLAST search with generation of profile of protein sequence
against protein sequence set
Query sequence: sw:papa1_carpa
         1 : standard set
         2 : user defined set
         3 : user provided BLAST databank
Select search set type [1]: 
        sw : SwissProt (highly annotated protein databank)
        up : UniProt (SwissProt + TrEMBL, EMBL ORF translations)
  uniref100 : UniRef100 (UniProt nonredundant subset)
  uniref90 : UniRef90 (UniRef100 subset with no more than 90% identity)
  uniref50 : UniRef50 (UniRef100 subset with no more than 50% identity)
      remt : REM-TrEMBL (old EMBL ORF translations not incl. in UniProt)
       pir : PIR (old general protein databank)
        gp : GenPept (GenBank ORF translations)
   refseqp : RefSeq (NCBI reference protein sequences)
       pdb : PDB (proteins with known 3D structure)
    gpcrdb : G protein coupled receptors
Standard protein search set [up]: sw
Word size [3]: 
E() value cutoff for display [10.0]: 
Maximum number of rounds [10]: 
E() value threshold for inclusion in model [0.002]: 
Output file [papa1_carpa.psiblast]:

        PSI-BLAST converged after 4 rounds

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-seq]               sequence   Query sequence
   -dbtype             menu       [1] Search set type : public databank or
                                  databank provided by user (Values: 1
                                  (standard set); 2 (user defined set); 3
                                  (user provided BLAST databank))
*  -nucdb              menu       [emblnontags] Standard nucleic acid search
                                  set (Values: em (EMBL (general nucleic acid
                                  databank)); emblnontags (EMBL without EST
                                  and GSS); hum (EMBL humans); mus (EMBL
                                  mice); rod (EMBL other rodents); mam (EMBL
                                  other mammals); vrt (EMBL other
                                  vertebrates); inv (EMBL invertebrates); pln
                                  (EMBL plants); fun (EMBL fungi); pro (EMBL
                                  bacteria); phg (EMBL bacteriophages); vrl
                                  (EMBL other viruses); est (EMBL Expressed
                                  Sequence Tags); gss (EMBL Genome Survey
                                  Sequences); sts (EMBL Sequence Tagged
                                  Sites); htg (EMBL High Throughput Genomic);
                                  htc (EMBL High Throughput cDNA); env (EMBL
                                  environmental samples); pat (EMBL patents);
                                  tgn (EMBL transgenic); syn (EMBL synthetic);
                                  unc (EMBL unclassified); new (EMBL updates
                                  since last release); wgs (EMBL Whole Genome
                                  Shotgun); refseq (RefSeq (NCBI reference
                                  sequences)); refseqwgs (RefSeq Whole Genome
                                  Shotgun); refseqgen (RefSeq other genomic);
                                  refseqrna (RefSeq transcripts); vec
                                  (Intelligenetics vector databank); emvec
                                  (EMBL vector subset); epd (Eukaryotic
                                  Promoter Database); ligm (ImMunoGeneTics
                                  databank Igg. + TcR genes); hla
                                  (ImMunoGeneTics databank human MHC genes);
                                  pdbn (PDB (nucleic acids with known 3D
                                  structure)))
*  -protdb             menu       [up] Standard protein search set (Values: sw
                                  (SwissProt (highly annotated protein
                                  databank)); up (UniProt (SwissProt + TrEMBL,
                                  EMBL ORF translations)); uniref100
                                  (UniRef100 (UniProt nonredundant subset));
                                  uniref90 (UniRef90 (UniRef100 subset with no
                                  more than 90% identity)); uniref50
                                  (UniRef50 (UniRef100 subset with no more
                                  than 50% identity)); remt (REM-TrEMBL (old
                                  EMBL ORF translations not incl. in
                                  UniProt)); pir (PIR (old general protein
                                  databank)); gp (GenPept (GenBank ORF
                                  translations)); refseqp (RefSeq (NCBI
                                  reference protein sequences)); pdb (PDB
                                  (proteins with known 3D structure)); gpcrdb
                                  (G protein coupled receptors))
*  -userdb             seqall     User defined search set
*  -userblastdb        infile     User provided BLAST format databank (you can
                                  make one using makeblastdb)
   -wordsize           integer    [3] Word size (Integer from 2 to 3)
   -expect             float      [10.0] E() value = number of databank
                                  sequences with same or higher bit score that
                                  you expect to find by chance. BLAST lists
                                  sequences with an E() value lower than the
                                  cutoff. (Number 0.000 or more)
*  -rounds             integer    [10] Maximum number of rounds (Integer 1 or
                                  more)
*  -include            float      [0.002] PSI-BLAST uses sequences with an E()
                                  value lower than this threshold for
                                  generating a profile. (Number 0.000 or more)
  [-outfile]           outfile    [*.psiblast] Output file name
*  -asciiout           outfile    [*.psiblast] Name output file with profile
                                  in ASCII format
*  -checkout           outfile    [*.psiblast] Name output file with profile
                                  in BLAST-readable format

   Additional (Optional) qualifiers (* if not always prompted):
   -matrix             selection  [3] Amino acid comparison matrix
   -gappenalty         integer    [11] Gap penalty (Integer 0 or more)
   -gaplength          integer    [1] Gap length penalty. BLAST subtracts from
                                  the similarity score for each gap a penalty
                                  of type <Gap penalty> + <Gap length
                                  penalty> * n. Only certain combinations of
                                  matrix and gap penalty are allowed, see
                                  on-line manual. (Integer 0 or more)
*  -dbgencode          menu       [1] Genetic code for translating databank
                                  sequences (Values: 1 (Standard); 2
                                  (Vertebrate Mitochondrial); 3 (Yeast
                                  Mitochondrial); 4 (Mold, Protozoan,
                                  Coelenterate Mitochondrial and
                                  Mycoplasma/Spiroplasma); 5 (Invertebrate
                                  Mitochondrial); 6 (Ciliate, Dasycladacean
                                  and Hexamita); 9 (Echinoderm Mitochondrial);
                                  10 (Euplotid); 11 (Bacterial); 12
                                  (Alternative Yeast); 13 (Ascidian
                                  Mitochondrial); 14 (Flatworm Mitochondrial);
                                  15 (Blepharisma); 16 (Chlorophycean
                                  Mitochondrial); 21 (Trematode
                                  Mitochondrial); 22 (Scenedesmus obliquus
                                  mitochondrial); 23 (Thraustochytrium
                                  mitochondrial))
*  -compstats          menu       [2] The E() value can be computed more
                                  accurately if the composition of the
                                  sequences being compared is taken into
                                  account. By default the scoring scheme is
                                  adjusted as explained in Bioinformatics
                                  21:902 when an empiric test suggests this to
                                  be beneficial, otherwise the scoring scheme
                                  is rescaled as explained in NAR 29:2994.
                                  You can force psiblast to always use any of
                                  both methods or none. (Values: 0 (none); 1
                                  (scale); 2 (adjust conditionally, otherwise
                                  scale); 3 (adjust))
   -format             menu       [0] Alignment format (Values: 0 (pairwise);
                                  1 (query-anchored, showing identities); 2
                                  (query-anchored, no identities); 3 (flat
                                  query-anchored, show identities); 4 (flat
                                  query-anchored, no identities); 5
                                  (query-anchored, no identities and blunt
                                  ends); 6 (flat query-anchored, no identities
                                  and blunt ends); 7 (XML Blast output); 8
                                  (tabular); 9 (tabular, with comment lines);
                                  10 (ASN.1))

   Advanced (Unprompted) qualifiers:
   -psitblastn         toggle     Run instead search of protein profile
                                  against translated nuc. db
   -checkin            infile     Restart search with profile from previous
                                  search. If you do this your query sequence
                                  must be the same.
   -seqfilter          boolean    Filter low complexity segments out of query
                                  sequence
   -seqcoilfilter      boolean    Filter coiled coils out of query sequence
   -seqsoftfilter      boolean    Use soft filtering, that is, filter only at
                                  initial hit searching, not at hit extension
   -[no]doublehit      boolean    [Y] Try to extend hit only if there is a
                                  second hit (is default)
   -window             integer    [40] Multiple hits window size (Integer 0 or
                                  more)
   -effdbsize          float      [0.000] Effective databank size for
                                  statistical calculations (Number 0.000 or
                                  more)
   -keep               integer    [0] Keep only n best hits from same region.
                                  Default is to show them all. If you use this
                                  option, a value of 100 is recommended.
                                  (Integer 0 or more)
   -listsize           integer    [500] Show only the n best scoring sequences
                                  that satisfy E() cutoff (Integer 0 or more)
   -align              integer    [250] Show only alignments for the n first
                                  sequences (Integer 0 or more, but not >
                                  listsize)
   -fulloutput         boolean    Show output of all rounds
   -makeasciiout       toggle     Make output file with profile in ASCII
                                  format (can only be done if number of rounds
                                  is 2 or more)
   -makecheckout       toggle     Make output file with profile in
                                  BLAST-readable format for restarting search
                                  (can only be done if number of rounds is 2
                                  or more)

   Associated qualifiers:

   "-seq" associated qualifiers
   -sbegin1            integer    Start of the sequence to be used
   -send1              integer    End of the sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-userdb" associated qualifiers
   -sbegin             integer    Start of each sequence to be used
   -send               integer    End of each sequence to be used
   -sreverse           boolean    Reverse (if DNA)
   -sask               boolean    Ask for begin/end/reverse
   -snucleotide        boolean    Sequence is nucleotide
   -sprotein           boolean    Sequence is protein
   -slower             boolean    Make lower case
   -supper             boolean    Make upper case
   -sformat            string     Input sequence format
   -sdbname            string     Database name
   -sid                string     Entryname
   -ufo                string     UFO features
   -fformat            string     Features format
   -fopenfile          string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   "-asciiout" associated qualifiers
   -odirectory         string     Output directory

   "-checkout" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
[-seq]
(Parameter 1)
Query sequence Readable sequence Required
-dbtype Search set type : public databank or databank provided by user
1 (standard set)
2 (user defined set)
3 (user provided BLAST databank)
1
-nucdb Standard nucleic acid search set
em (EMBL (general nucleic acid databank))
emblnontags (EMBL without EST and GSS)
hum (EMBL humans)
mus (EMBL mice)
rod (EMBL other rodents)
mam (EMBL other mammals)
vrt (EMBL other vertebrates)
inv (EMBL invertebrates)
pln (EMBL plants)
fun (EMBL fungi)
pro (EMBL bacteria)
phg (EMBL bacteriophages)
vrl (EMBL other viruses)
est (EMBL Expressed Sequence Tags)
gss (EMBL Genome Survey Sequences)
sts (EMBL Sequence Tagged Sites)
htg (EMBL High Throughput Genomic)
htc (EMBL High Throughput cDNA)
env (EMBL environmental samples)
pat (EMBL patents)
tgn (EMBL transgenic)
syn (EMBL synthetic)
unc (EMBL unclassified)
new (EMBL updates since last release)
wgs (EMBL Whole Genome Shotgun)
refseq (RefSeq (NCBI reference sequences))
refseqwgs (RefSeq Whole Genome Shotgun)
refseqgen (RefSeq other genomic)
refseqrna (RefSeq transcripts)
vec (Intelligenetics vector databank)
emvec (EMBL vector subset)
epd (Eukaryotic Promoter Database)
ligm (ImMunoGeneTics databank Igg. + TcR genes)
hla (ImMunoGeneTics databank human MHC genes)
pdbn (PDB (nucleic acids with known 3D structure))
emblnontags
-protdb Standard protein search set
sw (SwissProt (highly annotated protein databank))
up (UniProt (SwissProt + TrEMBL, EMBL ORF translations))
uniref100 (UniRef100 (UniProt nonredundant subset))
uniref90 (UniRef90 (UniRef100 subset with no more than 90% identity))
uniref50 (UniRef50 (UniRef100 subset with no more than 50% identity))
remt (REM-TrEMBL (old EMBL ORF translations not incl. in UniProt))
pir (PIR (old general protein databank))
gp (GenPept (GenBank ORF translations))
refseqp (RefSeq (NCBI reference protein sequences))
pdb (PDB (proteins with known 3D structure))
gpcrdb (G protein coupled receptors)
up
-userdb User defined search set Readable sequence(s) Required
-userblastdb User provided BLAST format databank (you can make one using makeblastdb) Input file Required
-wordsize Word size Integer from 2 to 3 3
-expect E() value = number of databank sequences with same or higher bit score that you expect to find by chance. BLAST lists sequences with an E() value lower than the cutoff. Number 0.000 or more 10.0
-rounds Maximum number of rounds Integer 1 or more 10
-include PSI-BLAST uses sequences with an E() value lower than this threshold for generating a profile. Number 0.000 or more 0.002
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.<psiblast or tpsiblastn>
-asciiout Name output file with profile in ASCII format Output file  
-checkout Name output file with profile in BLAST-readable format Output file  
Additional (Optional) qualifiers Allowed values Default
-matrix Amino acid comparison matrix Choose from selection list of values BLOSUM62
-gappenalty Gap penalty Integer 0 or more 11
-gaplength Gap length penalty. BLAST subtracts from the similarity score for each gap a penalty of type <Gap penalty> + <Gap length penalty> * n. Only certain combinations of matrix and gap penalty are allowed, see on-line manual. Integer 0 or more 1
-dbgencode Genetic code for translating databank sequences
1 (Standard)
2 (Vertebrate Mitochondrial)
3 (Yeast Mitochondrial)
4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)
5 (Invertebrate Mitochondrial)
6 (Ciliate, Dasycladacean and Hexamita)
9 (Echinoderm Mitochondrial)
10 (Euplotid)
11 (Bacterial)
12 (Alternative Yeast)
13 (Ascidian Mitochondrial)
14 (Flatworm Mitochondrial)
15 (Blepharisma)
16 (Chlorophycean Mitochondrial)
21 (Trematode Mitochondrial)
22 (Scenedesmus obliquus mitochondrial)
23 (Thraustochytrium mitochondrial)
1
-compstats The E() value can be computed more accurately if the composition of the sequences being compared is taken into account. By default the scoring scheme is adjusted as explained in Bioinformatics 21:902 when an empiric test suggests this to be beneficial, otherwise the scoring scheme is rescaled as explained in NAR 29:2994. You can force psiblast to always use any of both methods or none.
0 (none)
1 (scale)
2 (adjust conditionally, otherwise scale)
3 (adjust)
2
-format Alignment format
0 (pairwise)
1 (query-anchored, showing identities)
2 (query-anchored, no identities)
3 (flat query-anchored, show identities)
4 (flat query-anchored, no identities)
5 (query-anchored, no identities and blunt ends)
6 (flat query-anchored, no identities and blunt ends)
7 (XML Blast output)
8 (tabular)
9 (tabular, with comment lines)
10 (ASN.1)
0
Advanced (Unprompted) qualifiers Allowed values Default
-psitblastn Run instead search of protein profile against translated nuc. db Toggle value Yes/No No
-checkin Restart search with profile from previous search. If you do this your query sequence must be the same. Input file Required
-seqfilter Filter low complexity segments out of query sequence Boolean value Yes/No No
-seqcoilfilter Filter coiled coils out of query sequence Boolean value Yes/No No
-seqsoftfilter Use soft filtering, that is, filter only at initial hit searching, not at hit extension Boolean value Yes/No No
-[no]doublehit Try to extend hit only if there is a second hit (is default) Boolean value Yes/No Yes
-window Multiple hits window size Integer 0 or more 40
-effdbsize Effective databank size for statistical calculations Number 0.000 or more 0.000
-keep Keep only n best hits from same region. Default is to show them all. If you use this option, a value of 100 is recommended. Integer 0 or more 0
-listsize Show only the n best scoring sequences that satisfy E() cutoff Integer 0 or more 500
-align Show only alignments for the n first sequences Integer 0 or more, but not > listsize 250
-fulloutput Show output of all rounds Boolean value Yes/No No
-makeasciiout Make output file with profile in ASCII format (can only be done if number of rounds is 2 or more) Toggle value Yes/No No
-makecheckout Make output file with profile in BLAST-readable format for restarting search (can only be done if number of rounds is 2 or more) Toggle value Yes/No No

Input file format

psiblast searches a query sequence against a search sequence set. For the query sequence you can provide any normal sequence USA for a single protein sequence. You can select your search set in three different ways :
  1. a standard set is a copy of a public databank or a local databank installed by the managers of the server computer and "visible" to all users. You must choose from a selector.
  2. a user defined set is a set of sequences (public and/or private) that you can specify using a normal sequence USA. It is convenient to use a List File. The set will be transformed on-the-fly into a temporary BLAST format databank.
  3. a user provided BLAST databank is a databank in BLAST format (usually private). A BLAST format databank consists of three binary files (for proteins <basename>.phr <basename>.pin <basename>.psq, for nucleic acids <basename>.nhr <basename>.nin <basename>.nsq ; huge databanks can consist of several compound databanks enumerated in a "Library File" <basename>.pal respectively <basename>.nal). You can precise the databank by the name of any of its files (or eventually by the basename, but only if there exists a file with that name).
Note that if you want search a selected set of sequences taken from the public databanks and/or want to search a set of your own private sequences you can choose between options 2 and 3. If your set is small or if you want to search it just once, the "a user defined set" is good, otherwise, to save time, it is recommended to make a databank in BLAST format using the program makeblastdb and choose "a user provided BLAST databank".

Output file format

Output files for usage example

File: papa1_carpa.psiblast

BLASTP 2.2.18 [Mar-02-2008]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.


Reference for compositional score matrix adjustment: Altschul, Stephen F., 
John C. Wootton, E. Michael Gertz, Richa Agarwala, Aleksandr Morgulis,
Alejandro A. Schaffer, and Yi-Kuo Yu (2005) "Protein database searches
using compositionally adjusted substitution matrices", FEBS J. 272:5101-5109.


Reference for composition-based statistics starting in round 2:
Schaffer, Alejandro A., L. Aravind, Thomas L. Madden,
Sergei Shavirin, John L. Spouge, Yuri I. Wolf,  
Eugene V. Koonin, and Stephen F. Altschul (2001), 
"Improving the accuracy of PSI-BLAST protein database searches with 
composition-based statistics and other refinements",  Nucleic Acids Res. 29:2994-3005.

Query= PAPA1_CARPA P00784 Papain precursor (EC 3.4.22.2) (Papaya
proteinase I) (PPI) (Allergen Car p 1).
         (345 letters)

Database: SwissProt (manually annotated part of UniProt, including
splice variants) 
           387,470 sequences; 145,667,171 total letters

Results from round 4


                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value
Sequences used in model and found again:

sw:PAPA1_CARPA Papain precursor (EC 3.4.22.2) (Papaya proteinase...   476   e-133
sw:XCP1_ARATH Xylem cysteine proteinase 1 precursor (EC 3.4.22.-...   455   e-127
sw:RD21A_ARATH Cysteine proteinase RD21a precursor (EC 3.4.22.-)...   454   e-127
sw:PAPA3_CARPA Caricain precursor (EC 3.4.22.30) (Papaya protein...   450   e-126
sw:PAPA2_CARPA Chymopapain precursor (EC 3.4.22.6) (Papaya prote...   445   e-124

  [Part of this file has been deleted for brevity]

sw:CYSP1_CARCN Cysteine proteinase 1 (EC 3.4.22.-) (Cysteine pro...    78   7e-14
sw:THPA_THADA Thaumatopain (EC 3.4.22.-) (Fragment).                   53   3e-06
sw:MDO2_PSEMR Macrodontain-2 (EC 3.4.22.-) (Macrodontain II) (Fr...    50   2e-05
sw:PHIG1_SARGI Philibertain g 1 (EC 3.4.22.-) (Philibertain g I)...    47   1e-04
sw:CATB_COTJA Cathepsin B (EC 3.4.22.1) (Cathepsin B1) (Fragments).    44   0.002

Sequences not found previously or not previously below threshold:

sw:PEPG_LACDL Aminopeptidase G (EC 3.4.22.-).                          43   0.003
sw:PEPW_LACDL Aminopeptidase W (EC 3.4.22.-).                          41   0.011
sw:PEPC_LISMO Aminopeptidase C (EC 3.4.22.40) (Bleomycin hydrola...    40   0.019
sw:PEPC_LISIN Aminopeptidase C (EC 3.4.22.40) (Bleomycin hydrola...    40   0.019
sw:PEPE_LACHE Aminopeptidase E (EC 3.4.22.-).                          39   0.055

  [Part of this file has been deleted for brevity]

sw:YOW4_SCHPO Uncharacterized RNA-binding protein C1861.04c.           32   6.8  
sw:BLMH_RAT Bleomycin hydrolase (EC 3.4.22.40) (BLM hydrolase) (...    32   7.0  
sw:K2C8_POTTR Keratin, type II cytoskeletal 8 (Cytokeratin-8) (C...    32   7.1  
sw:BLMH_MOUSE Bleomycin hydrolase (EC 3.4.22.40) (BLM hydrolase)...    32   7.2  
sw:MYBD_DICDI Myb-like protein D.                                      31   8.4  


CONVERGED!
>sw:PAPA1_CARPA Papain precursor (EC 3.4.22.2) (Papaya proteinase I)
           (PPI) (Allergen Car p 1).
          Length = 345

 Score =  476 bits (1225), Expect = e-133,   Method: Composition-based stats.
 Identities = 345/345 (100%), Positives = 345/345 (100%)

Query: 1   MAMIPSISKLLFVAICLFVYMGLSFGDFSIVGYSQNDLTSTERLIQLFESWMLKHNKIYK 60
           MAMIPSISKLLFVAICLFVYMGLSFGDFSIVGYSQNDLTSTERLIQLFESWMLKHNKIYK
Sbjct: 1   MAMIPSISKLLFVAICLFVYMGLSFGDFSIVGYSQNDLTSTERLIQLFESWMLKHNKIYK 60

Query: 61  NIDEKIYRFEIFKDNLKYIDETNKKNNSYWLGLNVFADMSNDEFKEKYTGSIAGNYTTTE 120
           NIDEKIYRFEIFKDNLKYIDETNKKNNSYWLGLNVFADMSNDEFKEKYTGSIAGNYTTTE
Sbjct: 61  NIDEKIYRFEIFKDNLKYIDETNKKNNSYWLGLNVFADMSNDEFKEKYTGSIAGNYTTTE 120

Query: 121 LSYEEVLNDGDVNIPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNE 180
           LSYEEVLNDGDVNIPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNE
Sbjct: 121 LSYEEVLNDGDVNIPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNE 180

Query: 181 YSEQELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKT 240
           YSEQELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKT
Sbjct: 181 YSEQELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKT 240

Query: 241 DGVRQVQPYNEGALLYSIANQPVSVVLEAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYG 300
           DGVRQVQPYNEGALLYSIANQPVSVVLEAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYG
Sbjct: 241 DGVRQVQPYNEGALLYSIANQPVSVVLEAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYG 300

Query: 301 PNYILIKNSWGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 345
           PNYILIKNSWGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN
Sbjct: 301 PNYILIKNSWGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 345


>sw:XCP1_ARATH Xylem cysteine proteinase 1 precursor (EC 3.4.22.-)
           (AtXCP1).
          Length = 355

 Score =  455 bits (1171), Expect = e-127,   Method: Composition-based stats.
 Identities = 164/354 (46%), Positives = 233/354 (65%), Gaps = 9/354 (2%)

Query: 1   MAMI-PSISKL-LFVAICLFVYMGLSFG-DFSIVGYSQNDLTSTERLIQLFESWMLKHNK 57
           MA   PS+SK  L VAI     +  +F  DFSIVGY+   LT+T++L++LFESWM +H+K
Sbjct: 1   MAFSAPSLSKFSLLVAISASALLCCAFARDFSIVGYTPEHLTNTDKLLELFESWMSEHSK 60

Query: 58  IYKNIDEKIYRFEIFKDNLKYIDETNKKNNSYWLGLNVFADMSNDEFKEKYTGSIAGNYT 117
            YK+++EK++RFE+F++NL +ID+ N + NSYWLGLN FAD++++EFK +Y G     ++
Sbjct: 61  AYKSVEEKVHRFEVFRENLMHIDQRNNEINSYWLGLNEFADLTHEEFKGRYLGLAKPQFS 120

Query: 118 TTELSYEEVLNDGDVNIPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGN 177
                          ++P+ VDWR+KGAV PVK+QG CGSCWAFS V  +EGI +I TGN
Sbjct: 121 RKRQPSANFRYRDITDLPKSVDWRKKGAVAPVKDQGQCGSCWAFSTVAAVEGINQITTGN 180

Query: 178 LNEYSEQELLDCDRR-SYGCNGGYPWSALQLVAQ-YGIHYRNTYPYEGVQRYCRSREKGP 235
           L+  SEQEL+DCD   + GCNGG    A Q +    G+H  + YPY   +  C+ +++  
Sbjct: 181 LSSLSEQELIDCDTTFNSGCNGGLMDYAFQYIISTGGLHKEDDYPYLMEEGICQEQKEDV 240

Query: 236 YAAKTDGVRQVQPYNEGALLYSIANQPVSVVLEAAGKDFQLYRGGIFVGPCGNKVDHAVA 295
                 G   V   ++ +L+ ++A+QPVSV +EA+G+DFQ Y+GG+F G CG  +DH VA
Sbjct: 241 ERVTISGYEDVPENDDESLVKALAHQPVSVAIEASGRDFQFYKGGVFNGKCGTDLDHGVA 300

Query: 296 AVGY----GPNYILIKNSWGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 345
           AVGY    G +Y+++KNSWG  WGE G+IR+KR TG   G+CG+   + YP K 
Sbjct: 301 AVGYGSSKGSDYVIVKNSWGPRWGEKGFIRMKRNTGKPEGLCGINKMASYPTKT 354


  [Part of this file has been deleted for brevity]


>sw:CATB_COTJA Cathepsin B (EC 3.4.22.1) (Cathepsin B1) (Fragments).
          Length = 48

 Score = 43.6 bits (102), Expect = 0.002,   Method: Composition-based stats.
 Identities = 14/70 (20%), Positives = 23/70 (32%), Gaps = 28/70 (40%)

Query: 134 IPEYVD----WRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNEYSEQELLDC 189
           +P+  D    W     ++ +++QGS                         E S ++LL C
Sbjct: 1   LPDTFDSRKQWPNCPTISEIRDQGS----------------------VSVEVSAEDLLSC 38

Query: 190 D--RRSYGCN 197
                  GCN
Sbjct: 39  CGFECGMGCN 48


>sw:PEPG_LACDL Aminopeptidase G (EC 3.4.22.-).
          Length = 437

 Score = 42.8 bits (100), Expect = 0.003,   Method: Composition-based stats.
 Identities = 12/42 (28%), Positives = 19/42 (45%), Gaps = 5/42 (11%)

Query: 289 KVDHAVAAVGYGPN-----YILIKNSWGTGWGENGYIRIKRG 325
           +V HA+  VG   +        ++NSWG   GE G+  +   
Sbjct: 358 EVSHAMTLVGVDEDKGDIRQWKVENSWGDKSGEKGFFVMSHN 399


  [Part of this file has been deleted for brevity]


>sw:MYBD_DICDI Myb-like protein D.
          Length = 595

 Score = 31.3 bits (70), Expect = 8.4,   Method: Composition-based stats.
 Identities = 8/44 (18%), Positives = 19/44 (43%), Gaps = 2/44 (4%)

Query: 65  KIYRFEIF--KDNLKYIDETNKKNNSYWLGLNVFADMSNDEFKE 106
           K  +  I+  +++ K     NK   S+    + F D + ++ + 
Sbjct: 436 KHRKNAIWTQEEDEKMAQLYNKYGKSWKAIHSHFDDKTREQVQS 479


  Database: SwissProt (manually annotated part of UniProt, including
  splice variants)
    Posted date:  Apr 23, 2008  4:36 PM
  Number of letters in database: 130,497,792
  Number of sequences in database:  362,782
  
  Database: /dbfb/uniprot/uniprot_sprot_varsplic
    Posted date:  Apr 23, 2008  4:36 PM
  Number of letters in database: 15,169,379
  Number of sequences in database:  24,688
  
Lambda     K      H
   0.313    0.165    0.533 

Lambda     K      H
   0.267   0.0505    0.140 


Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 502,717,500
Number of Sequences: 387470
Number of extensions: 26812597
Number of successful extensions: 64988
Number of sequences better than 10.0: 246
Number of HSP's better than 10.0 without gapping: 889
Number of HSP's successfully gapped in prelim test: 72
Number of HSP's that attempted gapping in prelim test: 59305
Number of HSP's gapped (non-prelim): 1151
length of query: 345
length of database: 145,667,171
effective HSP length: 117
effective length of query: 228
effective length of database: 100,333,181
effective search space: 22875965268
effective search space used: 22875965268
T: 11
A: 40
X1: 16 ( 7.2 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 41 (21.1 bits)
S2: 70 (31.3 bits)

Data files

The amino acid comparison matrices are hard coded in the program and cannot be changed.

Notes

Note that only some combinations of matrix and gap penalty are allowed, because these are the ones for which the K and lambda parameters have been derived "experimentally" by searching a random sequence against a random databank and have been hard coded in the program :

scoring matrixgap penaltygap length penaltyrecommended
BLOSUM9062
72
82
91
2
101*
111
BLOSUM8062
72
82
91
2
101*
111
132
252
BLOSUM6262
72
82
91
2
101
2
111* (the default)
2
121
131
BLOSUM5093
103
113
122
3
132*
3
142
151
2
161
2
171
181
1
BLOSUM45103
113
122
3
132
3
142*
152
161
2
171
181
191
PAM3052
62
72
81
91*
101
PAM7062
72
82
91
101*
111
PAM250113
123
132
3
142
3
152*
3
162
171
2
181
191
201
211

References

  1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.
  2. Schaffer, Alejandro A., L. Aravind, Thomas L. Madden, Sergei Shavirin John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001) "Improving PSI-BLAST Protein Database Search Sensitivity with Composition-Based Statistics and Other Refinements." Nucleic Acids Res. 29:2994-3005.
  3. Yu, Y.-K., Wootton, J.C. and Altschul, S.F. (2003) "The compositional adjustment of amino acid substitution matrices," Proc. Natl. Acad. Sci. USA 100:15688-15693.
  4. Altschul, S.F. & Yu, Y.K. (2005). "The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions." Bioinformatics 21:902-911.
  5. Altschul, S.F., Wootton, J.C., Gertz, E.M., Agarwala, R., Morgulis, A., Schaffer, A.A. and Yu, Y.-K. (2005) "Protein database searches using compositionally adjusted substitution matrices," FEBS J 272(20):5101-9.

Warnings

Only some combinations of matrix and gap penalty are allowed (see Notes).

If you want to use a protein "user provided BLAST databank" and you have in the same directory a nucleic acid databank with same basename you should not use the basename but one of the file names to point to the databank, because otherwise the program would assume you point to the nucleic acid databank and give an error message.

If you want to use the "checkpointing" facility by giving a "check" file produced by phiblast.html or by a previous run of psiblast as input to psiblast, make sure that you have the "filter" settings the same as when you made the "check" file, otherwise the sequence that is actually used might not match.

Diagnostic Error Messages

The program always writes as an error message whether it "converged" or not :

PSI-BLAST converged after <number> rounds
or
PSI-BLAST did not converge after <number> rounds

There are error messages specific to this program, which are issued when the user chooses to provide a databank in BLAST format of his own, but there are problems :

  <databank> cannot be accessed or is not BLAST format databank

  <databank> is not a nucleic acid databank !

  <databank> is not a protein databank !

Exit status

It exits prematurely with status 255 and an error message if there is a problem with a "User provided BLAST format databank", otherwise it exits with status 0.

Known bugs

None.

See also

Program nameDescription
blast BLAST search of query sequence(s) against sequence search set
ebi_blast WU-BLAST search of query sequence against sequence databank using EBI Web Services
ebi_fasta fastA search of query sequence against sequence databank using EBI Web Services
fasta fastA search of query sequence(s) against sequence search set
fasts Protein identification from peptides using fastA algorithm
iprscan Scans proteins or nucleic acids for conserved motifs using Interpro tools
pfmake Creates PROSITE style profile from protein multiple alignment
phiblast Search protein sequence set combining matching of pattern with local alignment of a query sequence surrounding the match
profit Scan a sequence or database with a matrix or profile
prophecy Creates matrices/profiles from multiple alignments
prophet Gapped alignment for profiles
ps_scan Scans proteins for conserved motifs using PROSITE (patterns and profiles)
blast2seq Finds local alignments between two sequences, using BLAST
makeblastdb Make BLAST format sequence database

Author(s)

The wrapper application psiblast was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The program blastpgp itself was written by by a team of developers working at the National Center for Biotechnology Information, Bethesda MD, U.S.A., comprising among others Stephen Altschul, David Lipman, Tom Madden, Alex Schaffer, Sergei Shavirin and Jinghui Zhang.

You can contact the BLAST development team at blast-help@ncbi.nlm.nih.gov

History

Completed 28 August 2002
Modified 20 March 2003 - adapted to BLAST version 2.2.5
Modified 23 February 2004 - adapted to BLAST version 2.2.8
Modified 15 October 2004 - updated list of NCBI genetic codes
Modified 4 June 2006 - improved handling sequence filtering and adapted to BLAST version 2.2.12 composition dependent statistics

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.