ps_scan

 

Function

Scans proteins for conserved motifs using PROSITE (patterns and profiles)

Description

ps_scan is an EMBOSS "wrapper" program for the Perl script developed at the SIB to scan one or several patterns and/or profiles from PROSITE against one or several protein sequences.

By default simple motifs are ignored. These are motifs that are marked in the database with "SKIP-FLAG=TRUE", because they are likely to give a huge number of false positives. Examples are glycosylation sites, phosphorylation sites, leucine zippers, etc. You can force ps_scan to search them with the option -noskip. You can also force ps_scan to search only for a selected set of motifs by typing in a comma-separated list of Accession Numbers, e.g. "PS00001,PS00007". In order to find the AC's of the protein motifs of your interest, you can search the PROSITE databank using MRS.

Algorithm

The PROSITE database contains 2 kinds of entries :
  1. patterns, defined in a pattern definition language
  2. profiles (matrices), defined in a generalyzed profile syntax designed by P. Bucher, see original document
The patterns are searched by regular expression matching. Several parameters fine tune the way this is done :

A pattern-matching engine is said to be greedy if it tries to extend at most variable-length pattern elements. The sequence "ABCDC" and the pattern "A-x(1,3)-C" will produce the match "ABCDC" with a greedy engine, and the match "ABC" with a non-greedy one. By default, PROSITE is scanned in greedy mode, unless the option -nogreedy is set.

Some patterns may produce distinct but overlapping matches on a given sequence. Additionally, if the pattern contains variable-length elements, some of these matches may be completely included in another one. By default, all overlapping matches are reported rather than just one, and included matches ar not reported, but you can change this with the options -nooverlap repectively -include. Different combinations of -greedy, -overlap and -include may produce differences in match count and match positions with patterns that contain variable-length elements. An engine that allows overlaps should be greedy in order to reduce the number of multiple hits that are almost entirely overlapping except at the extremities.

The PROSITE syntax describes how to treat ambiguities in the pattern, but not how to handle ambiguities in the sequence. In rare sequences in Swiss-Prot and other databases, the characters B and Z are used according to IUBMB nomenclature when a residue may be either Asp or Asn, or Glu or Gln, respectively. The ps_scan program will produce a match if the sequence has a "B" and the pattern allows either a "D" or a "N", or both (and similarly for Z).
Whether the character X should be allowed to match any position of a pattern is more controversial. It is generally useful to accept a single pattern position to match X (unless that pattern position is an X itself, in which case we can accept more). The maximum number of X characters which are allowed to match a non-X position in a pattern can be specified with the -level=n option. The default value is 1.

The profiles are searched by a call to the pfscan program of the PFTOOLS suite. The generalized profile format includes one or more cut-off levels (L) for the comparison score. The level 0 cut-off is the trusted cut-off for positive matches, but a level -1 is usually defined above which a match is potential, especially if there are other matches in the sequence with the profile. By default, only trusted matches (level 0 or higher) are reported. To obtain information about potential (weak) matches as well, you can run the "wrapper" ps_scan with parameter -level=-1.

The validity of a match with a profile can depend on the context of matches with other motifs. The PROSITE databank contains for some profiles post-processing rules (the PP lines) :

Repeated profiles generally possess high amino acid substitution rates and their identification is highly problematic. Even if the presence of a certain repeat family is known, the exact locations and number of repetitive units often cannot be determined using current profile search. We have implemented a context dependent threshold that allows the detection of strongly divergent repeats when well characterized ones have already been identified.
Two complementary approaches were designed to increase the sensitivity of profiles for the detection of repeats. One approach, Repeats Detection Method 1 (RDM1) consists in defining (computing) a low acceptance threshold placed at level -1 in the profile. For simplicity we will call level 0 cutoff protein-threshold and level -1 cutoff minimal-threshold. When the profile is compared with a given sequence a list of matches with scores greater than the minimal-threshold is collected. The matches are considered as significant only if at least a hit with a score greater than the protein-threshold has been detected in the target protein. In a target sequence where the occurrence of a particular domain has been reported the minimal-threshold represents the score above which the probability of detecting additional copies of the same domain by chance is close to zero.
However, the detection of repeats in proteins where no single domain scores above the protein-threshold remains critical. This is typically the case for more distantly related members of a protein family. To obviate this problem a second approach was devised, Repeats Detection Method 2 (RDM2). The sum of the scores of alignments with scores greater than the minimal-threshold is computed. If the sum of the individual domain scores is larger than a threshold (the sum-of-scores-threshold), these domains are considered to be true homologues. Based on the inspection of the list of positive hits found upon databases searches, we found that a good estimate for the sum-of-scores-threshold is the value of the sum of the protein-threshold with the minimal-threshold. This value was chosen since it represents in theory the minimal match score that would be detected when aligning a profile to a member of a given protein family containing only two copies of a repeat.
Profiles for repetitive domains are tagged with 'RR' or 'R?' in the LEVEL=-1 CUT_OFF line of the profile. When the profile is tagged with 'RR' the two methods RDM1 and RDM2 are applied, whereas when it is tagged with 'R?' only RDM1 is applied. In the output of the program the reported matches are tagged with 'R' or with 'r' when the hits have been detected with RDM1 or RDM2 respectively.

By default context-sensitive search and RDM are performed (and pfscan is run with cut-off level -1 although only matches with the set cut-off level are reported), but you can switch this of by setting -nopp.

By default a match with a pattern is validated by searching the corresponding sequence range against a mini-profile. If this range matches the mini-profile at cut-off level 0, a level L=(0) is reported, otherwise a level L=(-1) is reported. Validation is not performed if you switch it off by setting -novalidate or if you switch all postprocessing off with -nopp.

Usage

Here is a sample session with ps_scan

> ps_scan
Scans proteins for conserved motifs using PROSITE (patterns and profiles)
Input sequence(s): sw:tpa_human
Output file [tpa_human.ps_scan]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-seqs]              seqall     Protein sequence(s) filename and optional
                                  format, or reference (input USA)
  [-outfile]           outfile    [*.ps_scan] Output file name

   Additional (Optional) qualifiers (* if not always prompted):
   -motifs             infile     [/opt/sw/ps_scan/prosite.dat] Database of
                                  motifs. Default is PROSITE, but you can
                                  choose a personal file with protein patterns
                                  and profiles in PROSITE format instead.
   -[no]validate       toggle     [Y] Validate pattern hits using
                                  mini-profiles
*  -miniprofiles       infile     [/opt/sw/ps_scan/evaluator.dat] Database of
                                  profiles for pattern validation. Default is
                                  PROSITE evaluator file, but you can choose a
                                  personal file instead (see on-line manual).
   -search             menu       [3] Search which motifs (Values: 1 (only
                                  patterns); 2 (only profiles); 3 (patterns
                                  and profiles))
   -[no]skip           boolean    [Y] Ignore (is default) motifs in database
                                  that are labeled with SKIP-FLAG=TRUE, which
                                  means that they are likely to give many
                                  false positives.
   -select             string     [all] By default all motifs in the database
                                  are searched for. You can restrict the
                                  search by providing a comma-separated list
                                  of AC's, e.g. PS00001,PS00007 . If you do
                                  this the skip-flag is ignored. (Any string
                                  is accepted)
*  -[no]greedy         boolean    [Y] Greedy matching (is default) means that
                                  in case of doubt a variable-length pattern
                                  should match an as long as possible sequence
                                  segment
*  -[no]overlap        boolean    [Y] Allow overlapping matches for the same
                                  pattern (is default).
*  -include            boolean    Allow for a variable-length pattern multiple
                                  matches included the one into the other
*  -xmatches           integer    [1] Maximum number of X in sequence allowed
                                  to match pattern (Integer 0 or more)
*  -level              integer    [0] Minimum cut-off level for profiles. You
                                  can choose -1 for increased sensitivity.
                                  (Integer from -1 to 1)
*  -[no]pp             boolean    [Y] Perform post-processing (estimate
                                  validity of match in the context of other
                                  matches)

   Advanced (Unprompted) qualifiers:
   -ranges             toggle     Write as output matching ranges in fastA
                                  format

   Associated qualifiers:

   "-seqs" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
[-seqs]
(Parameter 1)
Protein sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.ps_scan
Additional (Optional) qualifiers Allowed values Default
-motifs Database of motifs. Default is PROSITE, but you can choose a personal file with protein patterns and profiles in PROSITE format instead. Input file /opt/sw/ps_scan/prosite.dat
-[no]validate Validate pattern hits using mini-profiles Toggle value Yes/No Yes
-miniprofiles Database of profiles for pattern validation. Default is PROSITE evaluator file, but you can choose a personal file instead (see on-line manual). Input file /opt/sw/ps_scan/evaluator.dat
-search Search which motifs
1 (only patterns)
2 (only profiles)
3 (patterns and profiles)
3
-[no]skip Ignore (is default) motifs in database that are labeled with SKIP-FLAG=TRUE, which means that they are likely to give many false positives. Boolean value Yes/No Yes
-select By default all motifs in the database are searched for. You can restrict the search by providing a comma-separated list of AC's, e.g. PS00001,PS00007 . If you do this the skip-flag is ignored. Any string is accepted all
-[no]greedy Greedy matching (is default) means that in case of doubt a variable-length pattern should match an as long as possible sequence segment Boolean value Yes/No Yes
-[no]overlap Allow overlapping matches for the same pattern (is default). Boolean value Yes/No Yes
-include Allow for a variable-length pattern multiple matches included the one into the other Boolean value Yes/No No
-xmatches Maximum number of X in sequence allowed to match pattern Integer 0 or more 1
-level Minimum cut-off level for profiles. You can choose -1 for increased sensitivity. Integer from -1 to 1 0
-[no]pp Perform post-processing (estimate validity of match in the context of other matches) Boolean value Yes/No Yes
Advanced (Unprompted) qualifiers Allowed values Default
-ranges Write as output matching ranges in fastA format Toggle value Yes/No No

Input file format

ps_scan reads any normal sequence USA for one or more protein sequence(s).
You can submit several sequences at the same time. In this case the "wrapper" ps_scan will launch the script for each individual sequence and make sure the output files have different names. Do note that the results of the individual scans are in no way merged.

Output file format

The default output looks like :

Output files for usage example

File: tpa_human.ps_scan

PS00021 KRINGLE_1 Kringle domain signature.
    178 - 190  YCRNpdrdskp.WC                                               L=(0)
    266 - 278  YCRNpdgdakp.WC                                               L=(0)
PS00022 EGF_1 EGF-like domain signature 1.
    108 - 119  CqCpeGfaGKcC                                                 L=(0)
PS00134 TRYPSIN_HIS Serine proteases, trypsin family, histidine active site.
    353 - 358  LSAAHC                                                       L=(0)
PS00135 TRYPSIN_SER Serine proteases, trypsin family, serine active site.
    507 - 518  DAcqGDSGGPLV                                                 L=(0)
PS01186 EGF_2 EGF-like domain signature 2.
    108 - 119  CqCpeGFagkc....C                                             L=(0)
PS01253 FN1_1 Fibronectin type-I domain signature.
      41 - 78  CrdektqmiYqqhqsWlRpvlrsnrveyCwCnsgraq...C                    L=(0)
PS50026 EGF_3 EGF-like domain profile.
     82 - 120  PVKSCSePRCFNGGTCQQALYFSDFVCQCPEGFAGKCCE                      L=0
PS50070 KRINGLE_2 Kringle domain profile.
    126 - 208  TCYEDQGISYRGTWSTAESGAECTNWNSSALAQKPYSGRRPDAIrlgLGNHNYCRNPDRD L=0
 SKPWCYVFkAGKYSSEFCSTPAC
    214 - 296  DCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKVYTAQNPSAQalgLGKHNYCRNPDGD L=0
 AKPWCHVLkNRRLTWEYCDVPSC
PS50240 TRYPSIN_DOM Serine proteases, trypsin domain profile.
    311 - 561  IKGGLFADIASHPWQAAIFAKHrrspgERFLCGGILISSCWILSAAHCFQERFPPHHLTV L=0
 ILGRTYRVVPGEEEQKFEVEKYIVHKEFDDDTYDNDIALLQLKSDssrcAQESSVVRTVC
 LPPADLQLPDWTECELSGYGKHEALsPFYSERLKEAHVRLYPSSRCtSQHLLNRTVTDNM
 LCAGDTrSGGPQAnlhdaCQGDSGGPLVCLNDGRMTLVGIISWGLGCGQKDVPGVYTKVT
 NYLDWIRDNMR
PS51091 FN1_2 Fibronectin type-I domain profile.
      39 - 81  VICRDEKTQMIYQQHQSWLRPVLRsNRVEYCWC---NSGRAQCHSV               L=0

First the pattern matches are reported and then the profile matches. For pattern matches amino acids matching an "x" in the pattern are reported in lowercase. For profile matches (and also for pattern matches if validation by mini-profile is on) the score cut-off level L is reported.

You can obtain instead (option -ranges) a file with the matching sequences in FASTA format.

Data files

ps_scan scans by default the PROSITE databank. You can choose to search instead a personal databank with patterns and/or profiles by setting -motifs=mydatabank.

You can write your own patterns :

ID   Gio1; PATTERN.
AC   P1;
DE   potential GPCR coupling site, sequence Gio
PA   [ILV]-x(3)-S-G-x(0,10)-R.
//
ID   Gio2; PATTERN.
AC   P2;
DE   potential GPCR coupling site, sequence Gio
PA   N-x(2)-R-x(1,4)-R.
//
ID   Gio3; PATTERN.
AC   P3;
DE   potential GPCR coupling site, sequence Gio
PA   Y-x-A-x(1,8)-A-[ILV].
//

Note :

  1. Each line starts by a two letter tag, followed by 3 white spaces
  2. The ID, AC and PA lines are obligatory, the DE line is facultative. The entry name (ID) and the accession number are preferably unique. The content of the ID, AC and DE lines will be written in the output.
  3. the ID line must end with
    ; PATTERN.
    to indicate it is a pattern
  4. The AC line must end with ;

You can make your own profiles using pfmake.

You can also make your own file with profiles for the purpose of validating patterns (see above, Algorithm). You must then make sure that a mini-profile corresponding to pattern with AC PSxxxxx has an AC MPxxxxx.

Notes

None.

References

  1. Alexandre Gattiker, Elisabeth Gasteiger, Amos Bairoch. ScanProsite: a reference implementation of a PROSITE scanning tool. Applied Bioinformatics 2002:1(2) 107-108.
  2. Falquet L., Pagni M., Bucher P., Hulo N., Sigrist C.J., Hofmann K., Bairoch A. The PROSITE database, its status in 2002 Nucleic Acids Res. 30:235-238(2002)

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
antigenic Finds antigenic sites in proteins
bscan Scans proteins or nucleic acids for conserved motifs using Blocks
digest Protein proteolytic enzyme or reagent cleavage digest
emast Motif detection
ememe Motif detection
epestfind Finds PEST motifs as potential proteolytic cleavage sites
fuzzpro Protein pattern search
fuzztran Protein pattern search after translation
helixturnhelix Report nucleic acid binding motifs
iprscan Scans proteins or nucleic acids for conserved motifs using Interpro tools
oddcomp Find protein sequence regions with a biased composition
patmatdb Search a protein sequence with a motif
patmatmotifs Search a PROSITE motif database with a protein sequence
pepcoil Predicts coiled coil regions
pfmake Creates PROSITE style profile from protein multiple alignment
phiblast Search protein sequence set combining matching of pattern with local alignment of a query sequence surrounding the match
preg Regular expression search of a protein sequence
profit Scan a sequence or database with a matrix or profile
prophecy Creates matrices/profiles from multiple alignments
prophet Gapped alignment for profiles
pscan Scans proteins for conserved motifs using PRINTS
psiblast Iterative BLAST search with generation of profile of protein sequence against protein sequence set
sigcleave Reports protein signal cleavage sites
ehmmpfam Scans proteins using Pfam-A or other HMM database

Author(s)

The wrapper application ps_scan was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The ps_scan Perl script itself was written by Alexandre Gattiker, Edouard de Castro (ecastro@isb-sib.ch) and Béatrice Cuche, with contributions from Lorenza Bordoli (repeat method).
Swiss Institute of Bioinformatics

The PFTOOLS suite (of which pfscan is part) was originaly evelopped by Philipp Bucher and is currently maintained by Volker Flegel
Swiss Institute of Bioinformatics (pftools@isb-sib.ch)

History

Completed 11 September 2002
Modified 7 October 2002 - added possibility to search only selected entries from PROSITE or to search personal databank instead
Modified 3 April 2007 - adapted to ps_scan.pl version of 2006
Modified 4 December 2008 - adapted to ps_scan.pl version of 2008

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.