|
|
ps_scan |
By default simple motifs are ignored. These are motifs that are marked in the database with "SKIP-FLAG=TRUE", because they are likely to give a huge number of false positives. Examples are glycosylation sites, phosphorylation sites, leucine zippers, etc. You can force ps_scan to search them with the option -noskip. You can also force ps_scan to search only for a selected set of motifs by typing in a comma-separated list of Accession Numbers, e.g. "PS00001,PS00007". In order to find the AC's of the protein motifs of your interest, you can search the PROSITE databank using MRS.
A pattern-matching engine is said to be greedy if it tries to extend at most variable-length pattern elements. The sequence "ABCDC" and the pattern "A-x(1,3)-C" will produce the match "ABCDC" with a greedy engine, and the match "ABC" with a non-greedy one. By default, PROSITE is scanned in greedy mode, unless the option -nogreedy is set.
Some patterns may produce distinct but overlapping matches on a given sequence. Additionally, if the pattern contains variable-length elements, some of these matches may be completely included in another one. By default, all overlapping matches are reported rather than just one, and included matches ar not reported, but you can change this with the options -nooverlap repectively -include. Different combinations of -greedy, -overlap and -include may produce differences in match count and match positions with patterns that contain variable-length elements. An engine that allows overlaps should be greedy in order to reduce the number of multiple hits that are almost entirely overlapping except at the extremities.
The PROSITE syntax describes how to treat ambiguities in the pattern,
but not how to handle ambiguities in the sequence. In rare sequences in
Swiss-Prot and other databases, the characters B and Z are used
according to IUBMB nomenclature when a residue may be either Asp or Asn,
or Glu or Gln, respectively. The ps_scan program will produce a match if
the sequence has a "B" and the pattern allows either a "D" or a "N", or
both (and similarly for Z).
Whether the character X should be allowed to match any position of a
pattern is more controversial. It is generally useful to accept a single
pattern position to match X (unless that pattern position is an X
itself, in which case we can accept more). The maximum number of X
characters which are allowed to match a non-X position in a pattern can
be specified with the -level=n option. The default value is 1.
The profiles are searched by a call to the pfscan program of the PFTOOLS suite. The generalized profile format includes one or more cut-off levels (L) for the comparison score. The level 0 cut-off is the trusted cut-off for positive matches, but a level -1 is usually defined above which a match is potential, especially if there are other matches in the sequence with the profile. By default, only trusted matches (level 0 or higher) are reported. To obtain information about potential (weak) matches as well, you can run the "wrapper" ps_scan with parameter -level=-1.
The validity of a match with a profile can depend on the context of matches with other motifs. The PROSITE databank contains for some profiles post-processing rules (the PP lines) :
By default context-sensitive search and RDM are performed (and pfscan is run with cut-off level -1 although only matches with the set cut-off level are reported), but you can switch this of by setting -nopp.
By default a match with a pattern is validated by searching the corresponding sequence range against a mini-profile. If this range matches the mini-profile at cut-off level 0, a level L=(0) is reported, otherwise a level L=(-1) is reported. Validation is not performed if you switch it off by setting -novalidate or if you switch all postprocessing off with -nopp.
> ps_scan Scans proteins for conserved motifs using PROSITE (patterns and profiles) Input sequence(s): sw:tpa_human Output file [tpa_human.ps_scan]: |
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers:
[-seqs] seqall Protein sequence(s) filename and optional
format, or reference (input USA)
[-outfile] outfile [*.ps_scan] Output file name
Additional (Optional) qualifiers (* if not always prompted):
-motifs infile [/opt/sw/ps_scan/prosite.dat] Database of
motifs. Default is PROSITE, but you can
choose a personal file with protein patterns
and profiles in PROSITE format instead.
-[no]validate toggle [Y] Validate pattern hits using
mini-profiles
* -miniprofiles infile [/opt/sw/ps_scan/evaluator.dat] Database of
profiles for pattern validation. Default is
PROSITE evaluator file, but you can choose a
personal file instead (see on-line manual).
-search menu [3] Search which motifs (Values: 1 (only
patterns); 2 (only profiles); 3 (patterns
and profiles))
-[no]skip boolean [Y] Ignore (is default) motifs in database
that are labeled with SKIP-FLAG=TRUE, which
means that they are likely to give many
false positives.
-select string [all] By default all motifs in the database
are searched for. You can restrict the
search by providing a comma-separated list
of AC's, e.g. PS00001,PS00007 . If you do
this the skip-flag is ignored. (Any string
is accepted)
* -[no]greedy boolean [Y] Greedy matching (is default) means that
in case of doubt a variable-length pattern
should match an as long as possible sequence
segment
* -[no]overlap boolean [Y] Allow overlapping matches for the same
pattern (is default).
* -include boolean Allow for a variable-length pattern multiple
matches included the one into the other
* -xmatches integer [1] Maximum number of X in sequence allowed
to match pattern (Integer 0 or more)
* -level integer [0] Minimum cut-off level for profiles. You
can choose -1 for increased sensitivity.
(Integer from -1 to 1)
* -[no]pp boolean [Y] Perform post-processing (estimate
validity of match in the context of other
matches)
Advanced (Unprompted) qualifiers:
-ranges toggle Write as output matching ranges in fastA
format
Associated qualifiers:
"-seqs" associated qualifiers
-sbegin1 integer Start of each sequence to be used
-send1 integer End of each sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-sformat1 string Input sequence format
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name
"-outfile" associated qualifiers
-odirectory2 string Output directory
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write standard output
-filter boolean Read standard input, write standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages
|
| Standard (Mandatory) qualifiers | Allowed values | Default | |||||||
|---|---|---|---|---|---|---|---|---|---|
| [-seqs] (Parameter 1) |
Protein sequence(s) filename and optional format, or reference (input USA) | Readable sequence(s) | Required | ||||||
| [-outfile] (Parameter 2) |
Output file name | Output file | <sequence>.ps_scan | ||||||
| Additional (Optional) qualifiers | Allowed values | Default | |||||||
| -motifs | Database of motifs. Default is PROSITE, but you can choose a personal file with protein patterns and profiles in PROSITE format instead. | Input file | /opt/sw/ps_scan/prosite.dat | ||||||
| -[no]validate | Validate pattern hits using mini-profiles | Toggle value Yes/No | Yes | ||||||
| -miniprofiles | Database of profiles for pattern validation. Default is PROSITE evaluator file, but you can choose a personal file instead (see on-line manual). | Input file | /opt/sw/ps_scan/evaluator.dat | ||||||
| -search | Search which motifs |
|
3 | ||||||
| -[no]skip | Ignore (is default) motifs in database that are labeled with SKIP-FLAG=TRUE, which means that they are likely to give many false positives. | Boolean value Yes/No | Yes | ||||||
| -select | By default all motifs in the database are searched for. You can restrict the search by providing a comma-separated list of AC's, e.g. PS00001,PS00007 . If you do this the skip-flag is ignored. | Any string is accepted | all | ||||||
| -[no]greedy | Greedy matching (is default) means that in case of doubt a variable-length pattern should match an as long as possible sequence segment | Boolean value Yes/No | Yes | ||||||
| -[no]overlap | Allow overlapping matches for the same pattern (is default). | Boolean value Yes/No | Yes | ||||||
| -include | Allow for a variable-length pattern multiple matches included the one into the other | Boolean value Yes/No | No | ||||||
| -xmatches | Maximum number of X in sequence allowed to match pattern | Integer 0 or more | 1 | ||||||
| -level | Minimum cut-off level for profiles. You can choose -1 for increased sensitivity. | Integer from -1 to 1 | 0 | ||||||
| -[no]pp | Perform post-processing (estimate validity of match in the context of other matches) | Boolean value Yes/No | Yes | ||||||
| Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||
| -ranges | Write as output matching ranges in fastA format | Toggle value Yes/No | No | ||||||
PS00021 KRINGLE_1 Kringle domain signature.
178 - 190 YCRNpdrdskp.WC L=(0)
266 - 278 YCRNpdgdakp.WC L=(0)
PS00022 EGF_1 EGF-like domain signature 1.
108 - 119 CqCpeGfaGKcC L=(0)
PS00134 TRYPSIN_HIS Serine proteases, trypsin family, histidine active site.
353 - 358 LSAAHC L=(0)
PS00135 TRYPSIN_SER Serine proteases, trypsin family, serine active site.
507 - 518 DAcqGDSGGPLV L=(0)
PS01186 EGF_2 EGF-like domain signature 2.
108 - 119 CqCpeGFagkc....C L=(0)
PS01253 FN1_1 Fibronectin type-I domain signature.
41 - 78 CrdektqmiYqqhqsWlRpvlrsnrveyCwCnsgraq...C L=(0)
PS50026 EGF_3 EGF-like domain profile.
82 - 120 PVKSCSePRCFNGGTCQQALYFSDFVCQCPEGFAGKCCE L=0
PS50070 KRINGLE_2 Kringle domain profile.
126 - 208 TCYEDQGISYRGTWSTAESGAECTNWNSSALAQKPYSGRRPDAIrlgLGNHNYCRNPDRD L=0
SKPWCYVFkAGKYSSEFCSTPAC
214 - 296 DCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKVYTAQNPSAQalgLGKHNYCRNPDGD L=0
AKPWCHVLkNRRLTWEYCDVPSC
PS50240 TRYPSIN_DOM Serine proteases, trypsin domain profile.
311 - 561 IKGGLFADIASHPWQAAIFAKHrrspgERFLCGGILISSCWILSAAHCFQERFPPHHLTV L=0
ILGRTYRVVPGEEEQKFEVEKYIVHKEFDDDTYDNDIALLQLKSDssrcAQESSVVRTVC
LPPADLQLPDWTECELSGYGKHEALsPFYSERLKEAHVRLYPSSRCtSQHLLNRTVTDNM
LCAGDTrSGGPQAnlhdaCQGDSGGPLVCLNDGRMTLVGIISWGLGCGQKDVPGVYTKVT
NYLDWIRDNMR
PS51091 FN1_2 Fibronectin type-I domain profile.
39 - 81 VICRDEKTQMIYQQHQSWLRPVLRsNRVEYCWC---NSGRAQCHSV L=0
|
First the pattern matches are reported and then the profile matches. For pattern matches amino acids matching an "x" in the pattern are reported in lowercase. For profile matches (and also for pattern matches if validation by mini-profile is on) the score cut-off level L is reported.
You can obtain instead (option -ranges) a file with the matching sequences in FASTA format.
You can write your own patterns :
ID Gio1; PATTERN. AC P1; DE potential GPCR coupling site, sequence Gio PA [ILV]-x(3)-S-G-x(0,10)-R. // ID Gio2; PATTERN. AC P2; DE potential GPCR coupling site, sequence Gio PA N-x(2)-R-x(1,4)-R. // ID Gio3; PATTERN. AC P3; DE potential GPCR coupling site, sequence Gio PA Y-x-A-x(1,8)-A-[ILV]. // |
Note :
; PATTERN.to indicate it is a pattern
You can make your own profiles using pfmake.
You can also make your own file with profiles for the purpose of validating patterns (see above, Algorithm). You must then make sure that a mini-profile corresponding to pattern with AC PSxxxxx has an AC MPxxxxx.
| Program name | Description |
|---|---|
| antigenic | Finds antigenic sites in proteins |
| bscan | Scans proteins or nucleic acids for conserved motifs using Blocks |
| digest | Protein proteolytic enzyme or reagent cleavage digest |
| emast | Motif detection |
| ememe | Motif detection |
| epestfind | Finds PEST motifs as potential proteolytic cleavage sites |
| fuzzpro | Protein pattern search |
| fuzztran | Protein pattern search after translation |
| helixturnhelix | Report nucleic acid binding motifs |
| iprscan | Scans proteins or nucleic acids for conserved motifs using Interpro tools |
| oddcomp | Find protein sequence regions with a biased composition |
| patmatdb | Search a protein sequence with a motif |
| patmatmotifs | Search a PROSITE motif database with a protein sequence |
| pepcoil | Predicts coiled coil regions |
| pfmake | Creates PROSITE style profile from protein multiple alignment |
| phiblast | Search protein sequence set combining matching of pattern with local alignment of a query sequence surrounding the match |
| preg | Regular expression search of a protein sequence |
| profit | Scan a sequence or database with a matrix or profile |
| prophecy | Creates matrices/profiles from multiple alignments |
| prophet | Gapped alignment for profiles |
| pscan | Scans proteins for conserved motifs using PRINTS |
| psiblast | Iterative BLAST search with generation of profile of protein sequence against protein sequence set |
| sigcleave | Reports protein signal cleavage sites |
| ehmmpfam | Scans proteins using Pfam-A or other HMM database |
The ps_scan Perl script itself was written by Alexandre Gattiker,
Edouard de Castro (ecastro@isb-sib.ch)
and Béatrice Cuche, with contributions from Lorenza Bordoli
(repeat method).
Swiss Institute of Bioinformatics
The PFTOOLS suite (of which pfscan is part) was originaly evelopped by
Philipp Bucher and is currently maintained by Volker Flegel
Swiss Institute of Bioinformatics
(pftools@isb-sib.ch)