cmsearchrfam

 

Function

Scans nucleic acids for RNA genes and conserved motifs using Rfam

Description

cmsearchrfam is an EMBOSS "wrapper" program for the program cmsearch from Soan Eddy's INFERNAL suite. It scans one or several entries from the Rfam databank against one or several nucleic acid sequences. Rfam contains Covariance Models, which describe the most highly conserved regions of RNA sequences, taking account of as well the conservation of bases as the conservation of secondary structure (base pairs).

By default cmsearchrfam searches both strands of the nucleic acid sequence(s) and searches the complete Rfam databank. You can force cmsearchrfam to search only the top strand of the query sequence(s) by setting -noreverse. You can also force cmsearchrfam to search only for a selected set of Covariance Models by typing in a comma-separated list of Accession Numbers, e.g. "RF00001,RF00007". In order to find the AC's of the RNA models of your interest, you can search the Rfam databank using MRS.

By default cmsearchrfam can search for local alignments, in which the alignment is allowed to span two or more subsequences if necessary (e.g. if the structures of the query model and the target sequence are only partly shared). You can request cmsearchrfam to operate instead in glocal mode, that is local with respect to the sequence but global with respect to the model. "glocal" searches are less sensitive but might be desirable in some cases.

Algorithm

INFERNAL builds a model of consensus RNA secondary structure using a formalism called a Covariance Model (CM), which is a type of profile stochastic context-free grammar (profile SCFG) (Eddy and Durbin, 1994; Durbin et al., 1998; Eddy, 2002).

cmsearch implements a scanning window dynamic programming algorithm (Durbin et al., 1998). It looks for high-scoring alignments of the structural model to any subsequence of the target sequence. It works even if the target sequence is very large (e.g. a whole chromosome) - though it is slow, like all SCFG algorithms.

By default cmsearchrfam uses the inside algorithm, which computes a "bit score" summed over all possible alignments. You can optionally request to use instead the CYK algorithm, which computes the "bit score" of the single most likely alignment of the sequence with the CM. This option increases speed about two times, but decreases the sensitivity.

cmsearchrfam computes a "bit score", which is a log-odds score in log base two :
S = log2(P(seq|CM)/P(seq|null))
where P(seq|CM) is the probability of the target sequence given the CM and P(seq|null) is the probability of the target sequence given a "null hypothesis" model, which models the statistics of random sequence. cmsearchrfam also computes an E() value. It tells you how many false positives you would expect to see at or above a certain "bit score". The CM files in the Rfam have been calibrated, using the cmcalibrate program from the INFERNAL suite. With this calibration information from the CM file cmsearch can compute an E() from the "bit score" and the sequence length. Be however alert, currently INFERNAL E() values are not perfect.

The current version of cmsearchrfam by default uses a cutoff value for E() of 1.0, that is it reports only hits with an E() value of 1.0 or less. You can instead set yourself an E() value cutoff or set directly a "bit score" cutoff or use any of the three scores set in the Rfam CM file :

In case an E() value cutoff is used it is necessary to set a sequence search set length. It is possible to use the real length of the search set, that is the length of the query sequence in case you search only the top strand or twice the length of the query sequence in case you search both strands. This gives sometimes a lot of false positives. Therefore cmsearchrfam proposes by default to use instead a fictional search set length.

Accelerating cmsearch by QDB and filtering

RNA homology search with CMs is slow. One way to speed it up is query-dependent banding (QDB). QDB precalculates regions of the dynamic programming matrix that yield a negligible contribution to the final probability (see Nawrocki and Eddy, 2007). QDB sacrifies the guarantee that the optimal alignment for any subsequence will be found, so the acceleration potentially comes at the cost of sensitivity. The beta parameter is the amount of probability mass considered negligible during band calculation, higher values of beta yield greater speedups but also a greater chance of missing the optimal alignment.

A further way to considerably accelerate the search at a minimal cost of losing sensitivity is filtering. By default cmsearchrfam uses two rounds of filters with faster algorithms to prune the database prior to searching with the slow algorithm. The first round of filtering is faster but less strict that the second. First, the full database is searched with the first round filter, then any hits that survive the first round are searched with the second round filter. Finally any hits that survive the first and second round of filtering are searched with the final round search strategy. During the filter rounds, hits are padded with a short strectch of residues on either side prior to searching in the subsequent round. The exact number of residues is dependant on the size of the model being searched.

The first round of filtering is performed with an HMM, which is searched using a scanning "forward" algorithm. The HMM implemented in INFERNAL is called the Plan 9 HMM (to be distinguished from the Plan 7 HMM of the HMMER package). The HMM is derived from the Rfam CM. By default the appropriate threshold for accepting hits is automatically chosen dependant on the "bit score" cutoff of the final search and calibration information taken from the Rfam CM file, in such a way that there is 99.5% chance that a hit that would be accepted in the final search will pass the filter. In some cases HMM filtering is turned off because the expected survival fraction of sequence is close to 1, so that filtering would save little if any time. It is possible to set instead manually an E() value or a "bit score" cutoff.

The second round of filtering is performed with the CM using the QDB algorithm. By default the beta parameter is set to 1E-7, determined empirically as a good tradeoff between sensitivity and speed. By default the appropriate threshold for accepting hits is automatically chosen using an ad hoc procedure that makes that the filter will always allow hits that have an E() value 100 times greater than the cutoff for the final search and that the final round of the search is expected to require at least 3% of the total predicted number of calculations. It is possible to set instead manually an E() value or a "bit score" cutoff. The final round of searching is by default performed with QDB and beta set to 1E-15, determined empirically as giving not much difference in speed and sensitivity with an unbanded search (except for short sequences).

Usage

Here is a sample session with cmsearchrfam

> cmsearchrfam
Scans nucleic acids for RNA genes and conserved motifs using Rfam
Input nucleotide sequence(s): embl:x00066
         E : user defined E() value
         T : user defined bit score cutoff
        GA : gathering bit score cutoff from Rfam CM file
        TC : trusted bit score cutoff from Rfam CM file
        NC : noise bit score cutoff from Rfam CM file
Cutoff type for final search [E]:
E() value cutoff [1.0]:
Set user defined query sequence length for E() value calculation [Y]:
Effective sequence length used for E() value calculation (in Mbase) [1.5]:
Output file [x00066.cmsearchrfam]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-seqs]              seqall     Nucleotide sequence(s) filename and optional
                                  format, or reference (input USA)
   -cutoff             menu       [E] Cutoff type for final search (Values: E
                                  (user defined E() value); T (user defined
                                  bit score cutoff); GA (gathering bit score
                                  cutoff from Rfam CM file); TC (trusted bit
                                  score cutoff from Rfam CM file); NC (noise
                                  bit score cutoff from Rfam CM file))
*  -expect             float      [1.0] E() value = number of Rfam hits with
                                  same or higher bit score that you expect to
                                  find by chance. Note that this depends on
                                  the length of the query sequence. (Number
                                  0.000 or more)
*  -[no]setlength      toggle     [Y] Set user defined query sequence length
                                  for E() calculation instead of using actual
                                  sequence length. This is done by default
                                  because for short sequences using the actual
                                  sequence length often gives false
                                  positives.
*  -seqlength          float      [1.5] Effective sequence length used for E()
                                  value calculation (in Mbase) (Number 0.000
                                  or more)
*  -threshold          float      [0.0] Threshold (bit score cutoff) (Any
                                  numeric value)
  [-outfile]           outfile    [*.cmsearchrfam] Output file name

   Additional (Optional) qualifiers (* if not always prompted):
   -select             string     [all] By default all models in the database
                                  are searched for. You can restrict the
                                  search by providing a comma-separated list
                                  of AC's, e.g. RF00001,RF00007. (Any string
                                  is accepted)
   -[no]reverse        boolean    [Y] Search also complementary strand of
                                  query sequence (is default).
   -[no]local          boolean    [Y] Make local alignment. If you switch this
                                  off cmsearchrfam can only find alignments
                                  that are global in the Rfam Covariance
                                  Model.
   -cyk                boolean    Use CYK (Cocke-Younger-Kasami) algorithm
                                  instead of inside algorithm. This is faster
                                  but less sensitive.
   -[no]banded         toggle     [Y] Use QDB (query-dependent banding)
                                  algorithm, that is ignore regions of dynamic
                                  programming matrix that have negligible
                                  contribution (set with -beta) to
                                  probability. Turning this off is more
                                  sensitive but slower.
*  -beta               float      [1E-15] Beta parameter for banding.
                                  Probabilities lower than beta are considered
                                  negligible. (Number from 0.000 to 1.000)
   -hmmfilter          menu       [CM] By default HMM filtering is used as
                                  first filter, that is the sequence is
                                  searched against a Hidden Markov Model
                                  derived from the Rfam Covariance Model and
                                  only the hits are searched further. (Values:
                                  CM (with bit score cutoff derived from Rfam
                                  CM file); E (with user defined E() value
                                  cutoff); T (with user defined bit score
                                  cutoff); 0 (none))
*  -hmmexpect          float      [1.0] E() value cutoff for HMM filtering
                                  (Number 1.000 or more)
*  -hmmthreshold       float      [3.0] Threshold (bit score cutoff) for HMM
                                  filtering (Any numeric value)
   -qdbfilter          menu       [CM] By default QDB filtering is used as
                                  second filter, that is the sequence or the
                                  sequence ranges that passed the first filter
                                  is searched against the Rfam Covariance
                                  Model using the QDB algorithm and only the
                                  hits are searched further. (Values: CM (with
                                  E() value cutoff derived from Rfam CM
                                  file); E (with user defined E() value
                                  cutoff); T (with user defined bit score
                                  cutoff); 0 (none))
*  -qdbexpect          float      [1.0] E() value cutoff for QDB filtering
                                  (Number 1.000 or more)
*  -qdbthreshold       float      [0.0] Threshold (bit score cutoff) for QDB
                                  filtering (Any numeric value)
*  -qdbbeta            float      [1E-7] Beta parameter for banding while QDB
                                  filtering. Probabilities lower than beta are
                                  considered negligible. (Number from 0.000
                                  to 1.000)
   -posterior          boolean    Append posterior probabilities to hit
                                  alignments
   -annotate           boolean    Annotate bases in sequence that break base
                                  pairs in Rfam model
   -dna                boolean    Write hit alignments as DNA

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-seqs" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
[-seqs]
(Parameter 1)
Nucleotide sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
-cutoff Cutoff type for final search
E (user defined E() value)
T (user defined bit score cutoff)
GA (gathering bit score cutoff from Rfam CM file)
TC (trusted bit score cutoff from Rfam CM file)
NC (noise bit score cutoff from Rfam CM file)
E
-expect E() value = number of Rfam hits with same or higher bit score that you expect to find by chance. Note that this depends on the length of the query sequence. Number 0.000 or more 1.0
-[no]setlength Set user defined query sequence length for E() calculation instead of using actual sequence length. This is done by default because for short sequences using the actual sequence length often gives false positives. Toggle value Yes/No Yes
-seqlength Effective sequence length used for E() value calculation (in Mbase) Number 0.000 or more 1.5
-threshold Threshold (bit score cutoff) Any numeric value 0.0
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.cmsearchrfam
Additional (Optional) qualifiers Allowed values Default
-select By default all models in the database are searched for. You can restrict the search by providing a comma-separated list of AC's, e.g. RF00001,RF00007. Any string is accepted all
-[no]reverse Search also complementary strand of query sequence (is default). Boolean value Yes/No Yes
-[no]local Make local alignment. If you switch this off cmsearchrfam can only find alignments that are global in the Rfam Covariance Model. Boolean value Yes/No Yes
-cyk Use CYK (Cocke-Younger-Kasami) algorithm instead of inside algorithm. This is faster but less sensitive. Boolean value Yes/No No
-[no]banded Use QDB (query-dependent banding) algorithm, that is ignore regions of dynamic programming matrix that have negligible contribution (set with -beta) to probability. Turning this off is more sensitive but slower. Toggle value Yes/No Yes
-beta Beta parameter for banding. Probabilities lower than beta are considered negligible. Number from 0.000 to 1.000 1E-15
-hmmfilter By default HMM filtering is used as first filter, that is the sequence is searched against a Hidden Markov Model derived from the Rfam Covariance Model and only the hits are searched further.
CM (with bit score cutoff derived from Rfam CM file)
E (with user defined E() value cutoff)
T (with user defined bit score cutoff)
0 (none)
CM
-hmmexpect E() value cutoff for HMM filtering Number 1.000 or more 1.0
-hmmthreshold Threshold (bit score cutoff) for HMM filtering Any numeric value 3.0
-qdbfilter By default QDB filtering is used as second filter, that is the sequence or the sequence ranges that passed the first filter is searched against the Rfam Covariance Model using the QDB algorithm and only the hits are searched further.
CM (with E() value cutoff derived from Rfam CM file)
E (with user defined E() value cutoff)
T (with user defined bit score cutoff)
0 (none)
CM
-qdbexpect E() value cutoff for QDB filtering Number 1.000 or more 1.0
-qdbthreshold Threshold (bit score cutoff) for QDB filtering Any numeric value 0.0
-qdbbeta Beta parameter for banding while QDB filtering. Probabilities lower than beta are considered negligible. Number from 0.000 to 1.000 1E-7
-posterior Append posterior probabilities to hit alignments Boolean value Yes/No No
-annotate Annotate bases in sequence that break base pairs in Rfam model Boolean value Yes/No No
-dna Write hit alignments as DNA Boolean value Yes/No No
Advanced (Unprompted) qualifiers Allowed values Default
(none)

Input file format

cmsearchrfam reads any normal sequence USA for one or more nucleic acid sequence(s).
You can submit several sequences at the same time. In this case the "wrapper" cmsearchrfam will launch the script for each individual sequence and make sure the output files have different names. Do note that the results of the individual scans are in no way merged.

Output file format

The default output looks like :

Output files for usage example

File: x00066.cmsearchrfam

Searching the Rfam databank of RNA covariance models
The databank currently contains 1372 models.

CM: tRNA

  Plus strand results:

 Query = 1 - 71, Target = 438 - 510
 Score = 67.62, E = 1.821e-15, P = 6.641e-21, GC =  52

           (((((((,,<<<<___.____>>>>,<<<<<_______>>>>>,,,.,<<<<<_______
         1 gccccugUAGcucAaU.GGUAgagCauuggaCUuuuAAuccaaagg.ugugGGUUCgAaU 58      
           G:::CU:UAGCUCA+U GGUAGAGC :UGGA+U+U A UCCA:+ G +:UGGGUUCGAAU
       438 GUGGCUAUAGCUCAGUuGGUAGAGCCCUGGAUUGUGAUUCCAGUUGuCGUGGGUUCGAAU 497     

           >>>>>))))))):
        59 CCcaccaggggcA 71      
           CCCA::AG:::C 
       498 CCCAUUAGCCACC 510     


 Query = 1 - 71, Target = 663 - 736
 Score = 59.92, E = 2.486e-13, P = 9.066e-19, GC =  62

           (((((((,,<<<<___..____>>>>,<<<<<_______>>>>>,,,.,<<<<<______
         1 gccccugUAGcucAaU..GGUAgagCauuggaCUuuuAAuccaaagg.ugugGGUUCgAa 57      
           :::C::GUAGC:CA+   GGUAG:GCA:::G:+UU   A:C::: GG +:::GGUUCGAA
       663 CGGCGAGUAGCGCAGCuuGGUAGCGCAACUGGUUUGGGACCAGUGGGuCGGAGGUUCGAA 722     

           _>>>>>))))))):
        58 UCCcaccaggggcA 71      
           UCC:::C::G:::A
       723 UCCUCUCUCGCCGA 736     


 Query = 1 - 71, Target = 307 - 380
 Score = 59.68, E = 2.895e-13, P = 1.056e-18, GC =  65

           (((((((,,<<<<___.___._>>>>,<<<<<_______>>>>>,,,.,<<<<<______
         1 gccccugUAGcucAaU.GGU.AgagCauuggaCUuuuAAuccaaagg.ugugGGUUCgAa 57      
           :C:CC:GUAGCUCA+  GG  AGAGC+:UG::CU    A::CA:AGG + ::GGUUCGAA
       307 GCGCCCGUAGCUCAGCuGGAuAGAGCGCUGCCCUCCGGAGGCAGAGGuCUCAGGUUCGAA 366     

           _>>>>>))))))):
        58 UCCcaccaggggcA 71      
           UCC:: C:GG:G:A
       367 UCCUGUCGGGCGUA 380     


 Query = 1 - 71, Target = 534 - 617
 Score = 43.25, E = 1.048e-08, P = 3.823e-14, GC =  61

           (((((((,,<<<<___.___._>>>>,<<<<<_______>>>>>,,............,,
         1 gccccugUAGcucAaU.GGU.AgagCauuggaCUuuuAAuccaaag............gu 46      
           GC:: :GU GC::AAU GGU  ::GC+:U:::+U +   :::A:+             GU
       534 GCGAAGGUGGCGGAAUuGGUaGACGCGCUAGCUUCAGGUGUUAGU-guccuuacggacGU 592     

           <<<<<_______>>>>>))))))):
        47 gugGGUUCgAaUCCcaccaggggcA 71      
           G:GGGUUC+A UCCC:CC: ::GCA
       593 GGGGGUUCAAGUCCCCCCCCUCGCA 617     


CM: GcvB

  Plus strand results:

 Query = 1 - 201, Target = 623 - 772
 Score = 11.64, E = 0.368, P = 9.521e-07, GC =  49

           ::::::::~~~~~~,<<<<<<__................_>>>>>>,,,,,,~~~~~~,<
         1 aCuuaauu*[97]*uaGuAacaa................AguUaCuuAUUUu*[43]*ua 172     
           ACU AAU+       :GU ::A                  :: AC: AUUUU       A
       623 ACUCAAUG*[76]*GGGUCGGAGguucgaauccucucucGCCGACCAAUUUU*[ 0]*GA 746     

           <<<<<<<_______>>>>>>>>:::::::
       173 gCACCGCcuauuauGCGGUGcuUUUUUuU 201     
           :C:CCGC+U    +GCGG:G:UUUUU UU
       747 ACCCCGCUU---CGGCGGGGUUUUUUGUU 772     


CM: tmRNA

  Plus strand results:

 Query = 333 - 349, Target = 593 - 609
 Score = 12.75, E = 0.02046, P = 5.306e-08, GC =  71

           <<<<<_______>>>>>
       333 GcGGGUUCGAuUCCCgC 349     
           G:GGGUUC+A UCCC:C
       593 GGGGGUUCAAGUCCCCC 609     


 Query = 332 - 349, Target = 485 - 502
 Score = 11.84, E = 0.04013, P = 1.041e-07, GC =  56

           ,<<<<<_______>>>>>
       332 CGcGGGUUCGAuUCCCgC 349     
           C :GGGUUCGA UCCC: 
       485 CGUGGGUUCGAAUCCCAU 502     


CM: RsmY

  Plus strand results:

 Query = 99 - 120, Target = 745 - 766
 Score = 16.91, E = 0.01211, P = 2.555e-08, GC =  68

           <<<<<<<<<____>>>>>>>>>
        99 aaaCCCCGCuucGGCGGGGuuu 120     
           :AACCCCGCUUCGGCGGGGUU:
       745 GAACCCCGCUUCGGCGGGGUUU 766     


  Minus strand results:

 Query = 100 - 119, Target = 765 - 746
 Score = 12.07, E = 0.2719, P = 5.738e-07, GC =  70

           <<<<<<<<____>>>>>>>>
       100 aaCCCCGCuucGGCGGGGuu 119     
           AACCCCGC  + GCGGGGUU
       765 AACCCCGCCGAAGCGGGGUU 746     


CM: PrrF

  Plus strand results:

 Query = 1 - 150, Target = 633 - 773
 Score = 24.38, E = 0.0002918, P = 7.713e-10, GC =  50

           ::::::::::::::::::::::::::::::::::<<<<<<<<<~~~~~>>>>>>>>>,,,
         1 UUGACuuggAAAUGAgAAUgaUUAUuAUuuaacaCagCuGguC*[4]*GacCaGcuGgUA 59      
           UUGA  U++AAAU   AA +A  A UAU U+    AG: G::C     G::C :CU GU 
       633 UUGAACUAAAAAUUCAAAAAAGCAGUAUAUCGGCGAGUAGCGC*[9]*GCGCAACUGGUU 696     

           ,,<<<<<<<-<_________>-->>>>>>>,,,,,,,<<<____>>>,,,,,,,,,,,,,
        60 AaCuGagaGaCcuaagaGUCGGACucuCaGAUUAUCUCCuCAUCaGGCUAAUCACGGUUu 119     
             : G :::A     G GUC GA::: C :AU  UCUC    UC  GC  A CA   UUU
       697 --UGGGACCAG---UGGGUCGGAGGUUCGAAUCCUCUC----UC--GCCGACCAA--UUU 743     

           ,,<<<<<<<<_____>>>>>>>>::::::::
       120 UUGACCCGGCuuUUUGCCGGGUCUUUUUUUg 150     
           U :ACCC:GCUU   GC:GGGU:UUUU UU 
       744 UGAACCCCGCUUCG-GCGGGGUUUUUUGUUU 773     


CM: Leu_leader

  Plus strand results:

 Query = 1 - 148, Target = 697 - 769
 Score = 14.12, E = 0.5537, P = 1.664e-06, GC =  58

           :::<-<<<<~~~~~~>>>>->----------<<<<<<<<<<<____>>>>>>>>>>>:::
         1 caggAgcug*[94]*cagcccacAaaAaAaaaAAaCCCGCGCuAaUGCGCGGGuUUuUUU 148     
             GGA: :G      C: :CC +  AA+    AA:CCCGC    +U  GCGGG:UU UUU
       697 UGGGACCAG*[23]*CUCGCCGACCAAUUUUGAACCCCGC----UUCGGCGGGGUUUUUU 769     


CM: Trp_leader

  Plus strand results:

 Query = 1 - 100, Target = 691 - 771
 Score = 15.53, E = 0.2358, P = 9.476e-07, GC =  57

           ::::::::<<<<-~~~~~~-->>>>-----------------------------<<<<<<
         1 CauGGUUGGuGgC*[24]*aauCaCGuAUaucUguaAacAGaAAaaAgAuACCaaGCCCG 78      
           C+ G UUGG ::C      A :: CG   AUC     +C  + A +A+ U   AA CCCG
       691 CUGGUUUGGGACC*[10]*AGGUUCGA--AUCCUCUCUCGCCGACCAAUUUUGAACCCCG 752     

           <<____>->>>>>>>:::::::
        79 CuAauuaAGCGGGCuUUUUuaU 100     
           C       GCGGG UUUUUU U
       753 CUUC--G-GCGGGGUUUUUUGU 771     


CM: PK-repBA

  Minus strand results:

 Query = 1 - 124, Target = 766 - 678
 Score = 13.28, E = 0.3755, P = 1.163e-06, GC =  58

           :<<<<<<<~~~~~~>>>>>>>:::::::::::::::::::::::::::::::::::::::
         1 AACCCCCu*[39]*aGGGGGUccAAaCaGgCauauGcCaGuAAGGAUUAUAaCaaAUGca 93      
           AA:CCCC:      :GGGG:UC AAA +GG  U+ GC AG+ AGGAUU  AAC    G  
       766 AAACCCCG*[ 6]*CGGGGUUCAAAAUUGG--UCGGCGAGAGAGGAUUCGAACCUCCGAC 709     

           :::::::::::::::::::::::::::::::
        94 UAauCuAAaaCAAuaCaacCCGGCCaauaau 124     
             A+      CAA +CA++   GC A +AA 
       708 CCACUGGUCCCAAACCAGUUGCGCUACCAAG 678     


CM: CRISPR-DR32

  Minus strand results:

 Query = 1 - 37, Target = 774 - 740
 Score = 18.49, E = 0.3654, P = 1.142e-06, GC =  46

           :::::::::::::<<<<________>>>>::::::::
         1 aUaagAAUCAaCGCCCCGAgGAAGAGGGGaUUAAAAC 37      
           A AA+AA  AA  CCCCG +GAAG GGGG+U AAAA 
       774 AAAACAAAAAA--CCCCGCCGAAGCGGGGUUCAAAAU 740     


CM: isrB

  Plus strand results:

 Query = 1 - 87, Target = 688 - 768
 Score = 19.85, E = 0.04772, P = 1.448e-07, GC =  57

           :::<<<<~~~~~~>>>>,,<<<<<<~~~~~~>>>>>>,,,,..,<<<<<<<<____>>>>
         1 aauacca*[16]*ugguaGgcCCGG*[18]*CCGGgcuaaa..aaagCCCaCUUaGGuGG 80      
            A+::::      :::: GG::CGG      CCG::C+A     AA:CCC:CUU GG:GG
       688 CAACUGG*[ 7]*CCAGUGGGUCGG*[19]*CCGACCAAUUuuGAACCCCGCUUCGGCGG 761     

           >>>>:::
        81 GcuuUUU 87      
           G:UUUUU
       762 GGUUUUU 768     


CM: isrJ

  Plus strand results:

 Query = 1 - 75, Target = 717 - 767
 Score = 16.91, E = 0.2157, P = 7.733e-07, GC =  55

           :::::::::::::::<~~~~~~>-----------<<<<<<<<<<____>>>>>>>>>>:
         1 aguAaauCuGauAuCG*[22]*CCAAACAAAUucaAgCCcCGCgAucGcGCGgGGcUuU 75      
           +   AAUC+  U UCG      CC A CAA UU+:A:CCCCGC  UCG GCGGGG:U:U
       717 UUCGAAUCCUCUCUCG*[ 0]*CCGACCAAUUUUGAACCCCGC-UUCG-GCGGGGUUUU 767     


CM: isrK

  Plus strand results:

 Query = 1 - 77, Target = 699 - 768
 Score = 13.99, E = 0.5931, P = 1.546e-06, GC =  59

           <<<<<<~~~~~>>>>>><<<<<______.>>>>>-------------------<<<<<<_
         1 GCacCc*[8]*gGguGCCGGGaUUgGcA.uCCCGgAaaacaaaaggggcaaaaAgcCGCg 62      
           G  CC:     :GG  C G:G:UU G+A :C:C      C A +++     AA :CCGC 
       699 GGACCA*[1]*UGGGUCGGAGGUUCGAAuCCUCUCUCGCCGACCAAUUUUGAACCCCGC- 753     

           ____>>>>>>::::: 
        63 gcauGCGgcUUUUUU  77      
            +  GCGG: UUUUU 
       754 UUCGGCGGGGUUUUU  768     


CM: istR

  Plus strand results:

 Query = 1 - 130, Target = 644 - 769
 Score = 16.69, E = 0.06753, P = 1.968e-07, GC =  53

           :<<<<<----<<--~~~~~~>>>>>>>,,,,,,<<<<-<<<<~~~~~~>>>>>>>>,<<<
         1 cGuCaaAAuuCgug*[40]*cGuuGaCaUAAuACaGuGugcu*[18]*agcaaCuGaaaa 106     
           +  CAAAA+ ::+G      ::UUG  A    A: ::G:: :      : :::: :+  :
       644 AUUCAAAAAAGCAG*[19]*GCUUGGUAGCGCAACUGGUUUG*[35]*CGACCAAUUUUG 745     

           <<<<<<<<____>>>>>>>>>>>:
       107 aaCCcCGCUuCGGCGgGGuuuuuU 130     
           AACCCCGCUUCGGCGGGGUU:  U
       746 AACCCCGCUUCGGCGGGGUUUUUU 769     


The current version of cmsearch searches only one model against a sequence ; by consequence cmsearchrfam reports sequentially models that give a hit against the query sequence, without trying to rank them. If the complementary strand is searched too (parameter -noreverse not set) the hits against the complementary strand are for each model reported separately after the hits against the forward strand. Within each series however, hits are listed according to decreasing "bit score".

At the end of the example output file you can see a hit of the Rfam model with ID istR against the forward strand of sequence embl:x00066. The positions 1-130 of the model match the positions 644-769 of the query sequence, with a "bit score" of 16.69. The E() value (expected number of false positives) is 0.06753 and the P (probability to find at least one false positive) is 1.968e-07. P is always computed using the actual sequence length, E is computed based on the user set effective sequence length if an E() cutoff is chosen. The hit region in the query sequence has 53% GC.

The alignment is shown in a BLAST-like format, augmented by secondary structure annotation.

Alternative output file: x00066.cmsearchrfam


  [Part of this file has been deleted for brevity]

CM: istR

  Plus strand results:

 Query = 1 - 130, Target = 644 - 769
 Score = 16.69, E = 0.06753, P = 1.968e-07, GC =  53

            xx                      xx       x     x        x    x  xx 
           :<<<<<----<<--~~~~~~>>>>>>>,,,,,,<<<<-<<<<~~~~~~>>>>>>>>,<<<
         1 cGuCaaAAuuCgug*[40]*cGuuGaCaUAAuACaGuGugcu*[18]*agcaaCuGaaaa 106     
           +  CAAAA+ ::+G      ::UUG  A    A: ::G:: :      : :::: :+  :
       644 AUUCAAAAAAGCAG*[19]*GCUUGGUAGCGCAACUGGUUUG*[35]*CGACCAAUUUUG 745     
    POSTX. 99999999987765      2222222222222222211111      345556678999
    POST.X 99999875047427      3588877776666532098642      081392947126

                                xx 
           <<<<<<<<____>>>>>>>>>>>:
       107 aaCCcCGCUuCGGCGgGGuuuuuU 130     
           AACCCCGCUUCGGCGGGGUU:  U
       746 AACCCCGCUUCGGCGGGGUUUUUU 769     
    POSTX. 999999999999999999999999
    POST.X 899999999999999999999999

With the -annotate option you can request to put on top of the aligment x characters to better visualize the location of noncompensatory base pair substitutions.

If you add the option -posterior cmsearchrfam will run the posterior algorithm to determine the probability of the individual bases. E.g. in the first column of the alignment a A of the sequence has been aligned with a c of the model and the POSTX. (first decimal) 9 POST.X (second decimal) 9 mean that in 99% of all alignments sampled by the algorithm that A is aligned with that position of the model. If the probability is very near 100% a * is written instead of two digits.

WUSS notation

INFERNAL annotates RNA secondary structures using a linear string representation called "WUSS notation" (Washington University Secondary Structure notation). In detail, symbols used by WUSS notation in output structure annotation strings are as follows :
Base pairs Base pairs are annotated by nested matching pairs of symbols <>, (), [], or {}. The different symbols indicate the "depth" of the helix in the RNA structure as follows : <> are used for simple terminal stems ; () are used for "internal" helices enclosing a multifurcation of all terminal stems ; [] are used for internal helices enclosing a multifurcation that includes at least one annotated () stem already ; and {} are used for all internal helices enclosing deeper multifurcations.
Hairpin loops Hairpin loop residues are indicated by underscores _. Simple stem loops stand out as e.g. <<<<____>>>>.
Bulge, interior loops Bulge and interior loop residues are indicated by dashes -.
Multifurcation loops Multifurcation loop residues are indicated by commas ,. The mnemonic is "stem 1, stem2", e.g. <<<___>>>,,<<<___>>>.
External residues Unstructured single stranded residues completely outside the structure (unenclosed by any base pairs) are annotated by colons :.
Insertion Insertions relative to a known structure are indicated by periods .. Regions where local structural alignment was invoked, leaving regions of both target and query sequence unaligned, are indicated by tildes ~. These symbols only appear in alignments of a known (query) structure annotation to a target sequence of unknown structure.
Pseudoknot WUSS notation allows pseudoknots to be annotated as pairs of upper case/lower case letters : for example, <<<<_AAAA____>>>>aaaa annotates a simple pseudoknot ; additional pseudoknotted stems could be annotated by Bb, Cc, etc. INFERNAL cannot handle pseudoknots, however ; pseudoknot notation never appears in INFERNAL output ; it is accepted in input files, but ignored.

Data files

cmsearchrfam searches the Rfam databank. In the current version of the program it is not possible to choose another data set instead.

Notes

None.

References

  1. Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. J. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge UK.
  2. Eddy, S. R. (2002). A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics, 3:18.
  3. Eddy, S. R. and Durbin, R. (1994). RNA sequence analysis using covariance models. Nucl. Acids Res., 22:2079-2088.
  4. Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., and Eddy, S. R. (2003). Rfam: an RNA family database. Nucl. Acids Res., 31:439-441.
  5. Nawrocky, E. P. and Eddy, S. R. (2007). Query-dependent banding (QDB) for faster RNA similarity searches. PLoS Comput Biol., 3:e56.
  6. Weinberg, Z. and Ruzzo, W. L. (2004). Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics, 20 Suppl 1:i334-i341.
  7. Weinberg, Z. and Ruzzo, W. L. (2006). Sequence-based heuristics for faster annotation of non-coding RNA families. Bioinformatics, 22(1):35-39.

Warnings

cmsearchrfam does not prevent you from setting unreasonable extreme values for some parameters, e.g. E() value of 0, beta parameter of 0 or 1, etc. In such cases you will obtain an empty output file without warning.

Diagnostic Error Messages

There are error messages specific to this program, which are issued when the user has chosen to search only selected models and he has provided items that do not correspond to existing Rfam accession numbers :

 <number> is not acceptable, Rfam Accession number must be of type RFxxxxxx with x digit !

 Rfam entry <number> does not exist !

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program name Description
einverted Finds inverted repeats in nucleotide sequences
hybridize Prediction of hybridization between nucleic acid sequences, with consideration of secondary structure
profit Scan one or more sequences with a simple frequency matrix
prophecy Create frequency matrix or profile from a multiple alignment
prophet Scan one or more sequences with a Gribskov or Henikoff profile
unafold Prediction of optimal and suboptimal RNA or DNA secondary structure

Author(s)

The wrapper application cmsearchrfam was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The INFERNAL suite (of which cmsearch is part) was written by Sean Eddy
HHMI Janelia Farm
19700 Helix Drive
Ashburn VA 20147, USA

History

Completed 19 April 2005
Modified 29 January 2008 - adapted to INFERNAL version 0.80
Modified 21 January 2009 - adapted to INFERNAL version 1.0

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.