|
|
cmsearchrfam |
By default cmsearchrfam searches both strands of the nucleic acid sequence(s) and searches the complete Rfam databank. You can force cmsearchrfam to search only the top strand of the query sequence(s) by setting -noreverse. You can also force cmsearchrfam to search only for a selected set of Covariance Models by typing in a comma-separated list of Accession Numbers, e.g. "RF00001,RF00007". In order to find the AC's of the RNA models of your interest, you can search the Rfam databank using MRS.
By default cmsearchrfam can search for local alignments, in which the alignment is allowed to span two or more subsequences if necessary (e.g. if the structures of the query model and the target sequence are only partly shared). You can request cmsearchrfam to operate instead in glocal mode, that is local with respect to the sequence but global with respect to the model. "glocal" searches are less sensitive but might be desirable in some cases.
cmsearch implements a scanning window dynamic programming algorithm (Durbin et al., 1998). It looks for high-scoring alignments of the structural model to any subsequence of the target sequence. It works even if the target sequence is very large (e.g. a whole chromosome) - though it is slow, like all SCFG algorithms.
By default cmsearchrfam uses the inside algorithm, which computes a "bit score" summed over all possible alignments. You can optionally request to use instead the CYK algorithm, which computes the "bit score" of the single most likely alignment of the sequence with the CM. This option increases speed about two times, but decreases the sensitivity.
cmsearchrfam computes a "bit score", which is a log-odds score in log
base two :
S = log2(P(seq|CM)/P(seq|null))
where P(seq|CM) is the probability of the target sequence given the
CM and P(seq|null) is the probability of the target sequence given a
"null hypothesis" model, which models the statistics of random sequence.
cmsearchrfam also computes an E() value. It tells you how many false
positives you would expect to see at or above a certain "bit score". The
CM files in the Rfam have been calibrated, using the cmcalibrate program
from the INFERNAL suite. With this calibration information from the CM
file cmsearch can compute an E() from the "bit score" and the sequence
length. Be however alert, currently INFERNAL E() values are not
perfect.
The current version of cmsearchrfam by default uses a cutoff value for E() of 1.0, that is it reports only hits with an E() value of 1.0 or less. You can instead set yourself an E() value cutoff or set directly a "bit score" cutoff or use any of the three scores set in the Rfam CM file :
A further way to considerably accelerate the search at a minimal cost of losing sensitivity is filtering. By default cmsearchrfam uses two rounds of filters with faster algorithms to prune the database prior to searching with the slow algorithm. The first round of filtering is faster but less strict that the second. First, the full database is searched with the first round filter, then any hits that survive the first round are searched with the second round filter. Finally any hits that survive the first and second round of filtering are searched with the final round search strategy. During the filter rounds, hits are padded with a short strectch of residues on either side prior to searching in the subsequent round. The exact number of residues is dependant on the size of the model being searched.
The first round of filtering is performed with an HMM, which is searched using a scanning "forward" algorithm. The HMM implemented in INFERNAL is called the Plan 9 HMM (to be distinguished from the Plan 7 HMM of the HMMER package). The HMM is derived from the Rfam CM. By default the appropriate threshold for accepting hits is automatically chosen dependant on the "bit score" cutoff of the final search and calibration information taken from the Rfam CM file, in such a way that there is 99.5% chance that a hit that would be accepted in the final search will pass the filter. In some cases HMM filtering is turned off because the expected survival fraction of sequence is close to 1, so that filtering would save little if any time. It is possible to set instead manually an E() value or a "bit score" cutoff.
The second round of filtering is performed with the CM using the QDB algorithm. By default the beta parameter is set to 1E-7, determined empirically as a good tradeoff between sensitivity and speed. By default the appropriate threshold for accepting hits is automatically chosen using an ad hoc procedure that makes that the filter will always allow hits that have an E() value 100 times greater than the cutoff for the final search and that the final round of the search is expected to require at least 3% of the total predicted number of calculations. It is possible to set instead manually an E() value or a "bit score" cutoff. The final round of searching is by default performed with QDB and beta set to 1E-15, determined empirically as giving not much difference in speed and sensitivity with an unbanded search (except for short sequences).
> cmsearchrfam
Scans nucleic acids for RNA genes and conserved motifs using Rfam
Input nucleotide sequence(s): embl:x00066
E : user defined E() value
T : user defined bit score cutoff
GA : gathering bit score cutoff from Rfam CM file
TC : trusted bit score cutoff from Rfam CM file
NC : noise bit score cutoff from Rfam CM file
Cutoff type for final search [E]:
E() value cutoff [1.0]:
Set user defined query sequence length for E() value calculation [Y]:
Effective sequence length used for E() value calculation (in Mbase) [1.5]:
Output file [x00066.cmsearchrfam]:
|
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers (* if not always prompted):
[-seqs] seqall Nucleotide sequence(s) filename and optional
format, or reference (input USA)
-cutoff menu [E] Cutoff type for final search (Values: E
(user defined E() value); T (user defined
bit score cutoff); GA (gathering bit score
cutoff from Rfam CM file); TC (trusted bit
score cutoff from Rfam CM file); NC (noise
bit score cutoff from Rfam CM file))
* -expect float [1.0] E() value = number of Rfam hits with
same or higher bit score that you expect to
find by chance. Note that this depends on
the length of the query sequence. (Number
0.000 or more)
* -[no]setlength toggle [Y] Set user defined query sequence length
for E() calculation instead of using actual
sequence length. This is done by default
because for short sequences using the actual
sequence length often gives false
positives.
* -seqlength float [1.5] Effective sequence length used for E()
value calculation (in Mbase) (Number 0.000
or more)
* -threshold float [0.0] Threshold (bit score cutoff) (Any
numeric value)
[-outfile] outfile [*.cmsearchrfam] Output file name
Additional (Optional) qualifiers (* if not always prompted):
-select string [all] By default all models in the database
are searched for. You can restrict the
search by providing a comma-separated list
of AC's, e.g. RF00001,RF00007. (Any string
is accepted)
-[no]reverse boolean [Y] Search also complementary strand of
query sequence (is default).
-[no]local boolean [Y] Make local alignment. If you switch this
off cmsearchrfam can only find alignments
that are global in the Rfam Covariance
Model.
-cyk boolean Use CYK (Cocke-Younger-Kasami) algorithm
instead of inside algorithm. This is faster
but less sensitive.
-[no]banded toggle [Y] Use QDB (query-dependent banding)
algorithm, that is ignore regions of dynamic
programming matrix that have negligible
contribution (set with -beta) to
probability. Turning this off is more
sensitive but slower.
* -beta float [1E-15] Beta parameter for banding.
Probabilities lower than beta are considered
negligible. (Number from 0.000 to 1.000)
-hmmfilter menu [CM] By default HMM filtering is used as
first filter, that is the sequence is
searched against a Hidden Markov Model
derived from the Rfam Covariance Model and
only the hits are searched further. (Values:
CM (with bit score cutoff derived from Rfam
CM file); E (with user defined E() value
cutoff); T (with user defined bit score
cutoff); 0 (none))
* -hmmexpect float [1.0] E() value cutoff for HMM filtering
(Number 1.000 or more)
* -hmmthreshold float [3.0] Threshold (bit score cutoff) for HMM
filtering (Any numeric value)
-qdbfilter menu [CM] By default QDB filtering is used as
second filter, that is the sequence or the
sequence ranges that passed the first filter
is searched against the Rfam Covariance
Model using the QDB algorithm and only the
hits are searched further. (Values: CM (with
E() value cutoff derived from Rfam CM
file); E (with user defined E() value
cutoff); T (with user defined bit score
cutoff); 0 (none))
* -qdbexpect float [1.0] E() value cutoff for QDB filtering
(Number 1.000 or more)
* -qdbthreshold float [0.0] Threshold (bit score cutoff) for QDB
filtering (Any numeric value)
* -qdbbeta float [1E-7] Beta parameter for banding while QDB
filtering. Probabilities lower than beta are
considered negligible. (Number from 0.000
to 1.000)
-posterior boolean Append posterior probabilities to hit
alignments
-annotate boolean Annotate bases in sequence that break base
pairs in Rfam model
-dna boolean Write hit alignments as DNA
Advanced (Unprompted) qualifiers: (none)
Associated qualifiers:
"-seqs" associated qualifiers
-sbegin1 integer Start of each sequence to be used
-send1 integer End of each sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-sformat1 string Input sequence format
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name
"-outfile" associated qualifiers
-odirectory2 string Output directory
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write first file to standard output
-filter boolean Read first file from standard input, write
first file to standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages
|
| Standard (Mandatory) qualifiers | Allowed values | Default | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [-seqs] (Parameter 1) |
Nucleotide sequence(s) filename and optional format, or reference (input USA) | Readable sequence(s) | Required | ||||||||||
| -cutoff | Cutoff type for final search |
|
E | ||||||||||
| -expect | E() value = number of Rfam hits with same or higher bit score that you expect to find by chance. Note that this depends on the length of the query sequence. | Number 0.000 or more | 1.0 | ||||||||||
| -[no]setlength | Set user defined query sequence length for E() calculation instead of using actual sequence length. This is done by default because for short sequences using the actual sequence length often gives false positives. | Toggle value Yes/No | Yes | ||||||||||
| -seqlength | Effective sequence length used for E() value calculation (in Mbase) | Number 0.000 or more | 1.5 | ||||||||||
| -threshold | Threshold (bit score cutoff) | Any numeric value | 0.0 | ||||||||||
| [-outfile] (Parameter 2) |
Output file name | Output file | <sequence>.cmsearchrfam | ||||||||||
| Additional (Optional) qualifiers | Allowed values | Default | |||||||||||
| -select | By default all models in the database are searched for. You can restrict the search by providing a comma-separated list of AC's, e.g. RF00001,RF00007. | Any string is accepted | all | ||||||||||
| -[no]reverse | Search also complementary strand of query sequence (is default). | Boolean value Yes/No | Yes | ||||||||||
| -[no]local | Make local alignment. If you switch this off cmsearchrfam can only find alignments that are global in the Rfam Covariance Model. | Boolean value Yes/No | Yes | ||||||||||
| -cyk | Use CYK (Cocke-Younger-Kasami) algorithm instead of inside algorithm. This is faster but less sensitive. | Boolean value Yes/No | No | ||||||||||
| -[no]banded | Use QDB (query-dependent banding) algorithm, that is ignore regions of dynamic programming matrix that have negligible contribution (set with -beta) to probability. Turning this off is more sensitive but slower. | Toggle value Yes/No | Yes | ||||||||||
| -beta | Beta parameter for banding. Probabilities lower than beta are considered negligible. | Number from 0.000 to 1.000 | 1E-15 | ||||||||||
| -hmmfilter | By default HMM filtering is used as first filter, that is the sequence is searched against a Hidden Markov Model derived from the Rfam Covariance Model and only the hits are searched further. |
|
CM | ||||||||||
| -hmmexpect | E() value cutoff for HMM filtering | Number 1.000 or more | 1.0 | ||||||||||
| -hmmthreshold | Threshold (bit score cutoff) for HMM filtering | Any numeric value | 3.0 | ||||||||||
| -qdbfilter | By default QDB filtering is used as second filter, that is the sequence or the sequence ranges that passed the first filter is searched against the Rfam Covariance Model using the QDB algorithm and only the hits are searched further. |
|
CM | ||||||||||
| -qdbexpect | E() value cutoff for QDB filtering | Number 1.000 or more | 1.0 | ||||||||||
| -qdbthreshold | Threshold (bit score cutoff) for QDB filtering | Any numeric value | 0.0 | ||||||||||
| -qdbbeta | Beta parameter for banding while QDB filtering. Probabilities lower than beta are considered negligible. | Number from 0.000 to 1.000 | 1E-7 | ||||||||||
| -posterior | Append posterior probabilities to hit alignments | Boolean value Yes/No | No | ||||||||||
| -annotate | Annotate bases in sequence that break base pairs in Rfam model | Boolean value Yes/No | No | ||||||||||
| -dna | Write hit alignments as DNA | Boolean value Yes/No | No | ||||||||||
| Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||||||
| (none) | |||||||||||||
Searching the Rfam databank of RNA covariance models
The databank currently contains 1372 models.
CM: tRNA
Plus strand results:
Query = 1 - 71, Target = 438 - 510
Score = 67.62, E = 1.821e-15, P = 6.641e-21, GC = 52
(((((((,,<<<<___.____>>>>,<<<<<_______>>>>>,,,.,<<<<<_______
1 gccccugUAGcucAaU.GGUAgagCauuggaCUuuuAAuccaaagg.ugugGGUUCgAaU 58
G:::CU:UAGCUCA+U GGUAGAGC :UGGA+U+U A UCCA:+ G +:UGGGUUCGAAU
438 GUGGCUAUAGCUCAGUuGGUAGAGCCCUGGAUUGUGAUUCCAGUUGuCGUGGGUUCGAAU 497
>>>>>))))))):
59 CCcaccaggggcA 71
CCCA::AG:::C
498 CCCAUUAGCCACC 510
Query = 1 - 71, Target = 663 - 736
Score = 59.92, E = 2.486e-13, P = 9.066e-19, GC = 62
(((((((,,<<<<___..____>>>>,<<<<<_______>>>>>,,,.,<<<<<______
1 gccccugUAGcucAaU..GGUAgagCauuggaCUuuuAAuccaaagg.ugugGGUUCgAa 57
:::C::GUAGC:CA+ GGUAG:GCA:::G:+UU A:C::: GG +:::GGUUCGAA
663 CGGCGAGUAGCGCAGCuuGGUAGCGCAACUGGUUUGGGACCAGUGGGuCGGAGGUUCGAA 722
_>>>>>))))))):
58 UCCcaccaggggcA 71
UCC:::C::G:::A
723 UCCUCUCUCGCCGA 736
Query = 1 - 71, Target = 307 - 380
Score = 59.68, E = 2.895e-13, P = 1.056e-18, GC = 65
(((((((,,<<<<___.___._>>>>,<<<<<_______>>>>>,,,.,<<<<<______
1 gccccugUAGcucAaU.GGU.AgagCauuggaCUuuuAAuccaaagg.ugugGGUUCgAa 57
:C:CC:GUAGCUCA+ GG AGAGC+:UG::CU A::CA:AGG + ::GGUUCGAA
307 GCGCCCGUAGCUCAGCuGGAuAGAGCGCUGCCCUCCGGAGGCAGAGGuCUCAGGUUCGAA 366
_>>>>>))))))):
58 UCCcaccaggggcA 71
UCC:: C:GG:G:A
367 UCCUGUCGGGCGUA 380
Query = 1 - 71, Target = 534 - 617
Score = 43.25, E = 1.048e-08, P = 3.823e-14, GC = 61
(((((((,,<<<<___.___._>>>>,<<<<<_______>>>>>,,............,,
1 gccccugUAGcucAaU.GGU.AgagCauuggaCUuuuAAuccaaag............gu 46
GC:: :GU GC::AAU GGU ::GC+:U:::+U + :::A:+ GU
534 GCGAAGGUGGCGGAAUuGGUaGACGCGCUAGCUUCAGGUGUUAGU-guccuuacggacGU 592
<<<<<_______>>>>>))))))):
47 gugGGUUCgAaUCCcaccaggggcA 71
G:GGGUUC+A UCCC:CC: ::GCA
593 GGGGGUUCAAGUCCCCCCCCUCGCA 617
CM: GcvB
Plus strand results:
Query = 1 - 201, Target = 623 - 772
Score = 11.64, E = 0.368, P = 9.521e-07, GC = 49
::::::::~~~~~~,<<<<<<__................_>>>>>>,,,,,,~~~~~~,<
1 aCuuaauu*[97]*uaGuAacaa................AguUaCuuAUUUu*[43]*ua 172
ACU AAU+ :GU ::A :: AC: AUUUU A
623 ACUCAAUG*[76]*GGGUCGGAGguucgaauccucucucGCCGACCAAUUUU*[ 0]*GA 746
<<<<<<<_______>>>>>>>>:::::::
173 gCACCGCcuauuauGCGGUGcuUUUUUuU 201
:C:CCGC+U +GCGG:G:UUUUU UU
747 ACCCCGCUU---CGGCGGGGUUUUUUGUU 772
CM: tmRNA
Plus strand results:
Query = 333 - 349, Target = 593 - 609
Score = 12.75, E = 0.02046, P = 5.306e-08, GC = 71
<<<<<_______>>>>>
333 GcGGGUUCGAuUCCCgC 349
G:GGGUUC+A UCCC:C
593 GGGGGUUCAAGUCCCCC 609
Query = 332 - 349, Target = 485 - 502
Score = 11.84, E = 0.04013, P = 1.041e-07, GC = 56
,<<<<<_______>>>>>
332 CGcGGGUUCGAuUCCCgC 349
C :GGGUUCGA UCCC:
485 CGUGGGUUCGAAUCCCAU 502
CM: RsmY
Plus strand results:
Query = 99 - 120, Target = 745 - 766
Score = 16.91, E = 0.01211, P = 2.555e-08, GC = 68
<<<<<<<<<____>>>>>>>>>
99 aaaCCCCGCuucGGCGGGGuuu 120
:AACCCCGCUUCGGCGGGGUU:
745 GAACCCCGCUUCGGCGGGGUUU 766
Minus strand results:
Query = 100 - 119, Target = 765 - 746
Score = 12.07, E = 0.2719, P = 5.738e-07, GC = 70
<<<<<<<<____>>>>>>>>
100 aaCCCCGCuucGGCGGGGuu 119
AACCCCGC + GCGGGGUU
765 AACCCCGCCGAAGCGGGGUU 746
CM: PrrF
Plus strand results:
Query = 1 - 150, Target = 633 - 773
Score = 24.38, E = 0.0002918, P = 7.713e-10, GC = 50
::::::::::::::::::::::::::::::::::<<<<<<<<<~~~~~>>>>>>>>>,,,
1 UUGACuuggAAAUGAgAAUgaUUAUuAUuuaacaCagCuGguC*[4]*GacCaGcuGgUA 59
UUGA U++AAAU AA +A A UAU U+ AG: G::C G::C :CU GU
633 UUGAACUAAAAAUUCAAAAAAGCAGUAUAUCGGCGAGUAGCGC*[9]*GCGCAACUGGUU 696
,,<<<<<<<-<_________>-->>>>>>>,,,,,,,<<<____>>>,,,,,,,,,,,,,
60 AaCuGagaGaCcuaagaGUCGGACucuCaGAUUAUCUCCuCAUCaGGCUAAUCACGGUUu 119
: G :::A G GUC GA::: C :AU UCUC UC GC A CA UUU
697 --UGGGACCAG---UGGGUCGGAGGUUCGAAUCCUCUC----UC--GCCGACCAA--UUU 743
,,<<<<<<<<_____>>>>>>>>::::::::
120 UUGACCCGGCuuUUUGCCGGGUCUUUUUUUg 150
U :ACCC:GCUU GC:GGGU:UUUU UU
744 UGAACCCCGCUUCG-GCGGGGUUUUUUGUUU 773
CM: Leu_leader
Plus strand results:
Query = 1 - 148, Target = 697 - 769
Score = 14.12, E = 0.5537, P = 1.664e-06, GC = 58
:::<-<<<<~~~~~~>>>>->----------<<<<<<<<<<<____>>>>>>>>>>>:::
1 caggAgcug*[94]*cagcccacAaaAaAaaaAAaCCCGCGCuAaUGCGCGGGuUUuUUU 148
GGA: :G C: :CC + AA+ AA:CCCGC +U GCGGG:UU UUU
697 UGGGACCAG*[23]*CUCGCCGACCAAUUUUGAACCCCGC----UUCGGCGGGGUUUUUU 769
CM: Trp_leader
Plus strand results:
Query = 1 - 100, Target = 691 - 771
Score = 15.53, E = 0.2358, P = 9.476e-07, GC = 57
::::::::<<<<-~~~~~~-->>>>-----------------------------<<<<<<
1 CauGGUUGGuGgC*[24]*aauCaCGuAUaucUguaAacAGaAAaaAgAuACCaaGCCCG 78
C+ G UUGG ::C A :: CG AUC +C + A +A+ U AA CCCG
691 CUGGUUUGGGACC*[10]*AGGUUCGA--AUCCUCUCUCGCCGACCAAUUUUGAACCCCG 752
<<____>->>>>>>>:::::::
79 CuAauuaAGCGGGCuUUUUuaU 100
C GCGGG UUUUUU U
753 CUUC--G-GCGGGGUUUUUUGU 771
CM: PK-repBA
Minus strand results:
Query = 1 - 124, Target = 766 - 678
Score = 13.28, E = 0.3755, P = 1.163e-06, GC = 58
:<<<<<<<~~~~~~>>>>>>>:::::::::::::::::::::::::::::::::::::::
1 AACCCCCu*[39]*aGGGGGUccAAaCaGgCauauGcCaGuAAGGAUUAUAaCaaAUGca 93
AA:CCCC: :GGGG:UC AAA +GG U+ GC AG+ AGGAUU AAC G
766 AAACCCCG*[ 6]*CGGGGUUCAAAAUUGG--UCGGCGAGAGAGGAUUCGAACCUCCGAC 709
:::::::::::::::::::::::::::::::
94 UAauCuAAaaCAAuaCaacCCGGCCaauaau 124
A+ CAA +CA++ GC A +AA
708 CCACUGGUCCCAAACCAGUUGCGCUACCAAG 678
CM: CRISPR-DR32
Minus strand results:
Query = 1 - 37, Target = 774 - 740
Score = 18.49, E = 0.3654, P = 1.142e-06, GC = 46
:::::::::::::<<<<________>>>>::::::::
1 aUaagAAUCAaCGCCCCGAgGAAGAGGGGaUUAAAAC 37
A AA+AA AA CCCCG +GAAG GGGG+U AAAA
774 AAAACAAAAAA--CCCCGCCGAAGCGGGGUUCAAAAU 740
CM: isrB
Plus strand results:
Query = 1 - 87, Target = 688 - 768
Score = 19.85, E = 0.04772, P = 1.448e-07, GC = 57
:::<<<<~~~~~~>>>>,,<<<<<<~~~~~~>>>>>>,,,,..,<<<<<<<<____>>>>
1 aauacca*[16]*ugguaGgcCCGG*[18]*CCGGgcuaaa..aaagCCCaCUUaGGuGG 80
A+:::: :::: GG::CGG CCG::C+A AA:CCC:CUU GG:GG
688 CAACUGG*[ 7]*CCAGUGGGUCGG*[19]*CCGACCAAUUuuGAACCCCGCUUCGGCGG 761
>>>>:::
81 GcuuUUU 87
G:UUUUU
762 GGUUUUU 768
CM: isrJ
Plus strand results:
Query = 1 - 75, Target = 717 - 767
Score = 16.91, E = 0.2157, P = 7.733e-07, GC = 55
:::::::::::::::<~~~~~~>-----------<<<<<<<<<<____>>>>>>>>>>:
1 aguAaauCuGauAuCG*[22]*CCAAACAAAUucaAgCCcCGCgAucGcGCGgGGcUuU 75
+ AAUC+ U UCG CC A CAA UU+:A:CCCCGC UCG GCGGGG:U:U
717 UUCGAAUCCUCUCUCG*[ 0]*CCGACCAAUUUUGAACCCCGC-UUCG-GCGGGGUUUU 767
CM: isrK
Plus strand results:
Query = 1 - 77, Target = 699 - 768
Score = 13.99, E = 0.5931, P = 1.546e-06, GC = 59
<<<<<<~~~~~>>>>>><<<<<______.>>>>>-------------------<<<<<<_
1 GCacCc*[8]*gGguGCCGGGaUUgGcA.uCCCGgAaaacaaaaggggcaaaaAgcCGCg 62
G CC: :GG C G:G:UU G+A :C:C C A +++ AA :CCGC
699 GGACCA*[1]*UGGGUCGGAGGUUCGAAuCCUCUCUCGCCGACCAAUUUUGAACCCCGC- 753
____>>>>>>:::::
63 gcauGCGgcUUUUUU 77
+ GCGG: UUUUU
754 UUCGGCGGGGUUUUU 768
CM: istR
Plus strand results:
Query = 1 - 130, Target = 644 - 769
Score = 16.69, E = 0.06753, P = 1.968e-07, GC = 53
:<<<<<----<<--~~~~~~>>>>>>>,,,,,,<<<<-<<<<~~~~~~>>>>>>>>,<<<
1 cGuCaaAAuuCgug*[40]*cGuuGaCaUAAuACaGuGugcu*[18]*agcaaCuGaaaa 106
+ CAAAA+ ::+G ::UUG A A: ::G:: : : :::: :+ :
644 AUUCAAAAAAGCAG*[19]*GCUUGGUAGCGCAACUGGUUUG*[35]*CGACCAAUUUUG 745
<<<<<<<<____>>>>>>>>>>>:
107 aaCCcCGCUuCGGCGgGGuuuuuU 130
AACCCCGCUUCGGCGGGGUU: U
746 AACCCCGCUUCGGCGGGGUUUUUU 769
|
The current version of cmsearch searches only one model against a sequence ; by consequence cmsearchrfam reports sequentially models that give a hit against the query sequence, without trying to rank them. If the complementary strand is searched too (parameter -noreverse not set) the hits against the complementary strand are for each model reported separately after the hits against the forward strand. Within each series however, hits are listed according to decreasing "bit score".
At the end of the example output file you can see a hit of the Rfam model with ID istR against the forward strand of sequence embl:x00066. The positions 1-130 of the model match the positions 644-769 of the query sequence, with a "bit score" of 16.69. The E() value (expected number of false positives) is 0.06753 and the P (probability to find at least one false positive) is 1.968e-07. P is always computed using the actual sequence length, E is computed based on the user set effective sequence length if an E() cutoff is chosen. The hit region in the query sequence has 53% GC.
The alignment is shown in a BLAST-like format, augmented by secondary structure annotation.
[Part of this file has been deleted for brevity]
CM: istR
Plus strand results:
Query = 1 - 130, Target = 644 - 769
Score = 16.69, E = 0.06753, P = 1.968e-07, GC = 53
xx xx x x x x xx
:<<<<<----<<--~~~~~~>>>>>>>,,,,,,<<<<-<<<<~~~~~~>>>>>>>>,<<<
1 cGuCaaAAuuCgug*[40]*cGuuGaCaUAAuACaGuGugcu*[18]*agcaaCuGaaaa 106
+ CAAAA+ ::+G ::UUG A A: ::G:: : : :::: :+ :
644 AUUCAAAAAAGCAG*[19]*GCUUGGUAGCGCAACUGGUUUG*[35]*CGACCAAUUUUG 745
POSTX. 99999999987765 2222222222222222211111 345556678999
POST.X 99999875047427 3588877776666532098642 081392947126
xx
<<<<<<<<____>>>>>>>>>>>:
107 aaCCcCGCUuCGGCGgGGuuuuuU 130
AACCCCGCUUCGGCGGGGUU: U
746 AACCCCGCUUCGGCGGGGUUUUUU 769
POSTX. 999999999999999999999999
POST.X 899999999999999999999999
|
With the -annotate option you can request to put on top of the aligment x characters to better visualize the location of noncompensatory base pair substitutions.
If you add the option -posterior cmsearchrfam will run the posterior algorithm to determine the probability of the individual bases. E.g. in the first column of the alignment a A of the sequence has been aligned with a c of the model and the POSTX. (first decimal) 9 POST.X (second decimal) 9 mean that in 99% of all alignments sampled by the algorithm that A is aligned with that position of the model. If the probability is very near 100% a * is written instead of two digits.
| Base pairs | Base pairs are annotated by nested matching pairs of symbols <>, (), [], or {}. The different symbols indicate the "depth" of the helix in the RNA structure as follows : <> are used for simple terminal stems ; () are used for "internal" helices enclosing a multifurcation of all terminal stems ; [] are used for internal helices enclosing a multifurcation that includes at least one annotated () stem already ; and {} are used for all internal helices enclosing deeper multifurcations. |
| Hairpin loops | Hairpin loop residues are indicated by underscores _. Simple stem loops stand out as e.g. <<<<____>>>>. |
| Bulge, interior loops | Bulge and interior loop residues are indicated by dashes -. |
| Multifurcation loops | Multifurcation loop residues are indicated by commas ,. The mnemonic is "stem 1, stem2", e.g. <<<___>>>,,<<<___>>>. |
| External residues | Unstructured single stranded residues completely outside the structure (unenclosed by any base pairs) are annotated by colons :. |
| Insertion | Insertions relative to a known structure are indicated by periods .. Regions where local structural alignment was invoked, leaving regions of both target and query sequence unaligned, are indicated by tildes ~. These symbols only appear in alignments of a known (query) structure annotation to a target sequence of unknown structure. |
| Pseudoknot | WUSS notation allows pseudoknots to be annotated as pairs of upper case/lower case letters : for example, <<<<_AAAA____>>>>aaaa annotates a simple pseudoknot ; additional pseudoknotted stems could be annotated by Bb, Cc, etc. INFERNAL cannot handle pseudoknots, however ; pseudoknot notation never appears in INFERNAL output ; it is accepted in input files, but ignored. |
<number> is not acceptable, Rfam Accession number must be of type RFxxxxxx with x digit ! Rfam entry <number> does not exist !
| Program name | Description |
|---|---|
| einverted | Finds inverted repeats in nucleotide sequences |
| hybridize | Prediction of hybridization between nucleic acid sequences, with consideration of secondary structure |
| profit | Scan one or more sequences with a simple frequency matrix |
| prophecy | Create frequency matrix or profile from a multiple alignment |
| prophet | Scan one or more sequences with a Gribskov or Henikoff profile |
| unafold | Prediction of optimal and suboptimal RNA or DNA secondary structure |
The INFERNAL suite (of which cmsearch is part) was written by Sean Eddy
HHMI Janelia Farm
19700 Helix Drive
Ashburn VA 20147, USA