|
|
genemark |
GeneMark.HMM basically exploits the differences in oligonucleotide composition between coding and noncoding DNA. For some prokaryotes it makes a distinction between "typical" and "atypical" genes ; for many prokaryotes it can search for RBS (ribosomal binding sites) in order to detect more accurately the start of translation. For eukaryotes it detects the intron-exon splice sites. Note however that GeneMark.HMM only searches coding DNA, it has no provision for finding noncoding exons.
If you submit to EMBL/GenBank/DDBJ a sequence with a CDS you found using genemark, you should include in the Comment field a statement "The protein coding regions have been predicted from computer analysis using the program GeneMark.HMM from Gene Probe, Inc.".
Nucleic Acids Research, 1999, Vol. 27, No.19, 3911-3920
"Heuristic approach to deriving models for gene finding"
John Besemer and Mark Borodovsky
This method produces fairly accurate models from minimal information :
G+C composition of a sequence. These models could be used to find
genes of anonymous prokaryotic genomes and in genomes of organelles,
viruses, phages, plasmids, and in highly inhomogeneous genomes
where adjustment of models to local DNA composition is needed.
A collection of models in a range of GC composition from 30% to 70% for genetic codes 11, 4 and 1 was generated based on this algorithm. Following table shows START and STOP codons allowed in models for different genetic codes:
-------------------------------------------------------- Genetic Code | 11 | 4 | 1 -------------------------------------------------------- Start codon | ATG GTG TTG | ATG GTG TTG | ATG -------------------------------------------------------- Stop codon | TAA TAG TGA | TAA TAG | TAA TAG TGA --------------------------------------------------------Heuristic models can be applied to analyses of sequences as small as 400 nt.
> genemark
Finds potential genes using a species specific HMM
Input nucleotide sequence(s): embl:x02419
B : eubacterial
A : archaebacterial
H : heuristic (%GC)
E : eukaryotic
3 : eukaryotic (using Genemark.HMM version 3)
Gene Model Class [E]:
barley : Hordeum vulgare (barley)
chicken : Gallus gallus (chicken)
corn : Zea mays (maize)
human_00_43 : Homo sapiens (human) 0-43 %GC
human_43_49 : Homo sapiens (human) 43-49 %GC
human_49_99 : Homo sapiens (human) 49-99 %GC
mouse : Mus musculus (mouse)
wheat : Triticum aestivum (wheat)
Eukaryotic Gene Model [human_43_49]:
Output file [x02419.gmhmm]:
|
Go to the input files for this example
Go to the output files for this example
second example, with prokaryotic sequence
> genemark
Finds potential genes using a species specific HMM
Input nucleotide sequence(s): embl:v00307
B : eubacterial
A : archaebacterial
H : heuristic (%GC)
E : eukaryotic
3 : eukaryotic (using Genemark.HMM version 3)
Gene Model Class [E]: B
embl:v00307 1 : Acinetobacter_sp_ADP1
2 : Agrobacterium_tumefaciens_C58_Cereon_chromosome_circular
3 : Agrobacterium_tumefaciens_C58_Cereon_chromosome_linear
4 : Agrobacterium_tumefaciens_C58_Cereon_plasmid_AT
5 : Agrobacterium_tumefaciens_C58_Cereon_plasmid_Ti
6 : Agrobacterium_tumefaciens_C58_UWash_chromosome_circular
7 : Agrobacterium_tumefaciens_C58_UWash_chromosome_linear
[some lines have been deleted for brevity]
122 : Enterococcus_faecalis_V583_plasmid_pTEF3
123 : Erwinia_carotovora_atroseptica_SCRI1043
124 : Escherichia_coli_CFT073
125 : Escherichia_coli_K12
126 : Escherichia_coli_O157H7
127 : Escherichia_coli_O157H7_EDL933
128 : Escherichia_coli_O157H7_plasmid_pO157
[some lines have been deleted for brevity]
351 : Yersinia_pestis_KIM_plasmid_pMT-1
352 : Yersinia_pestis_biovar_Mediaevails
353 : Yersinia_pestis_biovar_Mediaevails_plasmid_pCD1
354 : Yersinia_pseudotuberculosis_IP32953
355 : Yersinia_pseudotuberculosis_IP32953_plasmid_pYV
356 : Yersinia_pseudotuberculosis_IP32953_plasmid_pYptb32953
357 : Zymomonas_mobilis_ZM4
Eubacterial Gene Model [125]:
Output file [v00307.gmhmm]:
|
Go to the output files for this example
Standard (Mandatory) qualifiers (* if not always prompted):
[-seqs] seqall Nucleotide sequence(s) filename and optional
format, or reference (input USA)
-class menu [E] Gene Model Class (Values: B
(eubacterial); A (archaebacterial); H
(heuristic (%GC)); E (eukaryotic); 3
(eukaryotic (using Genemark.HMM version 3)))
* -bmodel selection [125] Eubacterial Gene Model
* -amodel selection [28] Archaebacterial Gene Model
* -emodel menu [human_43_49] Eukaryotic Gene Model (Values:
barley (Hordeum vulgare (barley)); chicken
(Gallus gallus (chicken)); corn (Zea mays
(maize)); human_00_43 (Homo sapiens (human)
0-43 %GC); human_43_49 (Homo sapiens (human)
43-49 %GC); human_49_99 (Homo sapiens
(human) 49-99 %GC); mouse (Mus musculus
(mouse)); wheat (Triticum aestivum (wheat)))
* -model menu [a_thaliana] Eukaryotic Gene Model for
version 3 (Values: a_gambiae (Anopheles
gambiae (malaria mosquito)); a_thaliana
(Arabidopsis thaliana); c_elegans
(Caenorhabditis elegans); c_intestinalis
(Ciona intestinalis); c_reinhardtii
(Chlamydomonas reinhardtii); c_remanei
(Caenorhabditis remanei); d_melanogaster
(Drosophila melanogaster (fruitfly));
m_truncatula (Medicago truncatula (barrel
medic)); o_sativa (Oryza sativa (rice)))
* -gencode menu [1] Genetic code for Heuristic Model
(Values: 1 (Standard); 11 (Bacterial); 4
(Mitochondrial))
* -gc integer [50] %GC for Heuristic Model (Integer from
30 to 70)
[-outfile] outfile [*.genemark] Output file name
Additional (Optional) qualifiers (* if not always prompted):
* -[no]rbs boolean [Y] Use Ribosome Binding Site model
Advanced (Unprompted) qualifiers: (none)
Associated qualifiers:
"-seqs" associated qualifiers
-sbegin1 integer Start of each sequence to be used
-send1 integer End of each sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-sformat1 string Input sequence format
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name
"-outfile" associated qualifiers
-odirectory2 string Output directory
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write standard output
-filter boolean Read standard input, write standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages
|
| Standard (Mandatory) qualifiers | Allowed values | Default | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [-seqs] (Parameter 1) |
Nucleotide sequence(s) filename and optional format, or reference (input USA) | Readable sequence(s) | Required | ||||||||||||||||||
| -class | Gene Model Class |
|
E | ||||||||||||||||||
| -bmodel | Eubacterial Gene Model | Acinetobacter_sp_ADP1 Agrobacterium_tumefaciens_C58_Cereon_chromosome_circular Agrobacterium_tumefaciens_C58_Cereon_chromosome_linear ... Yersinia_pseudotuberculosis_IP32953_plasmid_pYptb32953 Zymomonas_mobilis_ZM4 (choose from list of 357 models) |
Escherichia_coli_K12 | ||||||||||||||||||
| -amodel | Archaebacterial Gene Model | Aeropyrum_pernix Archaeoglobus_fulgidus Haloarcula_marismortui_ATCC_43049_chromosome_I ... Thermoplasma_acidophilum Thermoplasma_volcanium (choose from list of 32 models) |
Sulfolobus_solfataricus | ||||||||||||||||||
| -emodel | Eukaryotic Gene Model |
|
human_43_49 | ||||||||||||||||||
| -model | Eukaryotic Gene Model for version 3 |
|
a_thaliana | ||||||||||||||||||
| -gencode | Genetic code for Heuristic Model |
|
1 | ||||||||||||||||||
| -gc | %GC for Heuristic Model | Integer from 30 to 70 | 50 | ||||||||||||||||||
| [-outfile] (Parameter 2) |
Output file name | Output file | <sequence>.gmhmm | ||||||||||||||||||
| Additional (Optional) qualifiers | Allowed values | Default | |||||||||||||||||||
| -[no]rbs | Use Ribosome Binding Site model | Boolean value Yes/No | Yes | ||||||||||||||||||
| Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||||||||||||||
| (none) | |||||||||||||||||||||
GeneMark.hmm (Version 2.2a) Sequence name: embl-id:X02419 Sequence length: 7258 bp G+C content: 54.38% Matrices file: /opt/sw/genemark/eukaryotic/human_43_49.mtx (Homo sapiens) Wed Nov 15 16:07:00 2006 Predicted genes/exons Gene Exon Strand Exon Exon Range Exon Start/End # # Type Length Frame 1 1 - Initial 846 1075 230 2 1 2 1 + Initial 1227 1283 57 1 3 2 2 + Internal 1701 1728 28 1 1 2 3 + Internal 1875 1982 108 2 1 2 4 + Internal 2586 2760 175 2 2 2 5 + Internal 2954 3045 92 3 1 2 6 + Internal 3203 3422 220 2 2 2 7 + Internal 3644 3792 149 3 1 2 8 + Internal 4458 4598 141 2 1 2 9 + Internal 4945 5093 149 2 3 2 10 + Terminal 6083 6259 177 1 3 |
>gene_1 length: 76 aa MGSGLRAAWIGTRGGRPGRGALTPCSRRGAQGARAAQGGAGRRSGAGAAAHPARGSQDRG THRWLRQEGASRRCGD >gene_2 length: 431 aa MRALLARLLLCVLVVSDSKGSNELHQVPSNCDCLNGGTCVSNKYFSNIHWCNCPKKFGGQ HCEIDKSKTCYEGNGHFYRGKASTDTMGRPCLPWNSATVLQQTYHAHRSDALQLGLGKHN YCRNPDNRRRPWCYVQVGLKPLVQECMVHDCADGKKPSSPPEELKFQCGQKTLRPRFKII GGEFTTIENQPWFAAIYRRHRGGSVTYVCGGSLMSPCWVISATHCFIDYPKKEDYIVYLG RSRLNSNTQGEMKFEVENLILHKDYSADTLAHHNDIALLKIRSKEGRCAQPSRTIQTICL PSMYNDPQFGTSCEITGFGKENSTDYLYPEQLKMTVVKLISHRECQQPHYYGSEVTTKML CAADPQWKTDSCQGDSGGPLVCSLQGRMTLTGIVSWGRGCALKDKPGVYTRVSHFLPWIR SHTKEENGLAL |
GeneMark.hmm PROKARYOTIC (Version 2.6q)
Sequence file name: embl-id:V00307
Model file name: /opt/sw/genemark/prokaryotic_modeldir/Escherichia_coli_K12.mod
Model organism: Escherichia_coli_K12
Fri Mar 7 14:44:31 2008
Predicted genes
Gene Strand LeftEnd RightEnd Gene Class Spacer RBS_score
# Length
1 + 172 669 498 1 4 0.9979
2 + 1037 2077 1041 1 5 1.5053
3 - 2150 >2269 120 1 0 -0.1109
|
>gene_1 length: 165 aa MYTSGYAHRSSSFSSAASKIARVSTENTTAGLISEVVYREDQPMMTQLLLLPLLQQLGQQ SRWQLWLTPQQKLSREWVQASGLPLTKVMQISQLSPCHTVESMVRALRTGNYSVVIGWLA DDLTEEEHAEVVDAANEGNAMGFIIHSGKRILSRHETTFRAKNSL >gene_2 length: 346 aa MKKTAIAIAVALAGFATVAQAAPKDNTWYTGAKLGWSQYHDTGFINNNGPTHENQLGAGA FGGYQVNPYVGFEMGYDWLGRMPYKGSVENGAYKAQGVQLTAKLGYPITDDLDIYTRLGG MVWRADTKSNVYGKNHDTGVSPVFAGGVEYAITPEIATRLEYQWTNNIGDAHTIGTRPDN GMLSLGVSYRFGQGEAAPVVAPAPAPAPEVQTKHFTLKSDVLFNFNKATLKPEGQAALDQ LYSQLSNLDPKDGSVVVLGYTDRIGSDAYNQGLSERRAQSVVDYLISKGIPADKISARGM GESNPVTGNTCDNVKQRAALIDCLAPDRRVEIEVKGIKDVVTQPQA >gene_3 length: 39 aa GKTLSETIVQLIEDAENKEKYANKMSSLKQDLQALLGKE |
The "Class" 1 and 2 stands for "typical" and "atypical" gene respectively.
| Program name | Description |
|---|---|
| bscan | Scans proteins or nucleic acids for conserved motifs using Blocks |
| getorf | Finds and extracts open reading frames (ORFs) |
| iprscan | Scans proteins or nucleic acids for conserved motifs using Interpro tools |
| marscan | Finds MAR/SAR sites in nucleic sequences |
| plotorf | Plot potential open reading frames |
| showorf | Pretty output of DNA translations |
| sixpack | Display a DNA sequence with 6-frame translation and ORFs |
| syco | Synonymous codon usage Gribskov statistic plot |
| tcode | Fickett TESTCODE statistic to identify protein-coding DNA |
| wobble | Wobble base plot |
The program Genemark.HMM itself is commercial software. It was developed
by Dr. Mark Borodovsky, Dr. Alex Lukashin, Dr. George Tarasenko and Mr.
Alex Lomsadze. The copyright belongs to :
Gene Probe, Inc.
883 Heritage Place
Atlanta, GA 30033