genemark

 

Function

Finds potential genes using a species specific HMM

Description

genemark is an EMBOSS "wrapper" program for the programs GeneMark.HMM prokaryotic and GeneMark.HMM eukaryotic from Gene Probe Inc. It searches potential protein coding sequences in DNA using a Hidden Markov algorithm with a species-specific inhomogeneous Markov model of gene-encoding regions of DNA.

GeneMark.HMM basically exploits the differences in oligonucleotide composition between coding and noncoding DNA. For some prokaryotes it makes a distinction between "typical" and "atypical" genes ; for many prokaryotes it can search for RBS (ribosomal binding sites) in order to detect more accurately the start of translation. For eukaryotes it detects the intron-exon splice sites. Note however that GeneMark.HMM only searches coding DNA, it has no provision for finding noncoding exons.

If you submit to EMBL/GenBank/DDBJ a sequence with a CDS you found using genemark, you should include in the Comment field a statement "The protein coding regions have been predicted from computer analysis using the program GeneMark.HMM from Gene Probe, Inc.".

Heuristic models

A special category of gene models, used with GeneMark.HMM prokaryotic, are the heuristic models. They are derived from parameters measured from the input sequences and knowledge gained through the study of various bacterial genomes.
        Nucleic Acids Research, 1999, Vol. 27, No.19, 3911-3920
        "Heuristic approach to deriving models for gene finding"
        John Besemer and Mark Borodovsky
This method produces fairly accurate models from minimal information : G+C composition of a sequence. These models could be used to find genes of anonymous prokaryotic genomes and in genomes of organelles, viruses, phages, plasmids, and in highly inhomogeneous genomes where adjustment of models to local DNA composition is needed.

A collection of models in a range of GC composition from 30% to 70% for genetic codes 11, 4 and 1 was generated based on this algorithm. Following table shows START and STOP codons allowed in models for different genetic codes:

--------------------------------------------------------
Genetic Code |      11     |      4      |     1
--------------------------------------------------------
Start codon  | ATG GTG TTG | ATG GTG TTG |     ATG
--------------------------------------------------------
Stop codon   | TAA TAG TGA | TAA TAG     | TAA TAG TGA
--------------------------------------------------------
Heuristic models can be applied to analyses of sequences as small as 400 nt.
Note : these models do not include parameters for RBS (ribosomal binding site).

Usage

Here is a sample session with genemark

> genemark
Finds potential genes using a species specific HMM
Input nucleotide sequence(s): embl:x02419
         B : eubacterial
         A : archaebacterial
         H : heuristic (%GC)
         E : eukaryotic
         3 : eukaryotic (using Genemark.HMM version 3)
Gene Model Class [E]: 
    barley : Hordeum vulgare (barley)
   chicken : Gallus gallus (chicken)
      corn : Zea mays (maize)
  human_00_43 : Homo sapiens (human) 0-43 %GC
  human_43_49 : Homo sapiens (human) 43-49 %GC
  human_49_99 : Homo sapiens (human) 49-99 %GC
     mouse : Mus musculus (mouse)
     wheat : Triticum aestivum (wheat)
Eukaryotic Gene Model [human_43_49]:
Output file [x02419.gmhmm]:

Go to the input files for this example
Go to the output files for this example

second example, with prokaryotic sequence

> genemark
Finds potential genes using a species specific HMM
Input nucleotide sequence(s): embl:v00307
         B : eubacterial
         A : archaebacterial
         H : heuristic (%GC)
         E : eukaryotic
         3 : eukaryotic (using Genemark.HMM version 3)
Gene Model Class [E]: B
embl:v00307      1 : Acinetobacter_sp_ADP1
      2 : Agrobacterium_tumefaciens_C58_Cereon_chromosome_circular
      3 : Agrobacterium_tumefaciens_C58_Cereon_chromosome_linear
      4 : Agrobacterium_tumefaciens_C58_Cereon_plasmid_AT
      5 : Agrobacterium_tumefaciens_C58_Cereon_plasmid_Ti
      6 : Agrobacterium_tumefaciens_C58_UWash_chromosome_circular
      7 : Agrobacterium_tumefaciens_C58_UWash_chromosome_linear

  [some lines have been deleted for brevity]

    122 : Enterococcus_faecalis_V583_plasmid_pTEF3
    123 : Erwinia_carotovora_atroseptica_SCRI1043
    124 : Escherichia_coli_CFT073
    125 : Escherichia_coli_K12
    126 : Escherichia_coli_O157H7
    127 : Escherichia_coli_O157H7_EDL933
    128 : Escherichia_coli_O157H7_plasmid_pO157

  [some lines have been deleted for brevity]

    351 : Yersinia_pestis_KIM_plasmid_pMT-1
    352 : Yersinia_pestis_biovar_Mediaevails
    353 : Yersinia_pestis_biovar_Mediaevails_plasmid_pCD1
    354 : Yersinia_pseudotuberculosis_IP32953
    355 : Yersinia_pseudotuberculosis_IP32953_plasmid_pYV
    356 : Yersinia_pseudotuberculosis_IP32953_plasmid_pYptb32953
    357 : Zymomonas_mobilis_ZM4
Eubacterial Gene Model [125]:
Output file [v00307.gmhmm]:

Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-seqs]              seqall     Nucleotide sequence(s) filename and optional
                                  format, or reference (input USA)
   -class              menu       [E] Gene Model Class (Values: B
                                  (eubacterial); A (archaebacterial); H
                                  (heuristic (%GC)); E (eukaryotic); 3
                                  (eukaryotic (using Genemark.HMM version 3)))
*  -bmodel             selection  [125] Eubacterial Gene Model
*  -amodel             selection  [28] Archaebacterial Gene Model
*  -emodel             menu       [human_43_49] Eukaryotic Gene Model (Values:
                                  barley (Hordeum vulgare (barley)); chicken
                                  (Gallus gallus (chicken)); corn (Zea mays
                                  (maize)); human_00_43 (Homo sapiens (human)
                                  0-43 %GC); human_43_49 (Homo sapiens (human)
                                  43-49 %GC); human_49_99 (Homo sapiens
                                  (human) 49-99 %GC); mouse (Mus musculus
                                  (mouse)); wheat (Triticum aestivum (wheat)))
*  -model              menu       [a_thaliana] Eukaryotic Gene Model for
                                  version 3 (Values: a_gambiae (Anopheles
                                  gambiae (malaria mosquito)); a_thaliana
                                  (Arabidopsis thaliana); c_elegans
                                  (Caenorhabditis elegans); c_intestinalis
                                  (Ciona intestinalis); c_reinhardtii
                                  (Chlamydomonas reinhardtii); c_remanei
                                  (Caenorhabditis remanei); d_melanogaster
                                  (Drosophila melanogaster (fruitfly));
                                  m_truncatula (Medicago truncatula (barrel
                                  medic)); o_sativa (Oryza sativa (rice)))
*  -gencode            menu       [1] Genetic code for Heuristic Model
                                  (Values: 1 (Standard); 11 (Bacterial); 4
                                  (Mitochondrial))
*  -gc                 integer    [50] %GC for Heuristic Model (Integer from
                                  30 to 70)
  [-outfile]           outfile    [*.genemark] Output file name

   Additional (Optional) qualifiers (* if not always prompted):
*  -[no]rbs            boolean    [Y] Use Ribosome Binding Site model

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-seqs" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
[-seqs]
(Parameter 1)
Nucleotide sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
-class Gene Model Class
B (eubacterial)
A (archaebacterial)
H (heuristic (%GC))
E (eukaryotic)
3 (eukaryotic (using Genemark.HMM version 3))
E
-bmodel Eubacterial Gene Model Acinetobacter_sp_ADP1
Agrobacterium_tumefaciens_C58_Cereon_chromosome_circular
Agrobacterium_tumefaciens_C58_Cereon_chromosome_linear
...
Yersinia_pseudotuberculosis_IP32953_plasmid_pYptb32953
Zymomonas_mobilis_ZM4
(choose from list of 357 models)
Escherichia_coli_K12
-amodel Archaebacterial Gene Model Aeropyrum_pernix
Archaeoglobus_fulgidus
Haloarcula_marismortui_ATCC_43049_chromosome_I
...
Thermoplasma_acidophilum
Thermoplasma_volcanium
(choose from list of 32 models)
Sulfolobus_solfataricus
-emodel Eukaryotic Gene Model
barley (Hordeum vulgare (barley))
chicken (Gallus gallus (chicken))
corn (Zea mays (maize))
human_00_43 (Homo sapiens (human) 0-43 %GC)
human_43_49 (Homo sapiens (human) 43-49 %GC)
human_49_99 (Homo sapiens (human) 49-99 %GC)
mouse (Mus musculus (mouse))
wheat (Triticum aestivum (wheat))
human_43_49
-model Eukaryotic Gene Model for version 3
a_gambiae (Anopheles gambiae (malaria mosquito))
a_thaliana (Arabidopsis thaliana)
c_elegans (Caenorhabditis elegans)
c_intestinalis (Ciona intestinalis)
c_reinhardtii (Chlamydomonas reinhardtii)
c_remanei (Caenorhabditis remanei)
d_melanogaster (Drosophila melanogaster (fruitfly))
m_truncatula (Medicago truncatula (barrel medic))
o_sativa (Oryza sativa (rice))
a_thaliana
-gencode Genetic code for Heuristic Model
1 (Standard)
11 (Bacterial)
4 (Mitochondrial)
1
-gc %GC for Heuristic Model Integer from 30 to 70 50
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.gmhmm
Additional (Optional) qualifiers Allowed values Default
-[no]rbs Use Ribosome Binding Site model Boolean value Yes/No Yes
Advanced (Unprompted) qualifiers Allowed values Default
(none)

Input file format

genemark reads any normal sequence USA for one or more nucleic acid sequence(s).
You can submit several sequences at the same time. In this case the "wrapper" genemark will launch the program for each individual sequence and make sure the output files have different names.

Output file format

genemark produces an output file with information about the predicted coding sequences and a second output file with the predicted protein(s) in fastA format.

Output files for usage example

File: x02419.gmhmm

GeneMark.hmm (Version 2.2a)
Sequence name: embl-id:X02419
Sequence length: 7258 bp
G+C content: 54.38%
Matrices file: /opt/sw/genemark/eukaryotic/human_43_49.mtx (Homo sapiens)
Wed Nov 15 16:07:00 2006

Predicted genes/exons

Gene Exon Strand Exon           Exon Range     Exon      Start/End
  #    #         Type                         Length       Frame
  1     1   -  Initial        846      1075     230          2 1

  2     1   +  Initial       1227      1283      57          1 3
  2     2   +  Internal      1701      1728      28          1 1
  2     3   +  Internal      1875      1982     108          2 1
  2     4   +  Internal      2586      2760     175          2 2
  2     5   +  Internal      2954      3045      92          3 1
  2     6   +  Internal      3203      3422     220          2 2
  2     7   +  Internal      3644      3792     149          3 1
  2     8   +  Internal      4458      4598     141          2 1
  2     9   +  Internal      4945      5093     149          2 3
  2    10   +  Terminal      6083      6259     177          1 3

File: x02419.gmhmm.fasta

>gene_1 length: 76 aa
MGSGLRAAWIGTRGGRPGRGALTPCSRRGAQGARAAQGGAGRRSGAGAAAHPARGSQDRG
THRWLRQEGASRRCGD
>gene_2 length: 431 aa
MRALLARLLLCVLVVSDSKGSNELHQVPSNCDCLNGGTCVSNKYFSNIHWCNCPKKFGGQ
HCEIDKSKTCYEGNGHFYRGKASTDTMGRPCLPWNSATVLQQTYHAHRSDALQLGLGKHN
YCRNPDNRRRPWCYVQVGLKPLVQECMVHDCADGKKPSSPPEELKFQCGQKTLRPRFKII
GGEFTTIENQPWFAAIYRRHRGGSVTYVCGGSLMSPCWVISATHCFIDYPKKEDYIVYLG
RSRLNSNTQGEMKFEVENLILHKDYSADTLAHHNDIALLKIRSKEGRCAQPSRTIQTICL
PSMYNDPQFGTSCEITGFGKENSTDYLYPEQLKMTVVKLISHRECQQPHYYGSEVTTKML
CAADPQWKTDSCQGDSGGPLVCSLQGRMTLTGIVSWGRGCALKDKPGVYTRVSHFLPWIR
SHTKEENGLAL

Output files for usage example 2

File: v00307.gmhmm

GeneMark.hmm PROKARYOTIC (Version 2.6q)
Sequence file name: embl-id:V00307
Model file name: /opt/sw/genemark/prokaryotic_modeldir/Escherichia_coli_K12.mod
Model organism: Escherichia_coli_K12
Fri Mar  7 14:44:31 2008

Predicted genes
   Gene    Strand    LeftEnd    RightEnd       Gene     Class   Spacer   RBS_score
    #                                         Length
    1        +         172         669          498        1        4   0.9979
    2        +        1037        2077         1041        1        5   1.5053
    3        -        2150       >2269          120        1        0  -0.1109

File: v00307.gmhmm.fasta

>gene_1 length: 165 aa
MYTSGYAHRSSSFSSAASKIARVSTENTTAGLISEVVYREDQPMMTQLLLLPLLQQLGQQ
SRWQLWLTPQQKLSREWVQASGLPLTKVMQISQLSPCHTVESMVRALRTGNYSVVIGWLA
DDLTEEEHAEVVDAANEGNAMGFIIHSGKRILSRHETTFRAKNSL
>gene_2 length: 346 aa
MKKTAIAIAVALAGFATVAQAAPKDNTWYTGAKLGWSQYHDTGFINNNGPTHENQLGAGA
FGGYQVNPYVGFEMGYDWLGRMPYKGSVENGAYKAQGVQLTAKLGYPITDDLDIYTRLGG
MVWRADTKSNVYGKNHDTGVSPVFAGGVEYAITPEIATRLEYQWTNNIGDAHTIGTRPDN
GMLSLGVSYRFGQGEAAPVVAPAPAPAPEVQTKHFTLKSDVLFNFNKATLKPEGQAALDQ
LYSQLSNLDPKDGSVVVLGYTDRIGSDAYNQGLSERRAQSVVDYLISKGIPADKISARGM
GESNPVTGNTCDNVKQRAALIDCLAPDRRVEIEVKGIKDVVTQPQA
>gene_3 length: 39 aa
GKTLSETIVQLIEDAENKEKYANKMSSLKQDLQALLGKE

The "Class" 1 and 2 stands for "typical" and "atypical" gene respectively.

Data files

The gene models used by genemark cannot be inspected by the end user nor replaced by user provided files.

Notes

None.

Warnings

Note that, in order to get a valid result, it is important to choose an appropriate gene model. If there is no gene model available for the species from which your sequence derives, you can try the most related species. You could also, as long as you do not suspect the presence of introns, use an "heuristic" model (but do first determine the %GC of the sequence to be analyzed).

Diagnostic Error Messages

None

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
bscan Scans proteins or nucleic acids for conserved motifs using Blocks
getorf Finds and extracts open reading frames (ORFs)
iprscan Scans proteins or nucleic acids for conserved motifs using Interpro tools
marscan Finds MAR/SAR sites in nucleic sequences
plotorf Plot potential open reading frames
showorf Pretty output of DNA translations
sixpack Display a DNA sequence with 6-frame translation and ORFs
syco Synonymous codon usage Gribskov statistic plot
tcode Fickett TESTCODE statistic to identify protein-coding DNA
wobble Wobble base plot

Author(s)

The wrapper application genemark was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The program Genemark.HMM itself is commercial software. It was developed by Dr. Mark Borodovsky, Dr. Alex Lukashin, Dr. George Tarasenko and Mr. Alex Lomsadze. The copyright belongs to :
Gene Probe, Inc.
883 Heritage Place
Atlanta, GA 30033

History

Completed 5 September 2002
Modified 2 February 2005 - greatly increased number of gene models

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.