unafold

 

Function

Prediction of optimal and suboptimal RNA or DNA secondary structure

Description

unafold is an EMBOSS "wrapper" program for a number of scripts and programs of the UNAFOLD (Unified Nucleic Acid Folding) suite of Markham and Zuker. It helps in predicting the secondary structure of a nucleic acid molecule.

unafold can find structures containing GC, AU (or AT) and GU (or GT) base pairs. It relies on Gibbs free enthalpy of formation calculations and has tables with energy values for base stacking and the effect of the formation of stems, hairpin loops, bulge loops and multibranch loops. It does however not take into account tertiary interactions, modified bases and the possible formation of "pseudo-knots".

By default unafold considers the molecule to be linear RNA at 37°C. You can however precise that it must consider the molecule to be circular, (single-stranded) DNA and/or at another temperature. The current version (3.6) of UNAFOLD has three different built-in sets of energy tables :

unafold can operate in 3 different modes :

  1. energy minimization : in this mode unafold searches the folding with the lowest Gibbs free enthalpy.
  2. energy minimization with suboptimal foldings : this corresponds to the mode of action of the earlier version of the suite called MFOLD. unafold reports besides the optimal folding also a series of suboptimal foldings with free enthalpies of formation within a certain percentage above the optimal. It also creates an "energy dotplot" with dots indicating base pairs in these foldings.
  3. partition function calculation : instead of mimimizing energy it computes the complete partition function and from it base pair probabilities. It samples stochastically a number of foldings and draws a "basepair probability plot".

The chance of finding the correct folding can be greatly improved by adding a number of "constaints", that is provividing directives (based on experimental evidence) about which bases cannot or must pair (see input).

Algorithm

A complete explanation of the algorithm is beyond the scope of this on-line manual. You should look at the references. Lets just note the following restrictions :

If UNAFOLD is working in partition function calculation mode it stochastically samples a number of likely foldings, as set by the -maxfold=n parameter ; note that you can get several times the same folding. If UNAFOLD is working in MFOLD mode it will compute a number of foldings as set by the -suboptimality=P, -window=W and -maxfold=MAX parameters. W may be thought of as a distance parameter. The distance between 2 base pairs i.j and i'.j' may be defined as max{|i-i'|,|j-j'|}. Then if k-1 foldings have already been predicted, the kth will have at leas W base pairs that are at least a distance W from any of the base pairs in the first k-1 foldings. UNAFOLD continues sampling foldings until k = MAX or until there are no more foldings with a Gibbs free enthalpy of formation above %P of the minimum.

If W is not specified, UNAFOLD will choose its value from this table based on sequence length. The user is encouraged to experiment with this parameter.
Sequence length Default window size
1-29 0
30-49 1
50-119 2
120-199 3
200-299 5
300-399 7
400-499 8
500-599 10
600-699 11
700-799 12
800-1199 15
1200-1999 20
>= 2000 25

Usage

Here is a sample session with unafold

> unafold -storeall
Prediction of optimal and suboptimal RNA or DNA secondary structure
Input nucleotide sequence: embl:k00152
         1 : energy minimization
         2 : energy minimization with suboptimal foldings
         3 : partition function calculation
Mode [2]:
Molecule is single-stranded DNA rather than RNA [N]:
Molecule is circular [N]:
Base name for output files [K00152]:
       png : PNG
       jpg : JPEG
Graphic output format [png]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-sequence]          sequence   Nucleotide sequence filename and optional
                                  format, or reference (input USA)
   -mode               menu       [2] Mode. The default mode 2 corresponds to
                                  the old MFOLD software. (Values: 1 (energy
                                  minimization); 2 (energy minimization with
                                  suboptimal foldings); 3 (partition function
                                  calculation))
   -dna                toggle     Molecule is single-stranded DNA rather than
                                  RNA
*  -circular           boolean    Molecule is circular
   -basename           string     [$(sequence.name)] Base name for output
                                  files (Any string is accepted)
   -graph              menu       [png] Graphic output format (Values: png
                                  (PNG); jpg (JPEG))

   Additional (Optional) qualifiers (* if not always prompted):
   -temperature        integer    [37] Temperature (Integer from -5 to 105)
*  -na                 float      [1.0] Na+ molar concentration (for DNA only)
                                  (Number 0.000 or more)
*  -mg                 float      [0.0] Mg++ molar concentration (for DNA
                                  only) (Number 0.000 or more)
   -minstem            integer    [2] Minimum stem size. The default implies
                                  that a helix of size 1, that is a base pair
                                  that does not stack on another base pair on
                                  either side, is not considered (Integer 1 or
                                  more)
   -maxloop            integer    [30] Maximum interior or bulge loop size
                                  (Integer 0 or more)
   -maxbp              integer    [No limit] Maximum distance between pairing
                                  bases (Integer 0 or more)
*  -suboptimality      integer    [5] Maximum percent free energy
                                  suboptimality for computing suboptimal
                                  structures when UNAFOLD is working in MFOLD
                                  mode (Integer from 0 to 100)
*  -window             integer    [depends on sequence length] Base pair
                                  distance window for suboptimal foldings when
                                  UNAFOLD is working in MFOLD mode. A
                                  negative value sets sequence length
                                  dependent value (see on-line manual). (Any
                                  integer value)
*  -maxfold            integer    [maximum 100 suboptimal foldings or 10
                                  stochastically sampled foldings] Number of
                                  predicted foldings. If UNAFOLD is working in
                                  MFOLD mode this corresponds to the maximum
                                  number of foldings within the limits set by
                                  the -suboptimality and -window parameters,
                                  if UNAFOLD is working in partition function
                                  calculation mode this corresponds to the
                                  number of stochastically sampled foldings.
                                  (Integer 1 or more)
*  -annotation         menu       [none] Structure annotation mode. If UNAFOLD
                                  is working in MFOLD mode 'ann' and
                                  'ss-count' stand for colouring the bases
                                  according to the number of alternative base
                                  pairings respectively the number of
                                  single-stranded foldings. If UNAFOLD is
                                  working in partition function calculation
                                  mode 'ann' and 'ss-count' stand for
                                  colouring the bases according to base
                                  pairing probability respectively
                                  single-stranded probability. (Values: none
                                  (No annotation); ann (Base pairing
                                  annotation); ss-count (Single stranded
                                  annotation))

   Advanced (Unprompted) qualifiers:
   -constraints        infile     File with constraints that force or prohibit
                                  base pairings. See on-line manual for
                                  syntax
   -cutoff             float      [1e-6] Cutoff for display in basepair
                                  probability plot (Number from 0.000 to
                                  1.000)
   -display            menu       [auto] Structure display mode (Values: auto
                                  (write base symbols if sequence is shorter
                                  than 800); bases (always write base
                                  symbols); lines (do not write base symbols))
   -numbering          integer    [depends on sequence length] Base numbering
                                  frequency. A value of 0 triggers the default
                                  (10 for seq. shorter than 50, 50 for seq.
                                  longer than 300, otherwise 20). (Integer 0
                                  or more)
   -storeall           boolean    Store all UNAFOLD output files

   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of the sequence to be used
   -send1              integer    End of the sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Nucleotide sequence filename and optional format, or reference (input USA) Readable sequence Required
-mode Mode. The default mode 2 corresponds to the old MFOLD software.
1 (energy minimization)
2 (energy minimization with suboptimal foldings)
3 (partition function calculation)
2
-dna Molecule is single-stranded DNA rather than RNA Toggle value Yes/No No
-circular Molecule is circular Boolean value Yes/No No
-basename Base name for output files Any string is accepted <sequence name>
-graph Graphic output format
png (PNG)
jpg (JPEG)
png
Additional (Optional) qualifiers Allowed values Default
-temperature Temperature Integer from -5 to 105 37
-na Na+ molar concentration (for DNA only) Number 0.000 or more 1.0
-mg Mg++ molar concentration (for DNA only) Number 0.000 or more 0.0
-minstem Minimum stem size. The default implies that a helix of size 1, that is a base pair that does not stack on another base pair on either side, is not considered Integer 1 or more 2
-maxloop Maximum interior or bulge loop size Integer 0 or more 30
-maxbp Maximum distance between pairing bases Integer 0 or more No limit
-suboptimality Maximum percent free energy suboptimality for computing suboptimal structures when UNAFOLD is working in MFOLD mode Integer from 0 to 100 5
-window Base pair distance window for suboptimal foldings when UNAFOLD is working in MFOLD mode. A negative value sets sequence length dependent value (see on-line manual). Any integer value depends on sequence length
-maxfold Number of predicted foldings. If UNAFOLD is working in MFOLD mode this corresponds to the maximum number of foldings within the limits set by the -suboptimality and -window parameters, if UNAFOLD is working in partition function calculation mode this corresponds to the number of stochastically sampled foldings. Integer 1 or more maximum 100 suboptimal foldings or 10 stochastically sampled foldings
-annotation Structure annotation mode. If UNAFOLD is working in MFOLD mode 'ann' and 'ss-count' stand for colouring the bases according to the number of alternative base pairings respectively the number of single-stranded foldings. If UNAFOLD is working in partition function calculation mode 'ann' and 'ss-count' stand for colouring the bases according to base pairing probability respectively single-stranded probability.
none (No annotation)
ann (Base pairing annotation)
ss-count (Single stranded annotation)
none
Advanced (Unprompted) qualifiers Allowed values Default
-constraints File with constraints that force or prohibit base pairings. See on-line manual for syntax Input file Required
-cutoff Cutoff for display in basepair probability plot Number from 0.000 to 1.000 1e-6
-display Structure display mode
auto (write base symbols if sequence is shorter than 800)
bases (always write base symbols)
lines (do not write base symbols)
auto
-numbering Base numbering frequency. A value of 0 triggers the default (10 for seq. shorter than 50, 50 for seq. longer than 300, otherwise 20). Integer 0 or more depends on sequence length
-storeall Store all UNAFOLD output files Boolean value Yes/No No

Input file format

unafold reads any normal sequence USA for a single nucleic acid sequence.

You can optionally (-constraints=infile) give to mfold a "constraints" file with information (based on laboratory evidence) about which bases cannot or must pair. This can greatly improve the chance of finding the correct structure. An example of a file, that works for the usage example (it makes mfold find the correct structure intstead of what you get by default), is :

constraint file for usage example

P 16 0 1
P 35 0 3
P 48 0 1
P 73 0 1
P 75 0 2

The complete set of syntax rules for the "constraints" file is as follows :

  1. Forcing a string of consecutive bases to pair :
    Syntax : F   i   0   k
    Ribonucleotides ri,ri+1,...,ri+k-1 are forced to be double stranded. The partners for these bases are chosen by the program. As an example, the command :
    F   23   0   5
    would cause bases 23, 24, 25, 26 and 27 to pair.

  2. Forcing a string of consecutive base pairs :
    Syntax : F   i   j   k
    Base pairs ri-rj,ri+1-rj-1,ri+2-rj-2,...,ri+k-1-rj-k+1 are forced to occur. This is the same thing as forcing a helix to form. The helix is designated by its (external) closing base pair, i.j. As an example, the command :
    F   2   110   3
    would force base pairs 2.110, 3.109 and 4.108. Note that these ase pairs must be able to form ! Be aware also that unafold by default filters out isolated base pairs.

  3. Prohibiting a string of consecutive bases from pairing :
    Syntax : P   i   0   k
    Ribonucleotides ri,ri+1,...,ri+k-1 are prevented from pairing.

  4. Prohibiting a string of consecutive base pairs :
    Syntax : P   i   j   k
    ri-rj,ri+1-rj-1,ri+2-rj-2,...,ri+k-1-rj-k+1 are not allowed to form. This is equivalent to prohibiting a helix.

  5. Prohibiting 1 segment of a sequence from pairing with another :
    Syntax : P   i-j   k-l
    where i <= j and k <= l. In this case, no base pairs are allowed between ri,ri+1,...,rj and rk,rk+1,...,rl. Note that the 2 segments need not be distinct. For example, the command :
    P   i-j   i-j
    will not allow ri,ri+1,...,rj to pair with itself.

Output file format

unafold calls several programs and scripts from the UNAFOLD suite, which create quite a lot of files in a temporary storage area. By default the "wrapper" only stores the graphical output files in the personal directories of the user. Currently the user can choose between PNG and JPEG formats. With the option -storeall the user can request to store also the most important text files (see below).

Output files for usage example

File: K00152.png

[graphic with energy dotplot]
The energy dotplot. Level 1 contains helices belonging to the optimal folding and is drawn in black, levels 2 to 4 contain helices only found in suboptimal structures and are drawn in colour.

File: K00152.plot

level	length	i	j	energy
1	5	49	65	-297
1	6	35	57	-287
1	3	38	47	-292
1	1	29	49	-293
1	6	30	48	-296
1	7	1	72	-297
1	4	10	44	-297
1	1	13	34	-294
1	5	14	33	-297
1	1	19	28	-291
1	3	19	27	-297
1	2	10	35	-284
1	4	10	25	-296

This file contains instructions for drawing the energy dotplot. The first record is a header, and each subsequent record describes a single helix, with the position of the first base, the position of the last base and the length. The energy is the smallest free energy change (dG), in kcal/mol, of the folding that still contains the helix.

File: K00152_1.png

[graphic with optimal structure]
A graphic with the structure of the optimal folding.

File: K00152_1.ct

76	dG = -29.7	K00152 K00152.1 E.coli Arg-tRNA-1.
1	G	0	2	72	1	0	2
2	C	1	3	71	2	1	3
3	A	2	4	70	3	2	4
4	T	3	5	69	4	3	5
5	C	4	6	68	5	4	6
6	C	5	7	67	6	5	7
7	G	6	8	66	7	6	8

  [Part of this file has been deleted for brevity]

70	T	69	71	3	70	69	71
71	G	70	72	2	71	70	72
72	C	71	73	1	72	71	73
73	A	72	74	0	73	72	0
74	C	73	75	0	74	0	0
75	C	74	76	0	75	0	0
76	A	75	0	0	76	0	0

This file contains instructions for drawing the secondary structure above (see below for exact syntax).

[The files K00152.ann, K00152.dG, and the structure 2, 3 and 4 have been omitted for brevity]

Prediction reliability annotation

If unafold is running in energy minimization mode only one structure is predicted and you of course have no information about the reliability of this prediction. Otherwise, several alternative structures are predicted and it is possible to add to the graphical representation of the foldings some annotation about the reliablity of the prediction. The parameter -annotation= currently take three values :
  1. none : secondary structures are drawn without any special annotation. Letters or outline are in black, while base pairs are coloured lines (red for GC, blue for AU and green for GU).
  2. ann : Base characters are replaced or covered by coloured dots that show how well-determined the folding is. If unafold in running in MFOLD mode a "p-num" file with the number of alternative pairings is used, bases that are never paired or are not paired with more than one partner are in red, the other bases are in a scale from red to black according to the number of alternative partners. If unafold in running in partition function calculation mode a "prob" file with the probabilities is used and the probabilities are expressed in a scale from black to red. Also see below for the colour scale and the .ann file syntax.
  3. ss-count : Base characters are replaced or covered by coloured dots that show how likely they are to be single-stranded. An "ss-count" file is used. If unafold in running in MFOLD mode bases that are never paired are in red, bases that are paired in all alternative foldings are in black and the others in an intermediate colour. If unafold in running in partition function calculation mode a scale from black to red indicates the prabability of the base being single-stranded.
Color% P-numProbability
 0.0-2.50.999
 2.5-5.00.998
 5.0-7.50.997
 7.5-10.00.997
 10.0-12.50.995
 12.5-15.00.994
 15.0-17.50.991
 17.5-20.00.988
 20.0-22.50.984
 22.5-25.00.978
Color% P-numProbability
 25.0-27.50.969
 27.5-30.00.958
 30.0-32.50.943
 32.5-35.00.923
 35.0-37.50.894
 37.5-40.00.856
 40.0-42.50.803
 42.5-45.00.731
 45.0-47.50.634
 47.5-50.00.500
Color% P-numProbability
 50.0-52.50.500
 52.5-55.00.366
 55.0-57.50.269
 57.5-60.00.197
 60.0-62.50.144
 62.5-65.00.106
 65.0-67.50.077
 67.5-70.00.057
 70.0-72.50.042
 72.5-75.00.031
Color% P-numProbability
 75.0-77.50.022
 77.5-80.00.016
 80.0-82.50.012
 82.5-85.00.009
 85.0-87.50.006
 87.5-90.00.005
 90.0-92.50.003
 92.5-95.00.003
 95.0-97.50.002
 97.5-100.00.001

The full list of the possible text output files

Data files

The "wrapper" unafold does not allow to change the energy tables.

Notes

None.

References

  1. The "classic" article on the suboptimal algorithm :
    M. Zuker
    On Finding All Suboptimal Foldings of an RNA Molecule. Science 244, 48-52 (1989)
  2. The reference for the MFOLD software :
    M. Zuker, D.H. Mathews & D.H. Turner
    Algorithms and Thermodynamics for RNA Secondary Structure Prediction: A Practical Guide in RNA Biochemistry and Biotechnology, J. Barciszewski & B.F.C. Clark, eds., NATO ASI Series, Kluwer Academic Publishers, (1999)
  3. The reference for the UNAFOLD software :
    N.R. Markham & M. Zuker
    UNAFold: Software for Nucleic Acid Folding and Hybridization. Methods Mol Biol. 453:3-31, 2008.
  4. The version 2.3 free energy rules for folding RNA :
    A.E. Walter, D.H. Turner, J. Kim, M.H. Lyttle, P. Müller, D.H. Mathews & M. Zuker
    Coaxial stacking of helixes enhances binding of oligoribonucleotides and improves predictions of RNA folding. Proc. Natl. Acad. Sci. USA 91, 9218-9222 (1994)
  5. The version 3.1 free energy rules for folding RNA :
    D.H. Mathews, J. Sabina, M. Zuker & D.H. Turner
    Expanded Sequence Dependence of Thermodynamic Parameters Provides Robust Prediction of RNA Secondary Structure J. Mol. Biol. 288:911-940 (1999)
  6. The free energy rules for folding DNA :
    J.Jr SantaLucia
    "A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics." Proc. Natl. Acad. Sci. USA 95, 1460-1465 (1998)
  7. Annotation of base pair prediction reliability for MFOLD :
    M. Zker & A.B. Jacobson
    Using reliability information to annotate RNA secondary structures. RNA, 4:669-679, 1998.
  8. Annotation of base pair probabilities for partition function calculation mode :
    D.H. Matthews
    Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimizatin. RNA, 10:1174-1177, 2004.

Warnings

UNAFOLD cannot find secondary structures that contain "pseudo-knots" (see Algorithm section for definition of pseudo-knots).

Diagnostic Error Messages

If no stable foldings are found the program issues the following message :

      No foldings were found  

If you add -annotation=ann or -annotation=ss-count to the command line, while the program is running in energy minimization mode, it issues the following warning :

      annotation reset to 'none' ! If only one folding is computed,
            counting alternative base pairings is pointless.

Exit status

It always exits with status 0.

Known bugs

The title in the "energy dotplot"/"basepair probability plot" contains a rogue character before the "°C".

See also

Program nameDescription
cmsearchrfam Scans nucleic acids for RNA genes and conserved motifs using Rfam
einverted Finds DNA inverted repeats
hybridize Prediction of hybridization between nucleic acid sequences, with consideration of seconcary structure

Author(s)

The wrapper application unafold was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The UNAFOLD suite itself was written by Michael Zuker (zukerm@rpi.edu) and Nicholas Markham (markhn@rpi.edu) at Rensselaer Polytechnic Institute (Troy, New York, USA).

History

unafold is a rewrite and a successor of mfold, the "wrapper" for MFOLD.

Completed 23 October 2008.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.