hybridize

 

Function

Prediction of hybridization between nucleic acid sequences, with consideration of secondary structure

Description

hybridize is an EMBOSS "wrapper" program for the hybrid2.pl script from the the UNAFOLD (Unified Nucleic Acid Folding) suite of Markham and Zuker. It predicts whether two nucleic acid sequences will hybridize with each other rather than fold on themselves and estimates the "melting temperature" of the complex. It makes plots with the concentrations of the different species, the estimated heat capacity and the estimated molar extinction coefficient at 260 nm in function of temperature.

Note that the hybridization between sequences is concentration dependant and that the user must provide the concentration(s) of the involved sequence(s) (there is no default provided). hybridize can operate in 3 different modes :

  1. single sequence : hybridize considers as possible species present the unfolded sequence, the sequence folded on itself and the sequence hybridized to another copy of itself.
  2. double-stranded sequence : the computations are like for two sequences (see below), but the user must only provide one sequence, which is supposed to be one strand of a double-stranded molecule. The "wrapper" takes care of presenting this to the "naked" software as two complementatry sequences with same concentration.
  3. two sequences : hybridize considers as possible species present Au (the first sequence unfolded), Af (the first sequence folded on itself), AA (the first sequence hybridized to another copy of itself), Bu (the second sequence unfolded), Bf (the second sequence folded on itself), BB (the second sequence hybridized to another copy of itself) and AB (a hybrid between the two sequences).
By default hybridize considers the molecule(s) to be RNA, but you can precise that it is DNA instead. The current version of hybridize does not consider foldings where a sequence makes base-pairs with both another sequence and itself. For this reason it works best with short oligonucleotides and does not handle well longer sequences and/or sequences of very uneven length. The "wrapper" accepts only sequences up to 100 bases long.

Algorithm

A complete explanation of the algorithm is beyond the scope of this on-line manual. You should look at the references. In brief :

hybridize uses built-in sets of energy tables to compute the complete partition function and from it base pair probabilities (see also unafold). It does this for the different folded species (Af, Bf, AA, BB, AB) and at different temperatures (set by tmin, tmax and tinc). The temperature range used actually extends 5° above and below tmin and tmax, because of the need to compute the derivative.

The contribution of unfolded molecules to the ensemble is estimated assuming that a single-stranded molecule "melts" (cooperatively goes from "stacked" to "unstacked") at 50°C and that the "stacked" state has an enthalpy that corresponds to 10% of the enthalpy of a fully double-stranded molecule.

hybridize uses the above to compute the concentration of the various species present and the free energies (G) for the ensemble. It computes :

enthalpy : H = G - T * dG/dT
entropy : S = - dG/dT
heat capacity : Cp = - T * d2G/dT2
It computes the molar extinction coefficient using values for dinucleotides.

Usage

Here is a sample session with hybridize

> hybridize
Prediction of hybridization between nucleic acid sequences, with
consideration of secondary structure
         1 : single sequence
         2 : double-stranded sequence
         3 : two sequences
Operation mode and input type [3]:
Input nucleotide sequence: asis::CAACCTCGATCGGGAGATTG
Second sequence: asis::GCTTCTCCAGATCCAGGTTG
Molecule is DNA rather than RNA [N]:
Concentration of first sequence [0.0]: 0.0000001
Concentration of second sequence [0.0]: 0.0000001
Base name for output files [asis]: oligo1
Base name for output files, second part [asis]: oligo2
        ps : PostScript
       pdf : PDF
Graphic output format [ps]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
   -mode               menu       [3] Operation mode and input type (Values: 1
                                  (single sequence); 2 (double-stranded
                                  sequence); 3 (two sequences))
  [-asequence]         sequence   Nucleotide sequence filename and optional
                                  format, or reference (input USA)
*  -bsequence          sequence   Nucleotide sequence filename and optional
                                  format, or reference (input USA)
   -dna                toggle     Molecule is DNA rather than RNA
   -aconc              float      [no default !] Concentration of first
                                  sequence. Entering value is required !
                                  (Number 0.000 or more)
*  -bconc              float      [no default !] Concentration of second
                                  sequence. Entering value is required when
                                  hybridize is running in mode 3 ! (Number
                                  0.000 or more)
   -abasename          string     [$(asequence.name)] Base name for output
                                  files (Any string is accepted)
*  -bbasename          string     [$(bsequence.name)] Base name for output
                                  files, second part (Any string is accepted)
   -graph              menu       [ps] Graphic output format (Values: ps
                                  (PostScript); pdf (PDF))

   Additional (Optional) qualifiers (* if not always prompted):
   -tmin               integer    [0] Minimum temperature (Integer from -5 to
                                  105)
   -tmax               integer    [100] Maximum temperature (Integer from -5
                                  to 105)
   -tinc               integer    [1] Temperature increment (Integer 1 or
                                  more)
*  -na                 float      [1.0] Na+ molar concentration (for DNA only)
                                  (Number 0.000 or more)
*  -mg                 float      [0.0] Mg++ molar concentration (for DNA
                                  only) (Number 0.000 or more)
   -minstem            integer    [2] Minimum stem size. The default implies
                                  that a helix of size 1, that is a base pair
                                  that does not stack on another base pair on
                                  either side, is not considered (Integer 1 or
                                  more)
   -maxloop            integer    [30] Maximum interior or bulge loop size
                                  (Integer 0 or more)
   -maxbp              integer    [No limit] Maximum distance between pairing
                                  bases (Integer 0 or more)
   -[no]unfolded       toggle     [Y] Estimate enthalpy and entropy unfolded
                                  sequence. If you unset this the unfolded
                                  sequences are not taken into account for
                                  calculating the ensemble.
*  -fraction           float      [0.1] Fraction of double strand stacking
                                  enthalpy used for estimating enthalpy
                                  unfolded sequence (Number from 0.000 to
                                  1.000)
*  -tmelt              integer    [50] Melting temperature used for estimation
                                  of entropy unfolded sequence (Integer from
                                  -5 to 105)

   Advanced (Unprompted) qualifiers:
   -storeall           boolean    Store all UNAFOLD output files

   Associated qualifiers:

   "-asequence" associated qualifiers
   -sbegin1            integer    Start of the sequence to be used
   -send1              integer    End of the sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-bsequence" associated qualifiers
   -sbegin             integer    Start of the sequence to be used
   -send               integer    End of the sequence to be used
   -sreverse           boolean    Reverse (if DNA)
   -sask               boolean    Ask for begin/end/reverse
   -snucleotide        boolean    Sequence is nucleotide
   -sprotein           boolean    Sequence is protein
   -slower             boolean    Make lower case
   -supper             boolean    Make upper case
   -sformat            string     Input sequence format
   -sdbname            string     Database name
   -sid                string     Entryname
   -ufo                string     UFO features
   -fformat            string     Features format
   -fopenfile          string     Features file name

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
-mode Operation mode and input type
1 (single sequence)
2 (double-stranded sequence)
3 (two sequences)
3
[-asequence]
(Parameter 1)
Nucleotide sequence filename and optional format, or reference (input USA) Readable sequence Required
-bsequence Nucleotide sequence filename and optional format, or reference (input USA) Readable sequence Required
-dna Molecule is DNA rather than RNA Toggle value Yes/No No
-aconc Concentration of first sequence. Entering value is required ! Number 0.000 or more no default !
-bconc Concentration of second sequence. Entering value is required when hybridize is running in mode 3 ! Number 0.000 or more no default !
-abasename Base name for output files Any string is accepted <first sequence name>
-bbasename Base name for output files, second part Any string is accepted <second sequence name>
-graph Graphic output format
ps (PostScript)
pdf (PDF)
ps
Additional (Optional) qualifiers Allowed values Default
-tmin Minimum temperature Integer from -5 to 105 0
-tmax Maximum temperature Integer from -5 to 105 100
-tinc Temperature increment Integer 1 or more 1
-na Na+ molar concentration (for DNA only) Number 0.000 or more 1.0
-mg Mg++ molar concentration (for DNA only) Number 0.000 or more 0.0
-minstem Minimum stem size. The default implies that a helix of size 1, that is a base pair that does not stack on another base pair on either side, is not considered Integer 1 or more 2
-maxloop Maximum interior or bulge loop size Integer 0 or more 30
-maxbp Maximum distance between pairing bases Integer 0 or more No limit
-[no]unfolded Estimate enthalpy and entropy unfolded sequence. If you unset this the unfolded sequences are not taken into account for calculating the ensemble. Toggle value Yes/No Yes
-fraction Fraction of double strand stacking enthalpy used for estimating enthalpy unfolded sequence Number from 0.000 to 1.000 0.1
-tmelt Melting temperature used for estimation of entropy unfolded sequence Integer from -5 to 105 50
Advanced (Unprompted) qualifiers Allowed values Default
-storeall Store all UNAFOLD output files Boolean value Yes/No No

Input file format

hybridize reads any normal sequence USAs for nucleic acids. The current version has however an upper limit of 100 bases per sequence.

Output file format

hybridize calls several programs and scripts from the UNAFOLD suite, which create quite a lot of files in a temporary storage area. By default the "wrapper" only stores 3 graphical files in the personal directories of the user. Currently the user can choose between PostScript and PDF format. With the option -storeall the user can request to store also all other output files, except the temporary scripts.

Output files for usage example

File: oligo1-oligo2.conc.ps

[graphic with concentration molecular
species in function temperature]
The concentrations of all molecular species plotted in function of temperature.

File: oligo1-oligo2.Cp.ps

[graphic with heat capacity in function
temperature]
The heat capacity of the ensemble plotted in function of temperature.

File: oligo1-oligo2.ext.ps

[graphic with extinction coefficient in
function temperature]
The molar extinction coefficient for UV radiation at 260 nm plotted in function of temperature.

Other output files you can obtain are :

Data files

The "wrapper" hybridize does not allow to change the energy tables.

Notes

None.

References

  1. N.R. Markham & M. Zuker
    UNAFold: Software for Nucleic Acid Folding and Hybridization. Methods Mol Biol. 453:3-31, 2008.
  2. A.E. Walter, D.H. Turner, J. Kim, M.H. Lyttle, P. Müller, D.H. Mathews & M. Zuker
    Coaxial stacking of helixes enhances binding of oligoribonucleotides and improves predictions of RNA folding. Proc. Natl. Acad. Sci. USA 91, 9218-9222 (1994)
  3. J.Jr SantaLucia
    A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. USA 95, 1460-1465 (1998)
  4. D.H. Matthews
    Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimizatin. RNA, 10:1174-1177, 2004.
  5. A. Dimitrov & M. Zuker
    Prediction of Hybridization and Melting for Double-Stranded Nucleic Acids. Biophys. J. 87:215-226 (2004)

Warnings

The current version of the UNAFOLD suite does not consider foldings where a sequence makes base-pairs with both another sequence and itself. Also, the computation of the free enthalpy for the unfolded species is done with an ad hoc formula. Therefore results should be taken with a "grain of salt".

Diagnostic Error Messages

If you omit to provide a value for the concentration of one or both of the input sequences, the program issues one of these messages :

 You must provide a value for the concentration of the first sequence !

 You must provide a value for the concentration of the second sequence !

If one or both of the sequences are longer than 100 bases, the program issues one of these messages :

 The sequence should not be longer than 100 bases.

 The sequences should not be longer than 100 bases.

If in "two sequences" mode the two input sequences are identical, the program issues the message :

 The two sequences should be different!

If in "double-stranded sequence" mode it turns out that the two strands of the sequence are identical, the program reverts to functioning in "single sequence" mode and issues the message :

 The sequence is a palindrome, hence only one unfolded species !

Error messages from the "naked" programs are displayed. For example, it can happen that the UNAFOLD suite programs fail to compute some values, with as consequence that gnuplot cannot make the plot. You will then see messages like (note : "nan" means "not a number") :

Warning: at 105 degrees the relative error of [B]+2[BB]+[AB] is nan

line 16: undefined variable: nan
If such error messages are issued it can be useful to re-run the program, setting tmin and tmax so that the temperatures at which the errors occur are at least 5° outside the computaion range.

If a sequence is shorter than 12 bases, you can see :

Note: for sequences of 11 bases or less, hybrid-ss-noml is functionally
equivalent to hybrid-ss, and significantly faster

Exit status

It exits prematurely with status 255 and an error message if an input sequence has a length of more than 100, if in "two sequences" mode the two sequences are identical or if the concentration of an input sequence is missing.

Known bugs

The name of the sequence in the plot is sometimes messed up.

See also

Program nameDescription
cmsearchrfam Scans nucleic acids for RNA genes and conserved motifs using Rfam
einverted Finds DNA inverted repeats
unafold Prediction of optimal and suboptimal RNA or DNA secondary structure

Author(s)

The wrapper application hybridize was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The UNAFOLD suite itself was written by Michael Zuker (zukerm@rpi.edu) and Nicholas Markham (markhn@rpi.edu) at Rensselaer Polytechnic Institute (Troy, New York, USA).

History

Completed 23 October 2008

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.