|
|
clustalnj |
The algorithm used by clustalnj is the NJ (Neighbour Joining) method of Saitou and Nei. First you calculate distances (percent divergence) between all pairs of sequences from a multiple alignment ; second you apply the NJ method to the distance matrix.
With the -ignoregaps option any alignment positions where ANY of the sequences have a gap will be ignored. This means that 'like' will be compared to 'like' in all distances, which is highly desirable. It also automatically throws away the most ambiguous parts of the alignment, which are concentrated around gaps (usually). The disadvantage is that you may throw away much of the data if there are many gaps (which is why it is difficult for us to make it the default).
With the -kimura option the correction formula of Kimura is
applied. For small divergence (say <10%) this option makes no
difference. For greater divergence, it corrects for the fact that
observed distances underestimate actual evolutionary distances. This is
because, as sequences diverge, more than one substitution will happen at
many sites. However, you only see one difference when you look at the
present day sequences. Therefore, this option has the effect of
stretching branch lengths.
Where possible, this option should be used. However, for VERY divergent
sequences, the distances cannot be reliably corrected. You will be warned if
this happens.
With the -bootstrap option confidence values for the groupings in the tree are computed. The bootstrap method was first adapted for trees by Joe Felsenstein. It involves making N random samples of sites from the alignment (N should be LARGE, 1000 is proposed by default), drawing N trees (1 from each sample) and counting how many times each grouping from the original tree occurs in the sample trees. You can set N using the -trials=n option. Note that using a big N can make the program run for a long time. You can also supply a seed number for the random number generator (-seed=n). Different runs with the same seed will give the same answer.
The method we provide is the Neighbor Joining method (NJ) of Saitou and Nei, which is a distance method. We use this for three reasons : it is conceptually and computationally simple ; it is fast ; it gives "good" trees in simple cases. It is difficult to prove that one tree is "better" than another if you do not know the true phylogeny ; the few systematic surveys of methods show it to work more or less as well as any other method ON AVERAGE. Another reason for using the NJ method is that it is very commonly used ; THIS IS A BAD REASON SCIENTIFICALLY but at least you will not feel lonely if you use it.
The NJ method works on a matrix of distances (the distance matrix) between all pairs of sequence to be analysed. These distances are related to the degree of divergence between the sequences. It is normal to calculate the distances from the sequences after they are multiply aligned. If you calculate them from seperate alignments (as is done for computing the clustal "guide tree") you may increase the error considerably.
In a simple world, you would like a distance to be proportional to the time since the sequences diverged. In practice this OBVIOUSLY depends on the existence and quality of the "molecular clock", a subject of ongoing debate. However, even if there is a good clock, there is a further problem with estimating divergences. As sequences diverge, they become "saturated" with mutations. Sites can have substitutions more than once. Calculated distances will underestimate actual divergence times ; the greater the divergence, the greater the discrepancy. There are various methods for dealing with this and we provide two commonly used ones, both due to Motoo Kimura, one for proteins and one for DNA.
One paradoxical effect of these corrections, is that distances can be corrected to have more than 100% divergence. That is because, for very highly diverged sequences of length N, you can estimate that more than N substitutions have occured by correcting the observed distance in the above ways. Don't panic!
Correction for nucleotide distances : Kimura's 2-parameter method (Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111-120)
Corrected K = 0.5*ln(a) + 0.25*ln(b)
where a = 1/(1 - 2*P - Q)
and b = 1/(1 - 2*Q)
P and Q are the proportions of transitions (A<-->G, C<-->T)
and transversions occuring between the sequences.
Correction for protein distances :
The formula we use to correct for multiple hits is from Motoo Kimura
(Kimura, M. The neutral Theory of Molecular Evolution, Camb.Univ.Press, 1983,
page 75) and is :
K = -Ln(1 - D - (D.D)/5) where D is the observed distance and K is
corrected distance.
This formula gives the mean number of estimated substitutions per site and, in
contrast to D (the observed number), can be greater than 1 i.e. more than
one substitution per site, on average. For example, if you observe 0.8
differences per site (80% difference ; 20% identity), then the above formula
predicts that there have been 2.5 substitutions per site over the course
of evolution since the 2 sequences diverged. This can also be expressed in
PAM units by multiplying by 100 (mean number of substitutions per 100 residues).
The PAM scale of evolution and its derivation/calculation comes from the
work of Margaret Dayhoff and co workers (the famous Dayhoff PAM series
of weight matrices also came from this work). Dayhoff et al constructed
an elaborate model of protein evolution based on observed frequencies
of substitution between very closely related proteins. Using this model,
they derived a table relating observed distances to predicted PAM distances.
Kimura's formula, above, is just a "curve fitting" approximation to this table.
It is very accurate in the range 0.75 > D > 0.0 but becomes increasingly
unaccurate at high D (>0.75) and fails completely at around D = 0.85.
To circumvent this problem, we calculate all the values for K corresponding to D above 0.75 directly using the Dayhoff model and store these in an internal table. This table gives values of K for all D between 0.75 and 0.93 in intervals of 0.001 i.e. for D = 0.750, 0.751, 0.752 ...... 0.929, 0.930. For any observed D higher than 0.930, we arbitrarily set K to 10.0. This sounds drastic but with real sequences, distances of 0.93 (less than 7% identity) are rare. If your data set includes sequences with this degree of divergence, you will have great difficulty getting accurate trees by ANY method ; the alignment itself will be very difficult (to construct and to evaluate).
In practice, you should use a very large number of bootstrap replicates (1000 is recommended, even if if this takes some time) For each grouping on the tree, you record the number of times this grouping occurs in the sample trees. For a group to be considered "significant" at the 95% level (or P <= 0.05 in statistical terms) you expect the grouping to show up in >= 95% of the sample trees. If this happens, then you can say that the grouping is significant, given the data set and the method used to draw the tree.
So, when you use the bootstrap option, a NJ tree is drawn as before and then you are asked to say how many bootstrap samples you want (1000 is the default) and you are asked to give a seed number for the random number generator. If you give the same seed number in future, you will get the same results (we hope). Remember to give different seed numbers if you wish to carry out genuinely different bootstrap sampling experiments.
For an unrooted tree with N sequences, there are actually only N-3 genuinely different groupings that we can test (this is the number of "internal branches" ; each internal branch splits the sequences into 2 groups). In this example, we have 6 sequences with 3 internal branches in the reference tree. In the bootstrap resampling, we count how often each of these internal branches occur. Here, we find that the branch which splits 1,2,3 and 4 versus 5 and 6 occurs in all 1000 samples ; the branch which splits 2,3 and 4 versus 1,5 and 6 occurs in 1000 ; the branch which splits 2 and 3 versus 1,4,5 and 6 occurs in 812/1000 samples. We can put these figures on to the diagrammatic representation we made earlier of our unrooted NJ tree as follows :
SEQ 1 SEQ 4
I I
I I
SEQ 6 ----------I I I I--------- SEQ 2
I 1000 I 1000 I 812 I
I--------I-----------I----------I
I I
SEQ 5 ----------I I--------- SEQ 3
You can equally put these confidence figures on the rooted tree (in
fact the interpretation is simpler with rooted trees). With the
unrooted tree, the grouping of sequence 5 with 6 is significant (as
is the grouping of sequences 1,2,3 and 4). Equally the grouping of
sequences 1,5 and 6 is significant (the same as saying that 2,3 and
4 group significantly). However, the grouping of 2 and 3 is not
significant, although it is relatively strongly supported.
Unfortunately, there is a small complication in the interpretation
of these results. In statistical hypothesis testing, it is not
valid to make multiple simultaneous tests and to treat the result of
each test completely independantly. In the above case, if you have
one particular test (grouping) that you wish to make in advance, it
is valid to test IT ALONE and to simply show the other bootstrap
figures for reference. If you do not have any particular test in
mind before you do the bootstrapping, you can just show all of the
figures and use the 95% level as an ARBITRARY cut off to show those
groups that are very strongly supported ; but do not mention anything
about SIGNIFICANCE testing. In the literature, it is common
practice to simply show the figures with a tree ; they frequently
speak for themselves.
> clustalnj Neighbor-Joining phylogenetic tree from multiple alignment Input sequence(s): ADH.msf Phylogenetic tree in nested parentheses format [yahk_ecoli.ph]: ADH.ph |
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers (* if not always prompted):
[-seqs] seqall (Gapped) sequence(s) filename and optional
format, or reference (input USA)
[-treefile] outfile [*.clustalnj] Phylogenetic tree in nested
parentheses format
* -outfile outfile [*.clustalnj] Second output file
Additional (Optional) qualifiers (* if not always prompted):
-ignoregaps boolean Ignore columns with gaps
-kimura boolean Correct for multiple substitutions according
to Kimura
-bootstrap toggle Bootstrap the data
* -trials integer [1000] Number of trials for bootstap
(Integer from 1 to 10000)
* -seed integer [111] Random number generator seed for
bootstrap (Integer from 1 to 1000)
-secout menu [0] Write second output file (Values: 0
(none); C (CLUSTAL report); N (phylogenetic
tree in NEXUS format))
Advanced (Unprompted) qualifiers:
-labelnodes boolean Write bootstrap labels on nodes instead of
branches
-[no]njplot boolean [Y] Open treefile using NJplot
Associated qualifiers:
"-seqs" associated qualifiers
-sbegin1 integer Start of each sequence to be used
-send1 integer End of each sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-sformat1 string Input sequence format
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name
"-treefile" associated qualifiers
-odirectory2 string Output directory
"-outfile" associated qualifiers
-odirectory string Output directory
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write standard output
-filter boolean Read standard input, write standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages
|
| Standard (Mandatory) qualifiers | Allowed values | Default | |||||||
|---|---|---|---|---|---|---|---|---|---|
| [-seqs] (Parameter 1) |
(Gapped) sequence(s) filename and optional format, or reference (input USA) | Readable sequence(s) | Required | ||||||
| [-treefile] (Parameter 2) |
Phylogenetic tree in nested parentheses format | Output file | <sequence>.<extension> | ||||||
| -outfile | Second output file | Output file | <sequence>.<extension> | ||||||
| Additional (Optional) qualifiers | Allowed values | Default | |||||||
| -ignoregaps | Ignore columns with gaps | Boolean value Yes/No | No | ||||||
| -kimura | Correct for multiple substitutions according to Kimura | Boolean value Yes/No | No | ||||||
| -bootstrap | Bootstrap the data | Toggle value Yes/No | No | ||||||
| -trials | Number of trials for bootstap | Integer from 1 to 10000 | 1000 | ||||||
| -seed | Random number generator seed for bootstrap | Integer from 1 to 1000 | 111 | ||||||
| -secout | Write second output file |
|
0 | ||||||
| Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||
| -labelnodes | Write bootstrap labels on nodes instead of branches | Boolean value Yes/No | No | ||||||
| -[no]njplot | Open treefile using NJplot | Boolean value Yes/No | Yes | ||||||
You can request a second output file, either with a CLUSTAL report (with information about the paiwise distances and the building of the tree, in human-readable format) or with a phylogenetic tree in NEXUS format. This format is used by several popular phylogeny programs, including PAUP and MacClade. It is described fully in : Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997. NEXUS : an extensible file format for systematic information. Systematic Biology 46:590-621.
By default, the bootstrap values are correctly placed on the tree branches of the phylip format output tree. With -labelnodes you can request them to be placed instead on the nodes, which is incorrect, but some display packages (e.g. TreeTool, TreeView and Phylowin) only support node labelling but not branch labelling. Care should be taken to note which branches and labels go together.
( ( YAHK_ECOLI:0.24338, ADHC_MYCTU:0.23046) :0.03188, ( CADH2_EUCGU:0.26331, MTDH2_ARATH:0.23246) :0.02635, ( ADH7_YEAST:0.34516, ADH2_BACST:0.34920) :0.06217); |
#NEXUS BEGIN TREES; TRANSLATE 1 YAHK_ECOLI, 2 ADHC_MYCTU, 3 CADH2_EUCGU, 4 MTDH2_ARATH, 5 ADH7_YEAST, 6 ADH2_BACST ; UTREE PAUP_1= ((1:0.24338,2:0.23046):0.03188,(3:0.26331,4:0.23246):0.02635,(5:0.34516,6:0.34920):0.06217); ENDBLOCK; |
DIST = percentage divergence (/100) Length = number of sites used in comparison 1 vs. 2 DIST = 0.4738; length = 344 1 vs. 3 DIST = 0.5954; length = 346 1 vs. 4 DIST = 0.5202; length = 346 1 vs. 5 DIST = 0.6763; length = 346 1 vs. 6 DIST = 0.6597; length = 335 2 vs. 3 DIST = 0.5449; length = 345 2 vs. 4 DIST = 0.5116; length = 344 2 vs. 5 DIST = 0.6850; length = 346 2 vs. 6 DIST = 0.6916; length = 334 3 vs. 4 DIST = 0.4958; length = 355 3 vs. 5 DIST = 0.6771; length = 353 3 vs. 6 DIST = 0.6976; length = 334 4 vs. 5 DIST = 0.6771; length = 353 4 vs. 6 DIST = 0.6826; length = 334 5 vs. 6 DIST = 0.6944; length = 337 Neighbor-joining Method Saitou, N. and Nei, M. (1987) The Neighbor-joining Method: A New Method for Reconstructing Phylogenetic Trees. Mol. Biol. Evol., 4(4), 406-425 This is an UNROOTED tree Numbers in parentheses are branch lengths Cycle 1 = SEQ: 5 ( 0.34516) joins SEQ: 6 ( 0.34920) Cycle 2 = SEQ: 1 ( 0.24338) joins SEQ: 2 ( 0.23046) Cycle 3 = SEQ: 3 ( 0.26331) joins SEQ: 4 ( 0.23246) Cycle 4 (Last cycle, trichotomy): Node: 1 ( 0.03188) joins Node: 3 ( 0.02635) joins Node: 5 ( 0.06217) |
To visualize the tree you can find at the BEN site the NJplot program of Manolo Gouy, which you can use under X-Window. There are two versions : njplot_unrooted displays the tree unrooted and allows to write the picture as a PostScript file. njplot displays the tree rooted and allows to manipulate the tree (swap branches, change outgroup), as well as save the altered tree and save the picture as a PostScript file. You can also find the programs drawgram and drawtree of the original PHYLIP package. They allow to preview the tree on the screen (Tektronix or X-Window) and to save the picture in a variety of formats.
Note that the NJ method is not sensitive to changes in the rate of evolution but, precisely because of this, does not pretend to find the root of the tree. The tree must be considered UNROOTED, even if you can "open" it with a program that presents the tree as rooted.
WARNING: N of the distances out of a total of M were out of range for the distance correction.
| Program name | Description |
|---|---|
| distmat | Creates a distance matrix from multiple alignments |
| fdnacomp | DNA compatibility algorithm |
| fdnadist | Nucleic acid sequence Distance Matrix program |
| fdnainvar | Nucleic acid sequence Invariants method |
| fdnaml | Estimates nucleotide phylogeny by maximum likelihood |
| fdnamlk | Estimates nucleotide phylogeny by maximum likelihood |
| fdnamove | Interactive DNA parsimony |
| fdnapars | DNA parsimony algorithm |
| fdnapenny | Penny algorithm for DNA |
| fdolmove | Interactive Dollo or Polymorphism Parsimony |
| fproml | Protein phylogeny by maximum likelihood |
| fpromlk | Protein phylogeny by maximum likelihood |
| fprotdist | Protein distance algorithm |
| fprotpars | Protein parsimony algorithm |
| frestboot | Bootstrapped restriction sites algorithm |
| frestdist | Distance matrix from restriction sites or fragments |
| frestml | Restriction site maximum Likelihood method |
| fseqboot | Bootstrapped sequences algorithm |
| modelgenerator | Selects optimal model for Maximum Likelihood phylogenetic tree from multiple alignment |
| mrbayesseqtree | Bayes phylogenetic tree from multiple alignment |
| phyml | Heuristic Maximum Likelihood phylogenetic tree from multiple alignment |
| clustal | Global multiple alignment of sequences |
| fneighbor | Phylogenies from distance matrix by N-J or UPGMA method |
The program clustalw itself was written by
Julie Thompson (Thompson@EMBL-Heidelberg.DE)
Toby Gibson (Gibson@EMBL-Heidelberg.DE)
European Molecular Biology Laboratory, Meyerhofstrasse 1, D 69117 Heidelberg, Germany
Des Higgins (Higgins@ucc.ie)
University of County Cork, Cork, Ireland