clustalnj

 

Function

Neighbor-Joining phylogenetic tree from multiple alignment

Description

clustalnj is an EMBOSS "wrapper" program for the phylogeny function of the program CLUSTAL (clustalw) of Des Higgins. It takes as input a nucleic acid or protein multiple sequence alignment and produces as output a phylogenetic tree in "nested parentheses" format. Note that the output is not a graphical display of the tree, but there exists a lot of software that takes the clustalnj output file as input and allows to visualize and/or manipulate the tree. If you work under X-Window the tree is automatically "opened" with the program njplot_unrooted.

The algorithm used by clustalnj is the NJ (Neighbour Joining) method of Saitou and Nei. First you calculate distances (percent divergence) between all pairs of sequences from a multiple alignment ; second you apply the NJ method to the distance matrix.

With the -ignoregaps option any alignment positions where ANY of the sequences have a gap will be ignored. This means that 'like' will be compared to 'like' in all distances, which is highly desirable. It also automatically throws away the most ambiguous parts of the alignment, which are concentrated around gaps (usually). The disadvantage is that you may throw away much of the data if there are many gaps (which is why it is difficult for us to make it the default).

With the -kimura option the correction formula of Kimura is applied. For small divergence (say <10%) this option makes no difference. For greater divergence, it corrects for the fact that observed distances underestimate actual evolutionary distances. This is because, as sequences diverge, more than one substitution will happen at many sites. However, you only see one difference when you look at the present day sequences. Therefore, this option has the effect of stretching branch lengths.
Where possible, this option should be used. However, for VERY divergent sequences, the distances cannot be reliably corrected. You will be warned if this happens.

With the -bootstrap option confidence values for the groupings in the tree are computed. The bootstrap method was first adapted for trees by Joe Felsenstein. It involves making N random samples of sites from the alignment (N should be LARGE, 1000 is proposed by default), drawing N trees (1 from each sample) and counting how many times each grouping from the original tree occurs in the sample trees. You can set N using the -trials=n option. Note that using a big N can make the program run for a long time. You can also supply a seed number for the random number generator (-seed=n). Different runs with the same seed will give the same answer.

Algorithm

There is a constant debate in the literature as to the merits of different methods but, unfortunately, a lot of what is said is incomprehensible or inaccurate. It is also a field that is prone to having highly opinionated schools of thought. This is a pity because it prevents rational discussion of the pro's and con's of the different methods. The approach adopted in CLUSTAL is to supply just one method and to produce alignments in a format that can be used by PHYLIP. In simple cases, the trees produced will be as "good" (reliable, robust) as those from ANY other method. In more complicated cases, there is no single magic recipe that we can supply that will work well in even most situations.

The method we provide is the Neighbor Joining method (NJ) of Saitou and Nei, which is a distance method. We use this for three reasons : it is conceptually and computationally simple ; it is fast ; it gives "good" trees in simple cases. It is difficult to prove that one tree is "better" than another if you do not know the true phylogeny ; the few systematic surveys of methods show it to work more or less as well as any other method ON AVERAGE. Another reason for using the NJ method is that it is very commonly used ; THIS IS A BAD REASON SCIENTIFICALLY but at least you will not feel lonely if you use it.

The NJ method works on a matrix of distances (the distance matrix) between all pairs of sequence to be analysed. These distances are related to the degree of divergence between the sequences. It is normal to calculate the distances from the sequences after they are multiply aligned. If you calculate them from seperate alignments (as is done for computing the clustal "guide tree") you may increase the error considerably.

Distances

The simplest measure of distance between sequences is percent divergence (100% minus percent identity). For two sequences, you count how many positions differ between them (ignoring all positions with a gap or an unknown residue) and divide by the number of positions considered. It is common practice to also ignore all positions in the alignment where there is a GAP in ANY of the sequences. Usually, you express the percent distance divided by 100 (this gives distances between 0.0 and 1.0).

In a simple world, you would like a distance to be proportional to the time since the sequences diverged. In practice this OBVIOUSLY depends on the existence and quality of the "molecular clock", a subject of ongoing debate. However, even if there is a good clock, there is a further problem with estimating divergences. As sequences diverge, they become "saturated" with mutations. Sites can have substitutions more than once. Calculated distances will underestimate actual divergence times ; the greater the divergence, the greater the discrepancy. There are various methods for dealing with this and we provide two commonly used ones, both due to Motoo Kimura, one for proteins and one for DNA.

One paradoxical effect of these corrections, is that distances can be corrected to have more than 100% divergence. That is because, for very highly diverged sequences of length N, you can estimate that more than N substitutions have occured by correcting the observed distance in the above ways. Don't panic!

Correction for nucleotide distances : Kimura's 2-parameter method (Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111-120)

       Corrected K = 0.5*ln(a) + 0.25*ln(b)

       where     a = 1/(1 - 2*P - Q)
       and       b = 1/(1 - 2*Q)

       P and Q are the proportions of transitions (A<-->G, C<-->T)
       and transversions occuring between the sequences.
Correction for protein distances : The formula we use to correct for multiple hits is from Motoo Kimura (Kimura, M. The neutral Theory of Molecular Evolution, Camb.Univ.Press, 1983, page 75) and is :
K = -Ln(1 - D - (D.D)/5)  where D is the observed distance and K is
                              corrected distance.
This formula gives the mean number of estimated substitutions per site and, in contrast to D (the observed number), can be greater than 1 i.e. more than one substitution per site, on average. For example, if you observe 0.8 differences per site (80% difference ; 20% identity), then the above formula predicts that there have been 2.5 substitutions per site over the course of evolution since the 2 sequences diverged. This can also be expressed in PAM units by multiplying by 100 (mean number of substitutions per 100 residues). The PAM scale of evolution and its derivation/calculation comes from the work of Margaret Dayhoff and co workers (the famous Dayhoff PAM series of weight matrices also came from this work). Dayhoff et al constructed an elaborate model of protein evolution based on observed frequencies of substitution between very closely related proteins. Using this model, they derived a table relating observed distances to predicted PAM distances. Kimura's formula, above, is just a "curve fitting" approximation to this table. It is very accurate in the range 0.75 > D > 0.0 but becomes increasingly unaccurate at high D (>0.75) and fails completely at around D = 0.85.

To circumvent this problem, we calculate all the values for K corresponding to D above 0.75 directly using the Dayhoff model and store these in an internal table. This table gives values of K for all D between 0.75 and 0.93 in intervals of 0.001 i.e. for D = 0.750, 0.751, 0.752 ...... 0.929, 0.930. For any observed D higher than 0.930, we arbitrarily set K to 10.0. This sounds drastic but with real sequences, distances of 0.93 (less than 7% identity) are rare. If your data set includes sequences with this degree of divergence, you will have great difficulty getting accurate trees by ANY method ; the alignment itself will be very difficult (to construct and to evaluate).

Neighbor Joining trees

VERY briefly, the NJ method works as follows. You start by placing the sequences in a star topology (no internal branches). You then find that internal branch (take 2 sequences ; join them ; connect them to the rest by the internal branch) which when added to the tree will minimise the total branch length. The two joined sequences (neighbours) are merged into a single sequence and the process is repeated. For an unrooted tree with N sequences, there are N-3 internal branches. The above process is repeated N-3 times to give the final tree. The full details are given in : Saitou, N. and Nei, M. (1987) The neighbor-joining method : a new method for reconstructing phylogenetic trees. Mol.Biol.Evol. 4, 406-425.

Bootstrapping

In the case of sequence alignments, bootstrapping involves taking random samples of positions from the alignment. If the alignment has N positions, each bootstrap sample consists of a random sample of N positions, taken WITH REPLACEMENT i.e. in any given sample, some sites may be sampled several times, others not at all. Then, with each sample of sites, you calculate a distance matrix as usual and draw a tree. If the data very strongly support just one tree then the sample trees will be very similar to each other and to the original tree, drawn without bootstrapping. However, if parts of the tree are not well supported, then the sample trees will vary considerably in how they represent these parts.

In practice, you should use a very large number of bootstrap replicates (1000 is recommended, even if if this takes some time) For each grouping on the tree, you record the number of times this grouping occurs in the sample trees. For a group to be considered "significant" at the 95% level (or P <= 0.05 in statistical terms) you expect the grouping to show up in >= 95% of the sample trees. If this happens, then you can say that the grouping is significant, given the data set and the method used to draw the tree.

So, when you use the bootstrap option, a NJ tree is drawn as before and then you are asked to say how many bootstrap samples you want (1000 is the default) and you are asked to give a seed number for the random number generator. If you give the same seed number in future, you will get the same results (we hope). Remember to give different seed numbers if you wish to carry out genuinely different bootstrap sampling experiments.

For an unrooted tree with N sequences, there are actually only N-3 genuinely different groupings that we can test (this is the number of "internal branches" ; each internal branch splits the sequences into 2 groups). In this example, we have 6 sequences with 3 internal branches in the reference tree. In the bootstrap resampling, we count how often each of these internal branches occur. Here, we find that the branch which splits 1,2,3 and 4 versus 5 and 6 occurs in all 1000 samples ; the branch which splits 2,3 and 4 versus 1,5 and 6 occurs in 1000 ; the branch which splits 2 and 3 versus 1,4,5 and 6 occurs in 812/1000 samples. We can put these figures on to the diagrammatic representation we made earlier of our unrooted NJ tree as follows :


                          SEQ 1       SEQ 4
                           I           I
                           I           I
  SEQ 6 ----------I        I           I          I--------- SEQ 2
                  I  1000  I   1000    I   812    I
                  I--------I-----------I----------I
                  I                               I
  SEQ 5 ----------I                               I--------- SEQ 3
You can equally put these confidence figures on the rooted tree (in fact the interpretation is simpler with rooted trees). With the unrooted tree, the grouping of sequence 5 with 6 is significant (as is the grouping of sequences 1,2,3 and 4). Equally the grouping of sequences 1,5 and 6 is significant (the same as saying that 2,3 and 4 group significantly). However, the grouping of 2 and 3 is not significant, although it is relatively strongly supported. Unfortunately, there is a small complication in the interpretation of these results. In statistical hypothesis testing, it is not valid to make multiple simultaneous tests and to treat the result of each test completely independantly. In the above case, if you have one particular test (grouping) that you wish to make in advance, it is valid to test IT ALONE and to simply show the other bootstrap figures for reference. If you do not have any particular test in mind before you do the bootstrapping, you can just show all of the figures and use the 95% level as an ARBITRARY cut off to show those groups that are very strongly supported ; but do not mention anything about SIGNIFICANCE testing. In the literature, it is common practice to simply show the figures with a tree ; they frequently speak for themselves.

Usage

Here is a sample session with clustalnj

> clustalnj
Neighbor-Joining phylogenetic tree from multiple alignment
Input sequence(s): ADH.msf
Phylogenetic tree in nested parentheses format [yahk_ecoli.ph]: ADH.ph

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-seqs]              seqall     (Gapped) sequence(s) filename and optional
                                  format, or reference (input USA)
  [-treefile]          outfile    [*.clustalnj] Phylogenetic tree in nested
                                  parentheses format
*  -outfile            outfile    [*.clustalnj] Second output file

   Additional (Optional) qualifiers (* if not always prompted):
   -ignoregaps         boolean    Ignore columns with gaps
   -kimura             boolean    Correct for multiple substitutions according
                                  to Kimura
   -bootstrap          toggle     Bootstrap the data
*  -trials             integer    [1000] Number of trials for bootstap
                                  (Integer from 1 to 10000)
*  -seed               integer    [111] Random number generator seed for
                                  bootstrap (Integer from 1 to 1000)
   -secout             menu       [0] Write second output file (Values: 0
                                  (none); C (CLUSTAL report); N (phylogenetic
                                  tree in NEXUS format))

   Advanced (Unprompted) qualifiers:
   -labelnodes         boolean    Write bootstrap labels on nodes instead of
                                  branches
   -[no]njplot         boolean    [Y] Open treefile using NJplot

   Associated qualifiers:

   "-seqs" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-treefile" associated qualifiers
   -odirectory2        string     Output directory

   "-outfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
[-seqs]
(Parameter 1)
(Gapped) sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
[-treefile]
(Parameter 2)
Phylogenetic tree in nested parentheses format Output file <sequence>.<extension>
-outfile Second output file Output file <sequence>.<extension>
Additional (Optional) qualifiers Allowed values Default
-ignoregaps Ignore columns with gaps Boolean value Yes/No No
-kimura Correct for multiple substitutions according to Kimura Boolean value Yes/No No
-bootstrap Bootstrap the data Toggle value Yes/No No
-trials Number of trials for bootstap Integer from 1 to 10000 1000
-seed Random number generator seed for bootstrap Integer from 1 to 1000 111
-secout Write second output file
0 (none)
C (CLUSTAL report)
N (phylogenetic tree in NEXUS format)
0
Advanced (Unprompted) qualifiers Allowed values Default
-labelnodes Write bootstrap labels on nodes instead of branches Boolean value Yes/No No
-[no]njplot Open treefile using NJplot Boolean value Yes/No Yes

Input file format

clustalnj reads any normal sequence USAs. There should be at least 3 sequences and they should be aligned already by the inclusion of gap symbols at appropriate positions.

Output file format

clustalnj by default produces one output file, with a phylogenetic tree in "nested parentheses" format. This is taken as input by a lot of software, like NJplot, TreeView or PHYLIP.

You can request a second output file, either with a CLUSTAL report (with information about the paiwise distances and the building of the tree, in human-readable format) or with a phylogenetic tree in NEXUS format. This format is used by several popular phylogeny programs, including PAUP and MacClade. It is described fully in : Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997. NEXUS : an extensible file format for systematic information. Systematic Biology 46:590-621.

By default, the bootstrap values are correctly placed on the tree branches of the phylip format output tree. With -labelnodes you can request them to be placed instead on the nodes, which is incorrect, but some display packages (e.g. TreeTool, TreeView and Phylowin) only support node labelling but not branch labelling. Care should be taken to note which branches and labels go together.

Output files for usage example

File: ADH.ph

(
(
YAHK_ECOLI:0.24338,
ADHC_MYCTU:0.23046)
:0.03188,
(
CADH2_EUCGU:0.26331,
MTDH2_ARATH:0.23246)
:0.02635,
(
ADH7_YEAST:0.34516,
ADH2_BACST:0.34920)
:0.06217);

Supplementry output file: ADH.tre

#NEXUS

BEGIN TREES;

	TRANSLATE
		1	YAHK_ECOLI,
		2	ADHC_MYCTU,
		3	CADH2_EUCGU,
		4	MTDH2_ARATH,
		5	ADH7_YEAST,
		6	ADH2_BACST
		;
	UTREE PAUP_1= ((1:0.24338,2:0.23046):0.03188,(3:0.26331,4:0.23246):0.02635,(5:0.34516,6:0.34920):0.06217);
ENDBLOCK;

Supplementry output file: ADH.nj



 DIST   = percentage divergence (/100)
 Length = number of sites used in comparison

   1 vs.   2  DIST = 0.4738;  length =    344
   1 vs.   3  DIST = 0.5954;  length =    346
   1 vs.   4  DIST = 0.5202;  length =    346
   1 vs.   5  DIST = 0.6763;  length =    346
   1 vs.   6  DIST = 0.6597;  length =    335
   2 vs.   3  DIST = 0.5449;  length =    345
   2 vs.   4  DIST = 0.5116;  length =    344
   2 vs.   5  DIST = 0.6850;  length =    346
   2 vs.   6  DIST = 0.6916;  length =    334
   3 vs.   4  DIST = 0.4958;  length =    355
   3 vs.   5  DIST = 0.6771;  length =    353
   3 vs.   6  DIST = 0.6976;  length =    334
   4 vs.   5  DIST = 0.6771;  length =    353
   4 vs.   6  DIST = 0.6826;  length =    334
   5 vs.   6  DIST = 0.6944;  length =    337


			Neighbor-joining Method

 Saitou, N. and Nei, M. (1987) The Neighbor-joining Method:
 A New Method for Reconstructing Phylogenetic Trees.
 Mol. Biol. Evol., 4(4), 406-425


 This is an UNROOTED tree

 Numbers in parentheses are branch lengths


 Cycle   1     =  SEQ:   5 (  0.34516) joins  SEQ:   6 (  0.34920)

 Cycle   2     =  SEQ:   1 (  0.24338) joins  SEQ:   2 (  0.23046)

 Cycle   3     =  SEQ:   3 (  0.26331) joins  SEQ:   4 (  0.23246)

 Cycle   4 (Last cycle, trichotomy):

		 Node:   1 (  0.03188) joins
		 Node:   3 (  0.02635) joins
		 Node:   5 (  0.06217) 

Data files

None.

Notes

At the BEN site you can find besides clustalnj also an EMBOSS "wrapper" program clustal, which interfaces the alignment function of CLUSTAL. You can also access the original software in a terminal session with the commands clustalw (pure command line) or clustalx (X-Window).

To visualize the tree you can find at the BEN site the NJplot program of Manolo Gouy, which you can use under X-Window. There are two versions : njplot_unrooted displays the tree unrooted and allows to write the picture as a PostScript file. njplot displays the tree rooted and allows to manipulate the tree (swap branches, change outgroup), as well as save the altered tree and save the picture as a PostScript file. You can also find the programs drawgram and drawtree of the original PHYLIP package. They allow to preview the tree on the screen (Tektronix or X-Window) and to save the picture in a variety of formats.

References

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680.

Warnings

Remember that clustalnj needs as input already aligned sequences. If you give as input sequences of uneven length the program will run without warning message, but will yield a meaningless tree.

Note that the NJ method is not sensitive to changes in the rate of evolution but, precisely because of this, does not pretend to find the root of the tree. The tree must be considered UNROOTED, even if you can "open" it with a program that presents the tree as rooted.

Diagnostic Error Messages

Error messages from clustalw itself are displayed. An important error message, which can appear when you use the -kimura option is :

WARNING: N of the distances out of a total of M
 were out of range for the distance correction.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
distmat Creates a distance matrix from multiple alignments
fdnacomp DNA compatibility algorithm
fdnadist Nucleic acid sequence Distance Matrix program
fdnainvar Nucleic acid sequence Invariants method
fdnaml Estimates nucleotide phylogeny by maximum likelihood
fdnamlk Estimates nucleotide phylogeny by maximum likelihood
fdnamove Interactive DNA parsimony
fdnapars DNA parsimony algorithm
fdnapenny Penny algorithm for DNA
fdolmove Interactive Dollo or Polymorphism Parsimony
fproml Protein phylogeny by maximum likelihood
fpromlk Protein phylogeny by maximum likelihood
fprotdist Protein distance algorithm
fprotpars Protein parsimony algorithm
frestboot Bootstrapped restriction sites algorithm
frestdist Distance matrix from restriction sites or fragments
frestml Restriction site maximum Likelihood method
fseqboot Bootstrapped sequences algorithm
modelgenerator Selects optimal model for Maximum Likelihood phylogenetic tree from multiple alignment
mrbayesseqtree Bayes phylogenetic tree from multiple alignment
phyml Heuristic Maximum Likelihood phylogenetic tree from multiple alignment
clustal Global multiple alignment of sequences
fneighbor Phylogenies from distance matrix by N-J or UPGMA method

Author(s)

The wrapper application clustalnj was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The program clustalw itself was written by
Julie Thompson (Thompson@EMBL-Heidelberg.DE)
Toby Gibson (Gibson@EMBL-Heidelberg.DE)
European Molecular Biology Laboratory, Meyerhofstrasse 1, D 69117 Heidelberg, Germany

Des Higgins (Higgins@ucc.ie)
University of County Cork, Cork, Ireland

History

Completed 9 May 2003

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.