clustal

 

Function

Global multiple alignment of sequences

Description

clustal is an EMBOSS "wrapper" program for the program CLUSTAL (clustalw) of Des Higgins. It takes as input nucleic acid or protein sequences that have a reasonable degree of similarity over their whole length and produces as output a multiple sequence alignment.

In brief, the multiple alignment is carried out in 3 stages :

You can instead of letting the program make pairwise alignments provide yourself an existing tree and let the program use this as "guide tree" (option -usertreefile -treefile=<file with "guide tree">, see Output file formats for the format). You could use a tree file generated by a previous run of clustal and modified with a tree editor like njplot.

To increase the chance of finding the correct alignment clustal uses a lot of tricks, especially for proteins. For more details, see below.

Algorithm

Slow-Accurate Pairwise sequence alignment

By default, pairwise alignments are made by the Needleman-Wunsch global alignment algorithm, or rather, by the memory efficient variant by Myers-Miller. The user can choose himself a symbol comparison matrix (see data files), a gap penalty and a gap length penalty. The gap penalty is of type <Gap penalty> + <Gap length penalty> * n.

Fast-approximate Pairwise sequence alignment

The user can request to do instead the pairwise alignments by the Wilbur-Lipman algorithm :

Multiple sequence alignment

This is done by a modification of the Needleman-Wunsch algorithm. The user can choose a symbol comparison matrix (see data files), a gap penalty and a gap length penalty, which can eventually be different from those used for making the pairwise alignment.

For nucleic acids the scores for transitions (A<-->G or C<-->T i.e. purine-purine or pyrimidine-pyrimidine substitutions) in the nucleotide comparison matrix are multiplied with a Transition Weight between 0 and 1. A weight of zero means that the transitions are scored as mismatches, while a weight of 1 gives the transitions the match score. For distantly related nucleic acid sequences, the weight should be near to zero ; for closely related sequences it can be useful to assign a higher score. The default is set to 0.5.

For proteins different amino acid comparison matrices are used depending on the mean percent identity of the sequences to be aligned. Although the input matrix can contain positive as well as negative values, the values are rescaled to all positive unless you switch this off with -norescale (this sometimes gives better results if the sequences are of uneven length).

The default or user selected gap penalties are not used as such, but are adapted. It has been shown that varying the gap penalties used with different weight matrices can improve the accuracy of sequence alignments. The average score for two mismatched residues is used as scaling factor. Furthermore, the percent identity of the two (groups of) sequences to be aligned is used to increase the gap penalty for closely related sequences and decrease it for more divergent sequences. Also, the scores for both true and false sequence alignments grow with the length of the sequences. So, the logarithm of the length of the shorter sequence is used to increase the gap penalty :
<corrected gap penalty> = (<gap penalty> + log(min(N,M))) * <average residue mismatch score> * <percent identity scaling factor>
where N and M are the lengths of the two sequences.
The penalty is also modified depending on the difference between the lengths of the two sequences to be aligned. If one sequence is much shorter than the other, it is increased to inhibit too many long gaps in the shorter sequence :
<corrected gap lenght penalty> = <gap lenght penalty> * (1.0 + |log(N/M)|)

Terminal gaps

In an earlier version of the program, terminal gaps were penalised the same as all other gaps. This caused some ugly side effects e.g.
acgtacgtacgtacgt                              acgtacgtacgtacgt
a----cgtacgtacgt  gets the same score as      ----acgtacgtacgt
NOW, terminal gaps are free. This is better on average and stops silly effects like single residues jumping to the edge of the alignment. However, it is not perfect. It does mean that if there should be a gap near the end of the alignment, the program may be reluctant to insert it i.e.
cccccgggccccc                                              cccccgggccccc
ccccc---ccccc  may be considered worse (lower score) than  cccccccccc---
In the right hand case above, the terminal gap is free and may score higher than the laft hand alignment. This can be prevented by lowering the gap and gap length penalties. It is difficult to get this right all thetime. Please watch the ends of your alignments.

Sequence weighting and divergent sequence handling

Sequence weights are calculated directly from the guide tree. The weights are normalised such that the biggest one is set to 1.0 and the rest are all less than one. Groups of closely related sequences receive lowered weights because they contain much duplicated information. Highly divergent sequences without any close relatives receive high weights. These weights are used as simple multiplication factors for scoring positions from different sequences or prealigned groups of sequences.
Note that the sequence weights are output in some sequence file formats (like GCG MSF). Some programs (EMBOSS or other) do something with these weights, see the documentation of these programs.

The alignment of the most distantly related sequences is delayed until after the most closely related sequences have been aligned. The Max. % identity required to delay the addition of a sequence can be set ; sequences that are less identical than this level to any other sequences will be aligned later.

Protein specific gap parameters

Amino acid specific gap penalties reduce or increase the gap opening penalties at each position in the alignment. As an example, positions that are rich in glycine are more likely to have an adjacent gap than positions that are rich in valine.

Hydrophylic gap penalties are used to increase the chances of a gap within a run (5 or more residues) of hydrophilic amino acids ; these are likely to be loop or random coil regions where gaps are more common. The residues that are "considered" to be hydrophilic can be entered as -hgapresidues=GPSNDQEKR.

The Gap Separation Distance tries to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalised more than other gaps. This does not prevent close gaps ; it makes them less frequent, promoting a block-like appearance of the alignment. By default, end gaps are ignored for this purpose, but you can request to treat end gaps just as internal gaps.

Usage

Here is a sample session with clustal

> clustal
Global multiple alignment of sequences
Input sequence(s): list::ADH.list
Multiple sequence alignment USA [yahk_ecoli.fasta]: msf::ADH.msf
Guide tree output filename [yahk_ecoli.dnd]: ADH.dnd

Go to the input files for this example

Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-seqs]              seqall     Sequence(s) filename and optional format, or
                                  reference (input USA)
*  -usertreefile       infile     User provided guide tree. Only required if
                                  you put -usertree
  [-outseqs]           seqoutall  [<sequence>lt;format> Multiple sequence
                                  alignment USA
*  -treefile           outfile    [*.clustal] Guide tree output filename

   Additional (Optional) qualifiers (* if not always prompted):
*  -pwa                menu       [slow] Algorithm for pairwise alignments
                                  (Values: fast (fast - approximate
                                  (Wilbur-Lipman)); slow (slow - accurate
                                  (Needleman-Wunsch)))
*  -pwdnamatrix        menu       [IUB] Nucleotide comparison matrix for
                                  pairwise alignment (Values: IUB (1.9/0.0
                                  matrix with handling of ambiguities);
                                  CLUSTALW (1.0/0.0 matrix); own (user
                                  provided matrix))
*  -pwmatrix           menu       [Gonnet] Amino acid comparison matrix for
                                  pairwise alignment (Values: BLOSUM (BLOSUM
                                  series 80, 62, 45, 30); PAM (PAM series 20,
                                  60, 120, 350); Gonnet (Gonnet series 80,
                                  120, 160, 250 and 350); id (identity matrix
                                  10.0/0.0); own (user provided matrix))
*  -pwusermatrix       infile     User provided symbol comparison matrix (in
                                  BLAST format) for pairwise alignment
*  -pwgappenalty       float      [15.0 for nucleic, 10.0 for protein] Gap
                                  penalty for pairwise alignment (Number from
                                  0.000 to 100.000)
*  -pwgaplength        float      [6.66 for nucleic, 0.1 for protein] Gap
                                  length penalty for pairwise alignment.
                                  CLUSTAL subtracts from the similarity score
                                  for each gap a penalty of type <Gap penalty>
                                  + <Gap length penalty> * n (Number from
                                  0.000 to 10.000)
*  -ktuple             integer    [2 for nucleic, 1 for protein] Wilbur-Lipman
                                  ktup size. Decrease for sensitivity,
                                  increase for speed (Integer from 1 to 4 for
                                  nucleic, 2 for protein)
*  -topdiags           integer    [4 for nucleic, 5 for protein] Wilbur-Lipman
                                  number of best diagonals to consider
                                  (Integer from 1 to 50)
*  -window             integer    [4 for nucleic, 5 for protein] Wilbur-Lipman
                                  window size for looking at diagonals around
                                  best diagonals (Integer from 1 to 50)
*  -joinw              integer    [5 for nucleic, 3 for protein] Wilbur-Lipman
                                  penalty for joining different diagonals
                                  (Integer from 1 to 500)
*  -dnamatrix          menu       [IUB] Nucleotide comparison matrix for
                                  multiple alignment (Values: IUB (1.9/0.0
                                  matrix with handling of ambiguities);
                                  CLUSTALW (1.0/0.0 matrix); own (user
                                  provided matrix))
*  -matrix             menu       [Gonnet] Amino acid comparison matrix for
                                  multiple alignment (Values: BLOSUM (BLOSUM
                                  series 80, 62, 45, 30); PAM (PAM series 20,
                                  60, 120, 350); Gonnet (Gonnet series 80,
                                  120, 160, 250 and 350); id (identity matrix
                                  10.0/0.0); own (user provided matrix))
*  -usermatrix         infile     User provided symbol comparison matrix (in
                                  BLAST format) for multiple alignment
*  -[no]rescale        boolean    [Y] Rescale amino acid comparison matrix to
                                  all positive values or use negative values
                                  (for proteins only). Option -norescale could
                                  be useful if proteins are of very uneven
                                  length.
*  -transitionw        float      [0.5] Transition weight : proportion between
                                  score of AG or CT pair and pair of
                                  identical bases (Number from 0.000 to 1.000)
   -gappenalty         float      [15.0 for nucleic, 10.0 for protein] Gap
                                  penalty for multiple alignment (Number from
                                  0.000 to 100.000)
   -gaplength          float      [6.66 for nucleic, 0.2 for protein] Gap
                                  length penalty for multiple alignment.
                                  CLUSTAL subtracts from the similarity score
                                  for each gap a penalty of type <Gap penalty>
                                  + <Gap length penalty> * n (Number from
                                  0.000 to 10.000)
   -delay              integer    [30] Max. % identity for delay of divergent
                                  sequences (Integer from 0 to 100)
*  -[no]pgap           boolean    [Y] Use gap penalties dependant on amino
                                  acids at edge (for proteins only)
*  -[no]hgap           toggle     [Y] Use lower gap penalties in strings of at
                                  least 5 hydrophylic amino acids (for
                                  proteins only)
*  -hgapresidues       string     [GPSNDQEKR] Hydrophylic amino acids (Any
                                  string is accepted)
*  -gapdist            integer    [4] Gap Separation Distance. Use higher gap
                                  penalty for gaps separated by less than n
                                  amino acids (for proteins only) (Integer
                                  from 0 to 100)
*  -endgaps            boolean    [N] Use higher gap penalty also for gaps at
                                  ends
   -outorder           menu       [aligned] Order of sequences in output
                                  (Values: input (same as input); aligned
                                  (according to order of progressive
                                  alignment))

   Advanced (Unprompted) qualifiers:
   -usertree           toggle     [N] Use user provided guide tree

   Associated qualifiers:

   "-seqs" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outseqs" associated qualifiers
   -osformat2          string     Output seq format
   -osextension2       string     File name extension
   -osname2            string     Base file name
   -osdirectory2       string     Output directory
   -osdbname2          string     Database name to add
   -ossingle2          boolean    Separate file for each entry
   -oufo2              string     UFO features
   -offormat2          string     Features format
   -ofname2            string     Features file name
   -ofdirectory2       string     Output directory

   "-treefile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default
[-seqs]
(Parameter 1)
Sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
-usertreefile User provided guide tree. Only required if you put -usertree Input file Required
[-outseqs]
(Parameter 2)
Multiple sequence alignment USA Writeable sequence(s) <sequence>.format
-treefile Guide tree output filename Output file <sequence>.dnd
Additional (Optional) qualifiers Allowed values Default
-pwa Algorithm for pairwise alignments
fast (fast - approximate (Wilbur-Lipman))
slow (slow - accurate (Needleman-Wunsch))
slow
-pwdnamatrix Nucleotide comparison matrix for pairwise alignment
IUB (1.9/0.0 matrix with handling of ambiguities)
CLUSTALW (1.0/0.0 matrix)
own (user provided matrix)
IUB
-pwmatrix Amino acid comparison matrix for pairwise alignment
BLOSUM (BLOSUM series 80, 62, 45, 30)
PAM (PAM series 20, 60, 120, 350)
Gonnet (Gonnet series 80, 120, 160, 250 and 350)
id (identity matrix 10.0/0.0)
own (user provided matrix)
Gonnet
-pwusermatrix User provided symbol comparison matrix (in BLAST format) for pairwise alignment Input file Required
-pwgappenalty Gap penalty for pairwise alignment Number from 0.000 to 100.000 15.0 for nucleic, 10.0 for protein
-pwgaplength Gap length penalty for pairwise alignment. CLUSTAL subtracts from the similarity score for each gap a penalty of type <Gap penalty> + <Gap length penalty> * n Number from 0.000 to 10.000 6.66 for nucleic, 0.1 for protein
-ktuple Wilbur-Lipman ktup size. Decrease for sensitivity, increase for speed Integer from 1 to 4 for nucleic, 2 for protein 2 for nucleic, 1 for protein
-topdiags Wilbur-Lipman number of best diagonals to consider Integer from 1 to 50 4 for nucleic, 5 for protein
-window Wilbur-Lipman window size for looking at diagonals around best diagonals Integer from 1 to 50 4 for nucleic, 5 for protein
-joinw Wilbur-Lipman penalty for joining different diagonals Integer from 1 to 500 5 for nucleic, 3 for protein
-dnamatrix Nucleotide comparison matrix for multiple alignment
IUB (1.9/0.0 matrix with handling of ambiguities)
CLUSTALW (1.0/0.0 matrix)
own (user provided matrix)
IUB
-matrix Amino acid comparison matrix for multiple alignment
BLOSUM (BLOSUM series 80, 62, 45, 30)
PAM (PAM series 20, 60, 120, 350)
Gonnet (Gonnet series 80, 120, 160, 250 and 350)
id (identity matrix 10.0/0.0)
own (user provided matrix)
Gonnet
-usermatrix User provided symbol comparison matrix (in BLAST format) for multiple alignment Input file Required
-[no]rescale Rescale amino acid comparison matrix to all positive values or use negative values (for proteins only). Option -norescale could be useful if proteins are of very uneven length. Boolean value Yes/No Yes
-transitionw Transition weight : proportion between score of AG or CT pair and pair of identical bases Number from 0.000 to 1.000 0.5
-gappenalty Gap penalty for multiple alignment Number from 0.000 to 100.000 15.0 for nucleic, 10.0 for protein
-gaplength Gap length penalty for multiple alignment. CLUSTAL subtracts from the similarity score for each gap a penalty of type <Gap penalty> + <Gap length penalty> * n Number from 0.000 to 10.000 6.66 for nucleic, 0.2 for protein
-delay Max. % identity for delay of divergent sequences Integer from 0 to 100 30
-[no]pgap Use gap penalties dependant on amino acids at edge (for proteins only) Boolean value Yes/No Yes
-[no]hgap Use lower gap penalties in strings of at least 5 hydrophylic amino acids (for proteins only) Toggle value Yes/No Yes
-hgapresidues Hydrophylic amino acids Any string is accepted GPSNDQEKR
-gapdist Gap Separation Distance. Use higher gap penalty for gaps separated by less than n amino acids (for proteins only) Integer from 0 to 100 4
-endgaps Use higher gap penalty also for gaps at ends Boolean value Yes/No No
-outorder Order of sequences in output
input (same as input)
aligned (according to order of progressive alignment)
aligned
Advanced (Unprompted) qualifiers Allowed values Default
-usertree Use user provided guide tree Toggle value Yes/No No

Input file format

clustal reads any normal sequence USAs. You must give as input at least two sequences. You can use proteins as well as nucleic acids, but you can not mix them. A convenient way to provide the input sequences is to use a List File.

Input files for usage example

List File ADH.list

SW:YAHK_ECOLI
SW:ADH2_BACST
SW:ADHC_MYCTU
SW:ADH7_YEAST
SW:CADH2_EUCGU
SW:MTDH2_ARATH

Output file format

clustal produces two output files, one with a multiple sequence alignment and one with the "guide tree" used to make the alignment.

The multiple sequence alignment is written as a standard EMBOSS sequence file. Note that by default the output format is fastA (with '-' gap characters introduced). Note that if you wish to visualize the alignment it can be profitable to request a format better suited for this purpose, like MSF format.
By default the sequences are written according to the order determined by the "guide tree", that is, the most similar sequences are written adjacent to each other. You can request that the sequences are instead written in the same order as in the input file with -outorder=input.

The "guide tree" is written in "nested parentheses" format. This format is taken as input by a lot of software, like NJplot, TreeView or the PHYLIP programs drawgram and drawtree. Note however that you should not use this "guide tree" as phylogenetic tree.

Output files for usage example

File: ADH.msf

!!AA_MULTIPLE_ALIGNMENT 1.0

  ADH.list MSF:  371 Type: P 04/03/05 CompCheck:  357 ..

  Name: YAHK_ECOLI  Len: 371  Check: 5277 Weight: 15.70
  Name: ADHC_MYCTU  Len: 371  Check:  819 Weight: 14.60
  Name: CADH2_EUCGU Len: 371  Check: 5012 Weight: 16.20
  Name: MTDH2_ARATH Len: 371  Check: 2155 Weight: 14.60
  Name: ADH7_YEAST  Len: 371  Check: 8445 Weight: 19.60
  Name: ADH2_BACST  Len: 371  Check: 8649 Weight: 19.10

//

           1                                               50
YAHK_ECOLI  ~~~~~~~MKIKAVGAYSAKQPLEPMDITRREPGPNDVKIEIAYCGVCHSD
ADHC_MYCTU  ~~~~~~MSTVAAYAAMSATEPLTKTTITRRDPGPHDVAIDIKFAGICHSD
CADH2_EUCGU MGSLEKERTTTGWAARDPSGVLSPYTYSLRNTGPEDLYIKVLSCGVCHSD
MTDH2_ARATH MGKVLQ.KEAFGLAAKDNSGVLSPFSFTRRETGEKDVRFKVLFCGICHSD
ADH7_YEAST  ~MLYPEKFQGIGISNAKDWKHPKLVSFDPKPFGDHDVDVEIEACGICGSD
ADH2_BACST  ~~~~~~~~~MKAAVVNEFKKALEIKEVERPKLEEGEVLVKIEACGVCHTD

           51                                             100
YAHK_ECOLI  LHQVRSEWAGT.VYPCVPGHEIVGRVVAVGDQVEK.YAPGDLVGVGCIVD
ADHC_MYCTU  IHTVKAEWGQP.NYPVVPGHEIAGVVTAVGSEVTK.YRQGDRVGVGCFVD
CADH2_EUCGU IHQIKNDLGMS.HYPMVPGHEVVGEVLEVGSEVTK.YRVGDRVGTGIVVG
MTDH2_ARATH LHMVKNEWGMS.TYPLVPGHEIVGVVTEVGAKVTK.FKTGEKVGVGCLVS
ADH7_YEAST  FHIAVGNWGPV.PENQILGHEIIGRVVKVGSKCHTGVKIGDRVGVGAQAL
ADH2_BACST  LHAAHGDWPIKPKLPLIPGHEGVGIVVEVAKGVKS.IKVGDRVGIPWLYS

           101                                            150
YAHK_ECOLI  SCKHCEECEDGLENYCDHMTG.TYNSPTPDEPGHTLGGYSQQIVVHERYV
ADHC_MYCTU  SCRECNSCTRGIEQYCKPGANFTYNSIGKDGQ.PTQGGYSEAIVVDENYV
CADH2_EUCGU CCRSCSPCNSDQEQYCNKKIW.NYNDVYTDGK.PTQGGFAGEIVVGERFV
MTDH2_ARATH SCGSCDSCTEGMENYCPKSIQ.TYGFPYYDNT.ITYGGYSDHMVCEEGFV
ADH7_YEAST  ACFECERCKSDNEQYCTNDHVLTMWTPYKDGY.ISQGGFASHVRLHEHFA
ADH2_BACST  ACGECEYCLTGQETLCPHQLN.........GGYSVDGGYAEYCKAPADYV

           151                                            200
YAHK_ECOLI  LRIRHPQEQLAAVAPLLCAGITTYSPLRHWQAG.PGKKVGVVGIGGLGHM
ADHC_MYCTU  LRIPDVLP.LDVAAPLLCAGITLYSPLRHWNAG.ANTRVAIIGLGGLGHM
CADH2_EUCGU VKIPDGLE.SEQAAPLMCAGVTVYSPLVRFGLKQSGLRGGILGLGGVGHM
MTDH2_ARATH IRIPDNLP.LDAAAPLLCAGITVYSPMKYHGLDKPGMHIGVVGLGGLGHV
ADH7_YEAST  IQIPENIP.SPLAAPLLCGGITVFSPLLRNGCG.PGKRVGIVGIGGIGHM
ADH2_BACST  AKIPDNLD.PVEVAPILCAGVTTYKALKVSGAR.PGEWVAIYGIGGLGHI

           201                                            250
YAHK_ECOLI  GIKLAHAMGAHVVAFTTSEAKR.EAAKALGADEVVNSRNADEMAAHLK..
ADHC_MYCTU  GVKLGAAMGADVTVLSQSLKKM.EDGLRLGAKSYYATADPDTFRKLRG..
CADH2_EUCGU GVKIAKAMGHHVTVISSSDKKRTEALEHLGADAYLVSSDENGMKEATD..
MTDH2_ARATH GVKFAKAMGTKVTVISTSEKKRDEAINRLGADAFLVSRDPKQIKDAMG..
ADH7_YEAST  GILLAKAMGAEVYAFSRGHSKR.EDSMKLGADHYIAMLEDKGWTEQYSNA
ADH2_BACST  ALQYAKAMGLNVVAVDISDEKS.KLAKDLGADIAINGLKEDPVKAIHDQV

           251                                            300
YAHK_ECOLI  .SFDFILNTVAAPHNLDDFTTLLKRDGTMTLVGAPATPHKSPEVFNLIMK
ADHC_MYCTU  .GFDLILNTVSANLDLGQYLNLLDVDGTLVELGIPEHPMAVP.AFALALM
CADH2_EUCGU .SLDYIFDTIPVVHPLEPYLALLKLDGKLILTGVINAPLQFI.SPMVMLG
MTDH2_ARATH .TMDGIIDTVSATHSLLPLLGLLKHKGKLVMVGAPEKPLELP.VMPLIFE
ADH7_YEAST  LDLLVVCSSSLSKVNFDSIVKIMKIGGSIVSIAAPEVNEKLV.LKPLGLM
ADH2_BACST  GGVHAAISVAVNKKAFEQAYQSVKRGGTLVVVGLPNADLPIP.IFDTVLN

           301                                            350
YAHK_ECOLI  RRAIAGSMIGGIPETQEMLDFCAEHGIVADIEMIRADQ..INEAYERMLR
ADHC_MYCTU  RRSLAGSNIGGIAETQEMLNFCAEHGVTPEIELIEPDY..INDAYERVLA
CADH2_EUCGU RKSITGSFIGSMKETEEMLEFCKEKGLTSQIEVIKMDY..VNTALERLEK
MTDH2_ARATH RKMVMGSMIGGIKETQEMIDMAGKHNITADIELISADY..VNTAMERLEK
ADH7_YEAST  GVSISSSAIGSRKEIEQLLKLVSEKNVKIWVEKLPISEEGVSHAFTRMES
ADH2_BACST  GVSVKGSIVGTRKDMQEALDFAARGKVRPIVETAELEE..INEVFERMEK

            351               371
YAHK_ECOLI  GDVKYRFVIDNRTLTD~~~~~
ADHC_MYCTU  SDVRYRFVIDISAL~~~~~~~
CADH2_EUCGU NDVRYRFVVDVVGSKLD~~~~
MTDH2_ARATH ADVRYRFVIDVANTLKPNPNL
ADH7_YEAST  GDVKYRFTLVDYDKKFHK~~~
ADH2_BACST  GKINGRIVLKLKED~~~~~~~

File: ADH.dnd

(
(
(
YAHK_ECOLI:0.25360,
ADHC_MYCTU:0.23773)
:0.02527,
(
ADH2_BACST:0.35494,
ADH7_YEAST:0.35302)
:0.05239)
:0.02435,
CADH2_EUCGU:0.26941,
MTDH2_ARATH:0.23340);

Data files

clustal uses symbol comparison matrices for scoring bases or amino acids. CLUSTAL has built-in symbol comparison matrices, but allows you to provide your own matrix. For proteins, but not for nucleic acids, you can give a series of matrices as input. You can choose different matrices for pairwise alignment and for multiple alignment.

Single matrix input file

The format used for a single matrix is the same as that used by the BLAST program. The scores in the new weight matrix should be similarities. You can use negative as well as positive values if you wish, although for proteins the matrix will be automatically adjusted to all positive scores, unless the -norescale option is selected. Any lines beginning with a # character are assumed to be comments. The first non-comment line should contain a list of bases or amino acids in any order, using the 1 letter code, followed by a * character. This should be followed by a square matrix of scores, with one row and one column for each base or amino acid. The last row and column of the matrix (corresponding to the * character) contain the minimum score over the whole matrix.

#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4 
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4 
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4 
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4 
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4 
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4 
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4 
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4 
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4 
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4 
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4 
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4 
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4 
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4 
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4 
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4 
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4 
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4 
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4 
B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4 
Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4 
X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4 
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1 

Matrix series input format

For proteins, CLUSTAL uses by default different matrices depending on the mean percent identity of the sequences to be aligned. For proteins, but not for nucleic acids, you can specify yourself a series of matrices and the range of the percent identity for each matrix in a matrix series file. The file is automatically recognised by the word CLUSTAL_SERIES at the beginning of the file. Each matrix in the series is then specified on one line which should start with the word MATRIX. This is followed by the lower and upper limits of the sequence percent identities for which you want to apply the matrix. The final entry on the matrix line is the filename of a BLAST format matrix file (see above for details of the single matrix file format).

CLUSTAL_SERIES

MATRIX 81 100 blosum80
MATRIX 61  80 blosum62
MATRIX 31  60 blosum45
MATRIX  0  30 blosum30

Notes

At the BEN site you can find besides clustal also an EMBOSS "wrapper" program clustalnj, which interfaces the phylogeny function of CLUSTAL. You can also access the original software in a terminal session with the commands clustalw (pure command line) or clustalx (X-Window).

References

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)
CLUSTAL W : improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680.

Warnings

The alignment program clustal only works well if the sequences show a reasonable degree of similarity over their whole length. If they do not, you can try a local multiple alignment program like dialign, gibbs or mkdom.

You should not use the "guide tree" as a phylogenetic tree. Give the alignment produced by clustal to a phylogeny program, e.g. clustalnj or the programs of the PHYLIP package.

No two sequences should have the same name. Only the first 30 characters of the sequence name are used. Therefore no name should be longer than 30 characters, or at least the first 30 characters should be different.

Diagnostic Error Messages

Error messages from clustalw itself are displayed

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
edialign Local multiple alignment of sequences
infoalign Information on a multiple sequence alignment
mkdom Local multiple alignment of proteins, makes file for XDOM
mse Multiple Sequence Editor
muscle Multiple alignment of sequences by global optimization
plotcon Plot quality of conservation of a sequence alignment
prettyplot Displays aligned sequences, with colouring and boxing
showalign Displays a multiple sequence alignment
tranalign Align nucleic coding regions given the aligned proteins
clustalnj Neighbor-Joining phylogenetic tree from multiple alignment

Author(s)

The wrapper application clustal was written by Guy Bottu (gbottu@vub.ac.be)
BEN, ULB, Brussels, Belgium

The program clustalw itself was written by :
Julie Thompson (Thompson@EMBL-Heidelberg.DE)
Toby Gibson (Gibson@EMBL-Heidelberg.DE)
European Molecular Biology Laboratory, Meyerhofstrasse 1, D 69117 Heidelberg, Germany

Des Higgins (Higgins@ucc.ie)
University of County Cork, Cork, Ireland

History

Completed 9 May 2003
Modified 18 June 2003 - sequence weights added to output
Modified 4 March 2005 - gaps automatically removed from input sequences

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.