MultAlin documentation ====================== (Version 5.0, 5.1, 5.2, 5.3, 5.4) To jump to a specific section, search for "SECTION -#-", replacing the # with the appropriate section number. CONTENTS ======== SECTION -0- Introduction SECTION -1- New in the last releases NEW in version 5.0 NEW in version 5.1 NEW in version 5.2 NEW in version 5.3, 5.3.1, 5.3.2 NEW in version 5.4 SECTION -2- Installing MultAlin SECTION -3- Running MultAlin A. Cautions B. Command line mode C. Interactive mode SECTION -4- Algorithm A. Similarity scores for a pair of sequences. B. The FAST alignments (step 0). C. The hierarchical clustering (step 1). D. The Multiple alignment (step 2). E. Consensus sequences and scores (step 3). F. Iteration. SECTION -5- File formats A. Input Sequence File B. Output Sequence File C. Clustering Sequence File D. Score File SECTION -6- List of the package files SECTION -0- Introduction ======================== Welcome to MultAlin! This is software that will allow you to align simultaneously several biological sequences on computer that use UNIX system. What is a Multiple sequence alignment? It is the arrangement of several protein or nucleic acid sequences with postulated gaps so that similar residues are juxtaposed. A score is attached to identities, conservative or non- conservative substitutions (the score measuring the similarity) and a penalty to gaps; an ideal program would maximise the total score, taking account of all possible alignments and allowing for any length gap at any position. Unfortunately the computing requirements, both of time and memory, grow as the nth power, where n is the sequence number, so this ideal alignment can be found only for two sequences or three short sequences. In the general case, to be practicable programs must restrict the conditions of the optimisation. Nevertheless it is undeniably useful to have an automatic system available for multiple sequence alignment to provide a starting point for a more human analysis. MultAlin creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. The method used is described in "Multiple sequence alignment with hierarchical clustering", F.Corpet, 1988, Nucl. Acids Res. 16 10881-10890. SECTION -1- New in the last releases ==================================== NEW in version 5.0 ------------------ Comparison tables can include negative entries. GCG tables can be used. Gap penalty can be length dependent. Gap at sequence extremities can be scored or not. NEW in version 5.1 ------------------ There is a maximal number of iterations set to 10 (see F. Iteration). A bug has been fixed that prohibited the comparison of two sequences only. SCO and sco, CLU and clu are now valid extensions for score files and clustering files. Portability has been tested for more platforms. NEW in version 5.2 ------------------ The similarity coefficient at a position is still the mean of all pairwise coefficients at this position, BUT only the sequences for which the position is internal are counted. Example: CCPC50 QDG DAAKGEKEFN .KCKACHMI QAPDGTDII. KGGKTGPNLY CCRF2C ..G DAAKGEKEFN .KCKTCHSI IAPDGTEIV. KGAKTGPNLY CCRF2S QEG DPEAGAKAFN .QCQTCHVI VDDSGTTIAG RNAKTGPNLY CCQF2R .EG DAAAGEKVSK .KCLACHTF DQGGAN.... ...KVGPNLF CCQF2P .AG DAAVGEKIAK AKCTACHDL NKGGPI.... ...KVGPPLF | MultAlin sequence # 245 5555555555 555555555 5555555555 5555555555 Clustalv sequence # 245 5555555555 155555555 5555553331 3335555555 I think that it is important to take the mean over all sequences for new gaps to be preferentially inserted at the same position as old gaps. But this is a problem when sequence lengths are inhomogeneous, so I have made this modification. NEW in version 5.3, 5.3.1, 5.3.2 -------------------------------- The pairwise scores that are used to build the clustering can now be evaluated by three different methods: absolute = the score is the pairwise alignment score, using the current similarity table and gap penalties. It was the old method. percentage = the score is the pairwise alignment score, divided by the length of the shortest sequence. identity = the score is the number of identical pairs, divided by the length of the shortest sequence. Individual weights can be assigned to each sequence in order to down- weight near duplicate sequences and up-weight the most divergent ones. They are computed using the clustering tree and normalised so that their mean is 1.0.They are written on the output file. It is now possible to choose the order of the sequences in the output file, as input or as aligned. MultAlin can be used with already aligned sequences (.mul file), only to change the output file. When the input file has a mul extension, MultAlin does not realign the sequences, but reads the last ma.cfg file and optionally new options to write a new alignment file. The entries of blosum62.tab have been made non-negative by adding 4 to each entry. It becomes the default table. In release 5.3.1, bugs have been fixed and a new output format added (see "doc Format"). In release 5.3.2, the mul format has been modified to become a standard Fasta/Pearson format. In all input formats, sequences can be written with lowercase letters. NEW in version 5.4 ------------------ The pairwise and the alignment processes are modified to handle alignment of large families more quickly. In the alignment process, modifications are limited to the implementation (no theoretical change). In the pairwise process, very similar sequences (more than 80% identity) are clustered together without a hierarchical classification and only one sequence in the cluster is compared to the other sequences for the global classification. This allow to reduce drastically the pairwise step that can be time limiting in automatic alignments of large families of sequences. Since version 5.4.1, it is possible to align two groups of already aligned sequences (profiles) or a sequence with a profile (see option -2). MultAlin can read symbol comparison tables from GCG package, version 9 and upper. The user can parametrise the doc format (see SECTION -5- File formats/ B. Output Sequence File/ doc Format). SECTION -2- Installing MultAlin =============================== See ma_c.txt SECTION -3- Running MultAlin ============================= A. Cautions =========== Before aligning large sequences, you may test MultAlin with shorter sequences, and look at the system occupation (see 'ps' UNIX command) during alignment. When the swap partition of the hard disk is full, MultAlin can use an internal swap mode on user partition. You can run MultAlin in two modes: * command line mode. * interactive mode which helps you to select program parameters and options. Help: type 'ma -h' or 'ma -?' to obtain help screen. B. Command line mode ==================== Syntax:(1) ma [[