spacer.txt 05 April 2000 SPACER: A package for performing Principal Coordinates Analysis on sets of aligned sequences. Des Higgins Higgins@EMBL-Heidelberg.DE European Molecular Biology Laboratory Postfach 10.2209 Meyerhofstrasse 1 W-6900 Heidelberg, Germany CONTENTS: 1. Compiling/installation/simple usage 2. DISTANCES 3. SPACER 4. READAL3 1. COMPILING/INSTALLATION/SIMPLE USAGE Installation This package comes with three small FORTRAN programs. They have only been tested on a VAX running VMS but do not use any unusual grammar so there is a reasonable chance that they will compile using any standard F77 compiler. The READAL3 program does use a check for digits in an alignment file and DISTANCES converts from lower to upper case, using ASCII codes so these will definitely need to be modified if you use an EBCDIC machine. All input and output is done via text files or simple questions and answers on the screen; no graphics! The supplied programs are dimensioned for up to 650 sequences of maximum length 5000 residues each; these dimensions can be changed easily by editing the programs. Find PARAMETER statements in the programs and edit the numbers after MAXN= (number of sequences) and MAXL= (maximum length of sequences). Change ALL occurences of these statements in each program (sorry about that). Compile and link the 3 programs as follows, using SPACER as an example: $ FORT distances (takes distances.for and produces distances.obj) $ LINK distances (takes distances.obj and produces distances.exe) You can then delete distances.obj; run the program with: $ RUN distances (this runs distances.exe) Repeat this for SPACER and READAL3. That's it! Simple usage The sequences must be aligned already in a multiple alignment. There are literally dozens of different multiple alignment formats. My programs have to read these, which presents a real headache. I have adopted one format (PIR format with gaps denoted by hyphens '-') as my standard and supply a small utility program, READAL3, which will attempt to read other formats and convert. It is difficult to allow for all variations, so be careful. Some software will write multiple alignments out in PIR format anyway (e.g. CLUSTAL V (Higgins, Bleasby and Fuchs, 1992)). I will start by assuming that you have a multiple alignment in PIR format; if you do not, then read the notes below on READAL3 and try running it on a multiple alignment. Here is an example file with 12 globins in PIR format, already aligned. The sequence names are from SWISSPROT (Bairoch and Boeckmann, 1991). You could try cutting this out with an editor and using it. >P1;HBA$ELEMA V--LSDKDKTNVKATWSKVGDHASDYVAEALERMFFSFPTTKTYFPHF-DLSH-----GS GQVKGHGKKVGEALTQAVGHLDDLPSALSALSDLHAHKLRVDPVNFKLLSHCLLVTLSSH QPT-EFTPEVHASLDKFLSNVSTVLTSKYR------ * >P1;HBA$PHOVI V--LSPADKTNVKATWDKIGGHAGEYGGEALERTFTAFPTTKTYFPHF-DLSH-----GS AQVKAHGKKVADALTTAVAHMDDLPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLACH HPA-DFTPAVHASLDKFFSAVSTVLTSKYR------ * >P1;HBA$PHYCA V--LSPADKTNVKAAWAKVGNHAADFGAEALERMFMSFPSTKTYFSHF-DLGH-----NS TQVKGHGKKVADALTKAVGHLDTLPDALSDLSDLHAHKLRVDPVNFKLLSHCLLVTLAAH LPG-DFTPSVHASLDKFLASVSTVLTSKYR------ * >P1;HBA$PONPY V--LSPADKTNVKTAWGKVGAHAGDYGAEALERMFLSFPTTKTYFPHF-DLSH-----GS AQVKDHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH LPA-EFTPAVHASLDKFLASVSTVLTSKYR------ * >P1;HBBA$CAPHI M--LTAEEKAAVTGFWGKV--KVDEVGAEALGRLLVVYPWTQRFFEHFGDLSSADAVMNN AKVKAHGKKVLDSFSNGMKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVVVLARH HGS-EFTPLLQAEFQKVVAGVANALAHRYH------ * >P1;HBBL$XENLA V-HLSADEKSAINAVWSKV--NIENDGHDALTRLLVVFPWTQRYFSSFGNLSNVAAISGN AKVRAHGKKVLSAVDESIHHLDDIKNFLSVLSTKHAEELHVDPENFKRLADVLVIVLAGK LGA-AFTPQVQAAWEKFSAGLVAALSHGYF------ * >P1;HBBZ$MOUSE MVHFTAEEKAAITSIWDKV--DLEKVGGETLGRLLIVYPWTQRFFDKFGNLSSALAIMGN PRIRAHGKKVLTSLGLGVKNMDNLKETFAHLSELHCDKLHVDPENFKLLGNMLVIVLSTH FAK-EFTPEVQAAWQKLVIGVANALSHKYH------ * >P1;HBD$AOTTR V-HLTGDEKSAVAALWGKV--NVEEVGGEALGRLLVVYPWTQRFFESFGALSSPDAVMGN PKVKAHGKKVLGAFSDGLAHLDNLKGTFAQLSELHCDKLHVDPENFRLLGNVLVCVLARN FGK-EFTPLLQAAFQKVVAGVATALAHKYH------ * >P1;MYG$CALJA G--LSDGEWQLVLNVWGKVEADIPSHGQEVLISLFKGHPETLEKFDKFKHLKSEDEMKAS EELKKHGVTVLTALGGILKKKGHHEAELKPLAQSHATKHKIPVKYLEFISDAIVHVLQKK HPG-DFGADAQGAMKKALELFRNDMAAKYKELGFQG * >P1;MYG$CANFA G--LSDGEWQIVLNIWGKVETDLAGHGQEVLIRLFKNHPETLDKFDKFKHLKTEDEMKGS EDLKKHGNTVLTALGGILKKKGHHEAELKPLAQSHATKHKIPVKYLEFISDAIIQVLQSK HSG-DFHADTEAAMKKALELFRNDIAAKYKELGFQG * >P1;MYG$CASFI G--LSDGEWQLVLHVWGKVEADLAGHGQEVLIRLFKGHPETLEKFNKFKHIKSEDEMKAS EDLKKHGVTVLTALGGVLKKKGHHEAEIKPLAQSHATKHKIPIKYLEFISEAIIHVLQSK HPG-XFGADAXGAMNKALELFRKDIAAKYKELGFQG * >P1;MYG$CEBAP G--LSDGEWQLVLNVWGKVEADIPSHGQEVLISLFKGHPETLEKFDKFKHLKSEDEMKAS EELKKHGATVLTALGGILKKKGQHEAELKPLAQSHATKHKIPVKYLEFISDAIVHVLQKK HPG-DFGADAQGAMKKALELFRNDMAAKYKELGFQG * The beginning of each sequence is marked with a right angle bracket (>); the name of the sequence follows a few characters later, after the semi-colon; the next line is ignored and the end of the sequence is marked by an asterisk. Each sequence here is exactly the same length after including the gap characters (-). The READAL3 program will attempt to generate this format from a normal-looking multiple alignment. Take the above file (or your own) and run DISTANCES to extract a distance matrix. This is a file with a distance between each pair of sequences. You finally compute the principal coordinate analysis by running SPACER and giving the name of the file containing the distance matrix. 2. DISTANCES The DISTANCES program reads a multiple alignment data file in PIR format and outputs a distance matrix file. You are asked to specify the names of the input and output files, and to choose options. The options specify whether you wish to delete all positions with a gap; whether you are using proteins or nucleic acids in your sequences, and whether to use an amino-acid weight matrix. INPUT FORMAT: As described above, the sequences must be aligned already! Further, they must be in PIR or FASTA (Lipman and Pearson, 1985) format with a modification: gaps are specified by using the hyphen "-" character. Do not use dots "." or spaces " " for gaps; they will simply be ignored! The program checks the characters immediately after each angle bracket ">" to determine whether PIR or FASTA format is being used. If the 4th character in the line is a semi-colon ";", then PIR format is assumed: the name after the semi-colon is read as the name of the sequence; the next line is normally used as a descriptive title line and is ignored; the sequence is then read in free format until an asterisk "*" is found. This marks the end of the sequence; the next line beginning with a ">" is treated as the start of the next sequence. FASTA format is simpler. There is no second title line and there is no asterisk to mark the end of the sequence. The only way the program can distinguish between PIR and FASTA formats is to check for the semi-colon as the fourth character on the main title line; if there is none, FASTA format is assumed. OUTPUT FORMAT: The output file from DISTANCES is a text file consisting of the following. First, the number of sequences being examined (one integer); secondly the lower left triangle of a matrix of all pairwise distances between the sequences, including the main diagonal. This data is printed as real numbers. The values will later be read in free format, so if you are creating such a file in a text editor, you have some freedom in how you lay out the data (FLW). A small example distance matrix file is shown below for 8 sequences 8 0.00 1.01 0.00 0.95 2.14 0.00 1.91 1.81 4.56 0.00 3.60 4.08 1.02 1.99 0.00 1.03 1.88 2.01 3.01 0.66 0.00 44.04 0.01 493.1 0.001 14.1 10.0 0.00 0.01 0.01 0.99 9.91 9.9 1.02 2.02 0.00 You can spread each row of the matrix over as many lines as you like. The diagonal values of zero should be present. Normally the distance between an object and itself is zero (unless you use funny distances or have funny objects). OPTIONS (Toss gaps): You will be asked whether or not you wish to "toss" all gaps. This means that any sites (positions/columns in the multiple alignment) with a gap in ANY sequence will be removed. If you do this, you throw away some of the data, but you guarantee that the distances will be euclidean. This is good because it means that "like" is compared with "like" in all sequences. It is easy to invent situations (although they are rare in practice) where strange combinations of gaps opposite regions that are very similar in sequences that are otherwise very distant, will generate very weird distance matrices. On the other hand, with most sequence data sets (any I have seen), the effect of not deleting gaps is small. You get a slight distortion of the distances but it is very small. If you are worried about distorting the distances, you can see the effect when you get the output from SPACER; see the documentation there for details. So, in summary, deleting gaps is the safer, more conservative option but it is usually not needed. OPTIONS (Proteins or DNA): The program must know whether the sequences are amino acid or nucleic acid. Please take the trouble to tell it which here. OPTIONS (Identity distance or Smith AA distance): The default option is to calculate a distance between a pair of sequences, based on the number of mismatches only (Identity Distance). The actual distance is the square root of the number of differences, ignoring positions where either sequence (or any sequence if gaps are tossed) has a gap. If all positions with a gap are removed, this distance is guaranteed to be euclidean. For DNA sequences, this distance is fine. For proteins, it is often nice to use a weight matrix which gives greater weight to amino acids which are judged to be similar. Unfortunately, the most commonly used weight matrices (PAM matrices, Dayhoff et al, 1978) are not very useful for calculating Euclidean distances. A simplistic distance matrix that does give Euclidean distances is that of Smith and Smith (1990). It is shown below: D 0 E 1 0 K 2 2 0 R 2 2 1 0 H 2 2 1 1 0 N 2 2 2 2 2 0 Q 2 2 2 2 2 1 0 S 2 2 2 2 2 2 2 0 T 2 2 2 2 2 2 2 1 0 I 3 3 3 3 3 3 3 3 3 0 L 3 3 3 3 3 3 3 3 3 1 0 V 3 3 3 3 3 3 3 3 3 1 1 0 F 3 3 3 3 3 3 3 3 3 2 2 2 0 W 3 3 3 3 3 3 3 3 3 2 2 2 1 0 Y 3 3 3 3 3 3 3 3 3 2 2 2 1 1 0 C 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 0 M 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 0 A 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 G 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 0 P 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 D E K R H N Q S T I L V F W Y C M A G P These distances can be used because, in effect, they are arranged to give an exact hierarchy of the amino acids. 2. SPACER Principal Coordinates Analysis (PCOORD) was invented by John Gower and described in Gower (1966). It is a method for carrying out ordination, also known as multidimensional scaling. The most familiar ordination method is Principal Components Analysis (PCA). PCA is related to PCOORD but there are important differences. Ordination methods work by attempting to embed the objects (sequences in this case) in a space of N dimensions, where N is less than the number of objects. The distribution of the objects in this space (or almost always, just a few dimensions of it) is used to examine patterns of relatedness between the objects. PCOORD takes the N by N distance matrix containing the distances between all N sequences and finds a space of N-1 dimensions where these distances are preserved. If the distances are euclidean, then it gives an exact solution. This is also known as classical multidimensional scaling. The analysis ranks the dimensions, in terms of the amount of variation accounted for by each. A plot of the positions of the objects in the first 2 or 3 dimensions is then used to visualise a summary of the ordination. Run SPACER and give the name of a distance matrix file, generated by DISTANCES. The program will then prompt for the name of the output file and will ask if you want to use a file of symbols. Symbols are just characters (digits, letters, etc.) that can be used to label the sequences or groups of sequences in the output. If you just press , the program will just use an asterisk "*" for each sequence in the plots if it cannot find a symbol file. The format of this file is very simple: put one symbol per sequence into it in free format i.e. spaces are ignored. An example symbols file for the globin alignment example used earlier in this documentation is shown below: AAAA BBBB MMMM In this case, I use A's for the first 4 sequences, since they are all alpha globins, B's for the next 4 sequences, which are beta globins, and then M's for the last 4 sequences of myoglobins. I could have used a different symbol for each sequence. Then the analysis is carried out and sent to the output file. The output includes all of the eigenvalues. If the input distances are exactly euclidean, then you expect all eigenvalues to be non-negative. Some rounding errors may produce a small number of very small negative eigenvalues. If the distances are not euclidean, then the absolute size of the negative eigenvalues is a measure of the distortion. If this is small relative to the sum of all eigenvalues (including the absolute values of the negative ones) then the distortion is not serious. The coordinates of the sequences along the first 10 axes (dimensions) of the ordination are listed and the positions of the sequences in the first 3 dimensions are plotted on simple scatter plots. The symbols in the symbol file are used here if one was specified, or asterisks are used. If two or more points coincide on the plots, an asterisk is printed. Finally, you get the percentage of the total variation in the data set, accounted for by each of the first 10 dimensions. 3. READAL3 The READAL3 program attempts to read a variety of multiple alignment formats and write out the alignment in PIR format, with hyphens for gaps. There are so many possible formats that it is very difficult for me to write a completely general purpose program. All I have tried to do is write a program that will read a few different formats, unaided, and make it easy for you to edit other formats into a state where the program can read them too. BE CAREFUL!!!! Please check the output; it is very easy for this program to quietly produce a completely scrambled output file that will look reasonable at first glance, because the input format got the better of my format detection (or because my program is badly written - it has more "go to"'s per line than any program I have ever written and it is only 150 lines long). The output files produced by CLUSTAL (Higgins and Sharp, 1988, 1989), CLUSTALV (Higgins, Bleasby and Fuchs, 1992) can be automatically read by READAL3. An example for the file of 12 globins, used above is shown here: -----Example file that READAL3 can read----- CLUSTAL V multiple sequence alignment HBA$ELEMA V--LSDKDKTNVKATWSKVGDHASDYVAEALERMFFSFPTTKTYFPHF-DLSH-----GS HBA$PHOVI V--LSPADKTNVKATWDKIGGHAGEYGGEALERTFTAFPTTKTYFPHF-DLSH-----GS HBA$PHYCA V--LSPADKTNVKAAWAKVGNHAADFGAEALERMFMSFPSTKTYFSHF-DLGH-----NS HBA$PONPY V--LSPADKTNVKTAWGKVGAHAGDYGAEALERMFLSFPTTKTYFPHF-DLSH-----GS HBBA$CAPHI M--LTAEEKAAVTGFWGKV--KVDEVGAEALGRLLVVYPWTQRFFEHFGDLSSADAVMNN HBBL$XENLA V-HLSADEKSAINAVWSKV--NIENDGHDALTRLLVVFPWTQRYFSSFGNLSNVAAISGN HBBZ$MOUSE MVHFTAEEKAAITSIWDKV--DLEKVGGETLGRLLIVYPWTQRFFDKFGNLSSALAIMGN HBD$AOTTR V-HLTGDEKSAVAALWGKV--NVEEVGGEALGRLLVVYPWTQRFFESFGALSSPDAVMGN MYG$CALJA G--LSDGEWQLVLNVWGKVEADIPSHGQEVLISLFKGHPETLEKFDKFKHLKSEDEMKAS MYG$CANFA G--LSDGEWQIVLNIWGKVETDLAGHGQEVLIRLFKNHPETLDKFDKFKHLKTEDEMKGS MYG$CASFI G--LSDGEWQLVLHVWGKVEADLAGHGQEVLIRLFKGHPETLEKFNKFKHIKSEDEMKAS MYG$CEBAP G--LSDGEWQLVLNVWGKVEADIPSHGQEVLISLFKGHPETLEKFDKFKHLKSEDEMKAS . ............*.*. .........*.......*.*...*..* .... .. HBA$ELEMA GQVKGHGKKVGEALTQAVGHLDDLPSALSALSDLHAHKLRVDPVNFKLLSHCLLVTLSSH HBA$PHOVI AQVKAHGKKVADALTTAVAHMDDLPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLACH HBA$PHYCA TQVKGHGKKVADALTKAVGHLDTLPDALSDLSDLHAHKLRVDPVNFKLLSHCLLVTLAAH HBA$PONPY AQVKDHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH HBBA$CAPHI AKVKAHGKKVLDSFSNGMKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVVVLARH HBBL$XENLA AKVRAHGKKVLSAVDESIHHLDDIKNFLSVLSTKHAEELHVDPENFKRLADVLVIVLAGK HBBZ$MOUSE PRIRAHGKKVLTSLGLGVKNMDNLKETFAHLSELHCDKLHVDPENFKLLGNMLVIVLSTH HBD$AOTTR PKVKAHGKKVLGAFSDGLAHLDNLKGTFAQLSELHCDKLHVDPENFRLLGNVLVCVLARN MYG$CALJA EELKKHGVTVLTALGGILKKKGHHEAELKPLAQSHATKHKIPVKYLEFISDAIVHVLQKK MYG$CANFA EDLKKHGNTVLTALGGILKKKGHHEAELKPLAQSHATKHKIPVKYLEFISDAIIQVLQSK MYG$CASFI EDLKKHGVTVLTALGGVLKKKGHHEAEIKPLAQSHATKHKIPIKYLEFISEAIIHVLQSK MYG$CEBAP EELKKHGATVLTALGGILKKKGQHEAELKPLAQSHATKHKIPVKYLEFISDAIVHVLQKK .....**..*....................*...*.....................*... HBA$ELEMA QPTEFTPEVHASLDKFLSNVSTVLTSKYR------ HBA$PHOVI HPADFTPAVHASLDKFFSAVSTVLTSKYR------ HBA$PHYCA LPGDFTPSVHASLDKFLASVSTVLTSKYR------ HBA$PONPY LPAEFTPAVHASLDKFLASVSTVLTSKYR------ HBBA$CAPHI HGSEFTPLLQAEFQKVVAGVANALAHRYH------ HBBL$XENLA LGAAFTPQVQAAWEKFSAGLVAALSHGYF------ HBBZ$MOUSE FAKEFTPEVQAAWQKLVIGVANALSHKYH------ HBD$AOTTR FGKEFTPLLQAAFQKVVAGVATALAHKYH------ MYG$CALJA HPGDFGADAQGAMKKALELFRNDMAAKYKELGFQG MYG$CANFA HSGDFHADTEAAMKKALELFRNDIAAKYKELGFQG MYG$CASFI HPGXFGADAXGAMNKALELFRKDIAAKYKELGFQG MYG$CEBAP HPGDFGADAQGAMKKALELFRNDMAAKYKELGFQG ... *.... ....*............*. -----End of example file----- RULES FOR MULTIPLE ALIGNMENT FORMATS: In general, READAL3 only reads multiple alignments of the "blocked" variety. All residues must be present in all sequences. The blocks of sequence must not be interspersed with extra lines showing identical or similar residues between consecutive sequences. A consensus line of the type produced by CLUSTAL is allowed after each block provided it only contains blanks " ", stars "*" or dots ".". Each block of sequence must have all of the sequence names along the left hand side, with at least one blank between each name and the sequence. All sequences must be exactly the same length, including gaps i.e. the sides of the blocks of alignment must be flat. 1) Heading: Either leave the first line of the file BLANK and start the alignment anywhere you like or use the first few lines for comments/arbitrary text. If the first line is not blank, the program will skip the first lines of text until it finds a completely blank line. 2) Gap characters: Only use hyphens for gaps, not space or dot characters. 3) Sequence numbering: Numbers are allowed in most positions in the alignment and will be ignored e.g. left or right of an alignment block or on extra lines. 4) Sequence names: The name of every sequence must appear along the LEFT hand side of every block of sequence alignment. THESE NAMES MUST NOT CONTAIN ANY BLANKS. The program will start looking for the sequences themselves after the first blank after each name. 5) Format of the alignment blocks: The blocks of aligned sequences may be broken up by columns of blanks, to improve readability. The blank columns will be ignored. 6) End of the alignment: There must be at least one blank line at the end of the file. You will be asked if you want to use a file specifying fragments. This is to allow you to only use parts of the alignment. The format of the file is just a series of pairs of integers in free format, specifying the beginning and end of each fragment. To specify these positions, use positions along the alignment e.g. 10, 25 to specify a fragment from position 10 to position 25 inclusive. REFERENCES Dayhoff, M. O. 1978. Atlas of protein structure and function. Volume 5. Suppl 3. National Biomedical Research Foundation, Silver Spring, Md. Gower, J. C. 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53:325-328. Higgins, D.G., Bleasby, A. J. and Fuchs, R. (1992, in press) CLUSTAL V: improved software for multiple sequence alignment. CABIOS, 8(2): in press. Higgins, D. G., and P. M. Sharp. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237-244. Higgins, D. G., and P. M. Sharp. 1989. Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS, 5:151-153. Higgins, D.G. (1992) Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets. CABIOS, 8(1): 15-22. Smith, R. F., and T. F. Smith. 1990. Automatic generation of primary sequence patterns from sets of related protein sequences. Proc. Natl. Acad. Sci. USA 87:118-122. P.S. FLW means "famous last words".