pcca
Phylogenetic Canonical Correlation Analysis.
Beta version.

by Liam J. Revell

Contents


1.  Getting started

2.  Running "pcca"

3.  The output

4.  References and further reading

5.  Contact information

7.  Appendix - updates - citation - note on Mac OS installation

Getting Started

Introduction

pcca is a program for canonical correlation analysis of data consisting of observations that are non-independent due (usually) to phylogenetic history.

Canonical correlation analysis is a procedure by which two sets of orthogonal derived variables are calculated from two sets of original variables whereby the correlations between corresponding derived variables is maximized. Derived variables are calculated as linear combinations of the original variables. A good description of canonical correlation analysis is provided by Miles and Ricklefs (1984) and Losos (1990).

Canonical correlation analysis can be very useful in circumstances in which our variable of interest is abstract or cannot be measured directly. For example, one might collect several measurements of anti-predator escape behaviour (such as maximum speed, acceleration, and jump distance) and several measurements of morphology (hind limb measurements, muscle measurements), and then use canonical correlation analysis to determine if variation in the latter affected variation in the former.

When data are collected from species whose relationship can be described using a phylogenetic tree, observations collected from the tips of the tree are non-independent due to common history. Shared history creates expected covariance among the observations at the tips that is a function of the amount of history shared between taxa. In the simplest model for the evolutionary process, Brownian motion, this covariance is directly proportional to the amount of shared history.

In matrix form, under Brownian motion the expected covariance among tips is proportional to an n × n (for n species) symmetric matrix with diagonal elements equal to the tree length and off-diagonals equal to the distance from the root of the tree to the most recent common ancestor of each pair of species.

For example, using the following tree and Brownian motion as our model of the evolutionary process:

we would calculate an expected covariance matrix amongst the observations at tips proportional to:

In order to remove statistical dependence among our observations, then, we essentially transform our data array using the inverse root of C : Z=C-1/2X, where X is our original data and Z is an array of transformed variates. (This is not the precise transformation used: for specific details see Rohlf 2001, Revell and Harrison 2008.) This is a non-linear transformation philosophically analogous to, say, transforming all our observations to have a common variance - except that in this case, we also transform the data to eliminate covariances among observations.

Computing C as above is only one way in which to obtain a value for the matrix describing the variances and covariances among the observations at the tips of the tree for a given trait. Pagel (1999) proposed a parameter, λ (to be estimated using likelihood), by which the off-diagonal elements of C could be transformed in order to better fit the multivariate distribution predicted by the phylogeny to the observations collected from the tips. For our previous example, Cλ would then be computed as:

λ is then estimated by maximizing the following equation for the likelihood, which is based on the multivariate normal and in which y is a column vector of the values at the tips of the tree for a given trait, â is the "phylogenetic mean" or MLE of the ancestral state at the root, 1 is a column vector of 1s, and σ2 is the MLE of the evolutionary rate for the character, which can be computed analytically:

Obviously both Brownian motion (λ = 1.0) and phylogenetic independence (λ = 0.0) are special cases of this transformation.

In the multivariate case, we calculate and transform using a multivariate version of λ as described in Freckleton et al. (2002).

Once the transformed data have been obtained (using λ = 1.0 or λ ≠ 1.0), they are appropriate for standard statistical anaylses.

Canonical correlation analysis (CCA) is one such analysis. In CCA, for data matrices composed of observations for multiple variables X and Y, vectors of coefficents are computed to maximize the correlation between derived variables a'X and b'Y. The first canonical variables are the derived variables u1=a'X and v1=b'Y. The next pair of derived variables are then computed with the constraint that they are orthogonal to the first derived variables. This procedure is repeated, with each new canonical axis orthogonal to all prior axes, a number of times equivalent to the smaller of the two numbers of columns in X and Y. Further detail on CCA can be found in many statistical texts, in Miles and Ricklefs (1984), and on wikipedia.org.

Installation - Linux/UNIX, Windows, Mac OS

This manual, all available executables, and an example data file are available as a zipped tarball pcca.tar.gz.

In Linux/UNIX, unzip the tarball by typing (from the directory in which the tarball is located):

tar -zxvf pcca.tar.gz
It should be unnecesary to create a directory structure (all of that is in the zipped tarball).

Now navigate to the appropriate directory :

cd pcca/linux
and change the mode of the executable (if necessary - if you're not sure, there is no harm in doing it):
chmod a+x pcca.bin

In Windows and Mac OS, installation is just as simple. Simply unzip the zipped tarball using any of a number of widely available software such as WinZip or WinAce. Mac OS has many analogous softwares.

Back to top

Running pcca

Input file format

Any run of "pcca" requires two input files. Input files should be created as plain text which can be accomplished in Linux by using a text editor such as gedit or in Windows by using WordPad and saving in Save as Type: Text Document Format.

The first input file is your quantitative trait data file, formatted as follows:

In this input file, the two integers in the header are the number of traits (4 in this example) and number of taxa (5), respectively. The taxa are then listed by number (note that the period after the number is essential for the taxa numbers to be properly read), and the traits are in columns adjacent to the appropriate taxon number. Extra tabs after the first or subsequent file lines may cause the input file to be read incorrectly. Hopefully this will be fixed in a future version. Below the data array is a set of 0s and 1s corresponding to classes into which the variables (columns) are to be grouped for the canonical analysis. In this example, traits 1 & 2 belong to variable class 0, and traits 3 & 4 belong to variable class 2. Only 0s and 1s should be used here.

The second input file is your tree file, formatted as follows:

Polytomies are not allowed (and should be first resolved arbitrarily with zero branch length). The user can rest assured that it is irrelevant how polytomies are resolved (see Rohlf 2001). Equivalently, this is clear by considering the example tree provided above. If the branch connecting the ancestor of A and B to the root had zero length (i.e., if v(A,B) = 0.0), then it is irrelevant in the calculation of C whether the tree is written:

((A:vA,B:vB):0.0,C:vC) or (A:vA,(B:vB,C:vC):0.0).

Both the format of the data and the tree file are familiar to people who have used my idc program. In the manual for that program I recommend using a taxon conversion table. Obviously, this is still allowed (so long as the data and tree files do not contain information for multiple trees).

For ease of use the input files should be in the same directory as the executable!

It is also possible to run multiple data sets and multiple trees in sequence. This is accomplished by creating data files in which each tree is in a separate line, and the data sets are stacked in appropriate format in sequence.

Running pcca

Running "pcca" is easy. In Windows XP, you should be able to run the executable pcca.exe simply by double clicking on it. If you'd prefer, you can also run it from the command prompt. The easiest way to get a command prompt in XP is to open a RUN window and enter:

cmd
At the command prompt, navigate to the appropriate directory and type:
pcca.exe

To run in Linux/UNIX navigate to the appropriate directory and type:

./pcca.bin
This can also be accomplished in Mac OS X using the "terminal" program.

From here on out execution in Linux/UNIX/Mac OS X and Windows are identical, so I will follow execution in Linux/UNIX.

The program will give you several prompts:



When prompted for input files, enter the names of the tree and data files to be analyzed. Then enter the name of a file for output. Any existing file with this name will be appended. The results are only written to file, and if you ran the program by double-clicking, then you should be aware that the command prompt window will close as soon as the program is done executing.

The user should also be aware that due to the very simple procedure used to find the MLE of λ, the program runs quite slowly for large numbers of taxa. This will be fixed in a newer version of the program.

Notes on λ

The parameter λ is a parameter estimated for statistical reasons which also provides a measure of phylogenetic dependence. λ ≠ 1.0 does not correspond to any particular model of evolution, however, and λ ≠ 1.0 might be caused by any of a number of causes (functional constraint, rate heterogeneity, or natural selection).

It is also possible that when λ is estimated individually for each character it might each time be >> 0.0, but then when estimated simultaneously for all characters (as it would need to be for CCA), MLE(λ) = 0.0. The cause of this phenomenon, which I have observed empirically, is not clear.

Finally, λ has no specific natural range, and could even hypothetically be < 0.0 or > 1.0. In this program, I restrict estimation of λ to the interval (0,1). This is because λ > 1.0 and λ < 0.0 will often cause the matrix Cλ to be non positive definite. This is a problem for computation because matrices that are non positive definite cannot be inverted (and are also not valid covariance matrices).

I recommend that the user perform their analysis at least twice: once with λ = 1.0 (option 1) and once using the MLE of λ, and compare the results.

Back to top

The pcca output

The general output from "pcca" is printed to an output file and looks as follows:


The output is fairly straightforward to interpret

The first part is the results from the MLE of λ, if this is performed. The ML estimate for λ is provided, as are the results from likelihood-ratio tests of the null hypotheses of H0: λ = 0.0 and H0: λ = 1.0.

The second part are the phylogenetic transformed data. The phylogenetic transform is performed in such a way as to be already centered on the phylogenetic mean - this makes computation of the canonical coefficients and scores easier. Immediately below the phylogenetic transformed data are the phylogenetic means for each trait.

The third part is composed of the canonical scores for each canonical axis.

The fourth part consists of the canonical coefficients for each canonical axis.

The remainder of the output file consists of the canonical correlations, and the results for statistical tests about the canonical correlations.

The user should refer to standard statistical textbooks for interpretation of the results from CCA. A good one for biologists dealing with multivariate analyses is Experimental Design and Data Analysis for Biologists by Quinn and Keough, although they do not provide much information about canonical correlation analysis.

If many trees and data sets are batch run, then the outputs will be printed in sequence to a single output file.

Back to top

References

1. Freckleton, R.P., P.H. Harvey, and M. Pagel. 2002. Phylogenetic analysis and comparative data: A test and review of evidence. American Naturalist 160:712-726.

2. Losos, J.B. 1990. Ecomorphology, performance capability, and scaling of West Indian Anolis lizards: An evolutionary analysis. Ecological Monographs 60:369-388.

3. Miles, D.B., and R. Ricklefs. 1984. The correlation between ecology and morphology in deciduous forest passerine birds. Ecology 65:1629-1640.

4. Pagel, M. 1999. Inferring the historical patterns of biological evolution. Nature 401:877-884.

5. Quinn, G.P., and M.J. Keough. 2002. Experimental Design and Data Analysis for Biologists. Cambridge University Press. Cambridge, UK.

6. Rencher, A.C. 2002. Methods of Multivariate Analysis. Wiley-Interscience. Hoboken, NJ.

7. Revell, L. J. and A. S. Harrison. 2008. PCCA: A program for phylogenetic canonical correlation analysis. Bioinformatics 24: 1018-1020. PDF

8. Rohlf, F.J. 2001. Comparative methods for the analysis of continuous variables: Geometric interpretations. Evolution 55:2143-2160.

Back to top

Contact information

Please contact me by email with any questions, or if you find the program useful. My email is lrevell@fas.harvard.edu, and my other contact information is listed below.

Although I have thoroughly tested the program, I encourage users to do the same and I would be happy to hear about any bugs you might find.

Liam J. Revell
Department of Organismic and Evolutionary Biology
Harvard University
Cambridge, MA 02138
(617) 384-8437

Back to top

Appendix

Updates

Date - Jan. 5, 2008.

An updated version of the program now computes standardized canonical coefficients and structure coefficients (canonical loadings).

Standardized canonical coefficients are analogous to standardized beta weights in a multiple regression. They are computed for canonical coefficients a and b, for X and Y variables, respectively, as c = Dxa and d = Dyb. Dx and Dy are diagonal matrices in which the diagonal consists of the square roots of the mean square of each character.

Structure coefficients (canonical loadings) are simply the correlations between each canonical variable and each original variable. In phylogenetic CCA, the original variables have actually been transformed - using the PGLS transform described above. Structure coefficients are analogous to principal component loadings. Rencher (2002) cautions against the use and interpretation of structure coefficients. I have nonetheless provided the computation of structure coefficients as an option in pcca.

The modified output from pcca looks as follows:


The interpretation of the output is as
above.

Note also that the heading Sp. has been modified to read Obs. for PGLS transformed variates and canonical scores (compare to previous output file). This has been done to reflect the fact that variates and scores, of course, are in terms of the evolutionary differences among species (thus observations for the statistical analysis), rather than in terms of the original species.

Back to top

Citation

Please cite: Revell, L. J. and A. S. Harrison. 2008. PCCA: A program for phylogenetic canonical correlation analysis. Bioinformatics 24: 1018-1020. PDF

Back to top

Note on Mac OS Installation

A user reports that the Mac OS executable, pcca.bin, does not have appropriate permissions when installed.

To change the permission on pcca.bin, the user should navigate to the directory containing pcca.bin and execute the following command at the prompt:

chmod a+x pcca.bin

This should rectify any problem with permissions.

Back to top


Content copyright. Last updated 18 Apr. 2007.