Gene Expression Data

Professor Ashlock briefly described the process by which genes are analyzed. A single organism can have thousands of distinct genes. We are pretty sure of the purpose of only a few genes. Of the vast remainder, we know that a lot are "junk" genes, but many others perform useful functions. Our effort to classify these genes will involve trying to match them against some structural or physical property of our set of known genes, and then assuming that genes that "look alike" must act alike as well.

For the plant Arabidopsis, we have a number of sets of data. The actual data is determined by placing pairs of specimens on slides and measuring their fluorescence. Testing each gene results in 14 numbers. So we can think of our data as being points in 14-dimensional space.

The original data is available in Professor Ashlock's Gene Expression Toolkit Page. The data there is in PCL format. A program called PCL_READ was used to rewrite the data in a simpler text file form, in which each result was a single record of 14 real numbers.

The sets of data include:

i1650.dat, 180 entries;
i29111.dat, 1031 entries;
i1039.dat, 1152 entries;
i2032.dat, 1481 entries;
i29486.dat, 8526 entries;
i2282.dat, 11016 entries;

Get Some New Data: For your first task, download the smallest data set, i1650.dat. Then you should be able to use the MATLAB load command to read the data into a variable. For our MATLAB work, we have generally assumed that a table of N data values in NDIM dimensions had the form X(NDIM,N) so that each data item was a column of X. Check the dimensions of the X variable that MATLAB reads in. If it's 180 by 14, then you'll need to transpose your data to get it in the right form. You can do this with the simple command:

    x = x';

You should now have a table of data of size 14 by 180.

Use Our Old Code on New Data:: Our goal will be to sort the data into a small number of clusters. Let us assume for now that the number of clusters is probably between 10 and 15. Using the i1650.dat data and your current clustering code, try to determine the clusters for the data. It might be a good idea to compute the clusterings for K = 1 to 20, just to make sure your code is working, and to get a feeling for the size of the energy.

Implement a New Distance Function: Biologists don't use the Euclidean distance to measure the distance between two of these data "points". Instead, they measure similarity, which they define as

    sim ( x, y ) = x dot y / ( ||x|| ||y|| )

Now the symbol ||X|| means the Euclidean norm of X, which can be computed as

    ||X|| = Sqrt ( Sum ( 1 <= I <= NDIM ) X_i² )

In all our subsequent work, we will always be using normalized data, that is, we'll always be dividing X by its norm. To make life simpler, we should simply normalize our data immediately.

    n = size ( x, 2 )

    for i = 1 : n
      x(1:ndim,i) = x(1:ndim,i) / norm ( x(1:ndim,i), 2 )
    end

While this will make life easier for us, we will need to be careful that any new data points we read in or create by averaging will also be properly normalized.

Now, with this normalization, the similarity of two vectors is

    sim ( x, y ) = x dot y
                 = x' * y

Here we are assuming that each vector X is stored as a column, so that its dimension have the form 14 by 1. Thus, the transpose has dimensions 1 by 14, and the result of x' * y is a (1x(14x14)x1) or a (1x1), that is a scalar, or number. It's easy to store your data in an incompatible format, or to mistype this formula, so that the results are incorrect.

Because the vectors have unit norm, the similarity score will vary between -1 and 1, with 1 meaning most similar, and -1 most dissimilar. We can use this similarity score, but we need it to be more like a distance, so we write:

    diss ( x, y ) = 1 - sim(x,y)
                  = 1 - x' * y

Now diss(x,y) is going to play the role of distance. This means that in the find_closest routine, we take each point X(I) and look for the cluster center Z(J) that minimizes the value of diss(X(I),Z(J)).

Once we have gathered all the points X(I) that belong to cluster J, it is time to update the cluster center Z(J). How we do this depends on which algorithm we are using:

Continuous minimization: compute Z(J) as the average of the X(I). After you compute Z(J), you must normalize it, that is, compute its norm, and divide by that value.
Finite minimization: select Z(J) as the "most central" of the X(I). (In this case, there is no need to normalize Z(J), since the X values were already normalized.)

Finally, we will also want to revise our definition of the energy of a set of clusters, to be the following:

    E = sum ( each cluster J )
          sum ( points X(I) in cluster J )
            diss ( X(I), Z(J) )

If you can make these three sets of changes to your code, you should have an algorithm that is considerably better for clustering the gene expression data. But before we start looking closely at our clusters, let's just recompute the energy diagram.

Use Dissimilarity: For your third task, repeat the previous computation, in which you tried to cluster the 180 data points. However, use the biologist's dissimilarity score diss(x,y) to compute distances. Again, compute energies for K = 1 to 20. The computed energies will be different. Does the shape of your energy curve change?

Visualize a Cluster: Life was simple when our data was two dimensional. We could draw a picture of our points that made the clustering obvious. That's not possible in 14 dimensions, but surely there's something we can do to make a picture of our data. To begin with, think of drawing a "picture" of a single point,

    x = [ 0, 2, 0, 4, 0, 6, 0, 8, 0, 10, 0, 12, 0, 14 ]

We can certainly use the command plot(x) which plots the coordinate indices versus their values. Now suppose we want to plot x and y and indicate that they are close. If you look at the similarity measure, two vectors don't have to have the same values, just the same relative values; in fact, two vectors are considered identical if, for each index I,

    X_I/||X|| = Y_I/||Y||

so perhaps the thing to plot is X/||X||. With this in mind, consider making a "similarity plot" of the following 3 vectors:

    -0.04 0.38  0.36 -0.41  0.86 0.61  3.73  0.58 -0.26 0.13 2.12 1.35 -0.47 0.16
     0.14 0.75 -0.38  0.80  0.70 0.59  0.49 -2.06  0.70 0.28 1.08 0.97  0.11 1.07
     0.20 0.25  0.37 -0.22 -0.09 0.17 -0.62  0.20 -2.19 1.28 1.30 1.02 -0.70 0.58

(These are the first three entries of i1650.dat.) In your similarity plot, your X scale goes from 1 to 14, and your Y scale from -1 to 1. Amazingly, if you have a two dimensional array of data X, the MATLAB command plot(x) will plot each column of data as a straight line.

Now this is going to make it possible for us to draw plots that suggest the shape of the objects in each cluster. Keep in mind that we don't want to plot X, but rather the normalized X. (That is, the vector X divided by its norm). Secondly, we don't want to plot all the data points, just the ones in a particular cluster.

But that's not too hard to do. Make a vector CLUSTER, and keep it updated so that CLUSTER(237)=6 means that X(237) belongs to cluster 6. (This is the index NEAREST that you compute in your nearest point computation). Then, to make a plot of the things in cluster #6, use the FIND command to make a list C6 of the indices of CLUSTER that are equal to 6, and try the command

    plot(x(1:14,c6)).

Oops! Forgot to normalize X. But you can take care of that.

So, starting with the data in i1650.dat, use 12 clusters. Make similarity plots of the items in each of the 12 clusters. When you get this done, we'll compare our results with things that Professor Ashlock has done, and see if we're way off base.

Last modified on 03 July 2001.