Lecture 4
Applications of Microarrays
October 16, 2001
Lecturer: Larry Ruzzo
Notes: Brian Tjaden
Here, we wrap-up our discussion from Lecture 3 on analyzing microarray data to gain insights into the potential functions of large numbers of previously uncharacterized genes (Science, Vol 282, October 23, 1998, by S. Chu et al.) We also look at using microarray expression profiles to discover and predict classes of cancer (Science, Vol 286, October 15, 1999, by T.R. Golub et al.)
The Transcriptional Program of Sporulation in Budding Yeast
In this paper, Chu et al. use microarrays to assay RNA expression levels for ~6200 yeast genes over various time points of the sporulation process. They found more than 1000 genes which demonstrated significant changes in expression levels during sporulation, compared to about 50 genes which had been previously studied and identified as being part of various stages of the sporulation process. Approximately half of their 1000 significantly affected genes were up-regulated (induced) and half down-regulated (repressed), and they proceeded to group these genes into 7 temporal sporulation-related classes. This analysis identifies an order of magnitude more genes potentially associated with various phases of sporulation than previously known. Further, the authors validate the role of a small number of these genes in sporulation via knockout experiments. Additionally, they employ their classifications to identify binding motifs for sets of genes with similar expression patterns, and they discover both known transcription factor binding motifs as well as potential new motifs. One of the main contributions of this work is that it demonstrates that the relatively low cost process of microarray experiments can have considerably high yield. Microarray experiments may only take a few days to perform, and they can generate a somewhat global picture of transcription to help guide further biological experiments.
Where do computational and algorithmic challenges come into array analysis?
-
Automated sample handling
-
Image analysis
-
Data storage, data retrieval, and data integration
-
Lots of data (a single Affymetrix chip experiment generates up to 100MB of raw data)
-
Integrating expression results with information from public databases, sequence analysis, etc.
-
Visualization tools
-
Clustering
-
Sequence analysis
-
Similarity search (determine if translation of an unknown/hypothetical gene resembles a known protein)
-
Motif discovery (identify regulatory elements of orthologous genes from different species or similarly expressed genes of the same species using algorithms such as Gibbs sampling and expectation maximization)
-
Structure prediction
-
Ascertaining which mutations matter
-
Used in combination with sequence analysis to determine protein similarity
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring
This paper by Golub et al. presents an approach, using DNA microarrays, for class prediction and class discovery of acute leukemias, namely acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Clinically, the distinction between these forms of cancer can be crucial in terms of applying the appropriate treatments. Golub et al. attempt to answer many of the key questions in distinguishing cancer classes based on microarray data: Are the different forms of acute leukemia molecularly distinct? At the RNA level? Can we predict this distinction with array data? Are further subdivisions of the 2 classes possible that we can recognize?
The initial data set consists of 38 bone marrow samples, 27 obtained from patients diagnosed with ALL and 11 from patients diagnosed with ALL. RNA was extracted from the samples and hybridized to Affymetrix chips assaying ~6800 human genes. First, Golub et al. identify ~1100 genes whose expression pattern was highly correlated with the class distinction between AML and ALL. Then, they attempt to build a class predictor based on the expression of these genes. However, the number of genes is too massive and may allow substantial noise, so they restrict themselves to the 50 highest scoring genes and use a weighted vote based on the expression of each of the genes in a new sample to predict which class the new sample belongs to. They found that no single gene predicts perfectly but many genes are very good predictors, so their weighted voting scheme of the top informative genes is very effective.
They use leave-one-out cross validation (LOOCV) to test the validity of their predictions and find that 36 of the 38 samples are correctly classified, and the remaining two were classified as uncertain since the weighted vote didn't exceed their classification threshold. Further, they used all 38 samples to build a classifier and tested the classifier on 34 independent samples obtained from a broad range of acute leukemia patients under a range of conditions. Again the classifier managed 100% accuracy by correctly predicting 29 samples and labeling the remaining 5 samples as uncertain. While their approach meets with considerable success, it should be noted that further experiments indicate that the data set is remarkably clean and alternative methods also tend to perform quite well at distinguishing the samples.
Finally, Golub et al. address the issue of class discovery. If the two classes of AML and ALL are not known in advance, could they be learned from the data? Could we determine that there were in fact two classes as opposed to three or more? Using the entire expression profile (6800 genes) of the 38 samples, they cluster the samples into two classes via self-organizing maps (SOMs). The resulting two clusters nicely correspond to ALL (24 of 25 samples in first cluster) and AML (10 of 13 samples in second cluster). In addition, they ran their SOM clustering algorithm searching for four classes in an attempt to identify finer subclasses of the leukemias. Of the 4 clusters, 2 largely corresponded to ALL derived from B-cells,1 to ALL derived from T-cells, and 1 to AML.