image University of Washington Computer Science & Engineering
 CSE 527, Au '04: Reading #3: What Students Found
  CSE Home   About Us    Search    Contact Info 

I asked for brief reports on good microarray papers. Here's what you found:
Trevor Hastie, Robert Tibshirani, NMichael B Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt, Wing C Chan, David Bosein and David Brown, 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. , Genome Biology 2000, 1(2):research0003.1-0003.21

The technique of gene shaving is useful when trying to locate similar genes. Unlike other techniques for gene clustering, gene shaving not only attempts to group together genes with similar expression patterns, but also attempts to maximize the variability of expression patterns across conditions. It accomplishes this by applying principal component analysis to nested subsets of the gene expression data, computing the correlation of each gene expression to the first principal component, discarding the 10% of genes least correlated, and repeating the process on the remaining genes, until only one gene remains. The entire process is repeated after an orthogonalization step. Gene shaving can be unsupervised or supervised, making it a flexible algorithm. In this article, gene shaving is used to predict patient survival times for B-cell lymphoma.
I enjoyed reading this article. The article clearly explains the algorithm, its variants and its application. Although it is more mathematical than other biology/microarray articles, it is accessible to people in the life sciences who have had a basic introduction to clustering concepts. I especially liked how it clearly states that for a technique to be shown promising, one must test the null hypothesis. It is not enough to show that a technique identifies patterns in the data. One must also show that the identified patterns are indeed indicative of higher correlation beyond that of a simple random pattern grouping.

Mahlet G. Tadesse, Joseph G. Ibrahim, A Bayesian Hierarchical Model for the Analysis of Affymetrix Arrays, http://www.annalsnyas.org/cgi/content/full/1020/1/41s

Microarray analysis is often complicated by the high dimensionality of the data and the sensitivity limits of the technology. For instance, the Hu6800 GeneChip arrays do no reliably quantify low and very high levles of expression. The standard methods of analysis omit from the analysis genes whose transcrips are beyond the detection limits. This leads to the removal of many genes on the threshold of detection, which is problematic because some importatn mRNAs that may cause significant changes inside celss often have low total abundance.In this paper, the authors proposed accounting for expression readings that are beyond the limits of reliable detection by modeling them as censored data. In addition, they suggested using Box-Cox transformation for expressing the varaion in gene expression level, which are further decomposed as a linear combination of tissue type effect, gene effect and gene-tissue interactions. These parameters are then used in the Likelyhood function. In their Bayesian model, hierachical priors were specified to induce correlation between gene effects as well as between tissue types for a give gene .
They applied the proposed model for a set of publicly availabe leukemia data. As a result they identified 42 diffenrentially expressens genes that match the ones selected by another study, as well as a number of interesting genes that were NOT identified such as azurocidin, neutrophil elastase and azurodicin, which are regulated during hematopoietic differentitation; annd amphiregulin, which inhibits the growth of certain agressive carcinoma cell lines.
I enjoyed reading this paper because I was just studying about censored data analysis in BioStat 517, and I thought this group's "statistical" approach in quantifying expression limits is quite novel. Unfortunately the authors only used one set of data for testing their model in the study, which seems a bit insufficient for demonstrating the validity of their model.

Yoonkyung Lee and Cheol-Koo Lee, Classification of multiple cancer types by multicategory support vector machines using gene expression data, Bioinformatics, Vol. 19, no. 9, 2003, pp. 1132-1139

This paper looks at a generalization of support vector machines (SVMs) to deal with multicategory classification. That is, SVMs are excellent classifiers but the traditional use is as a binary classifier. This paper uses a generalization to a k-way classifier to compare the results. The multicategory SVM (MSVM) does have a higher computational complexity than running a set of k binary SVMs, but is still an appealing approach as it considers all classes simultaneously which may help with more overlapping classes. The results shown are for classifying types of leukemia and for small round blue cell tumors and are quite promising in boht cases.
Suggested extensions: One idea I had was that they should have used a more sophisticated gene selection method, only to discover later that the authors themselves did suggest this in the conclusions. I would also have preferredto see more information on confidence in the leukemia section; they mentioned it only with the small round blue cell tumors. Finally, their gene selection method for the small round blue cell tumors picked 16 of the top 20 genes that matched a list of 96 genes in an earlier work on the same data. But there was little said about the 4 that were not in agreement with the list of 96 other than that their function is not yet well understood. That seems to be an area for further exploration: why were they selected in the sense of how do they impact the results, seeing as how the other paper didn't identify them despite accepting many more genes as significant.

Arvind Rao, A clustering algorithm for gene expression data using wavelet packet decomposition , Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, 2002.

The idea of the paper is neat. Conventionally, in micro-array, different genes have differen time behaviors, so the clustering is based on the time-domain profile. Poeple have tried Fourier analysis but no big improvement because only very few time points can be a chieved in the experiments. This paper, try to combine these two methods by using the idea of wavelet packet decomposition. Like other approaches, K-means is used for clustering eventually.
One quesion motivated from this paper is the notion of feature selection. Normally, we can use time series as they are. Or use FFT. Or use wavelet packet decomposition. We can also try 1) normailized these conventional features; 2) other new features such as short-time FFT combined with long-time FFT.

Pierre Baldi, On the convergence of a clustering algorithm for protein-coding regions in microbial genomes, Bioinformatics Vol. 16 no. 4 2000 pp. 367-371

This paper discusses a previously presented method for predicting protein-coding regions in microbial DNA sequences. (This method was first described by Audic and Claverie in 1998 in "Self-identification of protein-coding regions in microbial genomes" in Proc. Natl Acad. Sci USA, 95, 10026-10031.)
Unlike other methods, this method does not require a trianing set, or any prior knowledge of the statistical properties of the genome under study. It is essentially a clustering, or self-organizing approach, that uses all the available unannotated genomic data for its calibration.
The simplified description of how this algorithm works goes as follows:
* The genomic sequences are considered to result from n Markov models of order k (typically k=5), each one responsible for a different "kind" of non-overlapping subsequences.
* In order to detect protein-coding regions, a natural value for n is 3, corresponding to 3 different regions: (1) coding on the direct strand, (2) coding on the complementary strand, and (3) non-coding.
* The available genomic sequences are then cut into non-overlapping fragments of length w (typically w=100). The resulting sequences are randomly partitioned amongst the three models, and the three Markov models are initialized accordingly, in a semi-random fashion.
* The algorithm then proceeds iteratively by cycling through all the available fragments. At each cycle, a fragment W is assigned to one of the the 3 classes depending on the highest posterior probability p(Mi|W) where Mi is the ith model (i=1,2,3). The parameters of each Markov model are then updated using all the sequences assigned to the corresponding sub-model.
The author then notes that this algorithm clearly corresponds to a mixture of three Markov models of order k, where the mixing coefficients represent the proportion of sequences in each class. (Apparently similar mixtures of HMMs have been used to model protein sub-families -- "Hidden Markov models in computational biology: applications to protein modeling" by Krogh et al., 1994)
After further details about the EM algorithm he notes that in fact the algorithm described above is an approximation to EM, in which probabilities are thresholded to 1 or 0, and that in the clustering literature this is referred to as k-means.
Another interesting comment related to why Markov models of order 5 make sense in this context. Apparently k=5 seems to be optimal because of the "well-recognized and important differences between DNA hexamer statistics in coding and non-coding regions". ("Assessment of protein coding measures" by Fickett and Tung, 1992.)
In conclusion, the author notes that the clustering method analysed here seems to work well with bacterial genomes where coding regions often represent more than 90% of the total DNA. The extension of these methods to eukaryotic genomes, where the fraction of coding sequences is often less than 10%, remains a challenge.

Sung-Bae Cho and Hong-Hee Won, Machine Learning in DNA Microarray Analysis, http://portal.acm.org.offcampus.lib.washington.edu/citation.cfm?id=820213&coll=ACM&dl=ACM&CFID=29453536&CFTOKEN=44720823

This was an interesting paper on the use of an ensemble-based approach to machine-learning applications to microarray technologies, specifically in identifying cancer-related genes. After a brief overview of microarrays and how they operate, the authors jumped into techniques involved in classifying gene expression data. They cover the use of neural networks (multi-layer perceptrons), k-nearest neighbor and self-organizing maps, among others, and how they can be applied to classification problems.
The key part of the paper is how the authors combined several classification methods, and using a majority voting consensus system, generated scores to classify cancer-related genes. It was the authors' assumption that by using multiple classification algorithms, they avoid weaknesses that may be inherent in any one algorithm alone. The authors use cosine coefficient, Euclidiean distance, information gain, mutual information, signal to noise ratio, and Spearman's and Pearson's correlation coefficient as feature selection methods in conjuction with the classifiers. The authors found that the best-performing classifiers among the group were MLP and KNN.
I found this paper engaging for several reasons. For one, at a top level, its similar to research I've worked on related to consensus-based gene-finding algorithms. One piece of information gleaned through that research is that consensus-based systems general perform slightly better than the best algorithm (in this case, MLP and KNN). Also, as the authors here noted, consensus-based systems are relatively easy to implement, and generally perform better than any single algorithm alone. Another reason I liked this paper was that it gave me a general overview of many techniques that can be used to classify genes in microarrays. While it did not go into any great detail about any one technique, it did pique my interest into possibly investigating further the techniques it did mention.

Barash, et al., Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays, Bioinformatics, 20:839-84

The authors of this paper basically discussed their analysis of three different algorithms for the quantitation of microarray expression signals, DChip, RMA and MAS5. I found the statistical analysis interesting. While the signal processing capabilities were tested on an actual array from Affymetrix, I found that it would perhaps be more informative to investigate multiple sets of data from multiple sources, from real genetic material to artificial sets to verify that the results hold.

Qin J, Lewis DP, Noble WS, Kernel hierarchical gene clustering from microarray expression data, Bioinformatics.2003 Nov 1;19(16):2097-104.

This computational paper emphasizes an unsupervised analysis of microarray expression data to find groups of similarly expressed genes and/or gene expression experiments. It uses an hierarchical clustering algorithm that works with a kernel function to map the data into a high-dimensional feature space because further experiments by other groups revealed that gene expression data contains informative, higher-order features which may not be apparent in the raw data (Brown et al. 2000). The algorithms utility was evaluated by both internal (measures the learnability of a given set of clusters) and external (compares the given tree to an external collection of possibly overlapping clusters) validation, revealing that different results from standard hierarchical clustering are produced. Unfortunately, it did not show any improvement, thus not providing any useful biological insight.

Bakewell DJ and E Wit, Weighted analysis of microarray gene expression using maximum likelihood, Bioinformatics, 2004, pp 1-12

This article demonstrates how the Maximum Likelihood Estimate (MLE) hierarchical model allows for improved detection of differential gene expression from microarray data. While biologists who gather microarray data typically only consider one of the spot statistics available (mean, median, or mode), the MLE model takes into account spot mean, variance, and pixel number values to more accurately quantify gene expression levels. The article is rich in technical details, which explain the underlying mathematics behind the model. One advantage of this technique is that it can more accurately quantify gene expression from non-uniform spots (which often appear like donuts). If "user-friendly" software that utilizes this algorighm is made available to biologists, they will have a more powerful tool with which to interpret their microarray data.

Reverter, A. and et al., A mixture model-based cluster analysis of DNA microarray gene expression data on Brahman and Brahman composite steers fed high-, medium-, and low-quality diets, American Society of Animal Science, 2003, pp. 1900-1910

This paper investigated the methods for doing power transformation to the intensity level so that the they could have a normal distribution. It turned out that base-2 logarithm is the best one. Although GMM can be used to approximate any kind of distribution given infinite data, it the transformation is needed for using the data more efficiently. The GMMs were trained by using EM. However, I think, it could be better if they could do discriminative training like minimum classification error (MCE) right after the EM training.

Jörg Rahnenführer and Daniel Bozinov, Hybrid clustering for microarray image analysis combining intensity and shape features, BMC Bioinformatics 2004, 5:47

The hybrid intensity/shape image analysis algorithm presented in this paper seems to do well over a wide range of image qualities. In particular, in comparison to the widely used Spot software, it seems to be less sensitive to high intensity outlier pixels. Its ability to factor in low intensity spots also seems to help distinguish more uniformly-sized spots. I liked how their shape analysis via masking filtered out a number of artifacts that would otherwise have remained. This extra step of analysis seems to provide a better separation between foreground and background pixels. My one complaint is that the paper didn't argue for how the improved spots analysis might impact gene expression results. It would be nice to take a well-known paper and repeat the gene analysis with this new image analysis as a first step to see how the gene results are affected. How important is this first image analysis step to the entire microarray process?

Ihmels J, Bergmann S, Barkai N, Defining transcription modules using large-scale gene expression data , Bioinformatics. 2004 Sep 1;20(13):1993-2003

Application of standard clustering methods to large-scale microarray data has several limitations: first, each gene will be assigned to one cluster (an exception is using fuzzy membership; however, the total of the membership function for each gene is still one), even though in fact genes may participate in more than one cellular process and therefore should be included in multiple clusters. Second, genes are classified on the basis of co-regulation under all the experimental conditions, and because genes are typically co-regulated only in specific experimental contexts, data under irrelevant conditions may just act as noise. Third, some algorithms are slow and thus unsuitable to deal with large-scale data.
The authors devised an iterative signature algorithm for the analysis of large-scale expression data. A transcription module is defined as a self-consistent unit consisting a set of co-regulated genes and an associated set of regulating conditions, and therefore different modules can have overlapping genes and conditions. A signature algorithm that was proposed previously by the same authors (Nat Genet. 2002 Aug;31(4):370-7) is then applied iteratively to identify these modules. The signature algorithm has two steps:
1. Select the conditions under which the input genes are co-regulated most tightly;
2. Select from the whole genome the genes that show a significant and consistent change in expression over all the condition selected in step 1.
These two steps are repeated, somewhat like the EM algorithm. A range of thresholds are used for the selection to reveal the tightness of co-regulation, or the hierachical modular decomposition of the expression data. The results are displayed in both layered and branched reprentation.
The authors applied the method to a large dataset of over 1000 genome-wide expression profiles of the yeast S.cerevisiae to identify the modules at different resolutions. They also compared their results with the clusters generated by other common methods (Pair-wise average linkage, K-means, SOM, SVD, Bi-clustering, and Coupled two-way clustering), and concluded that their method is superior in terms of self-consistency of clusters and biological figure of merit.

Advantages:
1. The definition of transcription module associates genes with specific contexts
2. Allows for overlapping genes and conditions
3. Avoids full partitioning of the data
4. Computationally efficient (linear on the size of the data), good for large-scale data

Critiques:
1. By definition, the results generated by their method are self-consistent, and thus it is unfair to compare it with other methods using this quantity
2. Random initialization of modules may ignore possible solutions
3. Needs more experiments for validation

K. Y. Yeung and W. L. Ruzzo, Principal components analysis for clustering gene expression data , Bioinformatics, Vol. 17(9), pp. 763--774, 2001

This paper analyzes the effectiveness of PCA for clustering gene expression data by comparing the clusters obtained using principal components with the clusters obtained using the original data. Their results show that (surprisingly ?) clustering with principal components is not (necessarily) a good idea.

T.R. Golub, D.K. Solonim, P. Tamayo et al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science: October 15, 1999; 285: 531-537

One thing I found especially interesting about this paper is that it aproached the same problem of cancer classification from two different dirrections. First using neighborhood analysis it tried to find genes with expression patterns highly correlated with known classifications to make a prediction model. If also did the reverse, using self-organizing maps to find classes using all the expression data. The paper seemed to get similar results approaching the problem form both dirrections which helps validate their results. I also liked their suggestion of trying to find tissue independent classes of cancers by removing tissue dependent genes and then trying to find classes in data from many different types of cancer.

Ideker T et al., Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data, J Comput Biol. 2000; 7(6):805-17

In this article, the authors use maximum likelihood analysis to identify the genes that are differentially expressed when yeast S. cerevisiae were placed in two different types of media. Gene expression was examined using a cDNA microarray containing spots for 6200 genes and then analyzed using the maximum likelihood method to determine if the true means of the spot intensities for each gene were the same for the two conditions.
Briefly, the authors used maximum likelihood estimation (MLE) to estimate the model parameters, including spot intensity means and the probability density function parameters. Then, the model parameters and data were used in the generalized likelihood ratio test to define a statistical parameter, ?i. Next, a threshold value of ?c was chosen based on an extensive set of control experiments. Finally, genes with ?i greater than ?c were considered to be differentially expressed. Among differentially expressed genes with the highest ?i were those involved with galactose metabolism; the gene set also partially overlapped with those identified using fold-expression analysis.
Overall, this use of maximum likelihood analysis was different than the uses we had discussed in class and provided a reasonable statistical method for analyzing differential expression. The method does require a reasonably high number of spots, corresponding to a large number of arrays or samples, which may be a problem for some labs or situations, despite the authors’ assurances. There are also some issues with choosing initial values for the MLE and perhaps with some other assumptions about the data.

Michael P. S. Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Walsh Sugnet, Terrence S. Furey, Manuel Ares Jr., and David Haussler, Knowledge-based analysis of microarray gene expression data by using support vector machines, Proceedings of the National Academy of Sciences of the United States of America, January 2000, 97: 262-267

This paper reviews the application of Support Vector Machine (SVM) algorithms as applied to microarray data, and compares the usefulness of the method compared to other techniques such as clustering or self-organizing maps. SVMs can take advantage of a pre-existing dataset to prime the algorithm in terms of which data should tend to cluster together; such datasets are available from multiple sources. By applying training data, the SVM can denote new (or pre-existing) genes as member/non-members of a class. SVMs also have an ability lacking in other algorithms: by using prior knowledge combined with very high-dimensional space analysis, correlations between gene measurements can be considered.
Experimental runs using the SVM technique were applied to a set of 79 element gene expression vectors from 2467 yeast samples. Data was generated from microarray samples collected at pre-determined stages of yeast growth. SVMs map vectors (X) into a hyperplane separating class from non-class members, onto a region defined as the feature space. A function is defined to separate the hyperplane regions, along with a margin to allow mislabeled or inappropriate data to fall away from the plane - the kernel function. Application of the kernel function along with a variable constant solves the problem of noise projecting data into the wrong region.
Experimental data was collected with SVMs in six functional classes such as respiration and histones. Helix-turn-helix proteins were included as a control group. Methodology for comparison to other algorithms defined a cost savings of the SVM method built from the false positives and negatives. In all runs with the exception of the control, SVMs were more accurate than other approaches, with Parzen windows being the closest. The paper concludes with discussion on false positives and misclassified genes.
I found this paper quite interesting from a modeling perspective, but would have benefitted from some additional mathematical examples to show the algorithm in action. The sample size of the run seems to be limiting and would have benefited from either a larger scope or different choice of sample genes. From a computer scientists' view, the usefulness of SVMs over other approaches seems considerable. Unfortunately the (over) emphasis on misclassified genes near the end of the paper took space which could have shown the benefits from the algorithm.

Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. USA, Vol.95, pp 14863-14868, Dec. 1998

This paper is a early paper in cluster analysis of Microarray DNA data. The author used Pearson coefficient as the metric of similarity of DNA expression level, and then applied hierachiccal clustering to construct tree of gene clusters. Although the author mentioned some other clustering techniques, he only implemented this relatively simple approach in the paper. The biggest contribution of this paper, in my mind, is to prove that coexpressed genes can be clustered in one group, wich is the intuition of biological research.
As to the technique, other than clustering technique, the author applied tree as the data structure of the clustering result and ordered the data in expression leve, time of maximal induction or chromosomal position.
I expected to see more about computing approachis because it is no wonder computing intensive, such as pairwise correlation calculation and sorting. However, it seems that I have to go to their website to find out.

Tusher, V. G., Tibshirani, R., and Chu, G, Significance analysis of microarrays applied to the ionizing radiation response, PNAS. 2001 98:5116-5121

This paper presents a method known as Significance Analysis of Microarrays (SAM) that results in a much reduced false discovery rate (FDR, defined as the percent of genes called significant that are not actually differentially expressed) as compared to conventional methods. This method assigns a score to genes based upon the difference in mean expression of each gene between treatment groups and the standard deviations of the gene in each group. The actual FDR as determined by RT-PCR was very close to the predicted value. The experimental design was robust in that the choice of replicates minimized the effects of confounding sources of variability (i.e. hybridization-dependent effects and treatment-independent biological variability).

Alter O, Brown PO, Botstein D., Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms, PNAS | March 18, 2003 | vol. 100 | no. 6 | 3351-3356

In this paper the authors give an interpretation to the GSVD of the pair of matrices formed by listing expression data from two organisms in two NxM matrices, rows of which correspond to (a transformed function of) the expression level of a single gene across all M microarray experiments, columns of which correspond the expression level of all N genes in a single microarray experiment. The GSVD of two matrices E1 and E2 gives,

E1(N1xM) = u1(N1xM)e1(MxM)x^-1(MxM)
E2(N2xM) = u2(N2xM)e2(MxM)x^-1(MxM)
Where e is a nonnegative, diagonal matrix, and x^-1 is "normal" but not necessarily orthonormal. Software to compute the GSVD of two matrices is plentiful, so the authors use it, and assign interpretations to the output. Also, they suggest some biological relevance to their interpretations. They describe the reduced basis produced by the GSVD from the N genes x M arrays data as M genelets x M arraylets. Genelet m is the mth row of x^-1, and represents expression across both arrays of some aggregate set of genes (specified by u1 and u2). Arraylet m is the mth column of u1 or u2 and (basically) maps a genelet to the set of genes which it represents in E1 or E2.
This interpretation is useful in the context the authors consider (ie, comparing expression data between two organisms which may have different numbers of genes) because GSVD essentially finds a basis which can be used to represent both E1 and E2 at the same time. Thus, genelet m can be assigned an interpretation in both E1 and E2. The importance of genelet m in E1 and E2 is given by e1(m) and e2(m) respectively. This importance can also be interpreted: if atan(e1(m)/e2(m)) - pi/4 =~ 0, (ie if e1(m) =~ e2(m)) then genelet m is equally significant in both genes; if f(m) is skewed towards ± pi/4, then genelet m is more important in one or the other of the data sets E1 or E2.
The authors apply this interpretation to microarray data from yeast and human cells that were synchronized by different means to be at certain point in the cell cycle at a given time. They assign an interpretation to a number of genelets. For instance, (these are details but...) one genelet that was important in both e1 and e2 turned out to be "the cell cycle genelet", and represented the set of genes that was responsible for causing the cell to cycle. Other genelets that were important in only one or the other of e1 or e2 turned out to be "synchronization response genelets", and represented the set of genes which were invoked when the yeast and human cells were synchronized. They are different because the cells were synchronized in different ways.
This paper succeeds in applying an interpretation to the matrix algebra concept of the GSVD in the context of microarray data. The technique, is of course, beyond question, since it is just linear algebra. The interpretation of "genelets" is an interesting concept, which, while you can get at it with other techniques of clustering, I'm not aware of many ways to get at it across species/chips. Then again, there are many things I'm not aware of. The GSVD appears to be a useful technique to know about for microarray analysis and maybe other things.

Finny G Kuruvilla, Peter J Park, Stuart L. Schreiber , Vector Algebra in the analysis of genome-wide expression data, Genome Biology 2002, 3:research0011.1-0011.11 doi:10.1186/gb-2002-3-3-research0011

This paper is a fascinating overview of how to use several linear algebra techniques to approach many of the same problems that we've been exploring in class with statistical methods.
The paper starts off with a summary explaining that while other people have already explored some of these topics, the authors' goal was to bring these topics together into a single place, in a coherent way - to provide a "framework". It then goes on to explain how one might consider obtain a matrix of data from a microarray. If one thinks of each run of a microarray as producing a producing a column of measurements, with one measurement per gene, and one runs the experiment multiple times (under different conditions, or with different tissue samples, etc), then one would obtain multiple columns, which can be joined together and used as a matrix. Each entry in the matrix is the log of the unsigned fold change (two-fold, three-fold, one-half-fold, etc). One advantage of setting things up this way is that one can think about things not just in terms of a collection column-vectors (with each entry reflecting the results of measuring a single gene) in a space whose dimensionality is defined by the number of experiments done, but also as a collection row-vectors (1 per experiment) in a space whose dimensionality is defined by the number of genes measured. On the one hand, given that the authors' goal is to deal with large amounts of data (500 experiments and counting for one particular organism), this is a neat trick. On the other hand, given that microarrays can now measure 20,000 genes at once, this doesn't seem all that useful - the dimensionality will be incredibly high, and at some level, you'll still need to deal with the volume of data, no matter how you slice it.
From there, the authors go on to examine a variety of linear algebra techniques that can be used to measure similarity between genes' expression levels, and/or reveal underlying patterns in the data. These techniques included using degrees between two vectors as means of measuring how similar two experiments are, using subspaces to find interesting subsets of experiments (or genes) that show significant significance, using the ratio of magnitudes of two vectors to compare the results of different experiments, using a singular value decomposition to reduce the dimensionality of the matrix (and thus reveal underlying behavior), and looking for a basis as a means of finding which vectors span the space, and thus might represent fundamental states of a cell.
There were a number of things that I thought the paper did well. It provided a good overview about how to use linear algebra techniques in this problem domain, and often related these techniques to statistical methods (e.g., it compared the SVD to PCA). On a personal level, I found the geometric interpretations intuitive.
At the same time, one of the major weaknesses of the paper was a consistent lack of details. While it made sure to point out similarities between linear algebra and statistical methods, it often didn't show the equivalence in a particularly rigorous fashion. Also, it would have been good to include a comparison of the various algorithms for computing these different techniques compare, in terms of time/space required to run them. While it's possible that some of the paper's references do this, it would be useful (especially since the paper is trying to establish these techniques as a 'framework') to include such comparisons here.
The most glaring problem with the paper was the lack of any experimental section. Early on, they mention that they're going to be considering a "sample analysis of cells treated with rapamycin", but I think they do this mainly because there's already been a lot of experiments run on this, and so they can easily build on the huge library of previous experiments. While they do relate some of their results back to previously known biological data, it seems to be a fairly cursory examination of any such connections. On the one hand, this strikes me as being really bad. On the other hand, the authors' goal is to establish a certain mathematical method as being useful, so maybe this sort of approach is acceptable for this type of paper. Another shortcoming is that since authors are looking to use results from numerous different experiments, I would have expected them to have spent more time explaining how they compensated for differences between microarray platforms, and how they were going to deal with such differences.
Overall, I enjoyed reading the paper, but feel that the paper could have been improved by including more details.


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to cse527-webmaster@cs.washington.edu]