CSE 527, Au '03: Reading #3: What Students Found

University of Washington Department of Computer Science & Engineering

CSE Home

About Us

Contact Info

I asked for brief reports on good microarray papers emphasizing data analysis. Here's what you found:

S. C. Geller, J. P. Gregg, P. Hagerman and D. M. Rocke, Transformation and normalization of oligonucleotide microarray data, Bioinformatics 2003 Sep;19(14):1817-23

The authors (successfully) applied a transformation and normalization technique developed for spotted microarrays (cDNA microarrays) to data obtained from Affymetrix microarrays (oligo microarrays). The transformation is employed to stabilize the variance of the data over all expression levels, which is a prerequisite for power calculations and most microarray data analysis methods (the commonly used log-transformation provides a stabilized variance for high expression levels but not for low expression levels). Additionally, the error in the data is symmetrized and the data is normalized across several chips. The procedure uses a two-component model of error to describe the measured intensity. The two error components are

a proportional error, that always exists, but is noticeable only at high expression levels and
an additive error, that is noticeable only at expression levels close to zero.

The transformation for data that fits this two-component model was developed and shown to work with cDNA microarray data in previous publications. The authors of this paper showed that the transformation and synchronous normalization work at least approximately for data from Affymetrix chips. An advantage of the method is that it yields only positive expression levels, unlike other commonly used methods, including those originally recommended by Affymetrix (Microarray Analysis Suite 4.0 used an average difference score between perfect match and mismatch pairs to determine expression levels, version 5.0 uses statistical algorithms to determine expression levels and artificially constraints them to be positive).

An iterative process is used to determine the parameters of the transformation formula, synchronously performing the normalization. This is necessary, as the two errors are coupled to one of the transformation parameters.

The authors used data from 4 replicated experiments, all performed with with biologically equivalent cells (this is a precondition), to estimate the transformation parameters. The obtained parameters are independent of the experiment (but not of the applied devices), such that the transformation parameters can be applied to analyze future experiments.

good:

hypothesized to be a machine calibration technique, that has to be performed only once in a lab, making replicates in further experiments redundant (saves money and time)
allows for more confidence in further steps of the analysis, as the data meets the assumptions made about a constant variance
use of median and interquartile range to provide more robustness againts outliers (that are quite common in microarray data)

not so good:

if the hypothesis about machine calibration is not true, the technique becomes pretty expensive and time consuming
no description how the transformation parameters vary under different lab conditions (obvious: chip model, dye, hybridization conditions, scanner setup and not so obvious: room temperature, air pressure, ozone concentration, ... ) and if conformation with the assumed model is assured under conditions other than those in the lab of the authors

Jane Fridlyand, Sandrine Dudoit, Applications of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method, Technical report

The paper discusses two independent new applications of resampling methods to clustering. To estimate the number of clusters, the Clest algorithm is being proposed, which repeatedly divides the dataset into two sets, one used for clustering and building a predictor, the other for testing. The number of clusters is then assessed by using the observed similarity statistic. The second part of the paper deals with two bagging algorithms to improve the accuracy of a clustering method. Bag1 repeatedly applies a clustering algorith matrix by measuring the proportion of two observations being assigned to the same cluster in a series of bootstrap examples. The new matrix is used as input for another clustering algorithm.

I am attracted by the idea of having an algorithm that assesses the optimal (or "correct") number of clusters, since many clustering algorithms require this number as an input parameter. The number of clusters is usually unknown, and thus there are only two possible solutions: guessing the correct number with the help of data visualization or having an algorithm compute that number (or a combination of the two).

The proposed algorithm, Clest, was compared to six other methods using both simulated and real gene expression data. Unsurprisingly, Clest was claimed to be superior. However, I find that the competing algorithms should also be discussed and the differing results should be interpreted (Why is this algorithm doing better in this situation?).

Furthermore, the paper doesnt address performance issues at all. I am aware, that in gene expression analysis even computing times of multiple hours might be acceptable. However, the Clest algorithm appears to be highly inefficient. In every loop a total clustering is performed, a predictor created, etc. I believe that the algorithm is not usable for parameters significantly larger than those in the given examples. The Clest algorithm also requires setting a large number of parameters. Many will probably have a direct influence on the number of estimated clusters. The algorithm seems of low value, if it is only estimating parameters by requiring other parameters as input.

In the paper, the Clest algorithm uses PAM for clustering and DLDA for classification in the predictor. The authors claim that PAM is more robust and more efficient than k-means, and DLDA performs well in complex situations. Ironically, in another paper ("Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data") the same authors warn that DLDA does not consider correlation between genes which are biological facts. Therefore, I suggest that other methods should be tested and a more careful decision made.

The proposed methods for increasing the accuracy are rather simple. I am actually surprised that they can really increase the accuracy. However, the second algorithm, Bag2, appears rather useless to me. It requires the computation of three nxn matrices, and according to the text the results were not satisfactory.

TC Kroll, S Wölfl, Ranking: a closer look on globalisation methods for normalisation of gene expression arrays, Nucleic Acids Research , 30(11):e50.

This paper did some analysis on different normalization methods for microarray data. The rank intensity plot used to compare methods was actually very helpful to me for visualize the comparisons. I was also pleased with the chart they provided with the pro's and con's of the different normalization methods.

The conclusion was that for their large data sets, a trimmed mean (trimmed by about 5-10%) was the best method of normalization. I would be interested to see how this method holds up up in other data sets, as well as in analyzing other non-microarray data. I would also be interested to see what would happen if they compared their methods to the methods described in class, where the data is fit to some known statistical model (Gaussian, etc.).

Susmita Datta, Somnath Datta., Comparisons and validation of statistical clustering techniques for microarray gene expression data., Bioinformatics 2003 19: 459-466

I selected this paper from the Notes on Readings list for homework 2. It evaluates six different clustering algorithms (representing broadly different approaches) against the well known Chu, et. al sporulation data set and a pair of synthetic data sets.

The paper suggests that a good clustering algorithm will produce stable clusters when given successive slices of a full data set each with one one measurement dimension removed. This is a similar approach to the figure-of-merit metric discussed in class, and this paper presents three such metrics to evaluate the quality of a clustering algorithm against a given data set.

The authors conclude that the Diana clustering algorithm performs well and use it to cluster the sporulation data, finding clusters which are remarkably consistent with the 7 model temporal profiles used by Chu, et al. in the original sporulation paper. The authors propose another metric for evaluating an algorithm's performance on a data set when such a set of model profiles is available.
Pluses:

A valuable survey of the performance of different clustering strategies on micro-array data sets.
Rewarding clustering algorithms that are stable seems appropriate, and their metrics (I) and (II) seem to capture this intuition and respond similarly.
They evaluate the performance at different cluster counts nearby the "expected" count of K=7

Minuses:

Their average-distance metric (III) seems counterintuitive, and disagrees with their other metrics.
As in the original sporulation paper, they average the profiles of a number of well-understood genes to obtain the set of 7 "model profiles". This seems like a questionable way to deal with the data, but Dianna somewhat validates this choice by obtaining a set of 7 very similar clusters.

David R. Bickel., Robust cluster analysis of microarray gene expression data with the number of clusters determined biologically., Bioinformatics, vol. 19, no. 7, 2003, pages 818-824.

This paper discusses a new clustering method which is more robust to outliers. The method combines a new distance(dissimilarity) metric, a new method of determining the number of clusters and a new cluster visualization technique with an existing clustering algorithm ("the myopic algorithm for optimal facility location with the exchange heuristic"). The new distance metric uses rank correlation (the highest expression level for each gene is assigned a value of 1, the second highest is assigned 2, the third 3, et cetra). The use of ranks rather than values should keep extreme values from rendering the data useless. The rank correlations are then mapped to Euclidean distances. The resulting measure is called Rank-Correlation Dissimilarity (RCD). The method for determining the number of clusters assumes that some of the genes are known to be biologically related. It then finds the maximum number of clusters such that some minimum fraction of every biologically related group is placed in the same cluster by the clustering algorithm. The visualization technique plots a graph of the expression levels of the median gene in the cluster, and plots error bars representing the expression levels of the 50% of the genes in the cluster with the least distance from the median gene. Looking at only half of the genes ensures that outliers do not ruin the visualization generated for the cluster.

Overall, this paper presented an interesting new combination of techniques for more robust clustering of microarray data. However the results section only considered one data set and, as we have seen in class, the performance of clustering algorithms tends to vary significantly between different data sets. The results also only compared the algorithm with the k-means algorithm.

S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander, and T. Golub, Multiclass cancer diagnosis using tumor gene expression signatures, Proc. Natl. Acad. Sci. USA. 2001 December 18; 98 (26): 15149-15154

This paper evaluated two techniques for classifying (and diagnosing) tumor cells, unsupervised learning (clustering) and trained Support Vector Machine (SVM) algorithm. Using the clustering approach, this study found that while several classes of tumors were well defined and formed easily identifiable clusters, many other classes of tumors did not correspond to clear clusters. The SVM technique was found to offer better results. Given an unknown tumor sample, this algorithm produced a "confidence" index for each tumor class which indicated the liklihood that the tumor was a member of that class. Simply classifying the tumor in the class which yeilded the highest confidence level produced an accurate diagnosis 78% of the time. The authors are careful to note that this classification algorithm does not represent a direct diagnosis and that it will be most helpful as an asjunct to traditional diagnosis techniques. It has success identifying molecular characteristics of a tumor which might not result in recognizable clinical or histological features.

Park T, Yi SG, Kang SH, Lee S, Lee YS, Simon R, Evaluation of normalization methods for microarray data, BMC Bioinformatics. 2003 Sep 2;4(1):33. Epub 2003 Sep 02.

Microarray data can be aquired using a variety of methods, and so this paper compares methods for normalizing microarray data. Overall this is actually a pretty good paper. According to the paper the methods that are compared are those commonly used in microarray analysis today. The paper concludes that methods using intensity level normalizations (a class of normalizations that forms to the trend of the data, e.g. least squares.) rather than a method that works on a global method (like mean of log of ratios) is in general a much better method to use. The group uses both real and simulated data and arrives at the same conclusion with both sets. The quality of the methods is evaluated by how small an estimated variance is obtained (i.e. the smaller the estimated variance, the better the method). The analysis is rather comprehesive in terms of the methods they use to normalize the data. Though the results are not mind blowing, in that intensity level methods are found to be better, this analysis is certainly necessary, and the group does perform the analysis in a very systematic and understandable way.

Zbynek Bozdech, Manuel Llinas, Brian Lee Pulliam, Edith D. Wong, Jingchun Zhu, Joseph L. DeRisi, The Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum, PLoS Biology, October 2003, Vol 1, Issue 1

Instead of using traditional clustering algorithms, the authors performed Fourier analysis (FFT and power calculations) on time-series expression profiles. The result was not a set of distinct clusters, but rather a temporal "cascade" of 2714 ORFs up-regulated then down-regulated, slightly out-of-phase with each other. Validation for the method appears to come from the way genes with known common characteristics (12 groups, 382 total genes) share a similar temporal profile. The authors also point out that Fourier analysis had been previously utilized to identify cell-cycle-regulated genes in yeast and human cells.

P. falciparum causes malaria; the authors speculate that good drug targets would be genes that are expressed slightly before or coincidentally with a known, important organelle's genome (the organelle is similar to mitochondria.) And, of course, future studies can analyze upstream regions of genes with similar phases, to discover any common transcriptional regulator.

A couple of things that I found interesting:
The authors did not start out with 2714 ORFs; they designed a microarray representing 4488 ORFs (out of 5409 manually annotated ORFs.) They discarded some data due to poor quality spots (dye intensities vs background). Then they excluded profiles a) whose main frequency component ranked (in terms of magnitude) in the bottom quartile among all profiles, and/or b) whose power spectrum was only lightly concentrated (<70%) in the main frequency component plus 2 adjacent frequency peaks. I imagine that they eyeballed the knee in some curve somewhere, but this was not explicitly stated.

The authors had to pick one frequency to serve as the major frequency, before calculating the phase of each profile. They used the mode (as opposed to the mean or the median or something else) of the main frequency components across all profiles. I wonder if there was a better choice,such as the frequency component with the most power across all profiles. Maybe the difference is too small to matter.

One more thing: a technique called microarray-based comparative genomic hybridization (CGH) was used to compare & contrast different strains of P. falciparum, but the authors did not describe this technique in detail.

Ramswamy Anbazhagan, Microarray data assembler, BIOINFORMATICS, Vol. 19 no. 1 2003, Pages 157-158

Program appears very useful for assembling microarray data, works with Excel (most Chemists/Biologists are already familiar with this software package). Program does no actual analysis, simply assembles data for later analysis

Doulaye Dembele and Philippe Kastner, Fuzzy C-means method for cluster microarray data, Bioinformatics, vol 19 no.8 2003, pages 973-980

This is a very interesting paper. It shows a practical way to make progress when theoretical support is hard to get. But I feel that the data they gave is not enough to support THE most important equation in their work (eq. 7). I think they should discuss a little about it, trying to give it more credibility.

Ciprian Doru Giurcaneunu, Ioan Tabus, Ilya Shmulevich, and Wei Zhang., Stability-Based Cluster Analysis Applied to Microarray Data., (to be submitted)

This paper presented a short introduction to what it called "stability-based methods" for finding structure in microarray data. In general these methods amount to a a general approach for finding the k for which k-clusterings using an arbitrary clustering technique for which the produced clusters are maximally stable where stability is measured based on the clusters stability in the face of changing the subset of data that is used to produce them. For instance, one general approach presented is to pick two random subsets of the data, generate a k-clustering and use the similarity of the points in their intersection as the stability measure. Another is to divide the data into two halves, k-cluster each, use the first to classify the points in the second, and then use the similarity between the second group's classified and clustered groupings as the stability measure. The paper then went on to evaluate this method based on both artificial and experimental data and found that it performed fairly well and was usually able to pick the "right" k.

In general I found this paper to be a good introduction to this type of analysis. It assumed that you knew some basic information about clustering which was fine for me since most of its assumptions were things we have gone over in class. I'm not sure its validation of process was that rigorous, but for me that didn't detract too much because they referenced several papers that did similar things that also seemed to work. I also don't think it should get published since there's not really anything new or novel here and as mentioned their validation was anything but thorough, but that's another issue I guess

Ramswamy Anbazhagan, Microarray data assembler, BIOINFORMATICS, Vol. 19 no. 1 2003, Pages 157-158

Smyth, G. K., Linear models and empirical Bayes methods for assessing differential expression in microarray experiments, 2003

Paper presents the idea and math behind the linear model for microarray data and the hierarchical model including marginal distributions, posterior odds and inference using moderated t and F statistics. (For those who don;t enjoy math, half of this paper is equations). The simulation done for the paper was between the moderated t-statistic and several other statistics for ranking genes interms of differential expression all using the LIMMA software package. Fold change was a statistic that produced the largest amount of false discoveries, due to ignoring differences in gene-wise variabilities. The moderated t-statistic is shown to do the best, and the results are said to be consistent over individual data sets.

Therese Biedl, Brona Brejova, Erik D. Demaine, Angele M. Hamel, Thomas Vinar., Optimal Arrangement of Leaves in the Tree Representing Hierarchical Clustering of Gene Expression Data, Technical report 2001-14 Dept of Computer Science, University of Waterloo

This paper was alluded to in one of the lectures on visualizing micro-array data. The problem tackled in this paper is ordering genes such that similar genes appear together, subject to the constraint that clusters of genes are contiguous. Essentially this is an attempt to frame a subjective visual question (how well arranged the green- black-red pictures from gene expression data over time) in more precise terms.

Their formalization goes: given N genes that have been hierarchically clustered, order them such that each cluster is a single contiguous segment and sum of adjacent- pair distances is minimized. First there is an interesting result: without requirement of contiguous clusters, the problem is computationally intractable (eg NP-complete) It�s surprising how simple the proof is, with the problem statement above. Usually NP completeness proofs involve elaborate and hairy constructions. In this case there is an easy parallel to the well-known optimization problem called traveling salesmen problem. (TSP) Gene ordering is really the same problem in different guise. It�s also surprising that what is on the surface a more difficult problem because of additional constraint about respecting the clustering order is actually easier. One explanation for this is the constraint reduces search space to manageable size.

The paper proposes a dynamic programming approach. This makes sense; tree structure in hierarchical clustering suggests that there may be relationships between optimal solutions for sub-clusters and the optimal solution for the whole gene group. (Incidentally the difference between TSP and gene ordering an optimal tour for all the cities does not necessarily contain the optimal tour for half of them.) Written out as a recurrence the algorithm appears to be O(n**5). Polynomial time all-right, but not very fast. Next section develops a tighter bound based on more careful analysis, including an analogy to matrix multiplication that gets this down to O(n**3). Better but still too slow for interactive use: their own implementation clocks in at 6 minutes on ~2K genes.

Perhaps the biggest disappointment is the algorithm seems to make very little difference visually. For a problem motivated by visualization of gene-expression data, the real benchmark is the subjective quality of pictures. But comparing their algorithm against two others, including the heuristic from an earlier paper and an approximation algorithm for TSP, the pictures are not obviously better, even though mathematically the objective function is 25% lower with current algorithm compared to the heuristic approach. (TSP achieves even better results but it doesn�t respect clustering.) What this suggests is that either visualization is surprisingly indifferent to ordering or the authors failed to capture what makes for a good picture in their objective function.

Fimeisz et al., Identification and quantification of disease-related gene clusters., Bioinformatics Vol 19 2003, 1781-1786

This paper details an algorithm for determining the spatial distribution of related genes. The novel idea of this paper is that they claim to be able to detect clusters that are distributed at great length across the genome.

The researchers began by arraying a pool of cDNAs from arthritic mice (cy-5) cohybridized with a pool of cDNAs from normal mice (cy-3) on a pre-fabricated chip containing 9500 mouse genes. Genes with a greater than 1.5 fold over-expression in arthritic mice were selected as targets and then spatially mapped to the mouse genome.

Based on this mapping the researchers compared the density of related genes to the expected density of genes. The expected density was computed based on the overall density found in the mouse genome.From there, they used an auto-correlation method to assign scores to disease related genes based on their deviation from the expected densities as compared to random genes that had the same separation (in BP.)�They then ran a friends-of-friends clustering algorithm to detect clusters within the genome.

As I see it, the clustering algorithm shows that there are indeed relatively small "islands" in which disease related genes cluster.Of course this is not surprising; the authors note that this has been done before.The whole point was to determine if there are patterns in the across-genome distribution of related genes; that was what the auto-correlation density analysis was supposed to accomplish. The density analysis did show a non-random distribution of the related genes across the genome. However, that same analysis also showed a non-random genome distribution of the genes included in the array chip.

This is an interesting idea, but the fact that the genes on the array chip were non-randomly distributed seems to make moot any analysis of the distribution of any subset of genes pulled from that chip.


	Department of Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to cse527-webmaster@cs.washington.edu]