CSE599V  Machine Learning in Biology

 

Announcements

Welcome to CSE599V!  The first meeting will be held on Monday March 29.

Please subscribe to the course mailing list.

 

Administration

Instructor: Su-In Lee (CSE 536, office hours: M11:30-1pm or by appointment)

Meetings: MW 10:10am @ CSE 303

 

Course description

Biological sciences are becoming data-rich and information-intensive. Nowadays it became possible to obtain very detailed information about living organisms. For instance, we can retrieve DNA sequence (3 billion-long string) information, expression (activity) levels of >20,000 genes, and various clinical measurements from humans. The growing availability of such information promises a better understanding of important questions (e.g. causes of diseases). However, the complexity of biological systems and the high-dimensionality of data with noise make it difficult to infer such mechanisms from data.

Machine learning (ML) techniques have become very useful tools for resolving important questions in biology by providing mathematical frameworks to analyze vast amount of biological information. Biology is also a fascinating application area of ML because it presents new sets of computational challenges that can ultimately advance ML. In this course, we will discuss recent papers describing successful examples of ML techniques applied to exciting problems in biology.

 

Format

We will discuss one paper in each meeting. One student will present the paper and lead the discussion. The discussion should include reading questions that will be given 1 week before the class. The instructor will then give a mini-lecture to provide necessary background knowledge for the topic to be discussed in the next meeting.

 

Grading

The course grade will be based on participation in discussions.

Students who take the course for S/NS: reading assigned papers; writing evaluations on 3 papers (due 1 week after the class); leading the discussion on 1-2 paper; participating in discussions

Letter grade: working on a mini-project, in addition to reading papers and leading/participating in discussions.

 

Topics to be covered (tentative)

Topic ID

Date

Topic and readings

Discussion Leader

Keywords (goals, ML techniques used)

Handouts, discussion questions

1

3/29

Introduction & overview of topics

Su-In Lee

Syllabus [pdf], Lecture note 1 [ppt]

2

3/31

Introduction & overview of topics

Su-In Lee

Lecture note 2 [ppt]

3

4/5

A feature-based approach to modeling protein-DNA interactions, Sharon E, Lubliner S, Segal E. PLoS Comput Biol. 2008.

(shorter version: Sharon E, Segal E, RECOMB 2007)

 

Optional readings:

Efficient structure learning of Markov networks using L1-regularization, Lee SI, Ganapathi V, Koller D. NIPS 2007.

Aniruddh Nath [pdf]
Learning probabilistic models for transcription factor binding sites; structure/parameter learning of Markov networks
Reading questions, Lecture note 3 [ppt]

4

4/7

Module networks: identifying regulatory modules and their conditional specific regulators from gene expression data, Segal E, Shapira M, Regev A, PeˇŻer D, Botstein D, Koller D, Friedman N. Nat Genet. 2003.

 

Optional readings:

Learning module networks, Segal E, PeˇŻer D, Regev A, Koller D, Friendman N. J Machine Learning Research (JMLR) 2005.

Su-In Lee

Inferring transcriptional regulatory networks; structure/parameter learning of Bayesian network

Reading questions, Lecture note 4 [ppt]

5

4/12

Probabilistic discovery of overlapping cellular processes and their regulation, Battle A, Segal E, Koller D. J Comput Biol. 2005.

(shorter version: Battle A, Segal E, Koller D. RECOMB 2004)

Casey L. Overby [ppt]

Clustering genes based on expression data; probabilistic relational models

Reading questions, Lecture note 5 [ppt]

6

4/14

Learning a Prior on Regulatory Potential from eQTL Data, Lee SI, Dudley AM, Drubin D, Silver PA, Krogan N, Koller D. PLoS Genet. 2009.

 

Optional readings:

Learning a meta-level prior for feature relevance from multiple related tasks, Lee SI, Chatalbashev V, Vickrey D, Koller D. ICML 2007.

Su-In Lee

Inferring the effect of sequence variation on regulatory networks; Bayesian network, LASSO

Reading questions, Lecture note 6 [ppt]

7

4/19

An integrative genomics approach to infer causal associations between gene expression and disease, Schadt EE et al. Nat Gen. 2005.

Xu Miao [ppt]

Inferring transcriptional regulatory networks, Bayesian network

Reading questions, Lecture note 7 [ppt]

8

4/21

Characterizing dynamic changes in the human blood transcriptional network, Zhu J, Chen Y, Leonardson AS, Wang K, Lamb JR, Emilsson V, Schadt EE. PLoS Comp Biol. 2010.

Rolfe Schmidt [ppt]

Inferring temporal changes of regulatory networks, Dynamic Bayesian network

Reading questions, Lecture note 8 [ppt]

9

4/26

Statistical estimation of correlated genome associations to a quantitative trait network, Kim SY, Xing E, PLoS Genet. 2010.

(shorter version: Kim SY, Xing E. ISMB 2009)

Eric Garcia [ppt]

Identifying genetic factors for human diseases, LASSO

Reading questions, Lecture note 9 [ppt]

10

4/28

Population structure and eigenanalysis, Patterson N, Price AL, Reich D. PLoS Genet 2006.

  

Principal components analysis corrects for stratification in genome-wide association studies, Price AL, Patternson N, Plenge RM, Weinblatt M, Shadick NA, Reich D. Nat Gen. 2006.

James Chen [ppt]

Population stratification, PCA

Reading questions, Lecture note 10 [ppt]

11

5/3

SNP imputation in association studies, Halperin E, Stephan DA. Nat Biotechnology. 2009.

 

One of the methods listed in Table 1 or

Bayesian multi-population haplotype inference via a hierarchical Dirichlet process mixture, Xing E, Sohn KA, Jordan MI, The YW. ICML 2006.

Cindy Desmarais

Haplotype reconstruction & imputation, HMM or Dirichlet process

Reading questions 

12

5/5

Reconstructing genetic ancestry blocks in admixed individuals, Tang H, Coram M, Wang P, Zhu X, Risch N. American Journal of Human Genetics (AJHG) 2006.

Elizabeth Tseng

Inferring the local ancestry of DNA segments, Markov HMM

Reading questions 

13

5/10

Tag SNP selection in genotype data for maximizing SNP prediction accuracy. Halperin E, Kimmel G, Shamir R. Bioinformatics 2005.

 

BNTagger: improved tagging SNP selection using Bayesian networks. Lee PH, Shatkay H. Bioinformatics 2006.

Will Mortensen

Tag SNP selection, Bayesian network, information theory

14

5/12 Causal protein-signaling networks derived from multiparameter single-cell data, Sachs K, Perez O, PeˇŻer D, Lauffenburger DA, Nolan GP. Science 2005.

Kristi Tsukida

Inferring signaling network, Bayesian network

15

5/17

CONTRAfold: RNA secondary structure prediction without physics-based models, Do CB, Woods DA, Batzoglou S. Bioinformatics 2006.

 

A max-margin model for efficient simultaneous alignment and folding of RNA sequences, Do CB, Foo CS, Batzoglou S. Bioinformatics 2008.

Daniel Jones

RNA secondary structure prediction

16

5/19

Automatic parameter learning for multiple network alignment, Flannick J, Novak A, Do CB, Srinizasan, Batzoglou S. J Comput Biol. 2009.

 

Optional readings:

Modeling cellular machinery through biological network comparison, Sharan R, Ideker T, Nat Biotech. 2006.

Adrienne Wang

Comparison of biological networks

17

5/24

Cancer classification based on genotype (TBD)

Nathan Parrish
ˇˇ

18

5/26

Evolution of regulatory networks (TBD)

Austin Webb
ˇˇ

19

6/2

Something on Tandem Mass Spec (TBD)

Adam Gustafson
ˇˇ