|04/01||---- Organizational Meeting ----|
|04/08||Madan Musuvathi & Todd Mytkowicz, Microsoft Research||Accurate DNA Sequence Alignment on Data Parallel Hardware||Details|
|04/15||Hamid Bolouri, FHCRC||Identifying dysregulated pathways in cancer||Details|
|04/22||Ben Logsdon, UW CSE||Multi cancer analysis of a leukemia stem cell signature|| |
|04/29||Max Libbrecht, UW CSE||Entropic Graph-based Posterior Regularization for Learning Probabilistic Models||Details|
|05/06||Daniel Jones, UW CSE||Compression and Assembly of Next Gen Sequence Data||Details|
|05/13||Erick Matsen, FHCRC||From the Ramayana to Reverend Bayes: host defenses and zoonotic transmission of simian foamy virus||Details|
|05/20||Larsson Omberg, Sage Bionetworks||Transparent and Collaborative Research within The Cancer Genome Atlas|| |
|06/03||Tony Chiang, UW Oceanography||Exploring seven oceanic strains of the cosmopolitian diatom T. pseudonana|| |
| Papers, etc.
Note on Electronic Access to Journals
Links to full papers below are often to journals that require a
paid subscription. The UW Library is generally a paid
subscriber, and you can freely access these articles if you do
so from an on-campus computer. For off-campus access,
follow the "[offcampus]" links below or
look at the
library "proxy server" instructions.
You will be prompted for your UW net ID and password once per
04/01: ---- Organizational Meeting ----
04/08: Accurate DNA Sequence Alignment on Data Parallel Hardware -- Madan Musuvathi & Todd Mytkowicz, Microsoft Research
Smith-Waterman (SW) is a dynamic programming algorithm that produces the best alignment of two DNA
sequences. Current implementations of SW have data dependencies which make them difficult to parallelize on data
parallel hardware like multicore, GPUs, and clusters. As a consequence, algorithms like BLAST use heuristics to
perform fast but approximate alignments and can therefore miss the best alignment.
In this talk, we will describe a data parallel SW such that it can take advantage of almost any form of data parallel
hardware available today (i.e., multicore, GPUs, FPGAs or clusters of each). Our algorithm splits a target DNA
sequence into multiple subsequences and aligns a query against each of these subsequences in parallel, while producing
the same result as the sequential SW. Using this approach, we have obtained near-linear speedup on a 12 core
machine. By appropriately utilizing data parallel hardware, we think our approach can be as fast as approximate
algorithms like BLAST while at the same time not sacrificing alignment accuracy.
04/15: Identifying dysregulated pathways in cancer -- Hamid Bolouri, FHCRC
In the first part of this talk, I will briefly present results from a collaboration with the laboratories of Soheil
Meshinchi (FHCRC) and Bob Arceci (JHU) in which we are combining whole genome sequencing with clinical records,
genome-wide promoter-methylation, and mRNA expression to identify the pathways and mechanisms underlying pediatric
Acute Myeloid Leukemia.
The second part of the talk will be a discussion of present obstacles to network/pathway analysis of cancers: (1) lack
of sufficient pathway knowledge; (2) high degree of overlap among pathways; (3) disparities between pathway DBs; (4)
ambiguities in public datasets such as ENCODE.
04/22: Multi cancer analysis of a leukemia stem cell signature -- Ben Logsdon, UW CSE
04/29: Entropic Graph-based Posterior Regularization for Learning Probabilistic Models -- Max Libbrecht, UW CSE
Large graphical models often use factorization assumptions to enable tractable exact or approximate
inference. We define a new class of entropic graph-based regularizers that combine probabilistic inference with
iterative graph-based methods. These regularizers can represent arbitrary patterns of interaction between variables in
a probabilistic model while maintaining tractable inference. We present a method for performing inference on this
joint model and for learning its parameters using an algorithm akin to a generalized version of the EM algorithm. We
are motivated by applications in computational biology in which generative time-series models such as hidden Markov
models are used to learn a human-interpretable representation of genomic data. We use our approach to enable
interaction over great distances in the genome. In doing so, we integrate evidence across cell types for
semi-automated genome annotation, an important problem which has previously been addressed only crudely.
05/06: Compression and Assembly of Next Gen Sequence Data -- Daniel Jones, UW CSE
We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and
SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the
first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to
dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be
assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined
with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively
collapsing very large data sets to <15% of their original size with no loss of information. Availability: Quip is
freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip.
05/13: From the Ramayana to Reverend Bayes: host defenses and zoonotic transmission of simian foamy virus -- Erick Matsen, FHCRC
Simian Foamy Virus (SFV) is a DNA retrovirus that is enzoonotic among nonhuman primates (NHP). It can be
transmitted to humans through bites, however, it does not appear to replicate in humans in the same way it does in
NHP. In this talk I will report the results of trying to understand that difference between hosts using sequence data
generated from a five-year project sampling both macaques and humans in Bangladesh. Along the way I will describe a
new Bayesian method we developed to detect the activity of the APOBEC hypermutation host defense; this plays a key
part in our interpretation of the data. This work is a collaboration with the labs of the virologist Maxine Linial
(FHCRC) and the primatologist Lisa Jones-Engel (UW).
05/20: Transparent and Collaborative Research within The Cancer Genome Atlas -- Larsson Omberg, Sage Bionetworks
05/27: -- Holiday
06/03: Exploring seven oceanic strains of the cosmopolitian diatom T. pseudonana -- Tony Chiang, UW Oceanography