| Papers, etc.
Note on Electronic Access to Journals
The UW Library is generally a paid subscriber to non-open-access journals we cite. You can freely access these
articles from on-campus computers. For off-campus access, follow the "[offcampus]" links below or look at the
library "proxy server" instructions. You will be
prompted for your UW net ID and password.
03/28: -- ---- Organizational Meeting ----
04/04: -- No Meeting
04/11: AI framework uncovers relationships between gene expression and Alzheimer's disease -- Nicasia Beebe-Wang
- N Beebe-Wang, S Celik, E Weinberger, P Sturmfels, PL De Jager, S Mostafavi, SI Lee, "Unified AI framework to uncover deep interrelationships between gene expression and Alzheimer's disease neuropathologies." Nat Commun, 12, #1 (2021) 5369.
04/18: Long-read sequencing to identify missing disease-causing variation -- Danny E. Miller
Two relevant papers:
- DE Miller, A Sulovari, T Wang, H Loucks, K Hoekzema, KM Munson, AP Lewis, EPA Fuerte, CR Paschal, T Walsh, J Thies, JT Bennett, I Glass, KM Dipple, K Patterson, et 36 al., "Targeted long-read sequencing identifies missing disease-causing variation." Am J Hum Genet, 108, #8 (2021) 1436-1449.
04/25: ML for de novo mass spec peptide sequencing -- Melih Yilmaz
- Melih Yilmaz, William E. Fondrie, Wout Bittremieux, Sewoong Oh, William Stafford Noble, "De novo mass spectrometry peptide sequencing with a transformer model."
Tandem mass spectrometry is the only high-throughput method for analyzing the protein content of complex biological
samples and is thus the primary technology driving the growth of the field of proteomics. A key outstanding
challenge in this field involves identifying the sequence of amino acids--the peptide--responsible for generating each
observed spectrum, without making use of prior knowledge in the form of a peptide sequence database. Although
various machine learning methods have been developed to address this de novo sequencing problem, challenges that
arise when modeling tandem mass spectra have led to complex models that combine multiple neural networks and
post-processing steps. We propose a simple yet powerful method for de novo peptide sequencing, Casanovo, that uses a
transformer framework to map directly from a sequence of observed peaks (a mass spectrum) to a sequence of amino
acids (a peptide). Our experiments show that Casanovo achieves state-of-the-art performance on a benchmark dataset
using a standard cross-species evaluation framework which involves testing with out-of-distribution samples, i.e.,
spectra with never-before-seen peptide labels. Casanovo not only achieves superior performance but does so at a
fraction of the model complexity and inference time required by other methods.
05/02: Learning inverse folding -- Pascal Sturmfels
05/09: -- No Meeting
05/16: Interpreting neural networks for biological sequences by learning stochastic masks -- Alyssa La Fleur
- Johannes Linder, Alyssa La Fleur, Zibo Chen, Ajasja Ljubetic, David Baker, Sreeram Kannan, and Georg Seelig,"Interpreting neural networks for biological sequences by learning stochastic masks." Nat Mach Intell 4, 41-54 (2022).
Sequence-based neural networks can learn to make accurate predictions from large biological datasets, but model interpretation remains challenging. Many existing feature attribution methods are optimized for continuous rather than discrete input patterns and assess individual feature importance in isolation, making them ill-suited for interpreting nonlinear interactions in molecular sequences. Here, building on work in computer vision and natural language processing, we developed an approach based on deep learning-scrambler networks-wherein the most important sequence positions are identified with learned input masks. Scramblers learn to predict position-specific scoring matrices where unimportant nucleotides or residues are scrambled by raising their entropy. We apply scramblers to interpret the effects of genetic variants, uncover nonlinear interactions between cis-regulatory elements, explain binding specificity for protein-protein interactions, and identify structural determinants of de novo-designed proteins. We show that scramblers enable efficient attribution across large datasets and result in high-quality explanations, often outperforming state-of-the-art methods. https://doi.org/10.1038/s42256-021-00428-6 [offcampus] . Also: https://www.biorxiv.org/content/10.1101/2021.04.29.441979v1
05/23: Large-scale genomic data integration and visualization: towards Augmented Genomics -- Wouter Meuleman, Altius Institute
Although technological developments have made it possible to construct rich genome-wide datasets measuring a variety of biological phenomena across hundreds of human cellular conditions, the scale and complexity precludes routine utility of such data. We develop computational and machine learning approaches to reduce their complexity, while maximally retaining relevant information. Our long term research goal is to make Augmented Genomics a reality: a new field in which the work of genome scientists is supplemented -- not replaced! -- by large-scale visualization and data-driven machine intelligence. I'll present our current vision for this field, along with a number of directions we are working in.
05/30: -- Holiday