Although we are not directly following any textbook in particular, the background readings for many of the topics will come from: Murphy, Kevin P. Machine Learning: a Probabilistic Perspective. Cambridge, MA: MIT press, 2012. Below, we will denote this book using "KM".
Background: Introduction to Probability and Statistical Learning [+] Expand All[+]
Case Study I: Estimating Click Probabilities [+] Expand All[+]
Online learning : KM Sec. 8.5

Gradient Decent: KM Sec. 8.3.2
9.1, 9.2 and 9.3 of Boyd, Stephen and Lieven Vandenburghe. Convex Optimization. Cambridge: Cambridge University Press, 2004. Sec. 9.1, 9.2, 9.3. Print. 
Accelerated descent methods
Tseng, Paul. "On accelerated proximal gradient methods for convexconcave optimization." submitted to SIAM Journal on Optimization (2008). 
Perceptron algorithm: KM Sec. 8.5.4
Freund, Yoav, and Robert E. Schapire. "Large margin classification using the perceptron algorithm." Machine learning 37.3 (1999): 277296.  Stochastic Gradient Descent: KM Sec. 8.5.2
Le Cun, Leon Bottou Yann. "Large Scale Online Learning." Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference. Vol. 16. MIT Press, 2004. 
Robust stochastic approximation approach to stochastic programming
Nemirovski, Arkadi, et al. "Robust stochastic approximation approach to stochastic programming." SIAM Journal on Optimization 19.4 (2009): 15741609.
Sketching and Hashing

Bloom Filter
Wikipedia 
Mincount Sketch
Cormode, Graham, and S. Muthukrishnan. "An improved data stream summary: the countmin sketch and its applications." Journal of Algorithms 55.1 (2005): 5875.

Hash Kernels
Shi, Qinfeng, et al. "Hash kernels for structured data." The Journal of Machine Learning Research 10 (2009): 26152637. 
Permutation Hashing
Li, Ping, Art Owen, and CunHui Zhang. "One Permutation Hashing." Advances in Neural Information Processing Systems 25. 2012.
Personalization via Multitask Learning
Case Study II: Document Retrieval [+] Expand All[+]
Basic kNN, TFIDF
Fast NN Search

KDtrees tutorial
Moore, Andrew W. "Efficient Memorybased Learning for Robot Control." Technical Report No.209, Computer Laboratory, University of Cambridge, 1991. Print. 
Approximate nearest neighbors by localitysensitive hashing (LSH):
Andoni, Alexandr and Piotr Indyk. "NearOptimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions". Communications of the ACM, vol. 51, no. 1 (2008): 117122. 
Practical insights for approximate nearest neighbors:
Gray, Alexander, Ting Liu and Andrew W. Moore. "New Algorithms for Efficient HighDimensional Nonparametric Classification." Journal of Machine Learning Research 7 (2006): 11351158. 
All pairs NN:
Ram, Parikshit, et al. "Lineartime Algorithms for Pairwise Statistical Problems." Advances in Neural Information Processing Systems 22 2009: 15271535
Clustering: KM Sec. 25.1

Kmeans: KM Sec. 11.4.2.5
Moore, Andrew W. Tutorials of Kmeans and Hierarchical Clustering. The Auton Lab at Carnegie Mellon University. Web. 4 Jan. 2013.  Mixture modeling (generative): KM Sec. 11.111.4

Spectral clustering: KM Sec. 25.4
Von Luxburg, Ulrike. "A tutorial on spectral clustering." Statistics and computing 17.4 (2007): 395416.
Mixed Membership Models: KM Sec. 27.3
 Basic LDA:
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 9931022. 
Introduction:
Blei, David M. "Probabilistic topic models." Communications of the ACM, vol. 55, no. 4 (2012): 7784. 
Sampling:
Griffith, Thomas L. and Mark Steyvers. "Finding scientific topics." Proceedings of the National Academy of Sciences of the United States of America, Volume: 101, Supplement: 1 (2004): Pages: 52285235
Advanced reading: KM Sec. 21.121.3

Online LDA:
Hoffman, Matt, et al. "Stochastic Variational Inference." arXiv:1206.7051 (2012). 
Largescale LDA:
Mimno, David, Matthew D. Hoffman and David M. Blei. "Sparse stochastic inference for latent Dirichlet allocation." International Conference on Machine Learning, 2012. 
Distributed LDA:
Ahmed, Amr, et al. "Scalable inference in latent variable models." Proceedings of the fifth ACM international conference on Web search and data mining (2012): 123132
Case Study III: fMRI Prediction [+] Expand All[+]
Linear and logistic regression: KM Sec. 7.17.3,7.5, 8.18.3, 8.5
LASSO: KM Sec. 13.1, 13.3, 13.4
 Original:
Tibshirani, Robert. "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society. Series B (Methodological) Vol. 58, No. 1 (1996): 267288. Published by: Wiley  Bayesian interpretation (optional):
Park, Trevor and George Casella. "The Bayesian Lasso." Journal of the American Statistical Association Volume 103, Issue 482 (2008): 681686.  Stochastic l1 regularzied loss minimization:
ShalevShwartz, Shai, and Ambuj Tewari. "Stochastic Methods for $\ ell_1 $ Regularized Loss Minimization." (2009).
Zeroshot learning

Features of words:
Mitchell, Tom M., et al. "Predicting human brain activity associated with the meanings of nouns." Science Vol. 320 no. 5880 (2008): 11911195. 
Features of words and learning from people:
Palatucci, Mark, et al. "Zeroshot learning with semantic output codes." Advances in neural information processing systems 22 (2009): 14101418.  Slides on papers above:
Tom Mitchell's slides
Graphical LASSO: KM Sec. 26.7
 Original:
Friedman, Jerome, Trevor Hastie and Robert Tibshirani. "Sparse inverse covariance estimation with the graphical lasso." Biostatistics 9(3) (2008): 432441.  Slides
 New insights (optional):
Mazumder, Rahul and Trevor Hastie. "The Graphical Lasso: New Insights and Alternatives." arXiv:1111.5479v2 (2012)
Parallel learning
 (Shotgun) Stochastic coordinate descent:
Bradley, Joseph, et al. "Parallel Coordinate Descent for L1Regularized Loss Minimization." International Conference on Machine Learning (2011).  Stochastic gradient descent:
Niu, Feng, et al. "HOGWILD!: A LockFree Approach to Parallelizing Stochastic Gradient Descent." arXiv:1106.5730v2 (2011)  Averaging methods:
Zhang, Yuchen, et al. "CommunicationEfficient Algorithms for Statistical Optimization." arXiv:1209.4129v1 (2012)  Alternating Directions Method of Multipliers (ADMM):
Boyd, Stephen, et al. "Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers." Machine Learning Vol. 3, No. 1 (2010): 1122
Case Study IV: Collaborative Filtering [+] Expand All[+]
Collaborative Filtering:
Matrix Factorization:
 Probabilistic matrix factorization:
Salakhutdinov, Ruslan, and Andriy Mnih. "Probabilistic matrix factorization." Advances in neural information processing systems 20 (2008): 12571264.  Exact Matrix Completion via Convex Optimization:
Candès, Emmanuel J., and Benjamin Recht. "Exact matrix completion via convex optimization." Foundations of Computational mathematics 9.6 (2009): 717772.  Document clustering via nonnegative matrix factorization:
Xu, Wei, Xin Liu and Yihong Gong. "Document clustering based on nonnegative matrix factorization." Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval (2003): 267273  Fast MaxMargin Factorization (optional):
Rennie, Jason D. M. and Nathan Srebro. "Fast Maximum Margin Matrix Factorization for Collaborative Prediction." Proceedings of the 22nd International Conference on Machine Learning (2005).  Largescale by divide and conquer (optional):
Mackey, Lester, Ameet Talwalkar, Michael I. Jordan. "DivideandConquer Matrix Factorization." arXiv:1107.0789v6 (2012)
Coldstart Problem (zeroshot learning), Incorporating Features:
 Basic concept:
Schein, Andrew I., et al. "Methods and Metrics for ColdStart Recommendations." Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (2002): 253260  Unified approach:
Menon, Aditya Krishna and Charles Elkan. "Link prediction via matrix factorization." Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases, Volume Part II (2011): 437452
Parallel Learning with GraphLab:
 Original Paper:
Low, Yucheng, et al. "GraphLab: A New Parallel Framework for Machine Learning." Proceedings of Conference on Uncertainty in Artificial Intelligence (2010).  Cloudbased:
Low, Yucheng, et al. "Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud." Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 8 (2012): 716727.  GraphLab 2:
Gonzalez, Joseph E. et al. "PowerGraph: distributed graphparallel computation on natural graphs." Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (2012): 1730.  GraphChi (GraphLab on disk):
Kyrola, Aapo, Guy Blelloch and Carlos Guestrin. "GraphChi: largescale graph computation on just a PC." Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (2012): 3146.
Advanced reading (optional):
 Stochastic block models:
Airoldi, Edoardo M. et al. "Mixed Membership Stochastic Blockmodels." Journal of Machine Learning Research 9 (2008): 19812014.  Mixedmembership matrix factorization:
Mackey, Lester, David Weiss and Michael I. Jordan. "Mixed Membership Matrix Factorization." Proceedings of the 27th International Conference on Machine Learning, 2010.  Scalable stochastic block models:
Gopalan, Prem. "Scalable Inference of Overlapping Communities." Neural Information Processing Systems, 2012.
Copyright © 2013 University of Washington