Today, data analysis methods in machine learning and statistics play a central role in industry and science. The growth of the Web and improvements in data collection technology in science have lead to a rapid increase in the magnitude and complexity of these analysis tasks. This growth is driving the need for scalable, parallel and online algorithms and models that can handle this "Big Data". This course will provide a broad foundation for this timely challenge.
In particular, we focus on the challenges associated with datasets of massive size and dimensionality, including settings where the dimensionality of the data is growing faster than the number of data points. Framed by canonical examples of big data applications in science and industry, we will present a core set of techniques, both in terms of algorithms and models, to tackle these challenges. We will also explore the computational foundations associated with performing these analyses in the context of parallel and cloud architectures.
Large-scale modeling techniques covered will include linear models, graphical models, matrix and tensor factorizations, clustering, and latent factor models. Algorithmic topics include sketching, fast n-body problems, random projections and hashing, large-scale online learning, and parallel learning. The computational techniques covered in this course will provide a basic foundation in large-scale programming, ranging from the basic "parfor" to parallel abstractions, such as MapReduce (Hadoop).
To be successful in this course, students should have prior exposure to basic statistical and machine learning concepts, such as those covered in CSE 546 or STAT 535. As needed, we will also provide background reading on certain topics throughout the quarter.
IMPORTANT: All class announcements will be broadcasted using the Canvas discussion board. The same applies to questions about homeworks, projects and lectures. If you have a question of personal matters, please email the instructors list: email@example.com. Otherwise, please send all questions to this board, since other students may have the same questions, and we need to be fair in terms of how we interact with everyone. Also, please feel free to participate, answer each others' questions, etc.
Each homework assignment contains both theoretical questions and will have programming components. Homeworks must be submitted by the posted due date.
COLLABORATION POLICY: Homework must be done individually: each student must hand in their own answers. In addition, each student must submit their own code in the programming part of the assignment (we may run your code). It is acceptable, however, for students to collaborate in figuring out answers and helping each other solve the problems (for HWs 1, 2, and 4). You must also indicate on each homework with whom you collaborated.
RE-GRADING POLICY: All grading related requests must be submitted to the TA via email only. Office hours and in person discussions are limited solely to asking knowledge related questions, not grade related questions. If you feel that we have made an error in grading your homework, please let us know with a written explanation, and we will consider the request. Please note that regrading of a homework may cause your grade to go up or down on the entire homework set.
LATE POLICY: Homeworks must be submitted by the posted due date. Any assignment turned in late, will incur a reduction of 33% in the final score, for each day (or part thereof) if it is late. For example, if an assignment is up to 24 hours late, it incurs a penalty of 33%. Else if it is up to 48 hours late, it incurs a penalty of 66%. And any longer, it will receive no credit. You must turn in all 4 homeworks, even if for zero credit, in order to pass the course. (Empty homeworks do not count.)
No exceptions will be given to the grading policies (unless based on university policies). If you are not able to comply with the late homework policy, due to travel, conferences, other deadlines, or any other reason, do not enroll in the course.
HONOR CODE: As we sometimes reuse problem set questions from previous years, covered by papers and webpages, we expect the students not to copy, refer to, or look at the solutions in preparing their answers (referring to unauthorized material is considered a violation of the honor code). Similarly, we expect students not to google directly for answers. The homework is to help you think about the material, and we expect you to make an honest effort to solve the problems. If you do happen to use other material, it must be acknowledged clearly with a citation on the submitted solution.
Project Page Link
You are expected to complete a final project for the class. This will provide you with an opportunity to apply the machine learning concepts you have learned. We will update the project requirements and due dates during the quarter.
Topic I: Estimating Click Probabilities
Topic II: (Document) Retrieval and Clustering
Topic III: Optimization and Parallelization in the Big Data Regime.
Topic IV: Exploration and Information Gathering