Logistics

Instructor

Teaching Assistants


Content

What is this course about?

Data analysis is a central activity for scientific research and is increasingly a critical part of decision making in government and business. However, producing reliable data analysis outcomes is challenging since the decisions made throughout the analysis process can dramatically affect the eventual outcome. This Data Science Capstone focuses on the complete end-to-end process of data analysis performed with code: the iterative, and often exploratory, steps that analysts go through to turn data into results. Our focus is not limited to statistical modeling or machine learning, but rather the complete process, including transformation, exploration, modeling, and evaluation choices.

Students will work in groups of four on a single project that will tie together and apply previous experiences from CSE 312, 332, 446, 442, 344, and other classes. Students are expected to already possess knowledge of appropriate machine learning, visualization and database methods, and will focus on independently applying those methods in the context of your project. There will therefore be limited lecture material in this course. Course staff will instead work closely with students to critique and advise on their group project. Students will experience the end-to-end data analysis process from transformation and exploration of data to modeling and evaluation. Your group will brainstorm on a project during the first week, before collaboratively exploring the data and implementing a complete data analysis workflow. This capstone course gives hands-on experience with selecting a data science question, and with crafting and evaluating a data science process to answer that question.

Prerequisites

Students should have completed CSE 332 and CSE 312, and at least one of CSE 446, CSE 442, or CSE 344. There are no other requirements for participating in this capstone class.


Schedule

Lecture slides will be posted here shortly before each lecture. If you wish to view slides further in advance, refer to last year's slides, which are mostly similar.

This schedule is subject to change.

Date Class Topic Assignment due at midnight before class
(see Deliverables for details)
Tue Oct 1 Introduction, Project Pitches, and Group Assignment
[slides]
Mandatory Project Pitches
Tue Oct 8 Data Science Process and Objectives
[slides]
Project Plan Presentation Video &
Project Selection Reflection
Tue Oct 15 Data Science by Example
[slides]
Reflection on example data science paper
Tue Oct 22 Data Science at Scale
[slides] [colab]
Validity Reflection Presentation Video
Tue Oct 29 Communicating Data Science through Visualization
[slides] [seaborn tutorial] [worksheet]
Spark Word Count Assignment
Tue Nov 5 Midpoint Project Presentations and Feedback
Midpoint Presentation Video
Tue Nov 12 Data Science through Causal Inference I
[slides]
Midpoint Feedback Reflection and Action Plan
Tue Nov 19 Data Science through Causal Inference II
[slides] [causal inference lab]
-
Tue Nov 26 Technical Writing for Data Science
[slides]
-
Tue Dec 3 Final Project Presentations and Feedback
Final Presentation Video
Sun Dec 8 Final Project Report Submission Deadline (due by midnight) Final Project Report &
Summary of Individual Contribution to Project &
Final Reflection
Optional Project Deliverable