Overview¶
Propose a data analysis project to the course staff. This can be almost anything that you choose. You might select a project from your field of study, your extracurricular interests, government or public policy, or elsewhere. Another good source of ideas is repeating an analysis that you read about in the popular press or in a scientific paper — usually you will do a simplified version of the analysis. We are just looking for you to show that you have a better understanding of how data works and the kinds of questions you might want to answer!
The goal of the project proposal is for you to describe your idea in enough detail that the course staff can evaluate whether it is an appropriate project. You are able to work in a group of up to 3 people total. The final project is composed of three main parts, a proposal phase and a deliverable phase and a review phase. These cannot be submitted late and cannot be resubmitted through the resubmission process. Once the deadline for a project part passes, you will no longer be able to make further submissions for that part of the project.
Requirements¶
The following sections describe some requirements for your proposal and project more broadly. Your project proposal will probably be about 1-2 pages long, but it is acceptable for the proposal to be longer or shorter as long as it sufficiently covers all of the required sections. Do not worry too much about the length, you should just focus on conveying the required information in as much detail as you think is relevant. Submit your proposal as a PDF file. (In Microsoft Word, you can choose “Save As” to save your document as a PDF file. Do not turn in a Word document or plain text file.)
Proposal Format¶
Your project proposal should include the following sections:
- Title and author(s). The title should reflect your specific research questions (not just “CSE 163 Project”).
- Summary of research questions. Give a numbered list of 3 or more research questions and a brief description of what you will investigate. Each research question should have 1–3 sentences and propose something that can be definitively answered, not just a general topic or area of investigation.
- Motivation. Expand on your research questions by providing context about why you care about the problem. How does knowing the answers affect the world or our understanding of it?
- Dataset. Describe the real, existing dataset that you will use, including exact URLs. You may not use a dataset that has been used in a previous CSE 163 assignment or demo. You may not repeat an analysis that you have performed in another class (though you might do something inspired by another class). The data must be real — neither you nor someone else may make up the data.
- Challenge goals. Select at least 2 challenge goals that you are planning to meet. Justify why you think your project will meet each goal. Try to bold the name of your challenge goal to make it obvious which one you are talking about. If you would like to meet more than 2 goals, discuss the 2 goals that you’re most passionate about. It is okay if you’re not sure how to complete the challenge goal at this stage; we will learn some techniques to address fundamental aspects of all of them during the quarter.
- Method. For each research question, outline the steps you will takes to implement your project deliverables. You should provide enough detail that someone else could independently reproduce your experiment. For each computation, indicate how the result would lead you to a conclusion about one or more research questions. Highlight connections to challenge goals. Refer to specific rows, columns, or elements in your dataset. You will not need to provide any code, but your methodology should contain enough details that a TA will be able to understand how you will approach the computation.
- Work Plan. Describe your plan to divide the project into at least 3 (but no more than 7) discrete tasks, each with an estimate of the time in hours required for each task. Then, describe your workflow for developing code, testing code, and coordinating work—particularly how team members can support each other in case one task is unexpectedly challenging. You should also discuss with your group the specifics of how you will develop Python and show that you’ve set up your development environment (see here for more information on development environments. All team members must contribute equally to the deliverables.- While you don’t need to submit any proof you’ve done so, you should do the local development setup for Python on your computer if possible. A very common challenge students face on the project is realizing too late that they can’t use an online solution since their dataset is too large or it might not have the library you want. You must install Python on your computer early to help you avoid running into this later.
 
Dataset¶
You are not allowed to use a dataset that we have used in class as the central dataset of your project. You can use one of our datasets to add on to a part of your project, but it should not be the central focus.
Your report must include simple, clear, unambiguous instructions that anyone can follow to download the data themselves. You should not rely on the staff to be able to “figure out” how to access the data, you must provide instructions for us. If the data is not publicly available or requires an account to log in, then you should make it available to course staff or include it with your assignment submission. One way to do this is to host your dataset on Dropbox or Google Drive and provide a link to this in your report. If the full dataset is too large to download, then provide a subset of it for the course staff to experiment with.
Do not use a dataset that cannot be shared with the course staff, such as one that contains confidential medical information or intellectual property. Do not manually perform any data cleaning steps — all data cleaning should be done programmatically, by Python code that you write.
Challenge Goals¶
Challenge goals help us to define expectations while still offering flexibility for you to design your own project. Meeting 2 or more challenge goals will likely require writing at least 120 lines of Python code. In order to qualify for a challenge goal, you must experiment with something not explicitly discussed in the course. Here are the challenge goals you can choose from. We provide some examples for some of the bullets that are not meant to be exhaustive lists.
- Multiple Datasets: Use multiple datasets together to come up with a richer analysis. This requirement is not just about using more than one dataset across your research questions, but is more about combining the datasets to make a more in-depth analysis. This will likely requires you to join or combine tables together to solve a research question. Our take-home assessment 5 that combines Census data with food access data is a good example of this.
- Messy Data: Using a dataset that is not cleanly presented to you in a CSV already (or any other format that could be easily converted into a CSV). Some possible examples: writing code to scrape data from the web, using an API to gather data, or using some dataset that requires a lot of pre-processing to be usable.
- Result Validity: Do some extra work on top of your results to verify the validity of the results. For example, using some test of statistical significance to verify your results aren’t likely to happen by chance.
- Machine Learning: Dive deep into applying machine learning to your dataset to gain insights about the data or use it to make predictions about the future. Be explicit with what your goal is and how you will assess if you meet that goal. One example could be looking at various model types (and different settings of their hyperparameters) to identify which model is “best” (by how you define best). Another could be looking at how to use an “interpretable model” to understand which features are the most informative for how a decision is made. This challenge goal requires going above and beyond the fundamental steps of a Machine Learning pipeline we introduce in class. To achieve this challenge goal, you need to demonstrate exploration of more ideas in machine learning.
- New Library: Learn a new Python library and use it in your project in a significant way to help with your analysis. Part of this class is being able to learn libraries in Python. Show that you are able to take what you’ve learned in the context of learning a library we have not discussed in-depth in this course. In the Libraries below, we list some recommended libraries (and a complete list of the libraries will cover in this class that do not count as new).
- Other: If you are thinking of doing a project that you think is challenging enough but does not fit into any of these challenge goals, you can propose a new one to explain why you think your project will be challenging. Do remember that this should be a last resort since it can be challenging to assess the difficulty of your new goal. We reserve the right to deny your proposed challenge goal and you will need to go back and figure out how to make your project challenging enough before submitting.
Challenge Goal Requirements¶
Additional requirements for each of the challenge goals listed above are as follows. Your challenge goals must meet these minimum requirements in order to be counted.
- Multiple Datasets- Must have at least 3 datasets that are used together in some way
- At least two of the research questions should involve at least two datasets
- Must have at least one join/merge operation between the datasets
 
- Messy Data- Data cannot be in a file format that we have worked with in class (.csv,.shp,.json,.txt) or be easily converted to one of these forms (i.e., cannot be converted in 5 or fewer lines of code)
- Data from web-scraping, API, or which require lots of preprocessing all count as messy data- Example: using imputation for large amounts of missing data
 
 
- Data cannot be in a file format that we have worked with in class (
- Result Validity- Validity must be verified through statistical testing or some other proven testing method using domain expertise
- Any test that is used must be justified and explained in the context of the data and research questions
- Results of the validity tests must be clearly interpreted alongside preliminary results from the analysis
- Validity tests do not count towards required testing modules in the final report and code as they should be validating results, not testing the robustness of your code
 
- Machine Learning- One or both of the following:- Using a new model that was not previously discussed in class (no DecisionTrees)
- Training several models with different hyperparameters
 
- Using a new model that was not previously discussed in class (no 
 
- One or both of the following:
- New Library- Must use at least one library that was not introduced in class (see below) to answer at least two research questions
- Multiple libraries do not count as separate challenge goals- e.g., using Scipyandplotlycounts as one challenge goal
 
- e.g., using 
- The new library may be used in tandem with one of the other challenge goals- e.g., using Pytorchto create advanced machine learning models
 
- e.g., using 
 
Advice¶
The following sections have some advice on dataset sources and libraries that you might want to look into. We also provide some more context for what we will be covering for the rest of the quarter.
Sources of Data¶
The best approach is to start with a problem that interests you, and then look for data. Google will be your friend for finding a dataset. Alternately an equally valid approach involves, starting with a dataset that looks interesting and designing questions that interest you around that data. You may use one of the datasets that you (or your teammates) used during the Data Exploration assignment, or you can find a new one. Here are some possible data sources, but many more exist:
- A variety of data sets are available from UW Libraries
- Awesome Public Datasets - large variety of maintained data sets
- Baron Schwartz’s list of datasets. Some of these are themselves rich lists of datasets, such as the Amazon AWS public data sets.
- Data.gov for U.S. open government data, data.wa.gov for Washington state open government data, and data.seattle.gov for Seattle open government data
- SQLShare: public scientific datasets. Some require considerable knowledge to interpret, others are easier to understand. You can select “All datasets” and then filter by keyword, or you can select a tag from among those in the left column.
- Reddit Data Sets
- Civic Data Sets for the Pacific Northwest
- An archive of datasets distributed with the R statistical language
- 30 Places to Find Open Data on the Web Visual.ly
- Office for National Statistics (UK) a repository of detailed statistics about Great Britain and Northern Ireland
- World Bank Data Catalog
- CDC NCHS Data - CDC’s National Center for Health Statistics Data Access
- Machine Learning Repository - large variety of maintained data sets
- For datasets used in CSE 163 Lessons (remember, these can’t be the central part of your project)
Later in the course, we will explore geospatial data to create maps. If mapping sounds interesting to you, feel free to use any of the class datasets below! (Again, it’s OK if you’re not completely sure how to use the data yet; none of these datasets will be the focus of your project anyway!)
Libraries¶
You are free to use most any Python library you find that will be useful to you, especially the ones we have learned in class this quarter. If you know of a library you will use at this time (not required for the proposal) please mention it in your proposal. Below, we list some libraries you might want to look into for your project since students have found them useful in the past:
- Download from Web: requests
- Scientific Computing: SciPy, statsmodels
- Web Scraping: Beautiful Soup, selenium
- Natural Language Processing: spaCy, nltk
- Advanced/Interactive Visualizations: altair or plotly
- Domain-Specific Applications: astropy, chempy, Bioconductor, cantera
You are also encouraged to take advantage of the libraries we learned this quarter to help you on this project (but they won’t count as a new library for that challenge goal). In the list below, we include a list of all libraries we have (or will) discuss in depth in CSE 163.
- pandas
- seaborn
- matplotlib
- scikit-learn
- geopandas(geo-spatial data, covered in Module 5)
- numpy(image data, covered in Module 6)
Development Setup¶
One of the challenges with developing your own project is that you need to make sure you have the computing resources you need. There are some excellent online options that let you experiment with your code in a notebook, but they usually have limits on what dataset sizes or libraries you can use.
For the project, Jupyter Notebooks are an excellent way to prototype your project, but your final deliverable should be a runnable script. You do not need to write code for the Project Proposal, but you and your group should explore your development setup earlier than later to make sure it is appropriate for your project. Below we list some options that you can use for online prototyping of your project, but you should still install Python on your computer so you can write your finalized scripts for the final deliverable.
Local Python Installation¶
Follow the instructions below to set up Python and an editor called Visual Studio Code (VSCode) on your computer. Note that installing software for a new programming language is often a pretty frustrating, so it’s totally expected you might run into issues. Please reach out to the course staff in Office Hours or the Message Board earlier than later to get this set up!
Online Prototyping Tools¶
Note that the following tools are great for prototyping parts of your project, but you should be using your local Python installation to write runnable Python scripts that can handle data of any size.
- Ed Workspaces is the easiest development environment to use since it provides real-time collaboration in the same environment that we’ve been using throughout the course. However, each workspace is limited to 20MB of file storage and 1GB of memory, and you are not allowed to install additional libraries not used in this class, so large datasets and certain libraries may not work.
- Google Colaboratory is an online Jupyter Notebook provided by Google. Think Google Docs but for Jupyter Notebooks. Great resource and easy to setup.
- Jetbrains Datalore is a commercial service offering a powerful online collaboration environment for Jupyter Notebooks. The free tier should be enough for most data analysis and machine learning projects. Note that it is unable to write Python scripts so it may not be suitable for your final submission but can be useful for developing prototypes.
Example: Too Easy Project¶
To give some context for why we have these challenge goals, we provide an example project proposal that is way too simple.
Consider a project proposal that wants to use the following CSV dataset about salaries.
| id | age | salary | gender | location | 
|---|---|---|---|---|
| 1 | 42 | 110000 | F | Seattle | 
| 2 | 23 | 56000 | M | Kenmore | 
| 3 | 18 | 20000 | M | SF | 
| … | … | … | … | … | 
Suppose the project asked the following research questions:
- What is the average age in the data?
- How does salary relate to age?
- What is the average salary by gender?
Is this a good project? In general, it can be a bit hard to say since you can’t base it just on the number of research questions. Instead, you need to think about their depth and how much work would be necessary to answer them. This project ends up being very straight-forward to do in about 4 lines of code! One way to tell this project is too easy is it doesn’t meet any of the challenge goals! As we mentioned before, to adequately meet the challenge goals you will probably be writing at least 120 lines of Python code! That might sound a lot, but that’s already fewer lines than most of your homework assignments!
df = pd.read_csv('data.csv')
# Question 1: What is the average age?
df['age'].mean()
# Question 2: How does salary relate to age?
sns.relplot(x='age', y='salary', data=df, kind='scatter')
# Question 3: What is the average salary by gender?
df.groupby('gender')['salary'].mean()
It’s definitely okay to have some easier questions to build up a narrative in your report, but we are looking for you to challenge yourself on this project!
Past Project Gallery¶
For some example reports and slides, please refer to our past project gallery. Note: Some of these projects were before we added challenge goals and changed the spec so there might be some inconsistencies between their reports and what we are asking you to do.
Grading¶
There is no formal grade for this part of the project. All components of the project are incorporated into the 6 points awarded for the final project– some on completion, some on quality. (Up to 2 additional bonus points may be awarded for projects demonstrating above and beyond quality.) Satisfactory completion of this part of the project goes towards the completion points of the final point calculation. More details about project grading will be posted for the Code/Report part of the project; at this point you don’t need to worry about grades as completing all the required components on time will be counted as completed.
Submission¶
Your project proposal is due on Monday 7/21 at 11:59pm, Seattle time. Submit your proposal as a PDF file on Gradescope. Remember that you cannot submit any part of the project late and you can not use the resubmission process for take-home assessments to make submissions after the due date.
Submit your Part 1 as a PDF file. Do not turn in a Word document or plain text. Only one group member needs to submit your report on Gradescope and should use Group Members functionality to add the appropriate partner if you have one. If you want to learn about how to add Group Members on Gradescope, please see instructions here. Group members that are not listed in Gradescope by the due date will be marked as not submitted.