Overview

Propose a data analysis project to the course staff. This can be almost anything that you choose. You might select a project from your field of study, your extracurricular interests, government or public policy, or elsewhere. Another good source of ideas is repeating an analysis that you read about in the popular press or in a scientific paper — usually you will do a simplified version of the analysis. We are just looking for you to show that you have mastered the lessons of CSE 163 and can apply that knowledge to new challenges. Impress us!

The goal of the project proposal is for you to describe your idea in enough detail that the course staff can evaluate whether it is an appropriate project. You are able to work in a group of up to 3 people total. The final project is composed of three main parts, a proposal phase and a deliverable phase and a review phase. These cannot be submitted late and cannot be resubmitted through the resubmission process. Once the deadline for a project part passes, you will no longer be able to make further submissions for that part of the project.

Requirements

The following sections describe some requirements for your proposal and project more broadly. Your project proposal will probably be about 1-2 pages long, but it is acceptable for the proposal to be longer or shorter. Do not worry too much about the length, you should just focus on conveying the required information in as much detail as you think is relevant. Submit your proposal as a PDF file. (In Microsoft Word, you can choose “Save As” to save your document as a PDF file. Do not turn in a Word document or plain text.)

Proposal Format

Your project proposal should include the following sections:

  • Title and author(s). The title should reflect your specific research questions (not just “CSE 163 Project”).
  • Summary of research questions. Give a numbered list of 3 or more research questions and a brief description of what you will investigate. Each research question should have 1–3 sentences and propose something that can be definitively answered, not just a general topic or area of investigation.
  • Motivation. Expand on your research questions by providing context about why you care about the problem. How does knowing the answers affect the world or our understanding of it?
  • Dataset. Describe the real, existing dataset that you will use, including exact URLs. You may not use a dataset that has been used in a previous CSE 163 assignment or demo. You may not repeat an analysis that you have performed in another class (though you might do something inspired by another class). The data must be real — neither you nor someone else may make up the data.
  • Challenge goals. Select at least 2 challenge goals that you are planning to meet. Justify why you think your project will meet each goal. Try to bold the name of your challenge goal to make it obvious which one you are talking about. If you would like to meet more than 2 goals, discuss the 2 goals that you’re most passionate about.
  • Method. For each research question, outline the steps you will takes to implement your project deliverables. You should provide enough detail that someone else could independently reproduce your experiment. For each computation, indicate how the result would lead you to a conclusion about one or more research questions. Highlight connections to challenge goals. Refer to specific rows, columns, or elements in your dataset. You will not need to provide any code, but your methodology should contain enough details that a TA will be able to understand how you will approach the computation.
  • Work Plan. Describe your plan to divide the project into at least 3 (but no more than 7) discrete tasks, each with an estimate of the time in hours required for each task. Then, describe your workflow for developing code, testing code, and coordinating work—particularly how team members can support each other in case one task is unexpectedly challenging. All team members must contribute equally to the deliverables.

    You should also describe your primary development environment. Here are some options to consider:

    • Ed Workspaces is the easiest development environment to use since it provides real-time collaboration in the same environment that we’ve been using throughout the course. However, each workspace is limited to 20MB of file storage and 1GB of memory—large datasets and certain libraries may not work.
    • Jetbrains Datalore is a commercial service offering a powerful online collaboration environment for Jupyter Notebooks. The free tier should be enough for most data analysis and machine learning projects. Note that it is unable to write Python scripts so it may not be suitable for your final submission but can be useful for developing prototypes.
    • Local development (i.e. installing Python on your personal machine) is also an option. This proposal phase is a great time to get a Python development environment set up on your computer so you can prepare yourself for success after the course! If you choose this option, you should set up Python on your computer during proposal phase of the project, since many students encounter installation errors which can case last-minute stress if you trying to get set up before the project deliverables are due. Feel free to post on the message board or come to office hours if you get stuck with the setup.

Dataset

You are not allowed to use a dataset that we have used in class as the central dataset of your project. You can use one of our datasets to add on to a part of your project, but it should not be the central focus.

Your report must include simple, clear, unambiguous instructions that anyone can follow to download the data themselves. You should not rely on the staff to be able to “figure out” how to access the data, you must provide instructions for us. If the data is not publicly available, then you should make it available to course staff or include it with your assignment submission. One way to do this is to host your dataset on Dropbox or Google Drive and provide a link to this in your report. If the full dataset is too large to download, then provide a subset of it for the course staff to experiment with.

Do not use a dataset that cannot be shared with the course staff, such as one that contains confidential medical information or intellectual property. Do not manually perform any data cleaning steps — all data cleaning should be done programmatically, by Python code that you write.

Challenge Goals

Challenge goals help us to define expectations while still offering flexibility for you to design your own project. Meeting 2 or more challenge goals will likely require writing at least 120 lines of Python code. In order to qualify for a challenge goal, you must experiment with something not explicitly discussed in the course. Here are the challenge goals you can choose from. We provide some examples for some of the bullets that are not meant to be exhaustive lists.

  • Multiple Datasets: Use multiple datasets together to come up with a richer analysis. This requirement is not just about using more than one dataset across your research questions, but is more about combining the datasets to make a more in-depth analysis. This will likely requires you to join or combine tables together to solve a research question. Our take-home assessment 5 that combines Census data with food access data is a good example of this.
  • Messy Data: Using a dataset that is not cleanly presented to you in a CSV already. Some possible examples: writing code to scrape data from the web, using an API to gather data, or using some dataset that requires a lot of pre-processing to be usable.
  • Result Validity: Do some extra work on top of your results to verify the validity of the results. For example, using some test of statistical significance to verify your results aren’t likely to happen by chance.
  • Machine Learning: Dive deep into applying machine learning to your dataset to gain insights about the data or use it to make predictions about the future. Be explicit with what your goal is and how you will assess if you meet that goal. One example could be looking at various model types (and different settings of their hyperparameters) to identify which model is “best” (by how you define best). Another could be looking at how to use an “interpretable model” to understand which features are the most informative for how a decision is made.
  • New Library: Learn a new Python library and use it in your project in a significant way to help with your analysis. Part of this class is being able to learn libraries in Python. Show that you are able to take what you’ve learned in the context of learning a library we have not discussed in-depth in this course. In the Libraries below, we list some recommended libraries (and a complete list of the libraries will cover in this class that do not count as new).
  • Other: If you are thinking of doing a project that you think is challenging enough but does not fit into any of these challenge goals, you can propose a new one to explain why you think your project will be challenging. Do remember that this should be a last resort since it can be challenging to assess the difficulty of your new goal. We reserve the right to deny your proposed challenge goal and you will need to go back and figure out how to make your project challenging enough before submitting.

Advice

The following sections have some advice on dataset sources and libraries that you might want to look into. We also provide some more context for what we will be covering for the rest of the quarter.

For some example reports and slides, please refer to our past project gallery. Note: Some of these projects were before we added challenge goals and changed the spec so there might be some inconsistencies and it’s up to the student to understand the spec for this particular quarter.

Sources of Data

The best approach is to start with a problem that interests you, and then look for data. Google will be your friend for finding a dataset. Alternately an equally valid approach involves, starting with a dataset that looks interesting and designing questions that interest you around that data. Here are some possible data sources, but many more exist:

Libraries

You are free to use most any Python library you find that will be useful to you, especially the ones we have learned in class this quarter. If you know of a library you will use at this time (not required for the proposal) please mention it in your proposal. Below, we list some libraries you might want to look into for your project since students have found them useful in the past:

You are also encouraged to take advantage of the libraries we learned this quarter to help you on this project (but they won’t count as a new library for that challenge goal). In the list below, we include a list of all libraries we have (or will) discuss in depth in CSE 163.

  • pandas
  • seaborn
  • matplotlib
  • scikit-learn
  • geopandas (geo-spatial data, covered in Module 6)
  • numpy (image data, covered in Module 8)

Example: Too Easy Project

To give some context for why we have these challenge goals, we provide an example project proposal that is way too simple.

Consider a project proposal that wants to use the following CSV dataset about salaries.

id age salary gender location
1 42 110000 F Seattle
2 23 56000 M Kenmore
3 18 20000 M SF

Suppose the project asked the following research questions:

  1. What is the average age in the data?
  2. How does salary relate to age?
  3. What is the average salary by gender?

Is this a good project? In general, it can be a bit hard to say since you can’t base it just on the number of research questions. Instead, you need to think about their depth and how much work would be necessary to answer them. This project ends up being very straight-forward to do in about 4 lines of code! One way to tell this project is too easy is it doesn’t meet any of the challenge goals! As we mentioned before, to adequately meet the challenge goals you will probably be writing at least 120 lines of Python code! That might sound a lot, but that’s already fewer lines than most of your homework assignments!

df = pd.read_csv('data.csv')

# Question 1: What is the average age?
df['age'].mean()

# Question 2: How does salary relate to age?
sns.relplot(x='age', y='salary', data=df, kind='scatter')

# Question 3: What is the average salary by gender?
df.groupby('gender')['salary'].mean()

It’s definitely okay to have some easier questions to build up a narrative in your report, but we are looking for you to challenge yourself on this project!

Submission

Your project proposal is due on Thursday, February 17 at 23:59. Submit your proposal as a PDF file on Gradescope. Remember that you can not submit any part of the project late and you can not use the resubmission process for take-home assessments to make submissions after the due date.

Submit your Part 0 as a PDF file. Do not turn in a Word document or plain text. One group member should submit your report on Gradescope and should use Group Members functionality to add the appropriate partner if you have one. If you want to learn about how to add Group Members on Gradescope, please see instructions here. Group members that are not listed in Gradescope by the due date will be marked as not submitted.

Submit Proposal

Grading

There is no formal grade for this part of the project. All components of the project are incorporated into the 6 points awarded for the final project. Satisfactory completion of this part of the project goes towards the completion points that are awarded out of the 6 points. More details about project grading will be posted for the deliverables part of the project; at this point you don’t need to worry about grades as completing all the required components on time will be counted as completed.