CSE 163, Spring 2020: Final Project: Part 0

Overview

The goal of Part 0 is for you to describe your project idea in enough detail that the course staff can evaluate whether it is an appropriate project. The specification for this part has a lot of detail. Not because Part 0 is long, but because we want you to have as much information about the scope of your project earlier rather than later! We will try to give you feedback on your project proposal very quickly so you have time to adjust your idea before Part 1 (elaborating on your proposal) is due.

You are able to work in a group of up to 3 people total. You are more than welcome to work alone, but students in the past reported learning a lot from working with their peers. Working with a group allows you the opportunity to dive deeper into your project since you can have multiple people tackle the problems. Additionally, if you are ever looking for a job, companies love hearing when you share experiences about working on a team!

Even though you do have access to a late day for this part of the project, we strongly encourage that you turn Part 0 in on time to make sure the staff can get your feedback to you as quickly as possible. Please see the syllabus for the late day policy for projects.

Details

Propose a data analysis project to the course staff. This can be almost anything that you choose. You might select a project from your field of study, your extracurricular interests, government or public policy, or elsewhere. Another good source of ideas is repeating an analysis that you read about in the popular press or in a scientific paper — usually you will do a simplified version of the analysis. We are just looking for you to show that you have mastered the lessons of CSE 163 and can apply that knowledge to new challenges. Impress us!

Your project will use Python to help you answer your research questions. You might use Python to obtain the data or to clean/format it. You will almost certainly be using it to analyze your dataset. In particular, it's OK if the bulk of your code does the work of obtaining and cleaning the data, then the actual data processing is simple. (But some part of the project must require non-trivial Python processing.)

As a way to help you assess if your project is of the right difficulty, we have set up some "Challenge Goals" for your project. There are many possible goals you could achieve, and we ask that you select two from the list and meet those. You are more than welcome to go above and beyond and meet more, but at a minimum your project needs to demonstrate strength in two of the goals we've outlined.

Your data analysis proposal must clearly state the questions you will seek to answer, the existing dataset you will analyze, and the algorithm you will run on the dataset to answer your questions.

Your Part 0 proposal will probably be about a page long, but it is acceptable for the proposal to be longer or shorter as long as it conveys the required information without irrelevant details. Submit your proposal as a PDF file. (In Microsoft Word, you can choose “save as” to save your document as a PDF file. Do not turn in a Word document or plain text.)

Your proposal will be graded based on the quality and clarity of the writing and on whether your proposal conforms to the requirements below. The staff may direct you to make changes to your proposal. For example, the staff may direct you to revise the scope of your proposal (we don't want you to proceed with a project that is too hard or too easy). For Part 0, we are not grading you on your ability to come up with a project of the right difficulty right off the bat. Just focus on finding something that interests you and submit a proposal has all the required parts answered well!

You will not turn in any code for Part 0. However, you might want to consider starting to write code early. Some reasons are:

  • You might write code to help you download your data.
  • You might write code to perform data cleaning.
  • You might perform an experiment to validate your plans. For instance, maybe you plan to use a particular library. It would be wise to experiment with that library to make sure it has the functionality you need and that you understand how to use it. Such an experiment is called “prototyping”.
  • You might want to get a head start on Part 1 and 2.

Requirements

The following sections describe some requirements for your Part 0 and project more broadly.

Proposal Format

Your Part 0 project proposal should include the following sections:

  1. Title and author(s). The title should describe your research questions. It should not be something like “CSE 163 Project” (though you could use that as a subtitle if you wish).
  2. Summary of research questions. Give a numbered list of one or more research questions. Most projects should have at least 3 research questions. Each one should be a specific question with a specific answer, not merely a general topic or area of investigation. For each research question, state what are you trying to compute, and why in about 1 to 3 sentences.
  3. Dataset. Describe the real, existing dataset that you will use, including exact URLs. You may not use a dataset that has been used in a previous CSE 163 assignment or demo. You may not repeat an analysis that you have performed in another class (though you might do something inspired by another class). The data must be real — neither you nor someone else may make up the data.
  4. Challenge Goals. Select two of the challenge goals (described below) that you are planning on trying to meet with your project. Be specific and state exactly which two goals you are planning on meeting and justify why you think your project will meet that goal. You should include the bold term listed at the front of the goal to make it obvious which one you are talking about.

    As we mentioned before, you are more than welcome to tackle more than two goals if you want. However, your proposal you should only propose meeting two of them so that we can give feedback on how well we think your project will be able to meet those specific goals. Do not list more than two challenge goals even if you plan on meeting more.

Dataset

You are not allowed to use a dataset that we have used in class as the central dataset of your project. You can use one of our datasets to add on to a part of your project, but it should not be the central focus.

You do not have to turn in the dataset itself (it may be quite large, or it might be available only via the web rather than as a single download). If you don't, your report must include simple, clear, unambiguous instructions that anyone can follow to download the data themselves. You should not rely on the staff to be able to "figure out" how to access the data, you must provide instructions for us.

If the data is not currently publicly available (for example, it is data from some activity you are involved in), then you should make it available to course staff or include it with your assignment submission. One way to make it available is to host it on some public website (Dropbox or Google drive are possibilities); then, your program and your report should reference that location. Do not use a dataset that cannot be shared with the course staff, such as one that contains confidential medical information or intellectual property.

If the full dataset is too large to download, then provide a subset of it for the course staff to experiment with.

Do not manually perform any data cleaning stepsall data cleaning should be done programmatically, by Python code that you write. That is, you should not be searching in csv files in excel and filling in empty boxes or changing strings to numbers etc. It is possible that cleaning your data will be a big part of the code you write for your project! (Sometimes this can be as much as 75% of a data analysis project.)

Challenge Goals

We want your project to be meaningful to you while also pushing you a bit so that you learn on the way in the project! We are going to ask you to make your project challenging enough by meeting two (or more) "Challenge Goals". Part of your Part 0 proposal is to propose only two of these challenge goals you intend to meet in your project and why you think your project will meet them. You are more than welcome to change these later, but in your final project you will need to meet two of them sufficiently (and you will explain which ones you met and why you think you met them). By meeting two or more challenge goals, you will probably be aiming to write at least 120 lines of Python code; you will likely write more but it's a good benchmark to be thinking of.

There are many ways to make a project interesting and challenging! We want to outline a diverse set of ways you can challenge yourself on the project so you can really focus on working on something you're passionate about rather than focusing on just meeting some requirements. It's totally okay if you don't fully understand what some of the goals mean, since we are aiming for breadth in providing lots of options for project groups.

We think this list is comprehensive enough to capture almost every project, but we do leave the option for you to propose a way your project can be challenging outside of our dimensions. If you want to propose another option, you are allowed to do so but you should use this as a last resort. It can be difficult to assess the difficulty of a new goal on your own so it might require more work if we end up denying your new goal and you have to go back to the drawing board.

Here are the challenge goals you can choose from. We provide some examples for some of the bullets that are not meant to be exhaustive lists.

  • Multiple Datasets: Use multiple datasets together to come up with a richer analysis.
  • Messy Data: Using a dataset that is not cleanly presented to you in a CSV already. Some possible examples: writing code to scrape data from the web, using an API to gather data, or using some dataset that requires a lot of pre-processing to be usable.
  • Result Validity: Do some extra work on top of your results to verify the validity of the results. For example, using some test of statistical significance to verify your results aren't likely to happen by chance.
  • Many Perspectives: Craft a diverse set of research questions to tackle multiple perspectives to address the same topic. This usually means you will be using varied approaches in making sense of what you can conclude from your dataset. Usually requires demonstrating a deep understanding of the domain being studied.
  • Machine Learning: Dive deep into applying machine learning to your dataset to gain insights about the data or use it to make predictions about the future. Be explicit with what your goal is and how you will assess if you meet that goal. One example could be looking at various model types (and different settings of their hyperparameters) to identify which model is "best" (by how you define best). Another could be looking at how to use an "interpretable model" to understand which features are the most informative for how a decision is made.
  • New Library: Learn a new Python library and use it in your project in a significant way to help with your analysis. Part of this class is being able to learn libraries in Python. Show that you are able to take what you've learned in the context of learning a library we have not discussed in-depth in this course. In the Libraries below, we list some recommended libraries (and a complete list of the libraries will cover in this class that do not count as new).
  • Other: If you are thinking of doing a project that you think is challenging enough but does not fit into any of these challenge goals, you can propose a new one to explain why you think your project will be challenging. Do remember that this should be a last resort since it can be challenging to assess the difficulty of your new goal. We reserve the right to deny your proposed challenge goal and you will need to go back and figure out how to make your project challenging enough before submitting Part 1 the next week.

Advice

The following sections have some advice on dataset sources and libraries that you might want to look into. We also provide some more context for what we will be covering for the rest of the quarter.

For some example reports and slides, please refer to our past project gallery.

Datasets

The best approach is to start with a problem that interests you, and then look for data. Google will be your friend for finding a dataset. Alternately an equally valid approach involves, starting with a dataset that looks interesting and designing questions that interest you around that data. Here are some possible data sources, but many more exist:

Libraries

You are free to use most any Python library you find that will be useful to you, especially the ones we have learned in class this quarter. If you know of a library you will use at this time (not required for the proposal) please mention it in your proposal. Below, we list some libraries you might want to look into for your project since students have found them useful in the past:

You are also encouraged to take advantage of the libraries we learned this quarter to help you on this project (but they won't count as a new library for that challenge goal). In the list below, we include a list of all libraries we have (or will) discuss in depth in CSE 163.

  • pandas
  • seaborn
  • matplotlib
  • scikit-learn
  • geopandas (geo-spatial data)
  • numpy (image data)

Local Setup

Ed is a great tool for learning Python in a controlled environment, but is not something you will use in your real life doing data processing outside of this course. Additionally, it is very possible you might run into space restrictions on Ed if you are using a larger dataset. This is a great time to get a Python development environment set up on your computer so you can set yourself up for success after the course! We have instructions for how to set up Python on your computer here. Please try to set this up sooner than later so you don't run into troubles when you need to be working on your project! Feel free to post on the message board or come to office hours if you get stuck with the setup.

An additional resource you might want to consider using is Google Colaboratory (or Colab for short). This sort-of alternative to Jupyter Notebooks that lets you run computations in the cloud. You can set up Jupyter Notebooks locally, but Colab is becoming very popular among data scientists.

Example: Too Easy Project

To give some context for why we have these challenge goals, we provide an example project proposal that is way too simple.

Consider a project proposal that wants to use the following CSV dataset about salaries.

id age salary gender location
1 42 110000 F Seattle
2 23 56000 M Kenmore
3 18 20000 M SF
... ... ... ... ...

Suppose the project asked the following research questions:

  1. What is the average age in the data?
  2. How does salary relate to age?
  3. What is the average salary by gender?

Is this a good project? In general, it can be a bit hard to say since you can't base it just on the number of research questions. Instead, you need to think about their depth and how much work would be necessary to answer them. This project ends up being very straight-forward to do in about 4 lines of code! One way to tell this project is too easy is it doesn't meet any of the challenge goals! As we mentioned before, to adequately meet the challenge goals you will probably be writing at least 120 lines of Python code! That might sound a lot, but that's already fewer lines than most of your homework assignments!

df = pd.read_csv('data.csv')

# Question 1: What is the average age?
df['age'].mean()

# Question 2: How does salary relate to age?
sns.relplot(x='age', y='salary', data=df, kind='scatter')

# Question 3: What is the average salary by gender?
df.groupby('gender')['salary'].mean()

It's definitely okay to have some easier questions to build up a narrative in your report, but we are looking for you to challenge yourself on this project!

Preview: Part 2 Requirements

This project will involve writing code (but none for this part!). You don't need to write any code until Part 2 of the project, but we do recommend starting early. To give you an idea of the requirements you will ask of the code you write for Part 2, we have listed a draft of the requirements in the expandable card below. You do not need to really worry about these now, but we just wanted to give you some context for what your code will look like.

Notice: These requirements are a draft and are subject to change. These will be finalized when we release the Part 2 specification.

  • Your project must be a Python script (a .py file). You are more than welcome to experiment and/or develop in a Jupyter Notebook, but your end result must be a runnable Python script to output all your results.
  • Just for reference, most projects that adequately meet two challenge goals will be at least 120 lines of Python code long. This is not a hard requirement, and we will not count lines, but this is a very good heuristic to tell if your project has enough depth.
  • Your code must pass flake8.
  • Your code should meet some basic style requirements, namely good naming convention, breaking your code up into functions and your functions and modules, classes, and modules have doc-strings describing their use.
  • Your code will need to be broken up into at least two Python modules (files). You can decide how to break up your code, but it helps to separate them by the part of the project they concern. A very common split is to have one module that does all the data processing (loading in data, cleaning it, etc.) while the other module is the actual runnable program that does the analysis. You can break up your code however you see fit, but you must have at least two Python modules that you submit. Your testing programs (which you should write) does not count as one of the two required modules.
  • We will enforce a structure on your project directory. You will need to have a directory named data in your project to store all your data and a directory called results to store any results you want to save (namely plots).

Submission

Project Part 0 is due on Thursday, May 14 at 23:59 (PDT). Remember, you can only submit Part 0 up to one day late regardless of the number of late days your group may have remaining. Please see the late day policy on the syllabus for more information.

Submit your Part 0 as a PDF file. Do not turn in a Word document or plain text. One group member should submit your report on Gradescope and should use Group Members functionality to add the appropriate partner if you have one. If you want to learn about how to add Group Members on Gradescope, please see instructions here. Group members that are not listed in Gradescope by the late-cutoff will be marked as not submitted.