The goal of Part 0 is for you to describe your project idea in enough detail that the course staff can evaluate whether it is an appropriate project. The specification for this part has a lot of detail. Not because Part 0 is long, but because we want you to have as much information about the scope of your project earlier rather than later! We will try to give you feedback on your project proposal very quickly so you have time to adjust your idea before Part 1 (elaborating on your proposal) is due.
You are able to work in a group of up to 3 people total. You are more than welcome to work alone, but students in the past reported learning a lot from working with their peers. Working with a group allows you the opportunity to dive deeper into your project since you can have multiple people tackle the problems. Additionally, if you are ever looking for a job, companies love hearing when you share experiences about working on a team!
Even though you do have access to a late day for this part of the project, we strongly encourage that you turn Part 0 in on time to make sure the staff can get your feedback to you as quickly as possible. Please see the syllabus for the late day policy for projects.
Propose a data analysis project to the course staff. This can be almost anything that you choose. You might select a project from your field of study, your extracurricular interests, government or public policy, or elsewhere. Another good source of ideas is repeating an analysis that you read about in the popular press or in a scientific paper — usually you will do a simplified version of the analysis. We are just looking for you to show that you have mastered the lessons of CSE 163 and can apply that knowledge to new challenges. Impress us!
Your project will use Python to help you answer your research questions. You might use Python to obtain the data or to clean/format it. You will almost certainly be using it to analyze your dataset. In particular, it's OK if the bulk of your code does the work of obtaining and cleaning the data, then the actual data processing is simple. (But some part of the project must require non-trivial Python processing.)
As a way to help you assess if your project is of the right difficulty, we have set up some "Challenge Goals" for your project. There are many possible goals you could achieve, and we ask that you select two from the list and meet those. You are more than welcome to go above and beyond and meet more, but at a minimum your project needs to demonstrate strength in two of the goals we've outlined.
Your data analysis proposal must clearly state the questions you will seek to answer, the existing dataset you will analyze, and the algorithm you will run on the dataset to answer your questions.
Your Part 0 proposal will probably be about a page long, but it is acceptable for the proposal to be longer or shorter as long as it conveys the required information without irrelevant details. Submit your proposal as a PDF file. (In Microsoft Word, you can choose “save as” to save your document as a PDF file. Do not turn in a Word document or plain text.)
Your proposal will be graded based on the quality and clarity of the writing and on whether your proposal conforms to the requirements below. The staff may direct you to make changes to your proposal. For example, the staff may direct you to revise the scope of your proposal (we don't want you to proceed with a project that is too hard or too easy). For Part 0, we are not grading you on your ability to come up with a project of the right difficulty right off the bat. Just focus on finding something that interests you and submit a proposal has all the required parts answered well!
You will not turn in any code for Part 0. However, you might want to consider starting to write code early. Some reasons are:
The following sections describe some requirements for your Part 0 and project more broadly.
Your Part 0 project proposal should include the following sections:
Challenge Goals. Select two of the challenge goals (described below) that you are planning on trying to meet with your project. Be specific and state exactly which two goals you are planning on meeting and justify why you think your project will meet that goal. You should include the bold term listed at the front of the goal to make it obvious which one you are talking about.
As we mentioned before, you are more than welcome to tackle more than two goals if you want. However, your proposal you should only propose meeting two of them so that we can give feedback on how well we think your project will be able to meet those specific goals. Do not list more than two challenge goals even if you plan on meeting more.
You are not allowed to use a dataset that we have used in class as the central dataset of your project. You can use one of our datasets to add on to a part of your project, but it should not be the central focus.
You do not have to turn in the dataset itself (it may be quite large, or it might be available only via the web rather than as a single download). If you don't, your report must include simple, clear, unambiguous instructions that anyone can follow to download the data themselves. You should not rely on the staff to be able to "figure out" how to access the data, you must provide instructions for us.
If the data is not currently publicly available (for example, it is data from some activity you are involved in), then you should make it available to course staff or include it with your assignment submission. One way to make it available is to host it on some public website (Dropbox or Google drive are possibilities); then, your program and your report should reference that location. Do not use a dataset that cannot be shared with the course staff, such as one that contains confidential medical information or intellectual property.
If the full dataset is too large to download, then provide a subset of it for the course staff to experiment with.
Do not manually perform any data cleaning steps — all data cleaning should be done programmatically, by Python code that you write. That is, you should not be searching in csv files in excel and filling in empty boxes or changing strings to numbers etc. It is possible that cleaning your data will be a big part of the code you write for your project! (Sometimes this can be as much as 75% of a data analysis project.)
We want your project to be meaningful to you while also pushing you a bit so that you learn on the way in the project! We are going to ask you to make your project challenging enough by meeting two (or more) "Challenge Goals". Part of your Part 0 proposal is to propose only two of these challenge goals you intend to meet in your project and why you think your project will meet them. You are more than welcome to change these later, but in your final project you will need to meet two of them sufficiently (and you will explain which ones you met and why you think you met them). By meeting two or more challenge goals, you will probably be aiming to write at least 120 lines of Python code; you will likely write more but it's a good benchmark to be thinking of.
There are many ways to make a project interesting and challenging! We want to outline a diverse set of ways you can challenge yourself on the project so you can really focus on working on something you're passionate about rather than focusing on just meeting some requirements. It's totally okay if you don't fully understand what some of the goals mean, since we are aiming for breadth in providing lots of options for project groups.
We think this list is comprehensive enough to capture almost every project, but we do leave the option for you to propose a way your project can be challenging outside of our dimensions. If you want to propose another option, you are allowed to do so but you should use this as a last resort. It can be difficult to assess the difficulty of a new goal on your own so it might require more work if we end up denying your new goal and you have to go back to the drawing board.
Here are the challenge goals you can choose from. We provide some examples for some of the bullets that are not meant to be exhaustive lists.
The following sections have some advice on dataset sources and libraries that you might want to look into. We also provide some more context for what we will be covering for the rest of the quarter.
For some example reports and slides, please refer to our past project gallery.
The best approach is to start with a problem that interests you, and then look for data. Google will be your friend for finding a dataset. Alternately an equally valid approach involves, starting with a dataset that looks interesting and designing questions that interest you around that data. Here are some possible data sources, but many more exist:
You are free to use most any Python library you find that will be useful to you, especially the ones we have learned in class this quarter. If you know of a library you will use at this time (not required for the proposal) please mention it in your proposal. Below, we list some libraries you might want to look into for your project since students have found them useful in the past:
You are also encouraged to take advantage of the libraries we learned this quarter to help you on this project (but they won't count as a new library for that challenge goal). In the list below, we include a list of all libraries we have (or will) discuss in depth in CSE 163.
pandas
seaborn
matplotlib
scikit-learn
geopandas
(geo-spatial data)numpy
(image data)Ed is a great tool for learning Python in a controlled environment, but is not something you will use in your real life doing data processing outside of this course. Additionally, it is very possible you might run into space restrictions on Ed if you are using a larger dataset. This is a great time to get a Python development environment set up on your computer so you can set yourself up for success after the course! We have instructions for how to set up Python on your computer here. Please try to set this up sooner than later so you don't run into troubles when you need to be working on your project! Feel free to post on the message board or come to office hours if you get stuck with the setup.
An additional resource you might want to consider using is Google Colaboratory (or Colab for short). This sort-of alternative to Jupyter Notebooks that lets you run computations in the cloud. You can set up Jupyter Notebooks locally, but Colab is becoming very popular among data scientists.
To give some context for why we have these challenge goals, we provide an example project proposal that is way too simple.
Consider a project proposal that wants to use the following CSV dataset about salaries.
id | age | salary | gender | location |
---|---|---|---|---|
1 | 42 | 110000 | F | Seattle |
2 | 23 | 56000 | M | Kenmore |
3 | 18 | 20000 | M | SF |
... | ... | ... | ... | ... |
Suppose the project asked the following research questions:
Is this a good project? In general, it can be a bit hard to say since you can't base it just on the number of research questions. Instead, you need to think about their depth and how much work would be necessary to answer them. This project ends up being very straight-forward to do in about 4 lines of code! One way to tell this project is too easy is it doesn't meet any of the challenge goals! As we mentioned before, to adequately meet the challenge goals you will probably be writing at least 120 lines of Python code! That might sound a lot, but that's already fewer lines than most of your homework assignments!
df = pd.read_csv('data.csv')
# Question 1: What is the average age?
df['age'].mean()
# Question 2: How does salary relate to age?
sns.relplot(x='age', y='salary', data=df, kind='scatter')
# Question 3: What is the average salary by gender?
df.groupby('gender')['salary'].mean()
It's definitely okay to have some easier questions to build up a narrative in your report, but we are looking for you to challenge yourself on this project!
This project will involve writing code (but none for this part!). You don't need to write any code until Part 2 of the project, but we do recommend starting early. To give you an idea of the requirements you will ask of the code you write for Part 2, we have listed a draft of the requirements in the expandable card below. You do not need to really worry about these now, but we just wanted to give you some context for what your code will look like.
Notice: These requirements are a draft and are subject to change. These will be finalized when we release the Part 2 specification.
.py
file). You are more than welcome to experiment and/or develop in a Jupyter Notebook, but your end result must be a runnable Python script to output all your results.flake8
.data
in your project to store all your data and a directory called results
to store any results you want to save (namely plots).Project Part 0 is due on Thursday, May 14 at 23:59 (PDT). Remember, you can only submit Part 0 up to one day late regardless of the number of late days your group may have remaining. Please see the late day policy on the syllabus for more information.
Submit your Part 0 as a PDF file. Do not turn in a Word document or plain text. One group member should submit your report on Gradescope and should use Group Members functionality to add the appropriate partner if you have one. If you want to learn about how to add Group Members on Gradescope, please see instructions here. Group members that are not listed in Gradescope by the late-cutoff will be marked as not submitted.