Propose a data analysis project to the course staff. This can be almost anything that you choose. You might select a project from your field of study, your extracurricular interests, government or public policy, or elsewhere. Another good source of ideas is repeating an analysis that you read about in the popular press or in a scientific paper — usually you will do a simplified version of the analysis. We are just looking for you to show that you have mastered the lessons of CSE 163 and can apply that knowledge to new challenges. Impress us!
The goal of the project proposal is for you to describe your idea in enough detail that the course staff can evaluate whether it is an appropriate project. You are able to work in a group of up to 3 people total. The final project is composed of two parts, a proposal phase and a deliverable phase. These cannot be submitted late and cannot be resubmitted through the resubmission process. Once the deadline for a project part passes, you will no longer be able to make further submissions for that part of the project.
The following sections describe some requirements for your proposal and project more broadly. Your project proposal will probably be about 1-2 pages long, but it is acceptable for the proposal to be longer or shorter. Do not worry too much about the length, you should just focus on conveying the required information in as much detail as you think is relevant. Submit your proposal as a PDF file. (In Microsoft Word, you can choose “Save As” to save your document as a PDF file. Do not turn in a Word document or plain text.)
Your project proposal should include the following sections:
Work Plan. Describe your plan to divide the project into at least 3 (but no more than 7) discrete tasks, each with an estimate of the time in hours required for each task. Then, describe your workflow for developing code, testing code, and coordinating work—particularly how team members can support each other in case one task is unexpectedly challenging. All team members must contribute equally to the deliverables.
You should also describe your primary development environment. Here are some options to consider:
You are not allowed to use a dataset that we have used in class as the central dataset of your project. You can use one of our datasets to add on to a part of your project, but it should not be the central focus.
Your report must include simple, clear, unambiguous instructions that anyone can follow to download the data themselves. You should not rely on the staff to be able to "figure out" how to access the data, you must provide instructions for us. If the data is not publicly available, then you should make it available to course staff or include it with your assignment submission. One way to do this is to host your dataset on Dropbox or Google Drive and provide a link to this in your report. If the full dataset is too large to download, then provide a subset of it for the course staff to experiment with.
Do not use a dataset that cannot be shared with the course staff, such as one that contains confidential medical information or intellectual property. Do not manually perform any data cleaning steps — all data cleaning should be done programmatically, by Python code that you write.
Challenge goals help us to define expectations while still offering flexibility for you to design your own project. Meeting 2 or more challenge goals will likely require writing at least 120 lines of Python code. In order to qualify for a challenge goal, you must experiment with something not explictly discussed in the course. Here are the challenge goals you can choose from. We provide some examples for some of the bullets that are not meant to be exhaustive lists.
The following sections have some advice on dataset sources and libraries that you might want to look into. We also provide some more context for what we will be covering for the rest of the quarter.
For some example reports and slides, please refer to our past project gallery.
NOTE: some of these projects were before we added challenge goals and changed the spec so there might be some inconsistencies and it's up to the student to understand
the spec for this particular quarter.
The best approach is to start with a problem that interests you, and then look for data. Google will be your friend for finding a dataset. Alternately an equally valid approach involves, starting with a dataset that looks interesting and designing questions that interest you around that data. Here are some possible data sources, but many more exist:
You are free to use most any Python library you find that will be useful to you, especially the ones we have learned in class this quarter. If you know of a library you will use at this time (not required for the proposal) please mention it in your proposal. Below, we list some libraries you might want to look into for your project since students have found them useful in the past:
You are also encouraged to take advantage of the libraries we learned this quarter to help you on this project (but they won't count as a new library for that challenge goal). In the list below, we include a list of all libraries we have (or will) discuss in depth in CSE 163.
pandas
seaborn
matplotlib
scikit-learn
geopandas
(geo-spatial data)numpy
(image data)To give some context for why we have these challenge goals, we provide an example project proposal that is way too simple.
Consider a project proposal that wants to use the following CSV dataset about salaries.
id | age | salary | gender | location |
---|---|---|---|---|
1 | 42 | 110000 | F | Seattle |
2 | 23 | 56000 | M | Kenmore |
3 | 18 | 20000 | M | SF |
... | ... | ... | ... | ... |
Suppose the project asked the following research questions:
Is this a good project? In general, it can be a bit hard to say since you can't base it just on the number of research questions. Instead, you need to think about their depth and how much work would be necessary to answer them. This project ends up being very straight-forward to do in about 4 lines of code! One way to tell this project is too easy is it doesn't meet any of the challenge goals! As we mentioned before, to adequately meet the challenge goals you will probably be writing at least 120 lines of Python code! That might sound a lot, but that's already fewer lines than most of your homework assignments!
df = pd.read_csv('data.csv')
# Question 1: What is the average age?
df['age'].mean()
# Question 2: How does salary relate to age?
sns.relplot(x='age', y='salary', data=df, kind='scatter')
# Question 3: What is the average salary by gender?
df.groupby('gender')['salary'].mean()
It's definitely okay to have some easier questions to build up a narrative in your report, but we are looking for you to challenge yourself on this project!
Your project proposal is due on Friday, August 6 at 23:59 (AoE). Submit your proposal as a PDF file. Remember that you can not submit any part of the project late and you can not use the resubmission process for take-home assessments to make submissions after the due date.
More details about the submission process will be announced on Ed as a pinned post.