So far, you have analyzed data from a variety of sources to solve realistic problems from science, engineering, and business. Now it's your turn to choose and analyze a problem. This is good practice for how you will use Python for the remainder of your career.
There are three parts to this assignment, due separately.
Part I is due on Thursday, March 7 at 11pm.
Part II is due on Friday, March 15 at 9pm.
Part III is due on Monday, March 18 at noon.
For examples, see the reports from Winter 2013.
For this assignment, you are permitted and encouraged to work with a partner; the two of you will submit one solution. You are not required to work with a partner, and groups may not be larger than two people. Only one of you will submit the assignment — do not submit duplicates. If you work with a partner, we will expect your project to be more substantial than a project done individually. If you wish to submit your assignment late, then all project partners must use a late day. If one of the partners does not have any late days left, then the whole group is required to submit the work on time (or take a grade of zero).
Propose a data analysis project to the course staff. This can be almost anything that you choose. You might select a project from your field of study, your extracurricular interests, government or public policy, or elsewhere. Another good source of ideas is repeating an analysis that your read about in the popular press or in a scientific paper — usually you will do a simplified version of the analysis. We are just looking for you to show that you have absorbed the lessons of CSE 140. Impress us!
In general, your data analysis proposal must clearly state the questions you will seek to answer, the existing dataset you will analyze, and the algorithm you will run on the dataset to answer your questions. More specifically, you should turn in a draft report, in the format required in Part II of this assignment. Your report should include all of the sections required in the final report, except for: the results in section 1 (Summary), all of section 5 (Results), and all of section 6 (Reproducing Your Results). (Later, in Part II, you will add these missing sections, and update other sections in response to feedback from the course staff.) Don't forget to include section 7 (Collaboration and Reflection) in your report, in both Part I and Part II.
You can think of your proposal as a pitch to a venture capitalist, a foundation, or a scientific review panel. Another way to think of it is as creating a new CSE 140 assignment.
Your proposal will probably be about 2-3 pages long, but it is acceptable for the proposal to be longer or shorter as long as it conveys the required information without irrelevant details. Submit your proposal as a PDF file. (In Microsoft Word, you can choose “save as” to save your document as a PDF file. Do not turn in a Word document or plain text.)
Your proposal will be graded based on the quality and clarity of the writing and on whether your proposal conforms to the requirements above. The staff may direct you to make changes to your proposal. For example, the staff may direct you to revise the scope of your proposal (we don't want you to proceed with a project that is too hard or too easy).
Your project will use Python to help you answer your research questions. You might use Python to obtain the data, to clean/format it, and/or to process it (perform computations on it). Any of these is acceptable; in particular, it's OK if the bulk of your code does the work of obtaining and cleaning the data, then the actual data processing is simple. (But some part of the projects must require non-trivial Python processing.)
You will not turn in any code for Part I. However, you might write code. Some reasons are:
Do not neglect Part I of the assignment. If you do not work hard on Part I, you will not be prepared for Part II (not to mention that each part is graded).
One group member should submit your report via Catalyst CollectIt (a.k.a. Dropbox).
Finally, answer a survey about how much time you spent on Part I of this assignment. (Each group member should do this part individually.)
Implement your analysis, process your data, and interpret the results. Then, write a report that describes the results and conclusions of your analysis. It might include graphs, tables of numbers, or just a few key computations. Remember that plots and other visual representations of data are very useful in conveying your conclusions.
Submit your report in PDF (the same as in Part I). Your report will probably be about 4-6 pages of text long, but there are no fixed upper or lower bounds on its size. You should write at an appropriate length: neither so briefly that you omit information, nor so verbosely that you pad your report or bury the important information under irrelevant details.
Your report should contain at least the following parts. You are permitted to write additional sections as well.
You do not have to turn in the dataset itself (it may be quite large, or it might be available only via the web rather than as a single download). Your program might access the data directly via the web. If your program needs to access the data through the file system, then ideally, when your program is run, it should automatically check whether the required data files are present, and if not download them before doing any additional work. Alternately, your report must include simple, clear, unambiguous instructions that anyone can follow to download the data themselves.
If the data is not currently publicly available (for example, it is data from some activity you are involved in), then you should make it publicly available. One way to do this is to host it on some public website (Dropbox is one possibility); then, your program and your report should reference that location. Do not use a dataset that cannot be shared with the course staff, such as one that contains confidential medical information or intellectual property.
If the full dataset is too large to download, then provide a subset of it for the course staff to experiment with.
Do not manually perform any data cleaning steps — all data cleaning should be done programmatically, by Python code that you write.
Hints about datasets: The best approach is to start with a problem that interests you, and then look for data. Alternately, you can start with a dataset. Here are some possible data sources, but many more exist:
Your source code should be well-written and well-commented. It should be clear enough for another programmer, such as the course staff, to understand and modify if needed. Your source code documentation should assume that the programmer has already read your report — you do not need to repeat any of those details, but may wish to use concepts that it introduced. A typical final project might contain around 200 lines of well-structured code without duplicated functionality, though longer or shorter is possible.
It is acceptable for you to scale back, or to expand, the scope of your project if necessary. It's better to do a great job on a subset of your original proposal, than to do a bad job on a larger project. If you have to scale back, then explain why the task was more difficult than you estimated when you wrote your proposal. This will help you to make a better estimate for your next project. It will also convince the course staff that you have done an acceptable amount of work for CSE 140.
One group member should submit the following via Catalyst CollectIt (a.k.a. Dropbox):
Finally, answer a survey about how much time you spent on Part II of this assignment. (Each group member should do this part individually.)
Prepare a slide deck of no more than 5 slides (including title slide), that can be presented in no more than 3 minutes. The slide deck should summarize the main points of your report, including motivation, research questions, and results. You will only have time to present the most important aspects of your project. This may include your algorithm, but it will probably not include implementation details of your Python code. The slide deck must be a single self-contained file, in PDF (preferred), PowerPoint, or Impress format. We will load all the documents on a single computer and all groups will present using it.
You will present your work to the rest of the class on Monday, March 18. It is strongly recommended that you practice your presentation ahead of time.
One group member should submit your slide deck via Catalyst CollectIt (a.k.a. Dropbox):
It is not permitted to submit your slide deck late: you may not use late days, and the deadline is firm.
Finally, answer a survey about how much time you spent on Part III of this assignment. (Each group member should do this part individually.)