Homework 8: Final project: Design your own data analysis

So far, you have performed data analysis on a variety of data sources, to solve realistic problems from science, engineering, and business. Now it's your turn to choose and analyze a problem. This is good practice for how you will use Python in the remainder of your career.

There are two parts to this assignment, due separately. Part I is due on Thursday, August 9. Part II is due on Thursday, August 16.

For this assignment, you are permitted to work with a partner; the two of you will submit just one solution. You are not required to work with a partner, and groups may not be larger than two people. Only one of you will submit the assignment — do not submit duplicates. If you work with a partner, we will expect your project to be twice as substantial as a project done individually.

Part I: Propose an analysis and locate data

Propose a data analysis project to the course staff. This can be almost anything that you choose. You might select a project from your field of study, from your extracurricular interests, from open government or public policy, or from elsewhere. We are just looking for you to show that you have absorbed the lessons of CSE 190p. Impress us!

Think of your proposal as a pitch to a venture capitalist, a foundation, or a scientific review panel. Your data analysis proposal must clearly state the problem, in the form of one or more questions that you will seek to answer. It must explain your algorithm or other analysis, and you must have already located a pre-existing dataset that your code will analyze.

More specifically, you should turn in a draft report, in the format required in Part II of this assignment. Your report should include all of the parts required in the final report, except for the results in part 1, part 5 (results), and part 6 (reproducing results).

Your proposal will probably be about 2 pages long. Submit your proposal in PDF (not in plain text nor in proprietary binary formats like Microsoft Word or Rich Text Format).

The course staff will review your proposal and will either approve it or will require you to make changes. They will base their assessment on:

Hints about datasets: A good approach is to start with a problem of interest and then look for data. Alternately, you can start with a dataset. Here are some possible data sources, but many more exist:

Part II: Perform the analysis and report the results

Implement your analysis, process your data, and interpret the results. Then, submit a report that describes the results and conclusions of your analysis. It might include graphs, tables of numbers, or just a few key computations. Remember that plots and other visual representations of data are very useful in conveying your conclusions.

Submit your report in PDF (not in plain text nor in proprietary binary formats like Microsoft Word or Rich Text Format). Your report will probably be about 4-6 pages of text long, but there are no fixed upper or lower bounds on its size. You should write at an appropriate length: neither so briefly that you omit information, nor so verbosely that you pad your report or bury the important information under irrelevant details.

Your report should contain at least the following parts. You are permitted to write additional sections as well.

  1. Title and author(s). The title should be descriptive of your research questions. It should not be something like "CSE 190p homework 8" (though you could use that as a subtitle if you wish).
  2. Overview of research questions and results. In 1-3 sentences, state each research question — that is, each problem you investigated. What are you trying to compute, and why?
    After each research question, clearly state the answer you determined. You don't have to give details or justifications yet — just the answer.
  3. Motivation and background. State, explain, and motivate the problem. In other words, explain the context and why the problem matters. This expands on the research questions that you already stated. Why are they worth computing? What difference would knowing the answers make? We require a problem with some kind of real-world motivation.
  4. Dataset. Describe the real, pre-existing dataset that you used, including exact URLs. You may not use a dataset that has been used in a previous CSE 190p assignment or demo. You do not have to turn in the dataset itself (it may be quite large, or it might be available only via the web rather than as a single download).
    If the dataset needs to be downloaded, then ideally, when your program is run, it should automatically check whether the dataset has already been downloaded &mdash if not, your program should download it before doing any additional work. If that is not the case, then your report must include clear, unambiguous instructions that anyone can follow to download it themselves.
  5. Methodology (algorithm or analysis). A complete, clear, unambiguous English description of the analysis you performed. This should be sufficient for someone else to reproduce your results, even without access to your source code, and without having to guess nor to make significant design choices. This description is also likely to be helpful to people who read your code.
  6. Results. Present and discuss your research results. Focus in particular on the parts that are most interesting, surprising, or important. Discuss the consequences or implications.
  7. Reproducing your results. Give clear and explicit instructions for obtaining the data and for running the analysis, and for interpreting the results or finding the interesting parts in the output. It must be possible for the course staff to run the code, without any additional data entry or interaction, to re-create every number or figure that appears in your report (and also any other results that support your argument but that you did not include in your report).
  8. Collaboration and reflection. What did you learn from this assignment? What do you wish you had known before you started? What would you do differently? What advice would you offer to future students embarking on this project?
    Also, state which students or other people (besides the course staff) helped you with the assignment, or that no one did. State how many hours you spent on Part I and Part II of the assignment. Also state what you or the staff could have done to help you with the assignment.

Also, submit your commented source code. Your source code should be clear enough for another programmer, such as the course staff, to understand and modify if needed. Your source code documentation may assume that the programmer has already read your report.

Submit your work

Submit your files via Catalyst CollectIt (a.k.a. Dropbox).

In-class presentation

You will present your work to the rest of the class on Friday, August 17.

It is recommended that you use slides (which you can prepare using PowerPoint, KeyNote, Impress, etc.). Your presentation will be strictly limited to 5 minutes. It is strongly recommended that you practice your presentation ahead of time. You will only have time to present the most important results from your project. Be sure to clearly state the research questions and your answers to them.