Homework 7: Final project: Design your own data analysis

So far, you have analyzed data from a variety of sources to solve realistic problems from science, engineering, and business. Now it's your turn to choose and analyze a problem. This is good practice for how you will use Python for the remainder of your career.

There are four parts to this assignment, due separately.
Part 0 is due on Wednesday, Feb 22, 2017 at 11pm. (10% of overall project grade) (Catalyst Dropbox)
Part I is due on Wednesday, March 1, 2017 at 11pm. (10% of overall project grade) (Catalyst Dropbox) (Survey)
Part II is due on Friday, March 10, 2017 at 11pm. (70% of overall project grade) (Catalyst Dropbox) (Survey)
Part III is due on Monday, March 13, 2016 at 12pm (Noon). NO LATES ACCEPTED. (10% of overall project grade) (Catalyst Dropbox) (Survey)
For examples, see the reports from Winter 2013 which had a slightly different format than our reports will have.

Parts 0, I and II may only be submitted ONE DAY LATE regardless of how many late days you have remaining. Part III may NOT be submitted late. Submitting any of parts 0, I, or II late will consume one late day. For example, if you submit Part 0 one day late and Part II one day late that will consume two late days. If you have no late days remaining you will incur a penalty equivalent to 10% off that portion of the project.

This homework will count as approximately TWO regular homeworks.

For this assignment, you are permitted and very STRONGLY encouraged to work with a partner; the two of you will submit one solution. Students in previous quarters reported that they learned a tremendous amount from working with another person. You are not required to work with a partner, and groups may not be larger than two people. Only one of you will submit the assignment — do not submit duplicates. If you work with a partner, we will expect your project to be more substantial than a project done individually. You may wish to use this area on the GoPost to look for a partner. Both partners must contribute equally to BOTH the code and the report writing. If you wish to submit part of your assignment late, that will consume a late day from both partners. If one partner does not have any more late days remaining then that person will receive 10% off for that portion of the project.

Advice from previous students about this assignment: 14wi 15sp

Part 0: Propose an analysis and locate data

Propose a data analysis project to the course staff. This can be almost anything that you choose. You might select a project from your field of study, your extracurricular interests, government or public policy, or elsewhere. Another good source of ideas is repeating an analysis that you read about in the popular press or in a scientific paper — usually you will do a simplified version of the analysis. We are just looking for you to show that you have absorbed the lessons of CSE 160. Impress us!

Your data analysis proposal must clearly state the questions you will seek to answer, the existing dataset you will analyze, and the algorithm you will run on the dataset to answer your questions.

You can think of your proposal as a pitch to a venture capitalist, a foundation, or a scientific review panel. Another way to think of it is as creating a new CSE 160 assignment.

Your Part 0 proposal will probably be about a page long, but it is acceptable for the proposal to be longer or shorter as long as it conveys the required information without irrelevant details. Submit your proposal as a PDF file. (In Microsoft Word, you can choose “save as” to save your document as a PDF file. Do not turn in a Word document or plain text.)

Your proposal will be graded based on the quality and clarity of the writing and on whether your proposal conforms to the requirements above. The staff may direct you to make changes to your proposal. For example, the staff may direct you to revise the scope of your proposal (we don't want you to proceed with a project that is too hard or too easy).

Your project will use Python to help you answer your research questions. You might use Python to obtain the data, to clean/format it, and/or to process it (perform computations on it). Any of these is acceptable; in particular, it's OK if the bulk of your code does the work of obtaining and cleaning the data, then the actual data processing is simple. (But some part of the project must require non-trivial Python processing.)

Your Part 0 project proposal should include the following sections:

  1. Title and author(s). The title should describe your research questions. It should not be something like “CSE 160 homework 7” (though you could use that as a subtitle if you wish).
  2. Summary of research questions. Give a numbered list of one or more research questions. Each one should be a specific question with a specific answer, not merely a general topic or area of investigation. In 1-3 sentences per research question, state what are you trying to compute, and why.
  3. Dataset. Describe the real, existing dataset that you will use, including exact URLs. You may not use a dataset that has been used in a previous CSE 160 assignment or demo. You may not repeat an analysis that you have performed in another class (though you might do something inspired by another class). The data must be real — neither you nor someone else may make up the data.

You do not have to turn in the dataset itself (it may be quite large, or it might be available only via the web rather than as a single download). Your program might access the data directly via the web. If your program needs to access the data through the file system, then ideally, when your program is run, it should automatically check whether the required data files are present, and if not download them before doing any additional work. Alternately, your report must include simple, clear, unambiguous instructions that anyone can follow to download the data themselves.

If the data is not currently publicly available (for example, it is data from some activity you are involved in), then you should make it available to course staff or include it with your assignment submission. One way to make it available is to host it on some public website (Dropbox is one possibility); then, your program and your report should reference that location. Do not use a dataset that cannot be shared with the course staff, such as one that contains confidential medical information or intellectual property.

If the full dataset is too large to download, then provide a subset of it for the course staff to experiment with.

Do not manually perform any data cleaning stepsall data cleaning should be done programmatically, by Python code that you write. That is, you should not be searching in csv files in excel and filling in empty boxes or changing strings to numbers etc. It is possible that cleaning your data will be a big part of the code you write for your project! (Sometimes this can be as much as 75% of a data analysis project.)

Hints about datasets: The best approach is to start with a problem that interests you, and then look for data. Alternately, you can start with a dataset. Here are some possible data sources, but many more exist:

You will not turn in any code for Part 0. However, you might write code. Some reasons are:

Speaking of libraries, you are free to use most any Python library you find that will be useful to you. If you know of a library you will use at this time (not required for the proposal) please mention it in your proposal. The earlier you can mention this to us the better, as some libraries may do so much of the work for you that you need to beef up other parts of the project to make it substantial enough to get a good grade. A few libraries students have found useful in the past include:

You may also find several of the course handouts useful resources on csv.DictReader and interacting with files.

The goal of Part 0 is for you to describe your project idea in enough detail that the course staff can evaluate whether it is an appropriate project. We will try to give you feedback on your project proposal very quickly so you have time to adjust your idea before Part I is due.

Submit Part 0

Submit your part as a PDF file. Do not turn in a Word document or plain text. One group member should submit your report via Catalyst CollectIt (a.k.a. Dropbox).
Reminder: You can only submit Part 0 one day late regardless of the number of late days you may have remaining.

Part I: Describe Your Motivation, Methodology and Work Plan

Now that your Part 0 project proposal has been approved it is time to get started! You need to provide more background and decide on the details of how you will answer your research questions. For Part 1, you will submit a revised version of your part 0 project proposal. Include any feedback you received from course staff and any further refinements you have made. In addition you must include three new sections: one on your motivation, one on your methodology, and one on your work plan.

Your Part 1 project proposal should include the following sections:

  1. Title and author(s).
  2. Summary of research questions.
  3. Motivation and background. Explain the context and why the problem matters. This expands on the research questions that you already stated. Why are they worth computing? What difference would knowing the answers make? We require a problem with some kind of real-world motivation.
  4. Dataset.
  5. Methodology (algorithm or analysis). Write a complete, clear, unambiguous English description of the analysis you will perform. This should be sufficient for someone else to write a Python program (or perform manual computations) that reproduces your results, without access to your source code, and without having to guess or make significant design choices. This description is also likely to be helpful to people who read your code later.
    This section explains how your analysis works on an abstract level, focusing on the problem domain and your algorithm. It should be written to be read by a scientist or engineer, not by a programmer. It should should say nothing about specific implementation choices, such as how your code is organized or implemented (such details belong in code comments), what data structures are iterated over, and the like.
    This section should tie your computations to your research questions, indicating exactly what results would lead you to what conclusions.
    You have learned some elementary statistics in CSE 160. One other concept you might find useful is correlation, a measure of whether two variables' values are related, such as when one variable depends on another. Correlation is a quantitative measure of how good a "best fit line" you can draw on a scatterplot. A concrete metric for correlation is the Pearson coefficient r. It is better to import and use SciPy's pearsonr function rather than implementing it yourself. After you have computed r, you will need to interpret it.
  6. Work Plan. Include a breakdown of the remainder of your project (that is, the work you will do for Part II) into at least 3 and no more than 6 parts, and an estimate of the time you will spend on each of the parts. This does not need to be down to the level of the functions you will implement, but it is fine if you include that level of detail (more detail helps you). If you are working with a partner you must also describe how you will develop and test your code and coordinate other aspects of working together in a team (e.g. sharing access to source code, dividing responsibilities or working together). Both partners must contribute equally to BOTH the code and the report writing. If you are working with a partner you may be interested in trying pair programming . Pair programming is a technique that is a part of Extreme and Agile software development approaches used in some software companies. Here are a few references on pair programming: [as part of Agile Development] [guidelines for pair programming] [pair programming in Computer Science courses][Even a middle school student can do it!]. Note: You are not required to work in pairs, working individually is fine.
  7. Questions. If you have any specific questions for us feel free to add those here. (Not required)

Submit Part I

Submit your part as a PDF file. Do not turn in a Word document or plain text. One group member should submit your Part I report via Catalyst CollectIt (a.k.a. Dropbox).
Reminder: You can only submit Part I one day late regardless of the number of late days you may have remaining.

Finally, answer a survey about how much time you spent on Part I of this assignment. (Each group member should do this survey individually.)

Part II: Perform the analysis and report the results

Implement your analysis, process your data, and interpret the results. Then, complete your report to include the results and conclusions of your analysis. Plots and other visual representations of data are very useful in conveying your conclusions. Please annotate any visualizations with the method used to produce them (e.g. did your python program produce them or did you create them using some other method). Please include plots in your report, not as separate files! At this time you may also go back and improve any of the previous sections you have written.

Submit your report in PDF format. Your report will probably be about 4-6 pages of text long, but there are no fixed upper or lower bounds on its size. You should write at an appropriate length: neither so briefly that you omit information, nor so verbosely that you pad your report or bury the important information under irrelevant details. Visualizations can definitely make your report longer - which is fine!

Your report should contain at least the following parts. You are permitted to write additional sections as well.

  1. Title and author(s).
  2. Summary of research questions AND RESULTS. Repeat your research questions in a numbered list.
    After each research question, clearly state the answer you determined. Don't give details or justifications yet — just the answer.
  3. Motivation and background.
  4. Dataset.
  5. Methodology (algorithm or analysis).
  6. Results. Present and discuss your research results. Treat each of your research questions separately. Focus in particular on the results that are most interesting, surprising, or important. Discuss the consequences or implications. Interpret the results: if the answers are unexpected, then see whether you can find an explanation for them, such as an external factor that your analysis did not account for. Include some visualization of your results (a graph, plot, bar chart, etc.).
  7. Reproducing your results. Give clear and explicit instructions for obtaining the data and for running the analysis — this is usually a set of commands that can be typed to the command line (not in canopy). We should be able to follow these directions to run your code ourselves. Also explain how to interpret the results or to find the interesting parts in the output.
    These instructions must make it possible for the course staff to re-create, without any additional data entry or interaction, both (a) every number or figure that appears in your report and (b) any other results that support your argument but that you did not include in your report.
  8. Work Plan Evaluation. Include your work plan and evaluate it. How accurate were your work plan your estimates from Part I? Why were your estimates good or bad?
  9. Testing. Describe how you tested your code. Did you use asserts? Smaller data files? Be sure to include these with your code. Why should we trust your results?
  10. Live Presentation or Video? Tell us whether you are planning to give A) a ~2 minute presentation about your project on Monday March 13, 2017 2:30-4:20pm OR B) submit a 4 minute video of your presentation by Monday March 13, 2017 at 12pm (Noon).
  11. Collaboration. State which students or other people (besides the course staff and your project partner, if any) helped you with the assignment, or that no one did.

Your source code should be well-written and well-commented. It should be clear enough for another programmer, such as the course staff, to understand and modify if needed. Your source code documentation should assume that the programmer has already read your report — you do not need to repeat any of those details, but may wish to use concepts that it introduced. A typical final project might contain around 200 lines of well-structured code without duplicated functionality, though longer or shorter is possible.

It is acceptable for you to scale back, or to expand, the scope of your project if necessary. It's better to do a great job on a subset of your original proposal, than to do a bad job on a larger project. If you have to scale back, then explain why the task was more difficult than you estimated when you wrote your proposal. This will help you to make a better estimate for your next project. It will also convince the course staff that you have done an acceptable amount of work for CSE 160.

Submit Part II

Submit your part as a PDF file. Do not turn in a Word document or plain text. One group member should submit the following via Catalyst CollectIt (a.k.a. Dropbox):

Reminder: You can only submit Part II one day late regardless of the number of late days you may have remaining.

Finally, answer a survey about how much time you spent on Part II of this assignment. (Each group member should do this survey individually.)

Part III: Present your results

We provide two options for presentation of your results. Each project must choose one of these options at the time that Part II is submitted. If you choose option A) your slides are due 12pm (Noon) Monday, March 13, 2017. If you choose option B) your slides & video are due at 12pm (Noon) Monday, March 13, 2017. NO LATES ACCEPTED for either option.

A) Live presentation during our final exam slot on Monday March 13, 2017 2:30-4:20pm.

Prepare a slide deck of no more than 5 slides (including title slide), that can be presented in no more than ~2 minutes (exact length announced Sat March 11, 2017). The slide deck should summarize the main points of your report, including motivation, research questions, and results. You will only have time to present the most important aspects of your project. This may include your algorithm, but it will probably not include implementation details of your Python code. If you ran into major challenges with your project and are unsure what would be best to present, email us ASAP and we can give you suggestions.

If you worked with a partner, BOTH partners should speak/present during your presentation.

The slide deck must be a single self-contained file, in PDF or PowerPoint format. We will load all the documents on a single computer and all groups will present using that computer.

Your slides are due at 12pm (Noon) on Monday, March 13, 2017. NO LATES ACCEPTED. You will present your work to the rest of the class on Monday, March 13, 2017 during our exam slot: 2:30-4:20pm. It is strongly recommended that you practice your presentation ahead of time.

B) Video presentation due Monday March 13, 2017 12pm (Noon). NO LATES ACCEPTED.

Prepare a 4-minute video describing your project as described above for option A). The only difference is you may use up to 10 slides and you are allowed 4 minutes.

You may choose to use CamStudio(Windows only), Quicktime or Powerpoint's built-in slide-show recording tools. If you choose the video option you will also need to submit your presentation slides. You do not need to stand in front of a projector and record yourself and the slides with a video camera. You can just record what your computer displays (the animated slides) and what you say (the oral presentation). If you do use a video camera to record your presentation, then make sure that the slides are within the view of the video camera. If you work with a partner, BOTH partners need to speak/present and both need to clearly identify themselves when speaking in the video.

Your slides and video are due at 12pm (Noon) on Monday, March 13, 2107. NO LATES ACCEPTED.

Submit Part III

One group member should submit your slide deck and/or video via via Catalyst CollectIt (a.k.a. Dropbox):

It is not permitted to submit your slide deck or video late: you may not use late days, and the deadline is firm.

Finally, answer a survey about how much time you spent on Part III of this assignment. (Note: This survey also asks a few questions about the course overall so is slightly longer than normal.) (Each group member should do this survey individually.)

Now you're done!