CSE 163, Spring 2019: Final Project: Part 0

Overview

The goal of Part 0 is for you to describe your project idea in enough detail that the course staff can evaluate whether it is an appropriate project. We will try to give you feedback on your project proposal very quickly so you have time to adjust your idea before Part 1 is due.

Even though you do have access to a late day for this part of the project, we strongly encourage that you turn Part 0 in on time to make sure the staff can get your feedback to you as quickly as possible.

Details

Propose a data analysis project to the course staff. This can be almost anything that you choose. You might select a project from your field of study, your extracurricular interests, government or public policy, or elsewhere. Another good source of ideas is repeating an analysis that you read about in the popular press or in a scientific paper — usually you will do a simplified version of the analysis. We are just looking for you to show that you have absorbed the lessons of CSE 163. Impress us!

Your data analysis proposal must clearly state the questions you will seek to answer, the existing dataset you will analyze, and the algorithm you will run on the dataset to answer your questions.

You can think of your proposal as a pitch to a venture capitalist, a foundation, or a scientific review panel. Another way to think of it is as creating a new CSE 163 assignment.

Your Part 0 proposal will probably be about a page long, but it is acceptable for the proposal to be longer or shorter as long as it conveys the required information without irrelevant details. Submit your proposal as a PDF file. (In Microsoft Word, you can choose “save as” to save your document as a PDF file. Do not turn in a Word document or plain text.)

Your proposal will be graded based on the quality and clarity of the writing and on whether your proposal conforms to the requirements above. The staff may direct you to make changes to your proposal. For example, the staff may direct you to revise the scope of your proposal (we don't want you to proceed with a project that is too hard or too easy).

Your project will use Python to help you answer your research questions. You might use Python to obtain the data, to clean/format it, and/or to process it (perform computations on it). Any of these is acceptable; in particular, it's OK if the bulk of your code does the work of obtaining and cleaning the data, then the actual data processing is simple. (But some part of the project must require non-trivial Python processing.)

Your Part 0 project proposal should include the following sections:

  1. Title and author(s). The title should describe your research questions. It should not be something like “CSE 163 homework 7” (though you could use that as a subtitle if you wish).
  2. Summary of research questions. Give a numbered list of one or more research questions. Each one should be a specific question with a specific answer, not merely a general topic or area of investigation. In 1-3 sentences per research question, state what are you trying to compute, and why.
  3. Dataset. Describe the real, existing dataset that you will use, including exact URLs. You may not use a dataset that has been used in a previous CSE 163 assignment or demo. You may not repeat an analysis that you have performed in another class (though you might do something inspired by another class). The data must be real — neither you nor someone else may make up the data.

You do not have to turn in the dataset itself (it may be quite large, or it might be available only via the web rather than as a single download). Your program might access the data directly via the web. If your program needs to access the data through the file system, then ideally, when your program is run, it should automatically check whether the required data files are present, and if not download them before doing any additional work. Alternately, your report must include simple, clear, unambiguous instructions that anyone can follow to download the data themselves.

If the data is not currently publicly available (for example, it is data from some activity you are involved in), then you should make it available to course staff or include it with your assignment submission. One way to make it available is to host it on some public website (Dropbox or Google drive are possibilities); then, your program and your report should reference that location. Do not use a dataset that cannot be shared with the course staff, such as one that contains confidential medical information or intellectual property.

If the full dataset is too large to download, then provide a subset of it for the course staff to experiment with.

Do not manually perform any data cleaning stepsall data cleaning should be done programmatically, by Python code that you write. That is, you should not be searching in csv files in excel and filling in empty boxes or changing strings to numbers etc. It is possible that cleaning your data will be a big part of the code you write for your project! (Sometimes this can be as much as 75% of a data analysis project.)

You will not turn in any code for Part 0. However, you might write code. Some reasons are:

  • You might write code to help you download your data.
  • You might write code to perform data cleaning.
  • You might perform an experiment to validate your plans. For instance, maybe you plan to use a particular library. It would be wise to experiment with that library to make sure it has the functionality you need and that you understand how to use it. Such an experiment is called “prototyping”.
  • You might want to get a head start on Part 1 and 2.

Dataset Advice

The best approach is to start with a problem that interests you, and then look for data. Google will be your friend for finding a dataset. Alternately, you can start with a dataset to see if there is a problem that interests you. Here are some possible data sources, but many more exist:

Libraries

You are free to use most any Python library you find that will be useful to you, especially the ones we have learned in class this quarter. If you know of a library you will use at this time (not required for the proposal) please mention it in your proposal. The earlier you can mention this to us the better, as some libraries may do so much of the work for you that you need to beef up other parts of the project to make it substantial enough to get a good grade. A few libraries students have found useful in the past include:

You are also encouraged to take advantage of the libraries we learned this quarter to help you on this project

Submission

Submit your Part 0 as a PDF file. Do not turn in a Word document or plain text. One group member should submit your report on Gradescope and should use Group Members functionality to add the appropriate partner if you have one. If you want to learn about how to add Group Members on Gradescope, please see instructions here. Group members that are not listed in Gradescope by the late-cutoff will be marked as not submitted.

Reminder: You can only submit Part 0 one day late regardless of the number of late days you may have remaining.