Proposal
Develop research questions, datasets, and methodology.
Propose a data analysis project to the course staff. Your project might be about your academic or extracurricular interests, issues in government or public policy, or an analysis that you read about in the popular press or academic research literature. Create something that interests you and will outlive this course!
Your project proposal must clearly state the questions you will seek to answer, the existing dataset you will analyze, and the algorithm you will run on the dataset to answer your questions. It should be about 1 page long. Submit your proposal to Canvas. Your proposal will be evaluated on whether it meets the requirements. The staff may direct you to make changes so that it meets the requirements.
Although programming is not required as part of the proposal, you might find it helpful to prototype some ideas to test their feasibility. For example, it could help to write code to obtain or clean your data and test libraries that you need to solve the problem.
Get started by identifying a problem that interests you and then search the internet for data sources. Or, start with an interesting dataset and design questions around that data. Here are some sample data sources.
- A variety of datasets are available from UW Libraries.
- Awesome Public Datasets, which maintains a large variety of datasets.
- Baron Schwartz’s list of datasets. Some of these are themselves rich lists of datasets, such as the Registry of Open Data on AWS.
- Data.gov, data.wa.gov, and data.seattle.gov.
- Reddit Data Sets.
- Civic Data Sets for the Pacific Northwest.
- Google Dataset Search.
- An archive of datasets distributed with the R statistical language.
- 40 Places to Find Open Data on the Web.
- UK Office for National Statistics.
- World Bank Data Catalog.
- CDC National Center for Health Statistics Data.
- UCI Machine Learning Repository.
Requirements
The format of the proposal should include the following sections.
- Title and authors
- The title should reflect your specific research questions (not “CSE 163 Project”).
- List of research questions
- Give a numbered list of 3 or more research questions. Each research question is 1–3 sentences long and can be definitively answered. Research questions should be much more specific than general topics or areas of interest.
- Motivation
- Expand on your research questions by providing context about why you care about the problem. How does knowing the answers affect the world or our understanding of it?
- Dataset
- Describe the real dataset that you will use. Include a URL to your dataset. Do not use a dataset that has been used previously in this course. Do not self-plagiarize by repeating your own data analysis from another course or context. The data must be real: neither you nor someone else may make up the data. Do not use a paid or otherwise inaccessible dataset: make it easy for any student or staff member to access the dataset.
- Challenge goals
- Select 2 challenge goals that you are planning to meet. Justify why you think your project will meet each goal. Include the bold goal identifier to make it obvious which one you are talking about. If you would like to meet more than 2 goals, discuss the 2 goals that you’re most passionate about.
- Method
- Outline your analysis process in sufficient detail for someone else to independently reproduce your analysis without needing to make significant design decisions. For each computation, indicate how the result would lead you to a certain conclusion about one or more research questions. Highlight connections to challenge goals. While this should include design details, do not include specific implementation decisions that you would typically include as code comments. Think of this as a documentation or a specification for the data analysis program.
- Plan
- Describe your plan to divide the project into at least 3 (but no more than 7) discrete tasks, each with an estimate of the time in hours required for each task. Then, describe your workflow for developing code, testing code, and coordinating work—particularly how team members can support each other in case one task is unexpectedly challenging. All team members must contribute equally to the deliverables. Describe your primary development environment.
- Ed Workspaces
- Ed Workspaces is the easiest development environment to use since it provides real-time collaboration in the same environment that we’ve been using throughout the course. However, each workspace is limited to 20MB of file storage and 1GB of memory—large datasets and certain libraries may not work.
- JetBrains Datalore
- Datalore is a commercial service offering a powerful online collaboration environment for Jupyter Notebooks. The free tier should be enough for most data analysis and machine learning projects.
- Develop on your computer
- Install Anaconda to develop on your computer. This provides more flexibility compared to an online service but will require collaborating in a separate tool and more self-management of project files and library versions.
Challenge goals
Challenge goals help us to define expectations while still offering flexibility for you to design your own project. Meeting 2 or more challenge goals will likely require writing at least 120 lines of Python code.
- Multiple datasets
- Analyze multiple datasets to develop a richer analysis by joining or combining datasets together to answer a research question. For example, in class we combined census data with food access data: two datasets that, when considered together, reveal new insights.
- Messy data
- Use a dataset that is not cleanly presented to you in a CSV already. For example, write code to scrape data from the web, use an API to gather data, or use some dataset that requires a lot of preprocessing and cleaning.
- Result validity
- Externally verify the validity of your answers. For example, use statistical tests to verify your results aren’t likely to happen by chance.
- Machine learning
- Apply machine learning methods to gain insights about the data or make predictions about the future. Be explicit with what your goal is and how you will assess if you meet that goal. For example, consider various types of models (and hyperparameters) to identify which model is best, however you define best. Or, consider evaluating an interpretable model (e.g. linear regression) to understand which features are the most informative for how a decision is made.
- External library
- Apply an external library to help with your analysis. Learn a new library that we have not discussed in-depth in this course, such as requests, SciPy, Beautiful Soup, spaCy, Altair, or Plotly.
- Other
- If your proposal will be challenging enough on its own but does not meet any of these goals, explain why. This is really a last resort. You should run your idea by us in office hours or the discussion board before submitting the proposal.