Task: Find 2 datasets that you would be interested in exploring, and submit a brief writeup about each.
Overview¶
Let’s start thinking about the final project! As part of your preparation, we would like you to start searching for data and thinking about the kinds of research questions you’d like to answer.
Sources of Data¶
The best approach is to start with a problem that interests you, and then look for data. Google will be your friend for finding a dataset. Alternatively, an equally valid approach involves starting with a dataset that looks interesting and designing questions that interest you around that data. Here are some possible data sources, but many more exist:
- Kaggle is a great place to get started, with datasets organized by categories and keywords.
- Responsible Datasets in Context hosts some datasets with historical and social contexts
- A variety of data sets are available from UW Libraries
- Awesome Public Datasets - large variety of maintained data sets
- Baron Schwartz’s list of datasets. Some of these are themselves rich lists of datasets, such as the Amazon AWS public data sets.
- Data.gov for U.S. open government data, data.wa.gov for Washington state open government data, and data.seattle.gov for Seattle open government data
- SQLShare: public scientific datasets. Some require considerable knowledge to interpret, others are easier to understand. You can select “All datasets” and then filter by keyword, or you can select a tag from among those in the left column.
- Reddit Data Sets
- Civic Data Sets for the Pacific Northwest
- An archive of datasets distributed with the R statistical language
- 30 Places to Find Open Data on the Web Visual.ly
- Office for National Statistics (UK) a repository of detailed statistics about Great Britain and Northern Ireland
- World Bank Data Catalog
- CDC NCHS Data - CDC’s National Center for Health Statistics Data Access
- Machine Learning Repository - large variety of maintained data sets
Requirements¶
For each dataset that you find, you will submit a short answer of at least 3-4 sentences on each of the following prompts:
-
Briefly describe your dataset. What is it about? Who collected it? Where did you find it? Provide a link or attachment.
-
What makes this dataset interesting to you?
-
Write and describe at least two potential research questions you might want to explore using this dataset. It’s OK if you don’t know how to answer these questions yet!
-
Describe at least two potential limitations or biases in this dataset.
Make sure to answer all four questions for each dataset that you find!
Submissions and Grading¶
You will be graded on the quality and thoughtfulness of your responses, so make sure you are giving adequate time to each question.
There will be no resubmissions or late work accepted since this assignment is a project component. Make sure that you are managing your time wisely!
Submit your work on Gradescope by 7 July 2025, 11:59pm PST. Make sure that you submit a PDF that contains your answers to the questions, about each dataset. We will not be able to grade files that are not PDFs.