Assignment 2: Exploratory Data Analysis
In this assignment, you will identify a dataset of interest and perform an initial analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of annotated and/or captioned visualizations that convey key insights gained during your analysis.
Step 1: Data Selection
First, pick a topic area of interest to you and find a dataset that can provide insights into that topic. To streamline the assignment, we've pre-selected a number of datasets included below for you to choose from.
However, if you would like to investigate a different topic and dataset, you are free to do so. If working with a self-selected dataset and you have doubts about its appropriateness for the course, please check with the course staff. Be advised that data collection and preparation (also known as data wrangling) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.
After selecting a topic and dataset – but prior to analysis – you should write down an initial set of at least three questions you'd like to investigate.
Part 2: Exploratory Visual Analysis
Next, you will perform an exploratory analysis of your dataset using a visualization tool such as Vega-Lite/Altair or Tableau. You should consider two different phases of exploration.
In the first phase, you should seek to gain an overview of the shape & stucture of your dataset. What variables does the dataset contain? How are they distributed? Are there any notable data quality issues? Are there any surprising relationships among the variables? Be sure to perform "sanity checks" for any patterns you expect the data to contain.
In the second phase, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (by adding additional variables, changing sorting or axis scales, filtering or subsetting data, etc.) to develop better perspectives, explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, but feel free to revise your questions or branch off to explore new questions if the data warrants.
Final Deliverable
Your final submission should take the form of a sequence of images – similar to a comic book – that consists of 8 or more visualizations detailing your most important observations.
Your observations can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. Where appropriate, we encourage you to include annotated visualizations to guide viewers' attention and provide interpretive context. (If you aren't sure what we mean by "annotated visualization," see this page for some examples.)
Provide sufficient detail such that anyone can read your report and understand what you've learned without already being familiar with the dataset. To help gauge the scope of this assignment, see this example report analyzing motion picture data.
Each image should be a visualization, including any titles or descriptive annotations highlighting the insight(s) shown in that view. For example, annotations could take the form of guidelines and text labels, differential coloring, and/or fading of non-focal elements. You are also free to include a caption for each image, though no more than 2 sentences: be concise! You may create annotations using the visualization tools of your choice, or by adding them using image editing or vector graphics tools.
You must write up your report in a computational notebook format, published online. Examples include Observable notebooks or hosted Jupyter notebooks. Submit the URL of your notebook on the Canvas A2 submission page. For example, to publish using Observable from a private notebook, click the "..." menu button in the upper right and select "Enable link sharing", then copy and submit your notebook URL.
Be sure to enable link sharing if needed (e.g., on Observable), otherwise the course staff will not be able to view your submission!
A few tips:
- To export a Vega-Lite visualization, be sure you are using the "canvas" renderer, right click the image, and select "Save Image As...".
- To export images from Tableau, use the Worksheet > Export > Image... menu item.
- To add an image to an Observable notebook, first add your image as a notebook file attachment: click the "..." menu button and select "File attachments". Then load the image in a new notebook cell:
FileAttachment("your-file-name.png").image()
.
Potential Data Sources
To get up and running quickly with this assignment, here are some existing data sources.
The World Bank Data, 1960-2017
The World Bank has tracked global human development by indicators such as climate change, economy, education, environment, gender equality, health, and science and technology since 1960. We have 20 indicators from the World Bank for you to explore. Alternatively, you can browse the original data by indicators or by countries. Click on an indicator category or country to download the CSV file.
Data: https://github.com/ZeningQu/World-Bank-Data-by-Indicators
Daily Weather in the U.S., 2017
This dataset contains daily U.S. weather measurements in 2017, provided by the NOAA Daily Global Historical Climatology Network. This data has been transformed: some weather stations with only sparse measurements have been filtered out. See the accompanying weather.txt for descriptions of each column.
Data: weather.csv.gz (gzipped CSV)
Yelp Open Dataset
This dataset provides information about businesses, user reviews, and more from Yelp's database. The data is split into separate files (business, checkin, photos, review, tip, and user), and is available in either JSON or SQL format. You might use this to investigate the distributions of scores on yelp, look at how many reviews users typically leave, or look for regional trends about restaurants. Note that this is a large, structured dataset and you don't need to look at all of the data to answer interesting questions.
In order to download the data you will need to enter your email and agree to Yelp's Dataset License.
Data: Yelp Access Page (data available in JSON & SQL formats)
Additional Data Sources
Here are some other possible sources to consider. You are also free to use data from a source different from those included here. If you have any questions on whether a dataset is appropriate, please ask the course staff ASAP!
- data.seattle.gov - City of Seattle Open Data
- data.wa.gov - State of Washington Open Data
- nwdata.org - Open Data & Civic Tech Resources for the Pacific Northwest
- data.gov - U.S. Government Open Datasets
- U.S. Census Bureau - Census Datasets
- IPUMS.org - Integrated Census & Survey Data from around the World
- Federal Elections Commission - Campaign Finance & Expenditures
- Federal Aviation Administration - FAA Data & Research
- fivethirtyeight.com - Data and Code behind the Stories and Interactives
- Buzzfeed News - Open-source data from BuzzFeed's newsroom
- Kaggle Datasets - Datasets for Kaggle contests
- List of datasets useful for course projects - curated by Mike Freeman
Visualization Tools
You are free to use one or more visualization tools in this assignment. However, in the interest of time and for a friendlier learning curve, we strongly encourage you to use Tableau and/or Vega-Lite.
- Tableau - Desktop visual analysis software. Available for both Windows and MacOS; register for a free student license.
- Vega-Lite is a high-level grammar of interactive graphics. It provides a concise, declarative JSON syntax to create an expressive range of visualizations for data analysis and presentation.
- R, using the ggplot2 library or with R's built-in plotting functions.
- Jupyter Notebooks (Python), using libraries such as Altair or Matplotlib.
- Voyager - Research prototype from the UW Interactive Data Lab. Voyager combines a Tableau-style interface with visualization recommendations. Use at your own risk!
Data Wrangling Tools
The data you choose may require reformatting, transformation or cleaning prior to visualization. Here are tools you can use for data preparation. We recommend first trying to import and process your data in the same tool you intend to use for visualization. If that fails, pick the most appropriate option among the tools below. Contact the course staff if you are unsure what might be the best option for your data!
Graphical Tools
- Tableau - Tableau provides basic facilities for data import, transformation & blending.
- Trifacta Wrangler - Interactive tool for data transformation & visual profiling.
- OpenRefine - A free, open source tool for working with messy data.
Programming Tools
- Arquero: JavaScript library for wrangling and transforming data tables.
- JavaScript basics for manipulating data in the browser.
- Pandas - Data table and manipulation utilites for Python.
- dplyr - A library for data manipulation in R.
- Or, the programming language and tools of your choice...
Grading Criteria
Each submission will be graded based on both the analysis process and included visualizations. Here are our grading criteria:
- Poses clear questions applicable to the chosen dataset.
- Appropriate data quality assessment and transformation.
- Sufficient breadth of analysis, exploring multiple questions.
- Sufficient depth of analysis, with appropriate follow-up questions.
- Expressive & effective visualizations crafted to investigate analysis questions.
- Clearly written, understandable annotations that communicate primary insights.
Submission Details
This is an individual assignment. You may not work in groups.
Your completed exploratory analysis report is due Monday 4/26, 11:59pm. As described above, your report should take the form of an online notebook. Submit the URL of your notebook (ensure any link sharing is enabled!) on the Canvas A2 page.