Assignment 2: Exploratory Data Analysis
In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of captioned visualizations that convey key insights gained during your analysis.
Step 1: Data Selection
First, you will pick a topic area of interest to you and find a dataset that can provide insights into that topic. To streamline the assignment, we've included some datasets below for you to choose from.
However, if you would like to investigate a different topic and dataset, you are encouraged to do so. If you are working with a self-selected dataset and have any concerns about its appropriateness for the course, please check with the course staff.
Be advised that data collection and preparation (also known as data wrangling) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.
After selecting a topic and dataset – but prior to analysis – you should write down an initial set of at least three questions you'd like to investigate.
Part 2: Exploratory Visual Analysis
Next, you will perform an exploratory analysis of your dataset using a visualization tool (or tools) of your choice. Options include Altair (Python), Vega-Lite (JS), ggplot2 (R), or Tableau (GUI). You should consider two different phases of exploration.
In the first phase, you should seek to gain an overview of the shape & stucture of your dataset. What variables does the dataset contain? How are they distributed? Are there any notable data quality issues? Are there any surprising relationships among the variables? Be sure to also perform "sanity checks" for patterns you expect to see!
In the second phase, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (by adding additional variables, changing sorting or axis scales, filtering or subsetting data, etc.) to develop better perspectives, explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, and also feel free to revise your questions or branch off to explore new questions if the data warrants.
Final Deliverable
Your final submission should take the form of a report – similar to a slide show or comic book – that consists of 10 or more captioned visualizations detailing your most important insights. Your "insights" can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. To help you gauge the scope of this assignment, see this example report analyzing data about motion pictures.
Each visualization image should be a screenshot exported from a visualization tool, accompanied with a title and descriptive caption (1-4 sentences long) describing the insight(s) learned from that view. Provide sufficient detail for each caption such that anyone could read through your report and understand what you've learned. You are free, but not required, to annotate your images to draw attention to specific features of the data. You may perform highlighting within the visualization tool itself, or draw annotations on an exported image.
If you are using Tableau or other graphical construction tools, you can assemble your report document by hand. To easily export images from Tableau, use the Worksheet > Export > Image... menu item. We've provided a basic HTML template for you to fill in. You will need to edit the HTML to add your captions and links to image files. Exported image files should reside in the same local directory as your HTML file. You can then put all the files in a zip archive for submission.
Another option is to write your report in a computational notebook format. Options here include submitting the output of a Jupyter notebook (e.g., if using Altair, matplotlib, etc. in Python), the output of an R Markdown document (e.g., if using ggplot2 in R), or a link to an Observable notebook (e.g., if using Vega-Lite in JavaScript). If sharing a link to an Observable notebook, note that you do not need to publish your notebook to the world: you can enable link sharing by clicking the "..." menu in the upper-right and then clicking "Share link".
Do not submit a report cluttered with everything little thing you tried. Submit a clean report that highlights the most important "milestones" in your exploration, which can include initial overviews, identification of data quality problems, confirmations of key assumptions, and potential "discoveries".
Finally, the end of your report should include a brief summary of main lessons learned. When complete, upload your report file(s) or submit a link to Canvas.
Recommended Data Sources
To get up and running quickly with this assignment, we you might try exploring one of the following datasets:
Chicago Crimes, 2001-Present
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system.
Data: Chicago Access Page (click Export to download a CSV file)
Daily Weather in the U.S., 2017
This dataset contains daily U.S. weather measurements in 2017, provided by the NOAA Daily Global Historical Climatology Network. This data has been transformed: some weather stations with only sparse measurements have been filtered out. See the accompanying weather.txt for descriptions of each column.
Data: weather.csv.gz (gzipped CSV)
Yelp Open Dataset
This dataset provides information about businesses, user reviews, and more from Yelp's database. The data is split into separate files (business, checkin, photos, review, tip, and user), and is available in either JSON or SQL format. You might use this to investigate the distributions of scores on yelp, look at how many reviews users typically leave, or look for regional trends about restaurants. Note that this is a large, structured dataset and you don't need to look at all of the data to answer interesting questions.
In order to download the data you will need to enter your email and agree to Yelp's Dataset License.
Data: Yelp Access Page (data available in JSON & SQL formats)
Additional Data Sources
If you want to investigate datasets other than those recommended above, here are some possible sources to consider. You are also free to use data from a source different from those included here. If you have any questions on whether your dataset is appropriate, please ask the course staff ASAP!
- data.seattle.gov - City of Seattle Open Data
- data.wa.gov - State of Washington Open Data
- nwdata.org - Open Data & Civic Tech Resources for the Pacific Northwest
- data.gov - U.S. Government Open Datasets
- U.S. Census Bureau - Census Datasets
- IPUMS.org - Integrated Census & Survey Data from around the World
- Federal Elections Commission - Campaign Finance & Expenditures
- Federal Aviation Administration - FAA Data & Research
- fivethirtyeight.com - Data and Code behind the Stories and Interactives
- Buzzfeed News
- Socrata Open Data
- 17 places to find datasets for data science projects
Visualization Tools
You are free to use one or more visualization tools in this assignment. If you are unsure, in the interest of time and for a friendlier learning curve, we encourage you to use Tableau. Tableau provides a graphical interface focused on the task of visual data exploration. You will (in most cases) be able to complete an initial data exploration more quickly and comprehensively than with a programming-based tool.
- Tableau - Desktop visual analysis software. Available for both Windows and MacOS; register for a free student license.
- Jupyter Notebooks (Python), using libraries such as Altair or Matplotlib.
- R, using the ggplot2 library or with R's built-in plotting functions.
- Observable Notebooks (JavaScript), using libraries such as Vega-Lite.
Data Wrangling Tools
The data you choose may require reformatting, transformation or cleaning prior to visualization. Here are tools you can use for data preparation. We recommend first trying to import and process your data in the same tool you intend to use for visualization. If that fails, pick the most appropriate option among the tools below. Contact the course staff if you are unsure what might be the best option for your data!
Graphical Tools
- Tableau - Tableau provides basic facilities for data import, transformation & blending.
- Trifacta Wrangler - Interactive tool for data transformation & visual profiling.
- OpenRefine - A free, open source tool for working with messy data.
Programming Tools
- JavaScript data utilities.
- Pandas - Data frame and manipulation utilites for Python.
- dplyr - A library for data manipulation in R.
- Or, the programming language and tools of your choice...
Grading Criteria
Each submission will be graded based on both the analysis process and included visualizations. Here are our grading criteria:
- Poses clear questions applicable to the chosen dataset.
- Appropriate data quality assessment and transformation.
- Sufficient breadth of analysis, exploring multiple questions.
- Sufficient depth of analysis, with appropriate follow-up questions.
- Expressive & effective visualizations crafted to investigate analysis questions.
- Clearly written, understandable captions that communicate primary insights.
Submission Details
This is an individual assignment. You may not work in groups.
Your completed exploratory analysis report must be submitted to Canvas by Monday 4/22, 11:59pm. If using a notebook format, please submit either an exported PDF or a URL link (for Observable or Colab). If you are not using a notebook format, you should fill out the provided HTML template and upload a zip file containing your completed HTML file and visualization screenshots.