CSE442 Data Visualization (Winter 2020)

Assignment 2: Exploratory Data Analysis

In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of captioned visualizations that convey key insights gained during your analysis.

Step 1: Data Selection

First, you will pick a topic area of interest to you and find a dataset that can provide insights into that topic. To streamline the assignment, we've pre-selected a number of datasets included below for you to choose from.

However, if you would like to investigate a different topic and dataset, you are free to do so. If working with a self-selected dataset, please check with the course staff to ensure it is appropriate for the course. Be advised that data collection and preparation (also known as data wrangling) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.

After selecting a topic and dataset – but prior to analysis – you should write down an initial set of at least three questions you'd like to investigate.

Part 2: Exploratory Visual Analysis

Next, you will perform an exploratory analysis of your dataset using a visualization tool such as Altair or Tableau. You should consider two different phases of exploration.

In the first phase, you should seek to gain an overview of the shape & stucture of your dataset. What variables does the dataset contain? How are they distributed? Are there any notable data quality issues? Are there any surprising relationships among the variables? Be sure to also perform "sanity checks" for patterns you expect to see!

In the second phase, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (by adding additional variables, changing sorting or axis scales, filtering or subsetting data, etc.) to develop better perspectives, explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, but feel free to revise your questions or branch off to explore new questions if the data warrants.

Final Deliverable

Your final submission should take the form of a sequence of images – similar to a slide show or comic book – that consists of 8 or more annotated visualizations detailing your most important insights. Your "insights" can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. You do not necessarily need to create 8 unique visualizations: you may repeat the same visualization with various annotations highlighting different parts of the data, but your submission must include a minimum of 4 unique views of the data. If you aren't sure what we mean by "annotated visualization," see this page for some examples.

Each image should be a visualization exported from a visualization tool, accompanied with a title and descriptive anotation highlighting the insight(s) shown in that view. Provide sufficient detail such that anyone could read through your report and understand what you've learned without necessarily being familiar with that dataset. You should annotate your images to draw attention to specific features of the data; you may perform highlighting within the visualization tool itself, or draw annotations on the exported image.

If using Altair, there should be a button in the upper right corner of the visualizations that allow you to export to an image or SVG file. To easily export images from Tableau, use the Worksheet > Export > Image... menu item.

When complete, submit a zip file to Canvas with your visualization images. The images filenames should be numbered in the order that they are meant to be viewed, e.g. 1.png, 2.png, 3.png, .... You should also include a brief write-up (4 paragraphs max.) stating the questions you came up with, explaining your analysis process and outlining the data transformations that you performed in the course of creating the visualizations. The end of the write up should include a brief summary of main lessons learned.

Recommended Data Sources

To get up and running quickly with this assignment, we recommend exploring one of the following provided datasets.

The World Bank Data, 1960-2017

The World Bank has tracked global human development by indicators such as climate change, economy, education, environment, gender equality, health, and science and technology since 1960. We have 20 indicators from the World Bank for you to explore. Alternatively, you can browse the original data by indicators or by countries. Click on an indicator category or country to download the CSV file.

Data: https://github.com/ZeningQu/World-Bank-Data-by-Indicators

Campaign Finance Data for the 2017-18 Congressional Cycle

Congressional elections will be held November 2018 with political fundraising and spending well underway. The Federal Election Commission (FEC) maintains a database of the financial activity of Political Action Committees (PACs), including funds raised, money disbursed to committees & candidates, and operating expenditures. See the FEC site for a full listing of available data sets.

Note that these data sets are large, with both many rows and many columns. It is sufficiently rich to support a variety of explorations. If you choose to explore this data, we strongly encourage you to pick a focus area. Examples include: (1) How do presidential campaigns continue to spend money after the general election? (2) How are the primary Democratic and Republican National Committees spending funds? (3) What are Democratic Party presidential hopefuls doing? (4) What expenditures are occurring across Washington State, and by whom? (5) To what degree is money spent back in politician's home states? To what extent does it go to Beltway consultants? Focusing on any one of these topics should be sufficient for this assignment.

Chicago Crimes, 2001-Present

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

Data: Chicago Access Page (click Export to download a CSV file)

Daily Weather in the U.S., 2017

This dataset contains daily U.S. weather measurements in 2017, provided by the NOAA Daily Global Historical Climatology Network. This data has been transformed: some weather stations with only sparse measurements have been filtered out. See the accompanying weather.txt for descriptions of each column.

Data: weather.csv.gz (gzipped CSV)

Yelp Open Dataset

This dataset provides information about businesses, user reviews, and more from Yelp's database. The data is split into separate files (business, checkin, photos, review, tip, and user), and is available in either JSON or SQL format. You might use this to investigate the distributions of scores on yelp, look at how many reviews users typically leave, or look for regional trends about restaurants. Note that this is a large, structured dataset and you don't need to look at all of the data to answer interesting questions.

In order to download the data you will need to enter your email and agree to Yelp's Dataset License.

Data: Yelp Access Page (data available in JSON & SQL formats)

Additional Data Sources

If you want to investigate datasets other than those recommended above, here are some possible sources to consider. You are also free to use data from a source different from those included here. If you have any questions on whether your dataset is appropriate, please ask the course staff ASAP!

Visualization Tools

You are free to use one or more visualization tools in this assignment. However, in the interest of time and for a friendlier learning curve, we strongly encourage you to use Altair or Tableau. Tableau provides a graphical interface focused on the task of visual data exploration, wheras Altair provides a Python interface to the high-level visualization grammar Vega Lite.

Data Wrangling Tools

The data you choose may require reformatting, transformation or cleaning prior to visualization. Here are tools you can use for data preparation. We recommend first trying to import and process your data in the same tool you intend to use for visualization. If that fails, pick the most appropriate option among the tools below. Contact the course staff if you are unsure what might be the best option for your data!

Graphical Tools

  • Tableau - Tableau provides basic facilities for data import, transformation & blending.
  • Trifacta Wrangler - Interactive tool for data transformation & visual profiling.
  • OpenRefine - A free, open source tool for working with messy data.

Programming Tools

Grading Criteria

Each submission will be graded based on both the analysis process and included visualizations. Here are our grading criteria:

  • Poses clear questions applicable to the chosen dataset.
  • Appropriate data quality assessment and transformation.
  • Sufficient breadth of analysis, exploring multiple questions.
  • Sufficient depth of analysis, with appropriate follow-up questions.
  • Expressive & effective visualizations crafted to investigate analysis questions.
  • Clearly written, understandable annotations that communicate primary insights.

Submission Details

This is an individual assignment. You may not work in groups.

When complete, submit a zip file to Canvas with your visualization images. The images filenames should be numbered in the order that they are meant to be viewed, e.g. 1.png, 2.png, 3.png, .... You should also include a brief write-up (4 paragraphs max.) stating the questions you came up with, explaining your analysis process and outlining the data transformations that you performed in the course of creating the visualizations. The end of the write up should include a brief summary of main lessons learned.