CSE512 Data Visualization (Spring 2016)

Assignment 2: Exploratory Data Analysis

A variety of digital tools have been designed to help users visually explore data sets and confirm or disconfirm hypotheses about the data. The task in this assignment is to use an existing visualization tool to formulate and answer a series of specific questions about a data set of your choice. After answering the questions you should create a final visualization that is designed to present the answer to your question to others. You should maintain a digital notebook that documents all the questions you asked and the steps you performed from start to finish. The goal of this assignment is not to develop a new visualization tool, but to understand better the process of using visualizations to perform exploratory data analysis.

Here is one way to start.

Step 1. Pick a domain and data set that you are interested in.

  • Peruse the provided data sets below. Choose the one of greatest interest to you. We encourage you to use one of the provided data sets. However, if you would like to explore a different data set, you are free to do so. If you are unsure about your choice, contact the teaching staff.

Step 2. Pose an initial question that you would like to answer.

  • For example: Is there a relationship between melting point and atomic number? Are the brightness and color of stars correlated? Are there different patterns of nucleotides in different regions in human DNA?

Step 3. Assess the fitness of the data for answering your question.

  • Inspect the data--it is invariably helpful to first look at the raw values. Does the data seem appropriate for answering your question? If not, you may need to start the process over. If so, does the data need to be reformatted or cleaned prior to analysis? Perform any steps necessary to get the data into shape prior to visual analysis.

Exploratory Analysis Process

After you have an initial question and a dataset, construct a visualization that provides an answer to your question. As you construct the visualization you will find that your question evolves - often it will become more specific. Keep track of this evolution and the other questions that occur to you along the way. Once you have answered all the questions to your satisfaction, think of a way to present the data and the answers as clearly as possible. In this assignment, you should use existing visualization software tools. You may find it beneficial to use more than one tool.

Before starting, write down the initial question clearly. And, as you go, maintain a digital notebook of what you had to do to construct the visualizations and how the questions evolved. Include in the notebook which data set you chose; describe any transformations or rearrangements of the dataset that you needed to perform. In particular, describe how you got the data into the format needed by the visualization system. Keep copies of any intermediate visualizations that helped you refine your question. In the end, you should produce a PDF document for sharing your notebook.

After you have constructed the final visualization for presenting your answer, write a caption and a paragraph describing the visualization, and how it answers the question you posed. Think of the figure, the caption and the text as material you might include in a research paper.

Data Sets

We have provided the following data sets and encourage you to use one of them in order to get started quickly and therefore have more time to explore the data and develop your analysis questions. That said, you are welcome to use a different data set if you prefer; just be sure to first confirm with the course staff.

Campaign Finance Data

This data set consists of contributions data from the Federal Elections Commission (FEC) during the 2015-16 election campaign. The data consists of multiple tables, including records for political action committees, candidates, individual donations, and campaign expenditures. You should focus on one aspect of this data (e.g., expenditures by presidential candidates, or patterns of donations by region and political party). In any case, you will likely need to join multiple tables together and filter them down to the relevant records. Tableau provides built-in facilities for combining multiple data sources, as does the free Trifacta Wrangler tool.

The data sets can be downloaded directly from the FEC in a pipe-delimited format. For access to the data, plus descriptions of the data schema, see the FEC disclosure website.

Movie Data

This dataset contains some important statistics from a large sample of movies. The data includes the movie budget and revenue from different sources as well as ratings from RottenTomatoes, The Numbers and IMDB.

Download: csv file

Sources: RottenTomatoes, The Numbers and IMDB.

Flight Data

FAA data describing every commercial flight during the month of December 2009. For detailed descriptions of each data column in the attached file please see www.transtats.bts.gov. You are encouraged to download your own more recent version of the data (which might include columns or time spans that were left out from this dataset) directly from www.transtats.bts.gov.

Download: zipped csv file

Source: www.transtats.bts.gov

Other Sources

See Resources page for more datasets.

Visualization Software

To create the visualizations, you can use a tool of your choice. However, it must be a tool that supports rapid construction of visualizations so that you get an authentic experience of interactive visual analysis. One goal of this assignment is for you to learn to use and evaluate the effectiveness of rapid visualization tools.

The most common option is to use Tableau, a commercial database visualization tool that supports many different ways to interact with the data. Tableau supports both Windows and Mac OS, and is freely available to students through an academic license. To download the software, please see http://www.tableau.com/academic/students.

You are free to use other visualization tools as you see fit. For example, the R language has a number of facilities for manipulating data and creating exploratory graphics (e.g., using the popular ggplot2 package). Another option is to use the graphing facilities available in the iPython Notebook, including but not limited to matplotlib.

Grading Criteria

Each submission will be graded based on the analysis process and final visualization. Here are our grading criteria:

Analysis Process

  • Clear questions applicable to the chosen data set
  • Appropriate data diagnostics and transformation
  • Sufficient breadth of analysis, exploring multiple questions
  • Sufficient depth of analysis, with appropriate follow up questions
  • Clear explanation of data exploration process

Final Visualization

  • Image answers the chosen question in a compelling manner
  • Visualization can function as a "stand alone" figure
  • Expressive and effective visualization, good choice of visual encodings
  • Appropriate caption, labels and description

Submission Details

This is an individual assignment. You may not work in groups. Your completed assignment is due on Friday 4/15, by 5pm.

Please submit the following files to Canvas in a zip archive:

  • A text file (.txt) with this template. Write your name, uwnetid, email, dataset, and caption in the Yaml front matter. A paragraph describing your visualization should be on the content section. Please name the file a2-uwnetid.txt
  • A copy of your visualization in a standard image file format (JPG/PNG) – please name the file a2-uwnetid.jpg/png
  • An exploration notebook in pdf format. You can use any software to generate a pdf notebook file. Please name the file a2-uwnetid.pdf