Assignment 2: Exploratory Data Analysis
A variety of digital tools have been designed to help users visually explore data sets and confirm or disconfirm hypotheses about the data. The task in this assignment is to use an existing visualization tool to formulate and answer a series of specific questions about a data set of your choice. After answering the questions you should create a final visualization that is designed to present the answer to your question to others. You should maintain a digital notebook that documents all the questions you asked and the steps you performed from start to finish. The goal of this assignment is not to develop a new visualization tool, but to understand better the process of using visualizations to perform exploratory data analysis.
Here is one way to start.
Step 1. Pick a domain and data set that you are interested in.
- Peruse the provided data sets below. Choose the one of greatest interest to you. We encourage you to use one of the provided data sets. However, if you would like to explore a different data set, you are free to do so. If you are unsure about your choice, contact the teaching staff.
Step 2. Pose an initial question that you would like to answer.
- For example: Is there a relationship between melting point and atomic number? Are the brightness and color of stars correlated? Are there different patterns of nucleotides in different regions in human DNA?
Step 3. Assess the fitness of the data for answering your question.
- Inspect the data--it is invariably helpful to first look at the raw values. Does the data seem appropriate for answering your question? If not, you may need to start the process over. If so, does the data need to be reformatted or cleaned prior to analysis? Perform any steps necessary to get the data into shape prior to visual analysis.
Exploratory Analysis Process
After you have an initial question and a dataset, construct a visualization that provides an answer to your question. As you construct the visualization you will find that your question evolves - often it will become more specific. Keep track of this evolution and the other questions that occur to you along the way. Once you have answered all the questions to your satisfaction, think of a way to present the data and the answers as clearly as possible. In this assignment, you should use existing visualization software tools. You may find it beneficial to use more than one tool.
Before starting, write down the initial question clearly. And, as you go, maintain a digital notebook of what you had to do to construct the visualizations and how the questions evolved. Include in the notebook which data set you chose; describe any transformations or rearrangements of the dataset that you needed to perform. In particular, describe how you got the data into the format needed by the visualization system. Keep copies of any intermediate visualizations that helped you refine your question. You can use any tool (e.g. MS Word, iWork Pages) to create a final notebook in a pdf format.
After you have constructed the final visualization for presenting your answer, write a caption and a paragraph describing the visualization, and how it answers the question you posed. Think of the figure, the caption and the text as material you might include in a research paper.
Data Sets
We have provided the following data sets and encourage you to use one of them in order to get started quickly and therefore have more time to explore the data and develop your analysis questions. That said, you are welcome to use a different data set if you prefer; just be sure to first confirm with the course staff.
Campaign Finance Data
This data set consists of contributions data from the Federal Elections Commission (FEC) during the 2012 election campaign. We have compiled two versions of this data set. The small set covers 2011 - Oct 2012, which is ~200K individual contributions. The big set covers 2005 - Oct 2012, which is ~1M contributions (reveals several congressional election finance cycles, and ~2 presidential cycles). Both small and large versions are available as either CSV (comma separated values) or TDE (Tableau data engine) files. Each file has been compressed using zip.
Download: (Small Version) csv, tde; (Large Version) csv, tde
Info about how to translate various codes throughout the data can be found in these schema files for: candidates, committees, and contributions. The overall shape of the data starts with the contributions schema, with data joined onto each row from the candidates and committees schemas.
Movie Data
This dataset contains some important statistics from a large sample of movies. The data includes the movie budget and revenue from different sources as well as ratings from RottenTomatoes, The Numbers and IMDB.
Download: csv file
Sources: RottenTomatoes, The Numbers and IMDB.
Flight Data
FAA data describing every commercial flight during the month of December 2009. For detailed descriptions of each data column in the attached file please see www.transtats.bts.gov. You are also welcome to download your own version of the file (which might include columns or time spans that were left out from this dataset) directly from www.transtats.bts.gov.
Download: zipped csv file
Source: www.transtats.bts.gov
Other Sources
See Resources page for more datasets.
Visualization Software
To create the visualizations, we recommend using Tableau, a commercial database visualization tool that supports many different ways to interact with the data. Tableau has given us classroom licenses so that you can install the software on your own computer. (See Download Instructions.) You can also get your own free student license for longer term usage.
One goal of this assignment is for you to learn to use and evaluate the effectiveness of tools such as Tableau. In addition to (or in lieu of) Tableau, you are free to also use other visualization tools as you see fit. For example, the R language has a number of facilities for manipulating data and creating exploratory graphics (e.g., using the GGplot2 package).
Note that Tableau requires Microsoft Windows. Fortunately, CSE provides free licenses for both Windows and VMWare Fusion. See CSE Software Page for instructions. For non-CSE students, UW UWare also provides free Windows licenses for students. VMWare Player (for Linux and Windows) is free at VMWare's website. Mac users can try VMWare Fusion for 30 days. Virtual Box is a free alternative. We will update if CSE can provide VMWare License for non-CSE students.
Grading Criteria
Each submission will be graded based on the analysis process and final visualization. Here are our grading criteria:
Analysis Process
- Clear questions applicable to the chosen data set
- Appropriate data diagnostics and transformation
- Sufficient breadth of analysis, exploring multiple questions
- Sufficient depth of analysis, with appropriate follow up questions
- Clear explanation of data exploration process
Final Visualization
- Image answers the chosen question in a compelling manner
- Visualization can function as a "stand alone" figure
- Expressive and effective visualization, good choice of visual encodings
- Appropriate caption, labels and description
Submission Details
This is an individual assignment. You may not work in groups. Your completed assignment is due on Monday 1/27, by 5pm.
Please submit the following files via Catalyst Dropbox:
- A text file (.txt) with this template. Write your name, uwnetid, email, dataset, and caption in the Yaml front matter. A paragraph describing your visualization should be on the content section. Please name the file a2-uwnetid.txt
- A copy of your visualization in a standard image file format (JPG/PNG) – please name the file a2-uwnetid.jpg/png
- An exploration notebook in pdf format. You can use any software to generate a pdf notebook file. Please name the file a2-uwnetid.pdf