Exercise: A 1D Distribution

In this exercise, you will explore a one-dimensional distribution of numbers, along with category labels. One column of numbers, a couple categories — simple, right? Au contraire! There are many ways to transform and visualize distributions, which can lead to different insights as well as misleading omissions. You may work in groups of 1-3 people.


Dataset

The dataset we’ll examine is an anonymized list of restaurants along the Ave in the U-District. For each restaurant we have two columns:

The dataset is imported below. While we’ve included some initial Vega-Lite scaffolding, you are free to use whatever visualization tool you prefer.

const restaurants = vega_datasets['udistrict.json']()

Task 1: Plot Individual Data Points

To start, create a simple 1D plot of the “raw” lat values. What design decisions might help you combat overplotting?

// here is the very early beginnings of a plot
// revise/expand the code or replace it with an image generated elsewhere
render({
  mark: 'bar', // TODO
  data: { values: restaurants },
  encoding: {
    // TODO
  }
})

What potential features of interest can you see within your plot?

What additional real-world context might help you further interpret the data?


Task 2: Plot Summary Statistics

One approach to characterizing a distribution is to plot summary statistics, such as the average (mean), min, max, and standard deviation.

Create a new chart that uses tick marks to indicate the mean, min, and max lat values. Then add a bar mark in the background that conveys the interquartile range (IQR), where the middle 50% of the data reside. (Tip: you may find it helpful to review the layer operator.)

// put code here

Task 3: Plot The Distribution Shape

Compare your original plot in Task 1 to the summary statistics in Task 2. What is gained and what is lost when using summary statistics only? Plotting individual points can be valuable, but might lead to overplotting and does not scale to large datasets – drawing millions of points often creates both computational and perceptual concerns! Meanwhile, summary statistics can fail to shows us the “shape” of a distribution, such as multiple modes or skew.

Create a plot that performs data transformation to produce a chart that conveys the shape of the distribution of lat values. Possible choices include a histogram, density plot, or another distribution visualization method.

// put code here

What features of interest can you see in this plot? How do they match or differ from your earlier plots above?

If your distribution visualization involves parameters such as bin widths, offsets, or bandwidths, adjust the parameters over a range of values and see how the summary changes. Note that these parameters may be implicitly set using smart defaults, in which case you will need to expand your visualization code to provide explicit parameter values. What visual features appear to be robust and which appear to be artifacts of transformation parameters?


Task 4: Incorporate Restaurant Types

Finally, create one or more plots that now incorporate the key (restaurant type) column to provide more context. First, pose a question about the distribution of restaurant types along the Ave. Then, create your plot(s) to try to answer the question. A basic question might be “where are all the different restaurant types located?”. While that is a fine first question and first plot, we strongly encourage you to pose at least one additional, more nuanced, question.

List the questions you posed and what you learned (or failed to learn) from your plots:


Don’t forget to add, commit, and push your exercises to your GitLab repo!