Data Settings

In this lesson, we’ll consider what it means for a dataset to have a data setting: the technical and the human processes that affect what information is captured in the data collection process and how the data are then structured. In our upcoming assessment, we’ll examine an administrative dataset on educational attainment for people age 25 to 29 in the United States. The assessment serves not only as opportunity to demonstrate to reflect on the challenges of data visualization, but also the challenges inherent in working with real-world data.

By the end of this lesson, students will be able to:

Create visualizations involving time series data.
Identify questions about the data setting for a given dataset.
Identify competing conclusions for a given dataset.

import pandas as pd
import seaborn as sns

sns.set_theme()

Time series data¶

Seattleites often look forward to summer months for fantastic weather and outdoor activities. However, recent summers have been marred by intense climate catastrophies and wildfires in the western United States. In this activity, we’ll investigate air quality data captured by the Puget Sound Clean Air Agency’s Seattle-Duwamish sensor between April 2017 and April 2022. Current sensor readings can be found on Washington’s Air Monitoring Network Map.

This data is a time series, a time-indexed dataset often with a consistent interval between observations. For example, the air quality sensor data is recorded at hourly intervals.

seattle_air = pd.read_csv("seattle_air.csv", index_col="Time", parse_dates=True)
seattle_air

Time series data use a special type of index called a DatetimeIndex that stores datetime values. Each datetime value in the index below consists of a YEAR-MONTH-DAY and HOUR:MINUTE:SECOND displayed in ISO format.

seattle_air.index

Pandas provides convenient string-based syntax for slicing a datetime index.

seattle_air.loc["2022", :]

The dataset includes some NaN missing values. Let’s replace missing values using linear interpolation, which examines neighboring values to replace NaN values with best estimates.

missing_values = seattle_air["PM2.5"].isna()
seattle_air = seattle_air.interpolate()
# Show only the previously-missing values
seattle_air[missing_values]

Visualizations with `DatetimeIndex`¶

Let’s write some code to compare each year’s data. groupby not only accepts a column name or a list of column names, but also series that indicate groups. We can group by the index.year to form groups for each year in the time series. Here, groupby uses the given series directly rather than selecting a column from the original dataframe.

pandas has a built-in plot() function that uses matplotlib: it’s not quite as clever as seaborn in preparing data visualizations for communication purposes, but it is handy for quickly visualizing your dataframes without having to import seaborn. Since these are separate plots, they do not share common axes.

seattle_air.groupby(seattle_air.index.year).plot()

Ideally, we would like to see all 6 line plots together on the same axes. However, notice that the plots all maintain their original datetime information: each plot is labeled a different year because the datetime information records year data. Without a common or shared index, it will be difficult to combine the 6 plots into one.

To define a common or shared index, we need to define a new index that is common between all 6 years of data. This is where DatetimeIndex is more of a problem than a solution: each datetime value must have all three fields year, month, and day. We are simply not allowed to remove the year from a DatetimeIndex!

DatetimeIndex provides helpful accessors for defining a common index, one of which is returns the day_of_year for each value in the sequence.

seattle_air.index.day_of_year

By combining these accessors, we can use seaborn to generate a line plot that combines each year of air quality data. Just like how groupby can accept a series to determine groups, seaborn plotting functions also accept a series as input whose values are used directly.

What else can we improve about this line plot?

grid = sns.relplot(
    seattle_air,
    x=seattle_air.index.day_of_year,
    y="PM2.5",
    hue=seattle_air.index.year,
    kind="line",
    errorbar=None, # Much faster when we don't generate error bars
)
# When column name is not specified, the index name "Time" is used
grid.set(xlabel="Day of Year")
grid.legend.set(title="Year")

What’s in a `NaN`?¶

Earlier, we replaced the NaN (not a number) missing air quality data using interpolation to guess its value based on surrounding data points. But why were these values NaN in the first place?

I asked this question to a data analyst at the Puget Sound Clean Air Agency via their public contact phone number. They provided several potential reasons why a row might be NaN.

Regular, biweekly maintenance
Break-in and vandalism issues
Internet connectivity issues
Regulatory calibration requirements
Equipment relocation, changes, or upgrades

Furthermore, they pointed out that the air quality sensors are calibrated for lower concentrations, so sensors may underreport values during times when there are higher concentrations of particulate matter.

These stories and context that situate our data inform its data setting: the technical and the human processes that affect what information is captured in the data collection process and how the data are then structured. Let’s listen to Yanni Loukissas explain more.

%%html
<iframe width="640" height="360" src="https://www.youtube-nocookie.com/embed/bUws5GCF3GI?start=477&end=624" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

Sometimes, the creators of the dataset might share some of the data settings with you in the form of a datasheet. In Datasheets for Datasets, Timnit Gebru et al. (2018) propose many questions that should be answered when describing a dataset that they categorized into questions about:

Motivation: why the dataset was created
Composition: what the data represents and how values relate to each other
Collection process: how the data was collected
Preprocessing/cleaning/labeling: how the data was converted into its current form
Uses: what the data should and should not be used for
Distribution: how the data will be shared with other parties
Maintenance: how the data will be maintained, hosted, and updated over time

Even when datasets are documented, there may yet be stories behind each and every value in the dataset that might only be surfaced through discussion with the dataset creators or subject matter experts. Data is local, even when it doesn’t seem like it, because they are shaped by the practices of the people who created it.

Close reading with distant reading¶

When we produce data visualizations, we conduct a kind of distant reading of data by taking a bird’s-eye view. Yanni argues that data visualization “only really works when it’s combined with a closer reading.”

Well, to be clear, I’m not against visualization or distant reading in and of itself. But I think it only really works when it’s combined with a closer reading, a closer investigation. From a distance, data are just patterns, symbols. In order to understand their meaning, we have to unpack them within a context.

Let’s consider how data journalist and professor Alvin Chang’s video essay “this is an teenager” (interactive version) combines close reading with distant reading. What rhetorical moves do you notice? Then, let’s listen to Yanni’s final call to action.

%%html
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/fKv1Mixv0Hk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

%%html
<iframe width="640" height="360" src="https://www.youtube-nocookie.com/embed/bUws5GCF3GI?start=1689&end=1735" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

Data as a language¶

Our data practices, from even the most seemingly-benign choices like filling-in missing data, reflect our perspective as data scientists. Let’s consider how data can tell multiple stories, a section from Andy Cotgreave’s opinion article. How do you think Yanni would respond to this comparison?

Time series data¶

Visualizations with DatetimeIndex¶

What’s in a NaN?¶

Close reading with distant reading¶

Data as a language¶

Visualizations with `DatetimeIndex`¶

What’s in a `NaN`?¶