Data Settings¶

In this lesson, we'll consider what it means for a dataset to have a data setting: the technical and the human processes that affect what information is captured in the data collection process and how the data are then structured. In our upcoming assessment, we'll examine an administrative dataset on educational attainment for people age 25 to 29 in the United States. The assessment serves not only as opportunity to demonstrate to reflect on the challenges of data visualization, but also the challenges inherent in working with real-world data.

By the end of this lesson, students will be able to:

Create visualizations involving time series data.
Compare and contrast statistical, coded, and structural bias.
Identify questions about the data setting for a given dataset.

In [1]:

import pandas as pd
import seaborn as sns

sns.set_theme()

Time series data¶

Seattleites often look forward to summer months for fantastic weather and outdoor activities. However, recent summers have been marred by intense climate catastrophies and wildfires in the western United States. In this activity, we'll investigate air quality data captured by the Puget Sound Clean Air Agency's Seattle-Duwamish sensor between April 2017 and April 2022. Current sensor readings can be found on Washington's Air Monitoring Network Map.

This data is a time series, a time-indexed dataset often with a consistent interval between observations. For example, the air quality sensor data is recorded at hourly intervals.

In [2]:

seattle_air = pd.read_csv("seattle_air.csv", index_col="Time", parse_dates=True)
seattle_air

Out[2]:

	PM2.5
Time
2017-04-06 00:00:00	6.8
2017-04-06 01:00:00	5.3
2017-04-06 02:00:00	5.3
2017-04-06 03:00:00	5.6
2017-04-06 04:00:00	5.9
...	...
2022-04-06 19:00:00	5.1
2022-04-06 20:00:00	5.0
2022-04-06 21:00:00	5.3
2022-04-06 22:00:00	5.2
2022-04-06 23:00:00	5.2

43848 rows × 1 columns

Time series data use a special type of index called a DatetimeIndex that stores datetime values. Each datetime value in the index below consists of a YEAR-MONTH-DAY and HOUR:MINUTE:SECOND displayed in ISO format.

In [3]:

seattle_air.index

Out[3]:

DatetimeIndex(['2017-04-06 00:00:00', '2017-04-06 01:00:00',
               '2017-04-06 02:00:00', '2017-04-06 03:00:00',
               '2017-04-06 04:00:00', '2017-04-06 05:00:00',
               '2017-04-06 06:00:00', '2017-04-06 07:00:00',
               '2017-04-06 08:00:00', '2017-04-06 09:00:00',
               ...
               '2022-04-06 14:00:00', '2022-04-06 15:00:00',
               '2022-04-06 16:00:00', '2022-04-06 17:00:00',
               '2022-04-06 18:00:00', '2022-04-06 19:00:00',
               '2022-04-06 20:00:00', '2022-04-06 21:00:00',
               '2022-04-06 22:00:00', '2022-04-06 23:00:00'],
              dtype='datetime64[ns]', name='Time', length=43848, freq=None)

Pandas provides convenient string-based syntax for slicing a datetime index.

In [7]:

seattle_air.loc["2022-03-01", :]

Out[7]:

	PM2.5
Time
2022-03-01 00:00:00	5.1
2022-03-01 01:00:00	5.9
2022-03-01 02:00:00	6.0
2022-03-01 03:00:00	3.9
2022-03-01 04:00:00	3.4
2022-03-01 05:00:00	4.0
2022-03-01 06:00:00	3.6
2022-03-01 07:00:00	4.2
2022-03-01 08:00:00	4.4
2022-03-01 09:00:00	4.4
2022-03-01 10:00:00	4.3
2022-03-01 11:00:00	4.0
2022-03-01 12:00:00	4.2
2022-03-01 13:00:00	4.2
2022-03-01 14:00:00	4.5
2022-03-01 15:00:00	4.7
2022-03-01 16:00:00	4.8
2022-03-01 17:00:00	5.5
2022-03-01 18:00:00	5.7
2022-03-01 19:00:00	5.8
2022-03-01 20:00:00	6.9
2022-03-01 21:00:00	7.4
2022-03-01 22:00:00	10.3
2022-03-01 23:00:00	11.1

The dataset includes some NaN missing values. Let's replace missing values using linear interpolation, which examines neighboring values to replace NaN values with best estimates.

In [10]:

seattle_air[seattle_air["PM2.5"].isna()]

Out[10]:

	PM2.5
Time
2017-04-07 07:00:00	NaN
2017-04-17 06:00:00	NaN
2017-04-17 07:00:00	NaN
2017-04-17 09:00:00	NaN
2017-04-28 09:00:00	NaN
...	...
2022-02-28 05:00:00	NaN
2022-03-14 05:00:00	NaN
2022-03-15 12:00:00	NaN
2022-03-15 13:00:00	NaN
2022-03-28 05:00:00	NaN

789 rows × 1 columns

In [11]:

missing_values = seattle_air["PM2.5"].isna()
seattle_air = seattle_air.interpolate()
# Show only the previously-missing values
seattle_air[missing_values]

Out[11]:

	PM2.5
Time
2017-04-07 07:00:00	10.950000
2017-04-17 06:00:00	9.466667
2017-04-17 07:00:00	8.633333
2017-04-17 09:00:00	6.800000
2017-04-28 09:00:00	6.000000
...	...
2022-02-28 05:00:00	4.750000
2022-03-14 05:00:00	5.300000
2022-03-15 12:00:00	5.100000
2022-03-15 13:00:00	4.400000
2022-03-28 05:00:00	7.600000

789 rows × 1 columns

Visualizations with `DatetimeIndex`¶

Let's write some code to compare each year's data. groupby not only accepts a column name or a list of column names, but also series that indicate groups. We can group by the index.year to form groups for each year in the time series. Here, groupby uses the given series directly rather than selecting a column from the original dataframe.

pandas has a built-in plot() function that uses matplotlib: it's not quite as clever as seaborn in preparing data visualizations for communication purposes, but it is handy for quickly visualizing your dataframes without having to import seaborn. Since these are separate plots, they do not share common axes.

In [12]:

seattle_air.index.year

Out[12]:

Index([2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,
       ...
       2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022],
      dtype='int32', name='Time', length=43848)

In [13]:

seattle_air.groupby(seattle_air.index.year).plot()

Out[13]:

Time
2017    Axes(0.125,0.11;0.775x0.77)
2018    Axes(0.125,0.11;0.775x0.77)
2019    Axes(0.125,0.11;0.775x0.77)
2020    Axes(0.125,0.11;0.775x0.77)
2021    Axes(0.125,0.11;0.775x0.77)
2022    Axes(0.125,0.11;0.775x0.77)
dtype: object

No description has been provided for this image

Ideally, we would like to see all 6 line plots together on the same axes. However, notice that the plots all maintain their original datetime information: each plot is labeled a different year because the datetime information records year data. Without a common or shared index, it will be difficult to combine the 6 plots into one.

To define a common or shared index, we need to define a new index that is common between all 6 years of data. This is where DatetimeIndex is more of a problem than a solution: each datetime value must have all three fields year, month, and day. We are simply not allowed to remove the year from a DatetimeIndex!

DatetimeIndex provides helpful accessors for defining a common index, one of which is returns the day_of_year for each value in the sequence.

In [14]:

seattle_air.index.day_of_year

Out[14]:

Index([96, 96, 96, 96, 96, 96, 96, 96, 96, 96,
       ...
       96, 96, 96, 96, 96, 96, 96, 96, 96, 96],
      dtype='int32', name='Time', length=43848)

By combining these accessors, we can use seaborn to generate a line plot that combines each year of air quality data. Just like how groupby can accept a series to determine groups, seaborn plotting functions also accept a series as input whose values are used directly.

What else can we improve about this line plot?

In [15]:

grid = sns.relplot(
    seattle_air,
    x=seattle_air.index.day_of_year,
    y="PM2.5",
    hue=seattle_air.index.year,
    kind="line",
    errorbar=None, # Much faster when we don't generate error bars
)
# When column name is not specified, the index name "Time" is used
grid.set(xlabel="Day of Year")
grid.legend.set(title="Year")

Out[15]:

[None]

What's in a `NaN`?¶

Earlier, we replaced the NaN (not a number) missing air quality data using interpolation to guess its value based on surrounding data points. But why were these values NaN in the first place?

Kevin asked this question to a data analyst at the Puget Sound Clean Air Agency via their public contact phone number. They provided several potential reasons why a row might be NaN.

Regular, biweekly maintenance
Break-in and vandalism issues
Internet connectivity issues
Regulatory calibration requirements
Equipment relocation, changes, or upgrades

Furthermore, they pointed out that the air quality sensors are calibrated for lower concentrations, so sensors may underreport values during times when there are higher concentrations of particulate matter.

These stories and context that situate our data inform its data setting: the technical and the human processes that affect what information is captured in the data collection process and how the data are then structured. Let's listen to Yanni Loukissas explain more.

In [16]:

%%html
<iframe width="640" height="360" src="https://www.youtube-nocookie.com/embed/bUws5GCF3GI?start=477&end=624" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

Sometimes, the creators of the dataset might share some of the data settings with you in the form of a datasheet. In Datasheets for Datasets, Timnit Gebru et al. (2018) propose many questions that should be answered when describing a dataset that they categorized into questions about:

Motivation: why the dataset was created
Composition: what the data represents and how values relate to each other
Collection process: how the data was collected
Preprocessing/cleaning/labeling: how the data was converted into its current form
Uses: what the data should and should not be used for
Distribution: how the data will be shared with other parties
Maintenance: how the data will be maintained, hosted, and updated over time

Even when datasets are documented, there may yet be stories behind each and every value in the dataset that might only be surfaced through discussion with the dataset creators or subject matter experts. Data is local, even when it doesn't seem like it, because they are shaped by the practices of the people who created it.

In [17]:

%%html
<iframe width="640" height="360" src="https://www.youtube-nocookie.com/embed/bUws5GCF3GI?start=835&end=1005" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

Principle: Consider context¶

How do we put data locality and data settings into practice? Chapter 6 of Data Feminism by Catherine D'Ignazio and Lauren Klein titled "The Numbers Don't Speak for Themselves" provide some examples of how to consider context in our data work.

Instead of taking data at face value and looking toward future insights, data scientists can first interrogate the context, limitations, and validity of the data under use. In other words: consider the cooking process that produces "raw" data. As one example, computational social scientists Derek Ruths and Jürgen Pfeffer write about the limitations of using social media data for behavioral insights: Instagram data skews young because Instagram does; Reddit data contains far more comments by men than by women because Reddit's overall membership is majority men. They further show how research data acquired from those sources are shaped by sampling because companies like Reddit and Instagram employ proprietary methods to deliver their data to researchers, and those methods are never disclosed. Related research by Devin Gaffney and J. Nathan Matias took on a popular corpus that claimed to contain "every publicly available Reddit comment." Their work showed the that the supposedly complete corpus is missing at least thirty-six million comments and twenty-eight million submissions.

Is this what we need to do to "remove the bias" from a dataset? What do you think Yanni Loukissas would say in response to this question?

In [18]:

%%html
<iframe width="640" height="360" src="https://www.youtube-nocookie.com/embed/bUws5GCF3GI?start=1481&end=1800" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

As data programmers, how does this intersect with what we've just learned about data visualization? There's a question of how we communicate the data setting to our audiences.

To explain the difference between the two visualizations, which only differ in title and subtitle, the Catherine and Lauren write:

Which one of these graphics would you create? Which one should you create? The first—Mental Health in Jail—represents the typical way that the results of a data analysis are communicated. The title appears to be neutral and free of bias. This is a graphic about rates of mental illness diagnosis of incarcerated people broken down by race and ethnicity. The people are referred to as inmates, the language that the study used. The title does not mention race or ethnicity, or racism or health inequities, nor does the title point to what the data mean. But this is where additional questions about context come in. Are you representing only the four numbers that we see in the chart? Or are you representing the context from which they emerged?

The study that produced these numbers contains convincing evidence that we should distrust diagnosis numbers due to racial and ethnic discrimination. The first chart does not simply fail to communicate that but also actively undermines that main finding of the research. Moreover, the language used to refer to people in jail as inmates is dehumanizing, particularly in the context of the epidemic of mass incarceration in the United States. So, consider the second chart: Racism in Jail: People of Color Less Likely to Get Mental Health Diagnosis. This title offers a frame for how to interpret the numbers along the lines of the study from which they emerged. The research study was about racial disparities, so the title and content of this chart are about racial disparities. The people behind the numbers are people, not inmates. In addition, and crucially, the second chart names the forces of oppression that are at work: racism in prison.

Data work provides a rhetorical medium for data programmers to make and communicate meaning to readers that require careful attention to every part of our work: not only the code, but also the data because the data doesn't speak for itself.

Close Reading v.s. Distant Reading¶

"Distant reading" (source) has a specific meaning (coined by Franco Moretti), but can also generally refer to the use of computational methods to analyze literary texts. This is in contrast to "close reading", which is defined as:

Close reading is an activity that keeps you focused on and within a text—appraising individual words, shapes of thought, rhetorical devices, patterns of description and characterization, and so forth, in order to understand the text's artistic achievement.

Then, in the context of reading data, or data analysis, the idea of how close/distant you look at the dataset is also relevant. Depending on the closeness, you might be able to come up with a special narrative around the data that lets you tell the story you want to tell about the data. As an example, here is a video called "this is an teenager" from a data journalist and professor Alvin Chang. It traces the trajectories of teenagers starting from 1997 until now, and talks about how their childhood experiences affected their life outcomes (Interactive version here). Let's watch this and critique a little bit about the visualization, the story, and the distance looking at the data.

In [19]:

%%html
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/fKv1Mixv0Hk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

Any thoughts?