Data Settings¶
In this lesson, we'll consider what it means for a dataset to have a data setting: the technical and the human processes that affect what information is captured in the data collection process and how the data are then structured. In our upcoming assessment, we'll examine an administrative dataset on educational attainment for people age 25 to 29 in the United States. The assessment serves not only as opportunity to demonstrate to reflect on the challenges of data visualization, but also the challenges inherent in working with real-world data.
By the end of this lesson, students will be able to:
- Create visualizations involving time series data.
- Compare and contrast statistical, coded, and structural bias.
- Identify questions about the data setting for a given dataset.
import pandas as pd
import seaborn as sns
sns.set_theme()
Time series data¶
Seattleites often look forward to summer months for fantastic weather and outdoor activities. However, recent summers have been marred by intense climate catastrophies and wildfires in the western United States. In this activity, we'll investigate air quality data captured by the Puget Sound Clean Air Agency's Seattle-Duwamish sensor between April 2017 and April 2022. Current sensor readings can be found on Washington's Air Monitoring Network Map.
This data is a time series, a time-indexed dataset often with a consistent interval between observations. For example, the air quality sensor data is recorded at hourly intervals.
seattle_air = pd.read_csv("seattle_air.csv", index_col="Time", parse_dates=True)
seattle_air
PM2.5 | |
---|---|
Time | |
2017-04-06 00:00:00 | 6.8 |
2017-04-06 01:00:00 | 5.3 |
2017-04-06 02:00:00 | 5.3 |
2017-04-06 03:00:00 | 5.6 |
2017-04-06 04:00:00 | 5.9 |
... | ... |
2022-04-06 19:00:00 | 5.1 |
2022-04-06 20:00:00 | 5.0 |
2022-04-06 21:00:00 | 5.3 |
2022-04-06 22:00:00 | 5.2 |
2022-04-06 23:00:00 | 5.2 |
43848 rows × 1 columns
seattle_air.plot()
<Axes: xlabel='Time'>
Time series data use a special type of index called a DatetimeIndex
that stores datetime values. Each datetime value in the index below consists of a YEAR-MONTH-DAY
and HOUR:MINUTE:SECOND
displayed in ISO format.
seattle_air.index
DatetimeIndex(['2017-04-06 00:00:00', '2017-04-06 01:00:00', '2017-04-06 02:00:00', '2017-04-06 03:00:00', '2017-04-06 04:00:00', '2017-04-06 05:00:00', '2017-04-06 06:00:00', '2017-04-06 07:00:00', '2017-04-06 08:00:00', '2017-04-06 09:00:00', ... '2022-04-06 14:00:00', '2022-04-06 15:00:00', '2022-04-06 16:00:00', '2022-04-06 17:00:00', '2022-04-06 18:00:00', '2022-04-06 19:00:00', '2022-04-06 20:00:00', '2022-04-06 21:00:00', '2022-04-06 22:00:00', '2022-04-06 23:00:00'], dtype='datetime64[ns]', name='Time', length=43848, freq=None)
Pandas provides convenient string-based syntax for slicing a datetime index.
seattle_air.loc["2022-03-01", :]
PM2.5 | |
---|---|
Time | |
2022-03-01 00:00:00 | 5.1 |
2022-03-01 01:00:00 | 5.9 |
2022-03-01 02:00:00 | 6.0 |
2022-03-01 03:00:00 | 3.9 |
2022-03-01 04:00:00 | 3.4 |
2022-03-01 05:00:00 | 4.0 |
2022-03-01 06:00:00 | 3.6 |
2022-03-01 07:00:00 | 4.2 |
2022-03-01 08:00:00 | 4.4 |
2022-03-01 09:00:00 | 4.4 |
2022-03-01 10:00:00 | 4.3 |
2022-03-01 11:00:00 | 4.0 |
2022-03-01 12:00:00 | 4.2 |
2022-03-01 13:00:00 | 4.2 |
2022-03-01 14:00:00 | 4.5 |
2022-03-01 15:00:00 | 4.7 |
2022-03-01 16:00:00 | 4.8 |
2022-03-01 17:00:00 | 5.5 |
2022-03-01 18:00:00 | 5.7 |
2022-03-01 19:00:00 | 5.8 |
2022-03-01 20:00:00 | 6.9 |
2022-03-01 21:00:00 | 7.4 |
2022-03-01 22:00:00 | 10.3 |
2022-03-01 23:00:00 | 11.1 |
The dataset includes some NaN
missing values. Let's replace missing values using linear interpolation, which examines neighboring values to replace NaN
values with best estimates.
missing_values = seattle_air["PM2.5"].isna()
seattle_air = seattle_air.interpolate()
# Show only the previously-missing values
seattle_air[missing_values]
PM2.5 | |
---|---|
Time | |
2017-04-07 07:00:00 | 10.950000 |
2017-04-17 06:00:00 | 9.466667 |
2017-04-17 07:00:00 | 8.633333 |
2017-04-17 09:00:00 | 6.800000 |
2017-04-28 09:00:00 | 6.000000 |
... | ... |
2022-02-28 05:00:00 | 4.750000 |
2022-03-14 05:00:00 | 5.300000 |
2022-03-15 12:00:00 | 5.100000 |
2022-03-15 13:00:00 | 4.400000 |
2022-03-28 05:00:00 | 7.600000 |
789 rows × 1 columns
Visualizations with DatetimeIndex
¶
Let's write some code to compare each year's data. groupby
not only accepts a column name or a list of column names, but also series that indicate groups. We can group by the index.year
to form groups for each year in the time series. Here, groupby
uses the given series directly rather than selecting a column from the original dataframe.
pandas
has a built-in plot()
function that uses matplotlib
: it's not quite as clever as seaborn
in preparing data visualizations for communication purposes, but it is handy for quickly visualizing your dataframes without having to import seaborn
. Since these are separate plots, they do not share common axes.
seattle_air.index.year
Index([2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, ... 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022], dtype='int32', name='Time', length=43848)
seattle_air.groupby(seattle_air.index.year).plot()
Time 2017 Axes(0.125,0.11;0.775x0.77) 2018 Axes(0.125,0.11;0.775x0.77) 2019 Axes(0.125,0.11;0.775x0.77) 2020 Axes(0.125,0.11;0.775x0.77) 2021 Axes(0.125,0.11;0.775x0.77) 2022 Axes(0.125,0.11;0.775x0.77) dtype: object
Ideally, we would like to see all 6 line plots together on the same axes. However, notice that the plots all maintain their original datetime information: each plot is labeled a different year because the datetime information records year data. Without a common or shared index, it will be difficult to combine the 6 plots into one.
To define a common or shared index, we need to define a new index that is common between all 6 years of data. This is where DatetimeIndex
is more of a problem than a solution: each datetime value must have all three fields year, month, and day. We are simply not allowed to remove the year from a DatetimeIndex
!
DatetimeIndex
provides helpful accessors for defining a common index, one of which is returns the day_of_year
for each value in the sequence.
seattle_air.index.day_of_year
Index([96, 96, 96, 96, 96, 96, 96, 96, 96, 96, ... 96, 96, 96, 96, 96, 96, 96, 96, 96, 96], dtype='int32', name='Time', length=43848)
By combining these accessors, we can use seaborn to generate a line plot that combines each year of air quality data. Just like how groupby
can accept a series to determine groups, seaborn plotting functions also accept a series as input whose values are used directly.
What else can we improve about this line plot?
grid = sns.relplot(
seattle_air,
x=seattle_air.index.day_of_year,
y="PM2.5",
hue=seattle_air.index.year,
palette="tab10",
kind="line",
errorbar=None, # Much faster when we don't generate error bars
)
# When column name is not specified, the index name "Time" is used
grid.set(xlabel="Day of Year")
grid.legend.set(title="Year")
[None]
What's in a NaN
?¶
Earlier, we replaced the NaN
(not a number) missing air quality data using interpolation to guess its value based on surrounding data points. But why were these values NaN
in the first place?
Last year, I asked this question to a data analyst at the Puget Sound Clean Air Agency via their public contact phone number. They provided several potential reasons why a row might be NaN
.
- Regular, biweekly maintenance
- Break-in and vandalism issues
- Internet connectivity issues
- Regulatory calibration requirements
- Equipment relocation, changes, or upgrades
Furthermore, they pointed out that the air quality sensors are calibrated for lower concentrations, so sensors may underreport values during times when there are higher concentrations of particulate matter.
These stories and context that situate our data inform its data setting: the technical and the human processes that affect what information is captured in the data collection process and how the data are then structured. Let's listen to Yanni Loukissas explain more.
%%html
<iframe width="640" height="360" src="https://www.youtube-nocookie.com/embed/bUws5GCF3GI?start=477&end=624" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
Sometimes, the creators of the dataset might share some of the data settings with you in the form of a datasheet. In Datasheets for Datasets, Timnit Gebru et al. (2018) propose many questions that should be answered when describing a dataset that they categorized into questions about:
- Motivation: why the dataset was created
- Composition: what the data represents and how values relate to each other
- Collection process: how the data was collected
- Preprocessing/cleaning/labeling: how the data was converted into its current form
- Uses: what the data should and should not be used for
- Distribution: how the data will be shared with other parties
- Maintenance: how the data will be maintained, hosted, and updated over time
Even when datasets are documented, there may yet be stories behind each and every value in the dataset that might only be surfaced through discussion with the dataset creators or subject matter experts. Data is local, even when it doesn't seem like it, because they are shaped by the practices of the people who created it.
%%html
<iframe width="640" height="360" src="https://www.youtube-nocookie.com/embed/bUws5GCF3GI?start=835&end=1005" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
Principle: Consider context¶
How do we put data locality and data settings into practice? Chapter 6 of Data Feminism by Catherine D'Ignazio and Lauren Klein titled "The Numbers Don't Speak for Themselves" provide some examples of how to consider context in our data work.
Instead of taking data at face value and looking toward future insights, data scientists can first interrogate the context, limitations, and validity of the data under use. In other words: consider the cooking process that produces "raw" data. As one example, computational social scientists Derek Ruths and Jürgen Pfeffer write about the limitations of using social media data for behavioral insights: Instagram data skews young because Instagram does; Reddit data contains far more comments by men than by women because Reddit's overall membership is majority men. They further show how research data acquired from those sources are shaped by sampling because companies like Reddit and Instagram employ proprietary methods to deliver their data to researchers, and those methods are never disclosed. Related research by Devin Gaffney and J. Nathan Matias took on a popular corpus that claimed to contain "every publicly available Reddit comment." Their work showed the that the supposedly complete corpus is missing at least thirty-six million comments and twenty-eight million submissions.
Is the solution to "remove the bias" from a dataset? What do you think Yanni Loukissas would say in response to this question?
%%html
<iframe width="640" height="360" src="https://www.youtube-nocookie.com/embed/bUws5GCF3GI?start=1481&end=1800" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
As data programmers, how does this intersect with what we've just learned about data visualization? There's a question of how we communicate the data setting to our audiences.
To explain the difference between the two visualizations, which only differ in title and subtitle, the Catherine and Lauren write:
Which one of these graphics would you create? Which one should you create? The first—Mental Health in Jail—represents the typical way that the results of a data analysis are communicated. The title appears to be neutral and free of bias. This is a graphic about rates of mental illness diagnosis of incarcerated people broken down by race and ethnicity. The people are referred to as inmates, the language that the study used. The title does not mention race or ethnicity, or racism or health inequities, nor does the title point to what the data mean. But this is where additional questions about context come in. Are you representing only the four numbers that we see in the chart? Or are you representing the context from which they emerged?
The study that produced these numbers contains convincing evidence that we should distrust diagnosis numbers due to racial and ethnic discrimination. The first chart does not simply fail to communicate that but also actively undermines that main finding of the research. Moreover, the language used to refer to people in jail as inmates is dehumanizing, particularly in the context of the epidemic of mass incarceration in the United States. So, consider the second chart: Racism in Jail: People of Color Less Likely to Get Mental Health Diagnosis. This title offers a frame for how to interpret the numbers along the lines of the study from which they emerged. The research study was about racial disparities, so the title and content of this chart are about racial disparities. The people behind the numbers are people, not inmates. In addition, and crucially, the second chart names the forces of oppression that are at work: racism in prison.
Data work provides a rhetorical medium for data programmers to make and communicate meaning to readers that require careful attention to every part of our work: not only the code, but also the data because the data doesn't speak for itself.