Data Settings¶

In this lesson, we'll consider what it means for a dataset to have a data setting: the technical and the human processes that affect what information is captured in the data collection process and how the data are then structured. In our upcoming assessment, we'll examine an administrative dataset on educational attainment for people age 25 to 29 in the United States. The assessment serves not only as opportunity to demonstrate to reflect on the challenges of data visualization, but also the challenges inherent in working with real-world data.

By the end of this lesson, students will be able to:

  • Create visualizations involving time series data.
  • Compare and contrast statistical, coded, and structural bias.
  • Identify questions about the data setting for a given dataset.
In [1]:
import pandas as pd
import seaborn as sns

sns.set_theme()

Time series data¶

Seattleites often look forward to summer months for beautiful weather and outdoor activities, but in recent years summer wildfires have had significant impacts on air quality. Are there particular times during the summer months where air quality is most concerning? Let's investigate air quality data captured by the Puget Sound Clean Air Agency's Seattle-Duwamish sensor between April 2017 and April 2022. Current sensor readings can be found on Washington's Air Monitoring Network Map.

The air quality sensor data is recorded at hourly intervals, making it a time series: time-indexed data with a consistent interval between observations.

In [2]:
seattle_air = pd.read_csv("seattle_air.csv", index_col="Time", parse_dates=True)
seattle_air
Out[2]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

43848 rows × 1 columns

Time series data use a special type of index called a DatetimeIndex that stores datetime values. Each datetime value in the index below consists of a YEAR-MONTH-DAY and HOUR:MINUTE:SECOND.

In [3]:
seattle_air.index
Out[3]:
DatetimeIndex(['2017-04-06 00:00:00', '2017-04-06 01:00:00',
               '2017-04-06 02:00:00', '2017-04-06 03:00:00',
               '2017-04-06 04:00:00', '2017-04-06 05:00:00',
               '2017-04-06 06:00:00', '2017-04-06 07:00:00',
               '2017-04-06 08:00:00', '2017-04-06 09:00:00',
               ...
               '2022-04-06 14:00:00', '2022-04-06 15:00:00',
               '2022-04-06 16:00:00', '2022-04-06 17:00:00',
               '2022-04-06 18:00:00', '2022-04-06 19:00:00',
               '2022-04-06 20:00:00', '2022-04-06 21:00:00',
               '2022-04-06 22:00:00', '2022-04-06 23:00:00'],
              dtype='datetime64[ns]', name='Time', length=43848, freq=None)

Pandas provides convenient string-based syntax for slicing a datetime index.

In [4]:
# All the data in 2022 and all the columns
seattle_air.loc["2022", :]
Out[4]:
PM2.5
Time
2022-01-01 00:00:00 27.2
2022-01-01 01:00:00 25.1
2022-01-01 02:00:00 23.9
2022-01-01 03:00:00 21.0
2022-01-01 04:00:00 16.7
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

2304 rows × 1 columns

In [5]:
seattle_air.loc["2022-04", :]
Out[5]:
PM2.5
Time
2022-04-01 00:00:00 5.2
2022-04-01 01:00:00 5.1
2022-04-01 02:00:00 5.4
2022-04-01 03:00:00 5.4
2022-04-01 04:00:00 6.3
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

144 rows × 1 columns

How do we slice the air quality data for the summer months June 1, 2021 through August 31, 2021? Can we get only the summer months across all the years in the dataset? How does this compare against the MultiIndex slicing that we've learned in the past?

In [6]:
# Why is the colon between the two dates okay?
# It's okay because it's not directly in a tuple context. Remember that we use slice(...)
# when we're directly in the context of a tuple.
seattle_air.loc["2021-06-01":"2021-08-31", :]
Out[6]:
PM2.5
Time
2021-06-01 00:00:00 6.0
2021-06-01 01:00:00 6.1
2021-06-01 02:00:00 6.0
2021-06-01 03:00:00 6.6
2021-06-01 04:00:00 7.7
... ...
2021-08-31 19:00:00 5.9
2021-08-31 20:00:00 6.4
2021-08-31 21:00:00 6.7
2021-08-31 22:00:00 7.2
2021-08-31 23:00:00 6.5

2208 rows × 1 columns

In [7]:
# Pandas will not infer the year for you automatically, so there's not an immediately
# clear solution to what to do about this.
# There are ways around this! We might see some later.
seattle_air.loc["06-01":"08-31", :]
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
File period.pyx:1169, in pandas._libs.tslibs.period.period_ordinal_to_dt64()

OverflowError: Overflow occurred in npy_datetimestruct_to_datetime

The above exception was the direct cause of the following exception:

OutOfBoundsDatetime                       Traceback (most recent call last)
Cell In[7], line 1
----> 1 seattle_air.loc["06-01":"08-31", :]

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
   1374 if self._multi_take_opportunity(tup):
   1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
   1017 if com.is_null_slice(key):
   1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
   1021 # We should never have retval.ndim < self.ndim, as that should
   1022 #  be handled by the _getitem_lowerdim call above.
   1023 assert retval.ndim == self.ndim

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/datetimes.py:682, in DatetimeIndex.slice_indexer(self, start, end, step)
    674 # GH#33146 if start and end are combinations of str and None and Index is not
    675 # monotonic, we can not use Index.slice_indexer because it does not honor the
    676 # actual elements, is only searching for start and end
    677 if (
    678     check_str_or_none(start)
    679     or check_str_or_none(end)
    680     or self.is_monotonic_increasing
    681 ):
--> 682     return Index.slice_indexer(self, start, end, step)
    684 mask = np.array(True)
    685 in_index = True

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
   6618 def slice_indexer(
   6619     self,
   6620     start: Hashable | None = None,
   6621     end: Hashable | None = None,
   6622     step: int | None = None,
   6623 ) -> slice:
   6624     """
   6625     Compute the slice indexer for input labels and step.
   6626 
   (...)
   6660     slice(1, 3, None)
   6661     """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6664     # return a slice
   6665     if not is_scalar(start_slice):

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
   6877 start_slice = None
   6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
   6880 if start_slice is None:
   6881     start_slice = 0

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:6794, in Index.get_slice_bound(self, label, side)
   6790 original_label = label
   6792 # For datetime indices label may be a string that has to be converted
   6793 # to datetime boundary according to its resolution.
-> 6794 label = self._maybe_cast_slice_bound(label, side)
   6796 # we need to look up the label
   6797 try:

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/datetimes.py:642, in DatetimeIndex._maybe_cast_slice_bound(self, label, side)
    637 if isinstance(label, dt.date) and not isinstance(label, dt.datetime):
    638     # Pandas supports slicing with dates, treated as datetimes at midnight.
    639     # https://github.com/pandas-dev/pandas/issues/31501
    640     label = Timestamp(label).to_pydatetime()
--> 642 label = super()._maybe_cast_slice_bound(label, side)
    643 self._data._assert_tzawareness_compat(label)
    644 return Timestamp(label)

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/datetimelike.py:375, in DatetimeIndexOpsMixin._maybe_cast_slice_bound(self, label, side)
    369     except ValueError as err:
    370         # DTI -> parsing.DateParseError
    371         # TDI -> 'unit abbreviation w/o a number'
    372         # PI -> string cannot be parsed as datetime-like
    373         self._raise_invalid_indexer("slice", label, err)
--> 375     lower, upper = self._parsed_string_to_bounds(reso, parsed)
    376     return lower if side == "left" else upper
    377 elif not isinstance(label, self._data._recognized_scalars):

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/datetimes.py:538, in DatetimeIndex._parsed_string_to_bounds(self, reso, parsed)
    536 freq = OFFSET_TO_PERIOD_FREQSTR.get(reso.attr_abbrev, reso.attr_abbrev)
    537 per = Period(parsed, freq=freq)
--> 538 start, end = per.start_time, per.end_time
    540 # GH 24076
    541 # If an incoming date string contained a UTC offset, need to localize
    542 # the parsed date to this offset first before aligning with the index's
    543 # timezone
    544 start = start.tz_localize(parsed.tzinfo)

File period.pyx:1666, in pandas._libs.tslibs.period.PeriodMixin.start_time.__get__()

File period.pyx:1992, in pandas._libs.tslibs.period._Period.to_timestamp()

File period.pyx:1172, in pandas._libs.tslibs.period.period_ordinal_to_dt64()

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-06-01 00:00:00

Visualizations with DatetimeIndex¶

What would this data look like if we plotted the values?

In [8]:
sns.relplot(seattle_air, x="Time", y="PM2.5", kind="line")
Out[8]:
<seaborn.axisgrid.FacetGrid at 0x7a0f661be0d0>
No description has been provided for this image

This is a good start, but not so helpful for answering our research question about summer air quality. We can try to groupby each year and produce a plot for each unique year, but it'd be really nice if we could see all the years in a single plot.

In [12]:
seattle_air.index.year
Out[12]:
Index([2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,
       ...
       2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022],
      dtype='int32', name='Time', length=43848)
In [13]:
# Instead of supplying "Year" as a string,
# we can instead supply an equal-length series that indicates the group value
seattle_air.groupby(seattle_air.index.year).plot()
Out[13]:
Time
2017    Axes(0.125,0.11;0.775x0.77)
2018    Axes(0.125,0.11;0.775x0.77)
2019    Axes(0.125,0.11;0.775x0.77)
2020    Axes(0.125,0.11;0.775x0.77)
2021    Axes(0.125,0.11;0.775x0.77)
2022    Axes(0.125,0.11;0.775x0.77)
dtype: object
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Ideally, we would like to see all 6 line plots together on the same axes. However, notice that the plots all maintain their original datetime information: each plot is labeled a different year because the datetime information records YEAR-MONTH-DAY. Without a common or shared x-axis, it will be difficult to combine the 6 plots into one.

DatetimeIndex provides helpful accessors, including the day_of_year. The day-of-the-year is just a number, so it offers a way to align the x-axis across different years.

In [14]:
seattle_air.index.day_of_year
Out[14]:
Index([96, 96, 96, 96, 96, 96, 96, 96, 96, 96,
       ...
       96, 96, 96, 96, 96, 96, 96, 96, 96, 96],
      dtype='int32', name='Time', length=43848)

By combining these accessors, we can use seaborn to generate a line plot that combines each year of air quality data. Just like how groupby can accept a series to determine groups, seaborn plotting functions also accept a series as input whose values are used directly.

Based on the principles of visualization that we learned in the last lesson, what else can we improve about this line plot?

In [15]:
grid = sns.relplot(
    seattle_air,
    x=seattle_air.index.day_of_year,
    y="PM2.5",
    hue=seattle_air.index.year,
    kind="line",
    errorbar=None, # Much faster when we don't generate error bars
)
# When column name is not specified, the index name "Time" is used
grid.set(xlabel="Day of Year")
grid.legend.set(title="Year");
No description has been provided for this image

What's in a NaN?¶

If you look closely at the 6 plots for each year's data, you'll notice that there are some gaps in the dataset. The dataset has missing values that are marked NaN. Let's replace missing values using linear interpolation, which examines neighboring values to replace NaN values with best estimates.

In [16]:
missing_values = seattle_air["PM2.5"].isna()
# Show the missing values
seattle_air[missing_values]
Out[16]:
PM2.5
Time
2017-04-07 07:00:00 NaN
2017-04-17 06:00:00 NaN
2017-04-17 07:00:00 NaN
2017-04-17 09:00:00 NaN
2017-04-28 09:00:00 NaN
... ...
2022-02-28 05:00:00 NaN
2022-03-14 05:00:00 NaN
2022-03-15 12:00:00 NaN
2022-03-15 13:00:00 NaN
2022-03-28 05:00:00 NaN

789 rows × 1 columns

In [17]:
seattle_air = seattle_air.interpolate()
# Show only the previously-missing values
seattle_air[missing_values]
Out[17]:
PM2.5
Time
2017-04-07 07:00:00 10.950000
2017-04-17 06:00:00 9.466667
2017-04-17 07:00:00 8.633333
2017-04-17 09:00:00 6.800000
2017-04-28 09:00:00 6.000000
... ...
2022-02-28 05:00:00 4.750000
2022-03-14 05:00:00 5.300000
2022-03-15 12:00:00 5.100000
2022-03-15 13:00:00 4.400000
2022-03-28 05:00:00 7.600000

789 rows × 1 columns

But why were these values NaN in the first place? A few years ago, I called the Puget Sound Clean Air Agency and waited on the line to speak to a data analyst. They provided several potential reasons why a row might be NaN.

  • Regular, biweekly maintenance
  • Break-in and vandalism issues
  • Internet connectivity issues
  • Regulatory calibration requirements
  • Equipment relocation, changes, or upgrades

Furthermore, they pointed out that the air quality sensors are calibrated for lower concentrations, so sensors may underreport values during times when there are higher concentrations of particulate matter.

These stories and context that situate our data inform its data setting: the technical and the human processes that affect what information is captured in the data collection process and how the data are then structured. Sometimes, the creators of the dataset might share some of the data settings with you in the form of a datasheet. In Datasheets for Datasets, Timnit Gebru et al. (2018) propose many questions that should be answered when describing a dataset that they categorized into questions about:

  • Motivation: why the dataset was created
  • Composition: what the data represents and how values relate to each other
  • Collection process: how the data was collected
  • Preprocessing/cleaning/labeling: how the data was converted into its current form
  • Uses: what the data should and should not be used for
  • Distribution: how the data will be shared with other parties
  • Maintenance: how the data will be maintained, hosted, and updated over time

Even when datasets are documented, there may yet be stories behind each and every value in the dataset that might only be surfaced through discussion with the dataset creators or subject matter experts. Data is local, even when it doesn't seem like it, because they are shaped by the practices of the people who created it.

Consider context¶

How do we put data locality and data setting into practice? Chapter 6 of Data Feminism by Catherine D'Ignazio and Lauren Klein titled "The Numbers Don't Speak for Themselves" offers a call to action to consider context in our work.

Instead of taking data at face value and looking toward future insights, data scientists can first interrogate the context, limitations, and validity of the data under use. In other words: consider the cooking process that produces "raw" data.

How do we communicate this context—the underlying data setting—to readers? Consider these two plots, which only differ in their titles and subtitles.

Bar plot titled Mental Health in Jail and subtitled Rate of mental health diagnosis of inmates Same bar plot titled Racism in Jail and subtitled People of color less likely to get mental health diagnosis

To explain the difference between the two visualizations, Catherine and Lauren write:

Which one of these graphics would you create? Which one should you create? The first—Mental Health in Jail—represents the typical way that the results of a data analysis are communicated. The title appears to be neutral and free of bias. This is a graphic about rates of mental illness diagnosis of incarcerated people broken down by race and ethnicity. The people are referred to as inmates, the language that the study used. The title does not mention race or ethnicity, or racism or health inequities, nor does the title point to what the data mean. But this is where additional questions about context come in. Are you representing only the four numbers that we see in the chart? Or are you representing the context from which they emerged?

The study that produced these numbers contains convincing evidence that we should distrust diagnosis numbers due to racial and ethnic discrimination. The first chart does not simply fail to communicate that but also actively undermines that main finding of the research. Moreover, the language used to refer to people in jail as inmates is dehumanizing, particularly in the context of the epidemic of mass incarceration in the United States. So, consider the second chart: Racism in Jail: People of Color Less Likely to Get Mental Health Diagnosis. This title offers a frame for how to interpret the numbers along the lines of the study from which they emerged. The research study was about racial disparities, so the title and content of this chart are about racial disparities. The people behind the numbers are people, not inmates. In addition, and crucially, the second chart names the forces of oppression that are at work: racism in prison.

Close reading¶

Often, we think of statistical data analysis and visualization as a type of distant reading of the data—one that aims to abstract and zoom out to the furthest (and, some might argue, most complete) view of the data. Yet, Yanni Loukissas argues for a close reading of the data in his final take-home message at the end of his book:

Treat data as a point of contact, a landing, an opportunity to get closer, to learn to care about a subject, or the people and places beyond data. Do not mistake the availability of data as permission to remain at a distance.

In "this is an teenager", an interactive visualization by data journalist and professor Alvin Chang, we follow the lives of teenagers starting from 1997 through 2021 to see how adverse childhood experiences affect their life outcomes. How does Alvin take a closer reading of the National Longitudinal Survey of Youth dataset?

In [18]:
%%html
<iframe width="640" height="360" src="https://www.youtube-nocookie.com/embed/fKv1Mixv0Hk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>