Data Settings¶
In this lesson, we'll consider what it means for a dataset to have a data setting: the technical and the human processes that affect what information is captured in the data collection process and how the data are then structured. In our upcoming assessment, we'll examine an administrative dataset on educational attainment for people age 25 to 29 in the United States. The assessment serves not only as opportunity to demonstrate to reflect on the challenges of data visualization, but also the challenges inherent in working with real-world data.
By the end of this lesson, students will be able to:
- Create visualizations involving time series data.
- Compare and contrast statistical, coded, and structural bias.
- Identify questions about the data setting for a given dataset.
import pandas as pd
import seaborn as sns
sns.set_theme()
Time series data¶
Seattleites often look forward to summer months for beautiful weather and outdoor activities, but in recent years summer wildfires have had significant impacts on air quality. Are there particular times during the summer months where air quality is most concerning? Let's investigate air quality data captured by the Puget Sound Clean Air Agency's Seattle-Duwamish sensor between April 2017 and April 2022. Current sensor readings can be found on Washington's Air Monitoring Network Map.
The air quality sensor data is recorded at hourly intervals, making it a time series: time-indexed data with a consistent interval between observations.
seattle_air = pd.read_csv("seattle_air.csv", index_col="Time", parse_dates=True)
seattle_air
PM2.5 | |
---|---|
Time | |
2017-04-06 00:00:00 | 6.8 |
2017-04-06 01:00:00 | 5.3 |
2017-04-06 02:00:00 | 5.3 |
2017-04-06 03:00:00 | 5.6 |
2017-04-06 04:00:00 | 5.9 |
... | ... |
2022-04-06 19:00:00 | 5.1 |
2022-04-06 20:00:00 | 5.0 |
2022-04-06 21:00:00 | 5.3 |
2022-04-06 22:00:00 | 5.2 |
2022-04-06 23:00:00 | 5.2 |
43848 rows × 1 columns
Time series data use a special type of index called a DatetimeIndex
that stores datetime values. Each datetime value in the index below consists of a YEAR-MONTH-DAY
and HOUR:MINUTE:SECOND
.
seattle_air.index
DatetimeIndex(['2017-04-06 00:00:00', '2017-04-06 01:00:00', '2017-04-06 02:00:00', '2017-04-06 03:00:00', '2017-04-06 04:00:00', '2017-04-06 05:00:00', '2017-04-06 06:00:00', '2017-04-06 07:00:00', '2017-04-06 08:00:00', '2017-04-06 09:00:00', ... '2022-04-06 14:00:00', '2022-04-06 15:00:00', '2022-04-06 16:00:00', '2022-04-06 17:00:00', '2022-04-06 18:00:00', '2022-04-06 19:00:00', '2022-04-06 20:00:00', '2022-04-06 21:00:00', '2022-04-06 22:00:00', '2022-04-06 23:00:00'], dtype='datetime64[ns]', name='Time', length=43848, freq=None)
Pandas provides convenient string-based syntax for slicing a datetime index.
# All the data in 2022 and all the columns
seattle_air.loc["2022", :]
PM2.5 | |
---|---|
Time | |
2022-01-01 00:00:00 | 27.2 |
2022-01-01 01:00:00 | 25.1 |
2022-01-01 02:00:00 | 23.9 |
2022-01-01 03:00:00 | 21.0 |
2022-01-01 04:00:00 | 16.7 |
... | ... |
2022-04-06 19:00:00 | 5.1 |
2022-04-06 20:00:00 | 5.0 |
2022-04-06 21:00:00 | 5.3 |
2022-04-06 22:00:00 | 5.2 |
2022-04-06 23:00:00 | 5.2 |
2304 rows × 1 columns
seattle_air.loc["2022-04", :]
PM2.5 | |
---|---|
Time | |
2022-04-01 00:00:00 | 5.2 |
2022-04-01 01:00:00 | 5.1 |
2022-04-01 02:00:00 | 5.4 |
2022-04-01 03:00:00 | 5.4 |
2022-04-01 04:00:00 | 6.3 |
... | ... |
2022-04-06 19:00:00 | 5.1 |
2022-04-06 20:00:00 | 5.0 |
2022-04-06 21:00:00 | 5.3 |
2022-04-06 22:00:00 | 5.2 |
2022-04-06 23:00:00 | 5.2 |
144 rows × 1 columns
How do we slice the air quality data for the summer months June 1, 2021 through August 31, 2021? Can we get only the summer months across all the years in the dataset? How does this compare against the MultiIndex
slicing that we've learned in the past?
# Why is the colon between the two dates okay?
# It's okay because it's not directly in a tuple context. Remember that we use slice(...)
# when we're directly in the context of a tuple.
seattle_air.loc["2021-06-01":"2021-08-31", :]
PM2.5 | |
---|---|
Time | |
2021-06-01 00:00:00 | 6.0 |
2021-06-01 01:00:00 | 6.1 |
2021-06-01 02:00:00 | 6.0 |
2021-06-01 03:00:00 | 6.6 |
2021-06-01 04:00:00 | 7.7 |
... | ... |
2021-08-31 19:00:00 | 5.9 |
2021-08-31 20:00:00 | 6.4 |
2021-08-31 21:00:00 | 6.7 |
2021-08-31 22:00:00 | 7.2 |
2021-08-31 23:00:00 | 6.5 |
2208 rows × 1 columns
# Pandas will not infer the year for you automatically, so there's not an immediately
# clear solution to what to do about this.
# There are ways around this! We might see some later.
seattle_air.loc["06-01":"08-31", :]
--------------------------------------------------------------------------- OverflowError Traceback (most recent call last) File period.pyx:1169, in pandas._libs.tslibs.period.period_ordinal_to_dt64() OverflowError: Overflow occurred in npy_datetimestruct_to_datetime The above exception was the direct cause of the following exception: OutOfBoundsDatetime Traceback (most recent call last) Cell In[7], line 1 ----> 1 seattle_air.loc["06-01":"08-31", :] File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key) 1182 if self._is_scalar_access(key): 1183 return self.obj._get_value(*key, takeable=self._takeable) -> 1184 return self._getitem_tuple(key) 1185 else: 1186 # we by definition only have the 0th axis 1187 axis = self.axis or 0 File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup) 1374 if self._multi_take_opportunity(tup): 1375 return self._multi_take(tup) -> 1377 return self._getitem_tuple_same_dim(tup) File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup) 1017 if com.is_null_slice(key): 1018 continue -> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i) 1021 # We should never have retval.ndim < self.ndim, as that should 1022 # be handled by the _getitem_lowerdim call above. 1023 assert retval.ndim == self.ndim File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis) 1409 if isinstance(key, slice): 1410 self._validate_key(key, axis) -> 1411 return self._get_slice_axis(key, axis=axis) 1412 elif com.is_bool_indexer(key): 1413 return self._getbool_axis(key, axis=axis) File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis) 1440 return obj.copy(deep=False) 1442 labels = obj._get_axis(axis) -> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step) 1445 if isinstance(indexer, slice): 1446 return self.obj._slice(indexer, axis=axis) File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/datetimes.py:682, in DatetimeIndex.slice_indexer(self, start, end, step) 674 # GH#33146 if start and end are combinations of str and None and Index is not 675 # monotonic, we can not use Index.slice_indexer because it does not honor the 676 # actual elements, is only searching for start and end 677 if ( 678 check_str_or_none(start) 679 or check_str_or_none(end) 680 or self.is_monotonic_increasing 681 ): --> 682 return Index.slice_indexer(self, start, end, step) 684 mask = np.array(True) 685 in_index = True File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step) 6618 def slice_indexer( 6619 self, 6620 start: Hashable | None = None, 6621 end: Hashable | None = None, 6622 step: int | None = None, 6623 ) -> slice: 6624 """ 6625 Compute the slice indexer for input labels and step. 6626 (...) 6660 slice(1, 3, None) 6661 """ -> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step) 6664 # return a slice 6665 if not is_scalar(start_slice): File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step) 6877 start_slice = None 6878 if start is not None: -> 6879 start_slice = self.get_slice_bound(start, "left") 6880 if start_slice is None: 6881 start_slice = 0 File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:6794, in Index.get_slice_bound(self, label, side) 6790 original_label = label 6792 # For datetime indices label may be a string that has to be converted 6793 # to datetime boundary according to its resolution. -> 6794 label = self._maybe_cast_slice_bound(label, side) 6796 # we need to look up the label 6797 try: File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/datetimes.py:642, in DatetimeIndex._maybe_cast_slice_bound(self, label, side) 637 if isinstance(label, dt.date) and not isinstance(label, dt.datetime): 638 # Pandas supports slicing with dates, treated as datetimes at midnight. 639 # https://github.com/pandas-dev/pandas/issues/31501 640 label = Timestamp(label).to_pydatetime() --> 642 label = super()._maybe_cast_slice_bound(label, side) 643 self._data._assert_tzawareness_compat(label) 644 return Timestamp(label) File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/datetimelike.py:375, in DatetimeIndexOpsMixin._maybe_cast_slice_bound(self, label, side) 369 except ValueError as err: 370 # DTI -> parsing.DateParseError 371 # TDI -> 'unit abbreviation w/o a number' 372 # PI -> string cannot be parsed as datetime-like 373 self._raise_invalid_indexer("slice", label, err) --> 375 lower, upper = self._parsed_string_to_bounds(reso, parsed) 376 return lower if side == "left" else upper 377 elif not isinstance(label, self._data._recognized_scalars): File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/datetimes.py:538, in DatetimeIndex._parsed_string_to_bounds(self, reso, parsed) 536 freq = OFFSET_TO_PERIOD_FREQSTR.get(reso.attr_abbrev, reso.attr_abbrev) 537 per = Period(parsed, freq=freq) --> 538 start, end = per.start_time, per.end_time 540 # GH 24076 541 # If an incoming date string contained a UTC offset, need to localize 542 # the parsed date to this offset first before aligning with the index's 543 # timezone 544 start = start.tz_localize(parsed.tzinfo) File period.pyx:1666, in pandas._libs.tslibs.period.PeriodMixin.start_time.__get__() File period.pyx:1992, in pandas._libs.tslibs.period._Period.to_timestamp() File period.pyx:1172, in pandas._libs.tslibs.period.period_ordinal_to_dt64() OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-06-01 00:00:00
Visualizations with DatetimeIndex
¶
What would this data look like if we plotted the values?
sns.relplot(seattle_air, x="Time", y="PM2.5", kind="line")
<seaborn.axisgrid.FacetGrid at 0x7a0f661be0d0>
This is a good start, but not so helpful for answering our research question about summer air quality. We can try to groupby
each year and produce a plot for each unique year, but it'd be really nice if we could see all the years in a single plot.
seattle_air.index.year
Index([2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, ... 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022], dtype='int32', name='Time', length=43848)
# Instead of supplying "Year" as a string,
# we can instead supply an equal-length series that indicates the group value
seattle_air.groupby(seattle_air.index.year).plot()
Time 2017 Axes(0.125,0.11;0.775x0.77) 2018 Axes(0.125,0.11;0.775x0.77) 2019 Axes(0.125,0.11;0.775x0.77) 2020 Axes(0.125,0.11;0.775x0.77) 2021 Axes(0.125,0.11;0.775x0.77) 2022 Axes(0.125,0.11;0.775x0.77) dtype: object
Ideally, we would like to see all 6 line plots together on the same axes. However, notice that the plots all maintain their original datetime information: each plot is labeled a different year because the datetime information records YEAR-MONTH-DAY
. Without a common or shared x-axis, it will be difficult to combine the 6 plots into one.
DatetimeIndex
provides helpful accessors, including the day_of_year
. The day-of-the-year is just a number, so it offers a way to align the x-axis across different years.
seattle_air.index.day_of_year
Index([96, 96, 96, 96, 96, 96, 96, 96, 96, 96, ... 96, 96, 96, 96, 96, 96, 96, 96, 96, 96], dtype='int32', name='Time', length=43848)
By combining these accessors, we can use seaborn to generate a line plot that combines each year of air quality data. Just like how groupby
can accept a series to determine groups, seaborn plotting functions also accept a series as input whose values are used directly.
Based on the principles of visualization that we learned in the last lesson, what else can we improve about this line plot?
grid = sns.relplot(
seattle_air,
x=seattle_air.index.day_of_year,
y="PM2.5",
hue=seattle_air.index.year,
kind="line",
errorbar=None, # Much faster when we don't generate error bars
)
# When column name is not specified, the index name "Time" is used
grid.set(xlabel="Day of Year")
grid.legend.set(title="Year");
What's in a NaN
?¶
If you look closely at the 6 plots for each year's data, you'll notice that there are some gaps in the dataset. The dataset has missing values that are marked NaN
. Let's replace missing values using linear interpolation, which examines neighboring values to replace NaN
values with best estimates.
missing_values = seattle_air["PM2.5"].isna()
# Show the missing values
seattle_air[missing_values]
PM2.5 | |
---|---|
Time | |
2017-04-07 07:00:00 | NaN |
2017-04-17 06:00:00 | NaN |
2017-04-17 07:00:00 | NaN |
2017-04-17 09:00:00 | NaN |
2017-04-28 09:00:00 | NaN |
... | ... |
2022-02-28 05:00:00 | NaN |
2022-03-14 05:00:00 | NaN |
2022-03-15 12:00:00 | NaN |
2022-03-15 13:00:00 | NaN |
2022-03-28 05:00:00 | NaN |
789 rows × 1 columns
seattle_air = seattle_air.interpolate()
# Show only the previously-missing values
seattle_air[missing_values]
PM2.5 | |
---|---|
Time | |
2017-04-07 07:00:00 | 10.950000 |
2017-04-17 06:00:00 | 9.466667 |
2017-04-17 07:00:00 | 8.633333 |
2017-04-17 09:00:00 | 6.800000 |
2017-04-28 09:00:00 | 6.000000 |
... | ... |
2022-02-28 05:00:00 | 4.750000 |
2022-03-14 05:00:00 | 5.300000 |
2022-03-15 12:00:00 | 5.100000 |
2022-03-15 13:00:00 | 4.400000 |
2022-03-28 05:00:00 | 7.600000 |
789 rows × 1 columns
But why were these values NaN
in the first place? A few years ago, I called the Puget Sound Clean Air Agency and waited on the line to speak to a data analyst. They provided several potential reasons why a row might be NaN
.
- Regular, biweekly maintenance
- Break-in and vandalism issues
- Internet connectivity issues
- Regulatory calibration requirements
- Equipment relocation, changes, or upgrades
Furthermore, they pointed out that the air quality sensors are calibrated for lower concentrations, so sensors may underreport values during times when there are higher concentrations of particulate matter.
These stories and context that situate our data inform its data setting: the technical and the human processes that affect what information is captured in the data collection process and how the data are then structured. Sometimes, the creators of the dataset might share some of the data settings with you in the form of a datasheet. In Datasheets for Datasets, Timnit Gebru et al. (2018) propose many questions that should be answered when describing a dataset that they categorized into questions about:
- Motivation: why the dataset was created
- Composition: what the data represents and how values relate to each other
- Collection process: how the data was collected
- Preprocessing/cleaning/labeling: how the data was converted into its current form
- Uses: what the data should and should not be used for
- Distribution: how the data will be shared with other parties
- Maintenance: how the data will be maintained, hosted, and updated over time
Even when datasets are documented, there may yet be stories behind each and every value in the dataset that might only be surfaced through discussion with the dataset creators or subject matter experts. Data is local, even when it doesn't seem like it, because they are shaped by the practices of the people who created it.
Consider context¶
How do we put data locality and data setting into practice? Chapter 6 of Data Feminism by Catherine D'Ignazio and Lauren Klein titled "The Numbers Don't Speak for Themselves" offers a call to action to consider context in our work.
Instead of taking data at face value and looking toward future insights, data scientists can first interrogate the context, limitations, and validity of the data under use. In other words: consider the cooking process that produces "raw" data.
How do we communicate this context—the underlying data setting—to readers? Consider these two plots, which only differ in their titles and subtitles.
To explain the difference between the two visualizations, Catherine and Lauren write:
Which one of these graphics would you create? Which one should you create? The first—Mental Health in Jail—represents the typical way that the results of a data analysis are communicated. The title appears to be neutral and free of bias. This is a graphic about rates of mental illness diagnosis of incarcerated people broken down by race and ethnicity. The people are referred to as inmates, the language that the study used. The title does not mention race or ethnicity, or racism or health inequities, nor does the title point to what the data mean. But this is where additional questions about context come in. Are you representing only the four numbers that we see in the chart? Or are you representing the context from which they emerged?
The study that produced these numbers contains convincing evidence that we should distrust diagnosis numbers due to racial and ethnic discrimination. The first chart does not simply fail to communicate that but also actively undermines that main finding of the research. Moreover, the language used to refer to people in jail as inmates is dehumanizing, particularly in the context of the epidemic of mass incarceration in the United States. So, consider the second chart: Racism in Jail: People of Color Less Likely to Get Mental Health Diagnosis. This title offers a frame for how to interpret the numbers along the lines of the study from which they emerged. The research study was about racial disparities, so the title and content of this chart are about racial disparities. The people behind the numbers are people, not inmates. In addition, and crucially, the second chart names the forces of oppression that are at work: racism in prison.
Close reading¶
Often, we think of statistical data analysis and visualization as a type of distant reading of the data—one that aims to abstract and zoom out to the furthest (and, some might argue, most complete) view of the data. Yet, Yanni Loukissas argues for a close reading of the data in his final take-home message at the end of his book:
Treat data as a point of contact, a landing, an opportunity to get closer, to learn to care about a subject, or the people and places beyond data. Do not mistake the availability of data as permission to remain at a distance.
In "this is an teenager", an interactive visualization by data journalist and professor Alvin Chang, we follow the lives of teenagers starting from 1997 through 2021 to see how adverse childhood experiences affect their life outcomes. How does Alvin take a closer reading of the National Longitudinal Survey of Youth dataset?
%%html
<iframe width="640" height="360" src="https://www.youtube-nocookie.com/embed/fKv1Mixv0Hk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>