Data Frames¶

Earlier, we learned how to process CSV files using the list of dictionaries representation. This week, we will introduce pandas, the most commonly-used Python data programming tool and one that we'll be using for the remainder of the course. By the end of this lesson, students will be able to:

  • Import values and functions from another module using import and from statements.
  • Select individual columns from a pandas DataFrame and apply element-wise computations.
  • Filter a pandas DataFrame or Series with a boolean series.

The last two learning objectives are particularly ambitious: it will take much more deliberate practice before you feel comfortable.

In [1]:
import doctest
import io
import pandas as pd

Import statements¶

We've been writing some curious lines of code called import statements to use code written in a separate module by other people.

The simplest syntax uses the import statement to import a module like doctest. We can then call the definitions within that module like doctest.testmod() to run all our doctests.

import doctest
doctest.testmod()

We can also import a module and rename it to a more convenient shorthand, like pd instead of pandas. We can then call the definitions within the module like pd.read_csv(path).to_dict("records") to read a CSV file and then convert it into our list of dictionaries ("records") representation.

import pandas as pd
earthquakes = pd.read_csv(path).to_dict("records")

Finally, Python can also import just a single definition from a module. Here, we ask Python to only import the Counter dictionary type from the collections module.

from collections import Counter
with open(path) as f:
    return Counter(f.read().split())

A common practice in notebooks is to add your imports to the first code cell at the top of your notebook so that someone who's running your notebook will know what modules they will need to install to run the code.

In [2]:
from doctest import testmod
testmod()
Out[2]:
TestResults(failed=0, attempted=0)

Creating a Data Frame¶

To create a dataframe, call pd.read_csv(path). In addition to reading CSV data from a file, pd.read_csv also accepts io.StringIO to read-in CSV data directly from a Python string for specifying small datasets directly in a code cell.

In [4]:
csv = """
Name,Hours
Anna,20
Iris,15
Abiy,10
Gege,12
"""

staff = pd.read_csv(io.StringIO(csv))
staff
Out[4]:
Name Hours
0 Anna 20
1 Iris 15
2 Abiy 10
3 Gege 12
In [5]:
staff.to_dict("records")
Out[5]:
[{'Name': 'Anna', 'Hours': 20},
 {'Name': 'Iris', 'Hours': 15},
 {'Name': 'Abiy', 'Hours': 10},
 {'Name': 'Gege', 'Hours': 12}]

The index of a DataFrame appears in bold across the left (rows) and defines the keys for accessing values in a data frame. Like keys in a dictionary, the keys in an index should be unique.

By default, an integer index is provided, but you'll often want to set a more meaningful index. We can use the df.set_index(colname) function to return a new DataFrame with a more meaningful index that will be handy for later. In the example below, we assume that each TA has a unique name, though this assumption has severe limits in practice: people can change their names, or we might eventually run a course where two people share the same names.

In [6]:
staff = staff.set_index("Name")
staff
Out[6]:
Hours
Name
Anna 20
Iris 15
Abiy 10
Gege 12
In [10]:
type(staff)
Out[10]:
pandas.core.frame.DataFrame
In [7]:
staff.loc["Iris"]
Out[7]:
Hours    15
Name: Iris, dtype: int64

Column indexers¶

In pandas, tabular data is represented by a DataFrame as shown above. Unlike the list of dictionaries format that required us to write a loop to access the name of every TA, pandas provides special syntax to help us achieve this result.

In [8]:
staff.index
Out[8]:
Index(['Anna', 'Iris', 'Abiy', 'Gege'], dtype='object', name='Name')
In [11]:
type(staff.index)
Out[11]:
pandas.core.indexes.base.Index
In [9]:
staff["Hours"]
Out[9]:
Name
Anna    20
Iris    15
Abiy    10
Gege    12
Name: Hours, dtype: int64
In [12]:
type(staff["Hours"])
Out[12]:
pandas.core.series.Series

df["Hours"] returns a pandas object called a Series that represents a single column or row of a DataFrame. A Series is very similar to a list from Python, but has several convenient functions for data analysis.

  • s.mean() returns the average value in s.
  • s.min() returns the minimum value in s.
    • s.idxmin() returns the label of the minimum value in s.
  • s.max() returns the maximum value in s.
    • s.idxmax() returns the label of the maximum value in s.
  • s.unique() returns a new Series with all the unique values in s.
  • s.describe() returns a new Series containing descriptive statistics for the data in s.
In [13]:
staff["Hours"].describe()
Out[13]:
count     4.000000
mean     14.250000
std       4.349329
min      10.000000
25%      11.500000
50%      13.500000
75%      16.250000
max      20.000000
Name: Hours, dtype: float64
In [15]:
staff["Hours"].min()
Out[15]:
10

Defining a more meaningful index allows us to select specific values from a series just by referring to the desired key.

In [17]:
staff["Hours"]["Iris"]
Out[17]:
15
In [23]:
staff.loc["Iris"]
Out[23]:
Hours    15
Name: Iris, dtype: int64
In [19]:
staff.loc["Iris"]["Hours"]
Out[19]:
15

How can we compute the range of TA hours by calling the min() and max() functions? For this example dataset, the range should be 10 since Anna has 20 hours and Abiy has 10 hours for a difference of 10.

In [26]:
staff["Hours"].max() - staff["Hours"].min()
Out[26]:
10
In [27]:
staff.max() - staff.min()
Out[27]:
Hours    10
dtype: int64
In [28]:
type(staff.max() - staff.min())
Out[28]:
pandas.core.series.Series
In [29]:
staff.max()
Out[29]:
Hours    20
dtype: int64

Element-wise operations¶

Let's consider a slightly more complex dataset that has more columns, like this made-up emissions dataset. The pd.read_csv function also includes an index_col parameter that you can use to set the index while reading the dataset.

In [30]:
csv = """
City,Country,Emissions,Population
New York,USA,200,1500
Paris,France,48,42
Beijing,China,300,2000
Nice,France,40,60
Seattle,USA,100,1000
"""

emissions = pd.read_csv(io.StringIO(csv), index_col="City")
# This is like writing set_index("City")
emissions
Out[30]:
Country Emissions Population
City
New York USA 200 1500
Paris France 48 42
Beijing China 300 2000
Nice France 40 60
Seattle USA 100 1000

pandas can help us answer questions like the emissions per capita: emissions divided by population for each city.

In [31]:
emissions["Emissions"] / emissions["Population"]
Out[31]:
City
New York    0.133333
Paris       1.142857
Beijing     0.150000
Nice        0.666667
Seattle     0.100000
dtype: float64

Element-wise operations also work if one of the operands is a single value rather than a Series. For example, the following cell adds 4 to each city population.

In [32]:
emissions["Population"] + 4
Out[32]:
City
New York    1504
Paris         46
Beijing     2004
Nice          64
Seattle     1004
Name: Population, dtype: int64
In [33]:
emissions
Out[33]:
Country Emissions Population
City
New York USA 200 1500
Paris France 48 42
Beijing China 300 2000
Nice France 40 60
Seattle USA 100 1000
In [34]:
emissions["Population"] = emissions["Population"] + 4
emissions
Out[34]:
Country Emissions Population
City
New York USA 200 1504
Paris France 48 46
Beijing China 300 2004
Nice France 40 64
Seattle USA 100 1004

Row indexers¶

All the above operations apply to every row in the original data frame. What if our questions involve returning just a few rows, like filtering the data to identify only the cities that have at least 200 emissions?

In [36]:
high_emissions = emissions["Emissions"] >= 200
emissions[high_emissions]
Out[36]:
Country Emissions Population
City
New York USA 200 1504
Beijing China 300 2004

This new syntax shows how we can filter a dataframe by indexing it with a boolean series. PandasTutor shows you how the above output is determined by selecting only the rows that are True in the following boolean series.

In [37]:
high_emissions
Out[37]:
City
New York     True
Paris       False
Beijing      True
Nice        False
Seattle     False
Name: Emissions, dtype: bool

Multiple conditions can be combined using the following element-wise operators.

  • & performs an element-wise and operation.
  • | performs an element-wise or operation.
  • ~ performs an element-wise not operation.

Due to how Python evaluates order of operations, parentheses are required when combining boolean series in a single expression.

In [40]:
emissions[~high_emissions & (emissions["Country"] == "USA")]
Out[40]:
Country Emissions Population
City
Seattle USA 100 1004

Write a one-line pandas expression that returns all the cities in France that have a population greater than 50 from the emissions dataset.

In [41]:
emissions[(emissions["Country"] == "France") & (emissions["Population"] > 50)]
Out[41]:
Country Emissions Population
City
Nice France 40 64
In [43]:
emissions[emissions["Country"] == "France"] and emissions[emissions["Population"] > 50]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_347/3398251716.py in ?()
----> 1 emissions[emissions["Country"] == "France"] and emissions[emissions["Population"] > 50]

/opt/conda/lib/python3.11/site-packages/pandas/core/generic.py in ?(self)
   1575     @final
   1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
   1578             f"The truth value of a {type(self).__name__} is ambiguous. "
   1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1580         )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
In [46]:
emissions["Country"] == "France" & emissions["Population"] > 50
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:362, in na_logical_op(x, y, op)
    353 try:
    354     # For exposition, write:
    355     #  yarr = isinstance(y, np.ndarray)
   (...)
    360     # Then Cases where this goes through without raising include:
    361     #  (xint or xbool) and (yint or bool)
--> 362     result = op(x, y)
    363 except TypeError:

File /opt/conda/lib/python3.11/site-packages/pandas/core/roperator.py:54, in rand_(left, right)
     53 def rand_(left, right):
---> 54     return operator.and_(right, left)

TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:376, in na_logical_op(x, y, op)
    375 try:
--> 376     result = libops.scalar_binop(x, y, op)
    377 except (
    378     TypeError,
    379     ValueError,
   (...)
    382     NotImplementedError,
    383 ) as err:

File ops.pyx:180, in pandas._libs.ops.scalar_binop()

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[46], line 1
----> 1 emissions["Country"] == "France" & emissions["Population"] > 50

File /opt/conda/lib/python3.11/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File /opt/conda/lib/python3.11/site-packages/pandas/core/arraylike.py:74, in OpsMixin.__rand__(self, other)
     72 @unpack_zerodim_and_defer("__rand__")
     73 def __rand__(self, other):
---> 74     return self._logical_method(other, roperator.rand_)

File /opt/conda/lib/python3.11/site-packages/pandas/core/series.py:6130, in Series._logical_method(self, other, op)
   6127 lvalues = self._values
   6128 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
-> 6130 res_values = ops.logical_op(lvalues, rvalues, op)
   6131 return self._construct_result(res_values, name=res_name)

File /opt/conda/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:454, in logical_op(left, right, op)
    450 else:
    451     # i.e. scalar
    452     is_other_int_dtype = lib.is_integer(rvalues)
--> 454 res_values = na_logical_op(lvalues, rvalues, op)
    456 # For int vs int `^`, `|`, `&` are bitwise operators and return
    457 #   integer dtypes.  Otherwise these are boolean ops
    458 if not (left.dtype.kind in "iu" and is_other_int_dtype):

File /opt/conda/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:385, in na_logical_op(x, y, op)
    377         except (
    378             TypeError,
    379             ValueError,
   (...)
    382             NotImplementedError,
    383         ) as err:
    384             typ = type(y).__name__
--> 385             raise TypeError(
    386                 f"Cannot perform '{op.__name__}' with a dtyped [{x.dtype}] array "
    387                 f"and scalar of type [{typ}]"
    388             ) from err
    390 return result.reshape(x.shape)

TypeError: Cannot perform 'rand_' with a dtyped [int64] array and scalar of type [bool]
In [47]:
(emissions["Country"] == "France") & (emissions["Population"] > 50)
Out[47]:
City
New York    False
Paris       False
Beijing     False
Nice         True
Seattle     False
dtype: bool
In [48]:
emissions["Country"] == ("France" & emissions["Population"]) > 50
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:362, in na_logical_op(x, y, op)
    353 try:
    354     # For exposition, write:
    355     #  yarr = isinstance(y, np.ndarray)
   (...)
    360     # Then Cases where this goes through without raising include:
    361     #  (xint or xbool) and (yint or bool)
--> 362     result = op(x, y)
    363 except TypeError:

File /opt/conda/lib/python3.11/site-packages/pandas/core/roperator.py:54, in rand_(left, right)
     53 def rand_(left, right):
---> 54     return operator.and_(right, left)

TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:376, in na_logical_op(x, y, op)
    375 try:
--> 376     result = libops.scalar_binop(x, y, op)
    377 except (
    378     TypeError,
    379     ValueError,
   (...)
    382     NotImplementedError,
    383 ) as err:

File ops.pyx:180, in pandas._libs.ops.scalar_binop()

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[48], line 1
----> 1 emissions["Country"] == ("France" & emissions["Population"]) > 50

File /opt/conda/lib/python3.11/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File /opt/conda/lib/python3.11/site-packages/pandas/core/arraylike.py:74, in OpsMixin.__rand__(self, other)
     72 @unpack_zerodim_and_defer("__rand__")
     73 def __rand__(self, other):
---> 74     return self._logical_method(other, roperator.rand_)

File /opt/conda/lib/python3.11/site-packages/pandas/core/series.py:6130, in Series._logical_method(self, other, op)
   6127 lvalues = self._values
   6128 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
-> 6130 res_values = ops.logical_op(lvalues, rvalues, op)
   6131 return self._construct_result(res_values, name=res_name)

File /opt/conda/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:454, in logical_op(left, right, op)
    450 else:
    451     # i.e. scalar
    452     is_other_int_dtype = lib.is_integer(rvalues)
--> 454 res_values = na_logical_op(lvalues, rvalues, op)
    456 # For int vs int `^`, `|`, `&` are bitwise operators and return
    457 #   integer dtypes.  Otherwise these are boolean ops
    458 if not (left.dtype.kind in "iu" and is_other_int_dtype):

File /opt/conda/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:385, in na_logical_op(x, y, op)
    377         except (
    378             TypeError,
    379             ValueError,
   (...)
    382             NotImplementedError,
    383         ) as err:
    384             typ = type(y).__name__
--> 385             raise TypeError(
    386                 f"Cannot perform '{op.__name__}' with a dtyped [{x.dtype}] array "
    387                 f"and scalar of type [{typ}]"
    388             ) from err
    390 return result.reshape(x.shape)

TypeError: Cannot perform 'rand_' with a dtyped [int64] array and scalar of type [bool]

Selection by label¶

To summarize what we've learned so far, pandas provides both column indexers and row indexers accessible through the square brackets notation.

  • df[colname] returns the corresponding Series from the df.
  • df[boolean_series] returns a new DataFrame containing just the rows specified True in the boolean_series.

These two access methods are special cases of a more general df.loc[rows, columns] function that provides more functionality. For example, we can select just the city populations for cities with at least 200 emissions and visualize the procedure in PandasTutor.

In [49]:
staff.loc["Iris"]["Hours"]
Out[49]:
15
In [50]:
staff.loc["Iris", "Hours"]
Out[50]:
15
In [51]:
emissions.loc[high_emissions, "Population"]
Out[51]:
City
New York    1504
Beijing     2004
Name: Population, dtype: int64
In [52]:
emissions.loc[~high_emissions & (emissions["Country"] == "USA"), "Population"]
Out[52]:
City
Seattle    1004
Name: Population, dtype: int64

Whether a single value, a 1-dimensional Series, or a 2-dimensional DataFrame is returned depends on the selection.

Notice that label-based slicing includes the endpoint, unlike slicing a Python list.

In [53]:
emissions.loc[high_emissions, "Country":"Population"]
Out[53]:
Country Emissions Population
City
New York USA 200 1504
Beijing China 300 2004
In [54]:
emissions.loc[:, ["Country", "Emissions"]]
Out[54]:
Country Emissions
City
New York USA 200
Paris France 48
Beijing China 300
Nice France 40
Seattle USA 100
In [56]:
emissions.loc["Paris", ["Country", "Population"]]
Out[56]:
Country       France
Population        46
Name: Paris, dtype: object

Returning to our prior staff hours example, we can get Iris's hours by using a single df.loc[index, columns] access rather than two separate accesses. This convenient syntax only works when we've specified a meaningful index.

In [57]:
staff.loc["Iris", "Hours"]
Out[57]:
15

Practice: Largest earthquake place (Pandas)¶

Previously, we learned about two ways to write Python code to read earthquakes as a list of dictionaries and return the name of the place with the largest-magnitude earthquake.

In [59]:
def largest_earthquake_place(path):
    """
    Returns the name of the place with the largest-magnitude earthquake in the specified CSV file.

    >>> largest_earthquake_place("earthquakes.csv")
    'Northern Mariana Islands'
    """
    earthquakes = pd.read_csv(path).to_dict("records")
    # List of dictionaries format

    max_name = None
    max_magn = None
    for earthquake in earthquakes:
        if max_magn is None or earthquake["magnitude"] > max_magn:
            max_name = earthquake["name"]
            max_magn = earthquake["magnitude"]
    return max_name


doctest.run_docstring_examples(largest_earthquake_place, globals())

How might we convert this program to solve the problem directly with a DataFrame instead?

In [68]:
def largest_earthquake_place(path):
    """
    Returns the name of the place with the largest-magnitude earthquake in the specified CSV file.

    >>> largest_earthquake_place("earthquakes.csv")
    'Northern Mariana Islands'
    """
    earthquakes = pd.read_csv(path, index_col="id")
    return earthquakes.loc[earthquakes["magnitude"].idxmax(), "name"]


doctest.run_docstring_examples(largest_earthquake_place, globals())

Optional: Selection by position¶

Everything we've learned so far is an example of label-based indexing. But it turns out there's another system of position-based indexing that is also available. Let's compare the 4 approaches.

  • df[colname] returns the corresponding Series from the df.
    • df[[col1, col2, ...]] returns a new DataFrame containing the corresponding columns from the df.
  • df[boolean_series] returns a new DataFrame containing just the rows specified True in the boolean_series.
  • df.loc[index, columns] returns a single value, a Series, or a DataFrame for the label-based selection from the df.
  • df.iloc[rows, columns] returns a single value, a Series, or a DataFrame for the position-based selection from the df.

Label-based indexing uses the bolded column and row indexers. Position-based indexing uses purely integer-based indexing. Slicing by position excludes the endpoint, just like slicing a Python list. Position-based indexing is most useful when you have a position-based query that can't be easily specified using only label-based indexing. For example, we might know that we want to select just the rightmost two columns from a dataframe without knowing the column names.

In [ ]:
emissions.iloc[:, -2:]

We generally won't use position-based selections in this course, but you may run into code that uses them elsewhere.