Lesson 6. Data Frames

The content for this lesson is adapted from material by Hunter Schafer.

Objectives¶

Last week we started to see how to process these CSV files. We used the list of dictionaries representation to help us answer these questions. While the list of dictionaries format was helpful, it was still a bit tedious to write all that code. This week, we will introduce pandas, a popular library that supports data scientists. A library is code someone else wrote and shared with you to help solve problems. By the end of this lesson, students will be able to:

Import values and functions from another module using import and from statements.
Select individual columns from a pandas DataFrame and apply element-wise computations.
Filter a pandas DataFrame or Series with a mask.

The last two learning objectives are particularly ambitious. It will probably take more time, practice, and opportunities to engage with it before you feel fully comfortable. But we’re just at the start of learning about data frames.

Setting up¶

To follow along with the code examples in this lesson, please download the files here:

Imports¶

When we write code in a Python file like main.py, we’re actually defining a Python module. You can treat the words “module” and “file” interchangeably.

A module can be executed as a standalone program from the terminal, such as python main.py.
A module can be imported so that another module can access its values and function definitions.

Main-method pattern

Remember the main-method pattern? The main-method pattern ensures that certain code is only run when executed as a standalone program as opposed to when it’s imported from another module. More details at the end of this slide.

We use importing to use values or functions defined inside one module so they can be used in another module. You have already been using this on your homework! We defined a module cse163_utils and in order to use the function parse defined in that module, we imported it.

There are 2 primary ways to import in Python that we will explore in this slide. For the following examples, assume we have defined the module module_a as the file module_a.py.

# Contents of: module_a.py
def fun1() -> None:
    print("Calling a's fun1")
    print("Ending  a's fun1")


def fun2() -> None:
    print("Calling a's fun2")
    fun1()
    print("Ending  a's fun2")

Our goal is to call fun2 inside another module, module_b. To do this, we need to import the module to use its functions.

`import module_a`¶

The simplest syntax simply uses the import statement to import a module. The following snippet shows the contents of module_b, a short program that uses fun2 defined in module_a.

# Contents of: module_b.py
import module_a 


def fun1() -> None:
    print("Calling b's fun1")
    print("Ending  b's fun1")


def fun2() -> None:
    print("Calling b's fun2")
    print("Ending  b's fun2")


def main():
    fun2()
    module_a.fun2()


if __name__ == '__main__':
    main()

When you import a module, you are importing the whole thing, including all of the code and values defined within it. To keep things organized, Python puts all these values defined in module_a in a different namespace. All of the code in module_a resides in its own namespace while all of the code in module_b resides in another namespace. Notice how we call module_a.fun2() in order to access fun2() in module_a rather than the fun2() that we defined right above the main method.

Food for thought: What is the output of running python module_b.py?

`import module_a as m`¶

A minor alternative to the above is to import a module and also define a shorthand name for it. In this case, we assigned module_a the shorthand name m, so we can just say m.fun2(). The output and behavior of the program are exactly the same.

# Contents of: module_b.py
import module_a as m


def fun1() -> None:
    print("Calling b's fun1")
    print("Ending b's fun1")


def fun2() -> None:
    print("Calling b's fun2")
    print("Ending b's fun2")


def main():
    fun2()
    m.fun2()  # Notice m instead of module_a


if __name__ == '__main__':
    main()

Food for thought: When might it be preferable to use this syntax instead of importing the module under its given name?

`from module_a import fun2`¶

Sometimes, we only want to use a few functions from another module. Python provides another syntax that lets you import specific functions: from module_a import fun2 only imports the function fun2 from module_a. When this syntax is used, Python adds fun2 directly to your current namespace. This means you don’t need to call it with module_a.fun2(), you can just say fun2().

# Contents of: module_b.py
from module_a import fun2


def main():
    fun2() # Calling module_a's fun2


if __name__ == '__main__':
    main()

What would happen if we defined a fun2 in module_b.py? How would Python know which fun2 to call if there was one that came from module_a and one from module_b when both share the same namespace?

Python treats these import and function definitions exactly like assignment statements, so the value of a name always reflects its most recent assignment.

Why the main-method pattern?¶

We use the main-method pattern so that others can import from our code, without having to worry about running all of our analysis.

Imagine your writing a program to do some sort of analysis and it takes about 2 hours to run the analysis from start to end. You realize that you had some helper functions in this program that would be super useful for another project you’re working on, so decide to import those functions (using one of the ways shown above) so you could use them from another module. But if you didn’t use the main-method pattern part of the import process would be actually running your 2 hour analysis! That means you would have to wait 2 hours every time you run your new project just to import a function!!! 😱

By using the main-method pattern we make our modules runnable when we want them to run (by using the python command or pressing the Run button) but not runnable when we want to just import a function from them!

Pandas Context¶

pandas, not 🐼s! pandas is a library that we will use to simplify much of our CSV processing. When it comes to data science in Python, pandas is the most popular way of interacting with your data. (The name derives from “panel data, which is a type of longitudinal data where you might have multiple observations for the same individuals.)

A library is some code that someone else wrote that you can use in your code so you don’t have to write everything from scratch. pandas is one such library that we will learn this quarter, but we will learn many more before the quarter is done!

pandas uses several advanced Python features. There will be a lot of new concepts, but also a lot of new syntax. It may feel very different from all (or almost all) of your programming experience here at UW!

The pandas library could be described a declarative-style library. Rather than writing if statements, while loops, and for loops to process all data element-by-element, pandas wraps all of those operations inside many other functions and syntax. Like we saw with the number of unique problem that we could solve as len(set(...)), learning pandas also means learning a new way of composing programs in this manner.

Things will probably be overwhelming at first. We’re introducing new ideas and new syntax at the same time. You will get better with practice; that’s why our in-class guided practice is so important!

Like our approach to learning Python, we aren’t going to show you everything you can possibly do with the library. Instead, we’ll show you foundational patterns and examples for adapting to different problems. And for problems that don’t fit the taught examples, we hope that this will still bootstrap your understanding so that you know what and how to search for relevant information online.

So after reading this, we recommend trying to make your own reference sheet of the ideas and syntax you learned and actually use that reference sheet when trying to solve practice problems. Reorganizing and reconfiguring the ideas by writing it down can really help. If you find your reference sheet was missing something important that made it difficult to solve a particular problem, go back and add that to the reference sheet. It’s your external brain to help you make sense of how everything fits together!

Pandas Tutorial¶

Jupyter Notebook

We highly recommend following along with the code examples in lesson6.ipynb rather than just reading the content here!

The first thing we’ll have to do is import the pandas library. The convention is to abbreviate it to pd:

import pandas as pd

Next, we will load the data from the CSV file tas.csv that has the example data we were working with before. We will save it in a variable called df (stands for data frame which is a common pandas term). We do this with a provided function from pandas called read_csv.

df = pd.read_csv('tas.csv')
df # to show the df in Jupyter

Notice that this shows the CSV in a tabular format! What is df? It’s a pandas object called a DataFrame which stores a table of values, much like an Excel table.

Notice on the top row, it shows the name of the columns (Name and Salary) and on the left-most side, it shows an index for each row (0, 1, and 2).

DataFrames are powerful because they provide lots of ways to access and perform computations on your data without you having to write much code!

Accessing a Column¶

For example, you can get all of the TAs‘ names with the following call.

df['Name']

df['Name'] returns another pandas object called a Series that represents a single column or row of a DataFrame. A Series is very similar to a list from Python, but has many extra features that we will explore later.

Students sometimes get a little confused because this looks like df is a dict and it is trying to access a key named Name. This is not the case! One of the reasons Python is so powerful is it lets people who program libraries “hook into” the syntax of the language to make their own custom meaning of the [] syntax! df in this cell is really this special object defined by pandas called a DataFrame.

Pandas Functions¶

Now, pandas isn’t useful just because it not only lets you access this data conveniently, but also perform computations on them.

A Series object has many methods you can call on them to perform computation. Here is a list of some of the most useful ones:

sum: Calculates the sum of the Series
mean: Calculates the average value of the Series
min: Calculates the minimum value of the Series
max: Calculates the maximum value of the Series
idxmin: Calculates the index of the minimum value of the Series
idxmax: Calculates the index of the maximum value of the Series
count: Calculates the number values in the Series
unique: Returns a new Series with all the unique values from the Series.
And many more!

For example, if I wanted to compute the average Salary of the TAs, I would write:

average_salary = df['Salary'].mean()
average_salary

Reminder: Types matter¶

When first learning pandas, it’s easy to mix up DataFrame and Series. * A DataFrame is a 2-dimensional structure (it has rows and columns like a grid) * Series is 1-dimensional (it only has “one direction” like a single row or a single column).

When you access a single column (or as we will see later, a single row) of a DataFrame, it returns a Series.

Element-wise Operations¶

For the rest of this tutorial, let’s consider a slightly more complex dataset that has a few more columns. This dataset tracks the emissions for cities around the world (but only has a few rows).

df2 = pd.read_csv('emissions.csv')
df2

If we wanted to access the emissions column, we could write:

df2['emissions']

Notice that the dtype of this Series is int64 meaning that every element in the Series is an integer.

If we wanted to access the population columm, we could write:

df2['population']

One useful feature of pandas is it lets you combine values from different Series. For example, if we wanted to, we could add the values of the emissions column and the population column.

df2['emissions'] + df2['population']

Notice, this returns a new Series that represents the sum of those two columns. The first value in the Series is the sum of the first values in the two that were added, the second is the sum of the second two, etc. It does not modify any of the columns of the dataset (you will need to do an assignment to change a value).

Since we are performing an element-wise operation, we will still need to respect concatenation rules. Taking a step back from pandas, recall that "cat" + 3 will raise a TypeError and we must instead write "cat" + str(3). The analog of casting in pandas is to use the astype method to change the dtype of Series. For example,

df2['city'] + df2['population'].astype('str')

Here, our use of astype is telling pandas to perform an element-wise operation that casts each item in df2['population'] into a str. Notice that if you remove the call to astype in the previous code snippet, pandas will be unable to concatenate the two Series of different types.

Note: You can save a Series in a variable! It’s just like any other Python value!

These element-wise computations also work if a one of the values is a single value rather than a Series. For example, the following cell adds 4 to each of the populations. Notice this doesn’t modify the original DataFrame, it just returns a new Series with the old values plus 4.

df2['population'] + 4

Note here that the output of the Series actually tells you a bit about the values to help you out! The dtype property tells you the type of the data. In this case it uses a specialized integer type called int64, but for all intents and purposes that’s really just like an int. As a minor detail, it also stores the Name of the column the Series came from for refernce.

Another useful case for something like this is to compare the values of a column to a value. For example, the following cell computes which cities have an emissions value of 200 or more. Notice that the dtype here is bool since each value is a True/False.

df2['emissions'] >= 200

Filtering Data¶

You might have wondered why being able to compare a Series to some value is something we deemed “useful” since it doesn’t seem like it does anything helpful. The power comes from using this bool Series to filter the DataFrame to the rows you want.

For example, what if I wanted to print the names of the cities that have an emissions of 200 or more? I can use this bool Series to filter which rows I want! The syntax looks like the following cell.

df3 = df2[df2['emissions'] >= 200]
df3['city']

That’s pretty cool how we can get this result without having to write any loops!

Notice the return value has type DataFrame, so we can than use the syntax we learned at the beginning to grab a single column from that DataFrame (thus returning a Series).

The way this works is the indexing-notation for DataFrames has special cases for which type of value you pass it. * If you pass it a str (e.g., df2['emissions']), it returns that column as a Series. * If you pass it a Series with dtype=bool (e.g., df2[df2['emissions'] >= 200]), it will return a DataFrame of all the rows that Series had a True value for!

There is no magic with this, they just wrote an if-statement in their code to do different things based on the type provided!

We commonly call a Series with dtype=bool used for this context a mask. It usually makes your program more readable to save those masks in a variable. The following cell shows the exact same example, but adding a variable for readability for the mask.

high_emissions = df2['emissions'] >= 200
df3 = df2[high_emissions]
df3['city']

Filtering on Multiple Conditions¶

You can combine masks using logical operators to make complex queries. There are three logical operators for masks (like and, or, and not but with different symbols).

& does an element-wise and to combine two masks
| does an element-wise or to combine two masks
~ does an element-wise not of a single mask

For example, if you want to find all cities that have high emissions or are in the US, you would probably try writing the following (but you’ll run into a bug).

df2[df2['emissions'] >= 200 | df2['country'] == 'USA']

The problem comes from precedence (order of operations). Just like how * gets evaluated before +, | gets evaluated first because it has the highest precedence (so does &). This makes Python interpret the first sub-expression as (200 | df['country']), which causes an error since this operator is not defined for these types.

Whenever you run into ambiguities from precedence, on way you can always fix it is to the sub-expressions in parentheses like in the following code chunk.

df2[(df2['emissions'] >= 200) | (df2['country'] == 'USA')]

A much more readable solution involves saving each mask in a variable so you don’t have to worry about this precedence. This has an added benefit of giving each condition a human-readable name if you use good variable names!

high_emissions = df2['emissions'] >= 200
is_usa = df2['country'] == 'USA'
df2[high_emissions | is_usa]

Location¶

We’ve shown you how to select specific columns or select specific rows based on a mask. In some sense, it’s a little confusiong that df[val] can be used to grab columns or rows depending on what is passed. This is because this syntax we have shown below, is really just special cases of a more generic syntax that lets you specific some location in the DataFrame. pandas provides this shorthand for convencience in some cases, but this more general syntax below works in many more!

In its most general form, the loc property lets you specify a row indexer and a column indexer to specify which rows/columns you want. The syntax looks like the following (where things in <...> are placeholders)

df.loc[<row indexer>, <column index>]

The row indexer refers to the index of the DataFrame. Recall, when we display a DataFrame, it shows values to the left of each row to identify each row in the DataFrame.

It turns out the the column indexer is optional, so you can leave that out. For example, if I want to get the first row (row with index 0), I could write:

df2.loc[0]

Interestingly, this actually returns a Series! It looks different than the Series returned from something like df['name'] since now it has an index that are the column names themselves! This means I could index into a specifc column by doing something like:

s = df2.loc[0]
s['city']

Now this was a bit tedious to have to use double [] to acess the column as well, which is exactly why loc lets you specify a column as a “column indexer”. Instead, it’s more common to write:

df2.loc[0, 'city']

You might be wondering: I’ve used the word “indexer” a few times but haven’t defined what that means! By indexer, I mean some value to indicate which rows/columns you want. So far, I have shown how to specify a single value as an indexer, but there are actually many options to chose from! You can always mix-and-match these and use different ones for the rows/cols.

List of indices and slices¶

For example, you can use a list of values as an indexer to select many rows or many columns:

df2.loc[[1,2,3], ['city', 'country', 'emissions']]

Notice now it returns a DataFrame instead of a single value.

You can also use slice syntax like you could for list/str to access a range of values. There are a couple oddities about this:

The start/stop points are both inclusive which is different than for list/str where the stop point is exclusive.
They do some fancy “magic” that let you use ranges with strings to get a range of column names.

For example:

df2.loc[1:3, 'city':'emissions']

The way to read this loc access is “all the rows starting at index 1 and to index 3 (both inclusive) and all the columns starting at city and going to emissions (both inclusive)”.

How does it define the “range of strings”? It uses the order of the columns in the DataFrame.

Mask¶

You can also use a bool Series as an indexer to grab all the rows or columns that are marked True. This is similar to masking we saw before, but can now put the mask as a possible indexer.

high_emissions = df2['emissions'] >= 200
is_usa = df2['country'] == 'USA'
df2.loc[high_emissions | is_usa]

Notice in the last cell, I left out the column indexer and it gave me all the column (that is the default for the column indexer).

`:` for everything¶

Instead of relying on defaults, you can explicitly ask for “all of the columns” using the special range :. This is a common syntax for many numerical processing libraries so pandas adopts it too. It looks like the following:

df2.loc[[0, 4, 2], :]

You can also do this for the rows as well!

df2.loc[:, 'city']

A tip to help you read these in your head is to read : by itself as “all”.

Recap Indexers¶

So we saw the .loc property here is kind of like a universal way of asking your data. You can specify a row indexer and a column indexer to select your data. We saw the following things used as indexers: * A single value (row index for rows, column name for columns) * A list of values or a slice (row index for for rows, column names for columns) * A mask * : to select all values

Return Values¶

One thing that is also complex about .loc is the type of the value returns depends on the types of the indexers. Recall that a pandas DataFrame is a 2-dimensional strucutre (rows and columns) while a Series is a single row or single column.

To tell what the return type of a .loc call is, you need to look for the “single value” type of indexer.

If both the row and column indexers are a single value, returns a single value. This will be whatever the value is at the location so its type will be the same as the dtype of the column it comes from.
If only one of the row and colum indexers is a single value (meaning the other is multiple values), returns a Series.
If neither of the row and column indexers are single values (meaning both are multiple values), returns a DataFrame.

Pandas Reference¶

That was a ton of new concepts and syntax! Practice is key to getting the hang of things. Maybe not today, but hopefully in the coming weeks, you’ll start to develop a new fluency with composing and layering together data analysis as a sequence of element-wise operations and masking in pandas.

import pandas as pd

# Read a file
df = pd.read_csv('some_file.csv')

# Access a column
df['col']

# Cast the values in the column
df['col'].astype('type')

# Summary statistic of column
df['col'].mean()

# Lots of summary functions to use
#   mean: Calculates the average value of the Series
#   min: Calculates the minimum value of the Series
#   max: Calculates the maximum value of the Series
#   idxmin: Calculates the index of the minimum value of the Series
#   idxmax: Calculates the index of the maximum value of the Series
#   count: Calculates the number values in the Series
#   unique: Returns a new Series with all the unique values from the Series

# Element-wise operations
df['col1'] + df['col2']

# Also works with single values
df['col'] // 2
df['col'] > 2

# Filter a DataFrame (& for and, | for or, ~ for not)
mask1 = df['col'] > 2
df[mask1]

mask2 = df['col2'] == 2
df[mask1 & mask2]
df[mask2 | ~mask1]

# Location: df.loc[row_indexer, column_indexer] (column_indexer is optional, default all)
# Indexers can be many types (can mix and match for row/col!): 
#   * List of values or a slice
#   * Mask
#   * : (for everything)
#   * Single value

# Single value
df.loc[0, 'col'] # Returns value

# List or slice of values
df.loc[[0, 2, 1], ['col1', 'col2']]  # Returns DataFrame
df.loc[0:4, 'col1':'col5']  # Returns DataFrame

# Everything
df.loc[:, :]  # DataFrame

# Other examples
df.loc[0]  # Series, default for column indexer is :
df.loc[0:5, 'col']  # Series
df.loc[1, 'col':'col' ] # Series
df.loc[0:5, ['col1', 'col2']] # DataFrame

⏸️ Pause and 🧠 Think¶

Take a moment to review the following concepts and reflect on your own understanding. A good temperature check for your understanding is asking yourself whether you might be able to explain these concepts to a friend outside of this class.

Here’s what we covered in this lesson:

import statements (3 types)
pandas library
- Reading files
- Accessing columsn
- Casting values in a column
- Aggregating values in a column
- Element-wise and scalar operations
- Filtering DataFrames with boolean masking
- .loc

Here are some other guiding exercises and questions to help you reflect on what you’ve seen so far:

In your own words, write a few sentences summarizing what you learned in this lesson.
What did you find challenging in this lesson? Come up with some questions you might ask your peers or the course staff to help you better understand that concept.
What was familiar about what you saw in this lesson? How might you relate it to things you have learned before?
Throughout the lesson, there were a few Food for thought questions. Try exploring one or more of them and see what you find.

In-Class¶

When you come to class, we will work together on the problems in earthquakes.ipynb. We will also need earthquakes.csv for these tasks. Make sure that you have a way of editing and running these files!

Canvas Quiz¶

All done with the lesson? Complete the Canvas Quiz linked here!

Objectives¶

Setting up¶

Imports¶

import module_a¶

import module_a as m¶

from module_a import fun2¶

Why the main-method pattern?¶

Pandas Context¶

Pandas Tutorial¶

Accessing a Column¶

Pandas Functions¶

Reminder: Types matter¶

Element-wise Operations¶

Filtering Data¶

Filtering on Multiple Conditions¶

Location¶

List of indices and slices¶

Mask¶

: for everything¶

Recap Indexers¶

Return Values¶

Pandas Reference¶

⏸️ Pause and 🧠 Think¶

In-Class¶

Canvas Quiz¶

`import module_a`¶

`import module_a as m`¶

`from module_a import fun2`¶

`:` for everything¶