The content for this lesson is adapted from material by Hunter Schafer and by Kevin Lin.
Objectives¶
By the end of this lesson, students will be able to:
- Apply the
dissolvefunction to aggregateGeoDataFramecolumns. - Explain the difference between inner-join, left-join, right-join, and outer-join.
- Apply
DataFramemergeto join tabular data andgeopandassjointo join geospatial data.
Setting up¶
To follow along with the code examples in this lesson, please download the files in the zip folder here:
Make sure to unzip the files after downloading! The following are the main files we will work with:
lesson18.ipynbpopulation_maps.ipynb
Note: Geospatial data requires several supplementary files in addition to the main .shp file that we will use in our code. You will need to unzip the data folder in lesson18.zip in order to use these files. Make sure to update any file paths in lesson18.ipynb to match your local paths!
Dissolve¶
We learned that geopandas can process and plot geospatial data such as the world’s countries by GDP.
import geopandas as gpd
import matplotlib.pyplot as plt
# Load data
df = gpd.read_file('data/ne_110m_admin_0_countries.shp')
# Preview data
print(df.columns)
print(df.head())
# Plot data
df.plot(column='GDP_MD_EST', legend=True)
plt.savefig('world_gdp.png')
GDP by Continent¶
How about computing the total population for each continent? Since our dataset is tabular, you might suspect that since we are trying to compute a value “for each” group, that we would want to use a groupby!
This is exactly the right idea in regular pandas: a groupby will apply an aggregation function like sum to the POP_EST column for each group. But for geopandas, the issue is that we won’t be able to plot the country geometries since groupby won’t combine the geometries of each country in the same continent. Remember that groupby only does the computation on the included columns. The result doesn’t even look like a plot of the world since the groupby call throws away the geometry, making it a non-geospatial dataset!
populations = df.groupby('CONTINENT')['POP_EST'].sum()
populations.plot()
plt.savefig('wrong_figure.png')
Instead, geopandas provides a function dissolve that behaves exactly like groupby but with added logic to combine all the geometries for the group into one. This means you can still compute aggregates like sum, min, max, mean for the columns of interest. The dissolve function will then combine the geometry column in a special way to make a single geometry for the group. The default (and most common) geometry combination is to trace around the edges of all the geometries in the group.
When you run the snippet, you should see an output that looks like we would expect: where each continent is its own color that looks like to have a value that is the sum of all the countries in that continent.
Notice this dissolve call has a lot of the same components as a groupby but the syntax looks quite different. Instead of df.groupby('col1')['col2'].sum(), call df.dissolve(by='col1', aggfunc='sum').
The dissolve operation is applied to all columns of the GeoDataFrame so you’ll want to keep only the columns of interest before calling dissolve. In our example with col1 and col2, you would need to filter df down to ['col1', 'col2', 'geometry'] since the the dissolve only happens on those columns.
# Filter down to just the columns of interest
populations = df[['POP_EST', 'CONTINENT', 'geometry']]
# Run the dissolve (groupby) operation
populations = populations.dissolve(by='CONTINENT', aggfunc='sum')
# Then plot the result
populations.plot(column='POP_EST', legend=True)
plt.savefig('plot.png')
Merge and Join¶
In the next few slides, we’ll be introducing a new DataFrame function for more advanced data processing. Let’s go back to the world of CSV data that we would process with pandas. Suppose I had two DataFrame objects: the left DataFrame tas stores information about the TAs and the right DataFrame grading stores information about students and which TA will be grading them (ta_id and grader_id will match if that TA is grading that student).
How do we find the number of students each TA is responsible for grading?
tas
| ta_name | ta_id |
|---|---|
| Ryan | 1 |
| James | 2 |
| Nicole | 3 |
grading
| grader_id | student_name |
|---|---|
| 2 | Flora |
| 3 | Paul |
| 1 | Wen |
| 3 | Andrew |
It’s very common for data to come from different sources but share an identifier column that relates information in one table to the other. The process of combining two datasets in such a way as to “align” them based on some column values is called a join. If we were to join these columns on ta_id and grader_id, you would imagine that the process looks something like this pseudocode.
for t in tas:
for g in grading:
if t['ta_id'] == g['grader_id']:
output(t, g)
To explain in English, the process of joining them on ta_id and grader_id finds all pairs of rows from each table and keeps them if they match on ta_id and grader_id.
merge¶
To join two DataFrame objects, call the DataFrame merge function.
import pandas as pd
# Make the tas DataFrame
tas = pd.DataFrame([
{'ta_name': 'Ryan', 'ta_id': 1},
{'ta_name': 'James', 'ta_id': 2},
{'ta_name': 'Nicole', 'ta_id': 3},
])
print('tas')
print(tas)
print()
# Make the grading DataFrame
grading = pd.DataFrame([
{'grader_id': 2, 'student_name': 'Flora'},
{'grader_id': 3, 'student_name': 'Paul'},
{'grader_id': 1, 'student_name': 'Wen'},
{'grader_id': 3, 'student_name': 'Andrew'},
])
print('grading')
print(grading)
print()
# Join the datasets by ta_id and grader_id
merged = tas.merge(grading, left_on='ta_id', right_on='grader_id')
print('merged')
print(merged)
Notice that the output contains all pairs of rows that match up on ta_id and grader_id. There are now two rows for Nicole because there were two rows in the original grading table that had her ID. At this point merged is a new DataFrame with the four columns shown in the output above. To compute how many students each TA was grading, we could groupby on this new DataFrame.
You might be wondering, what would happen to rows that don’t “line up”. What if there was a TA who didn’t have anyone they were grading or a student had a grader_id for a grader that doesn’t exist in the other table? Suppose we had the following dataset instead. We removed James from the first table and changed Wen to be graded by Nicole in the second.
tas
| ta_name | ta_id |
|---|---|
| Ryan | 1 |
| Nicole | 3 |
grading
| grader_id | student_name |
|---|---|
| 2 | Flora |
| 3 | Paul |
| 3 | Wen |
| 3 | Andrew |
Let’s see what happens when we run merge on these datasets:
# Make the tas DataFrame
tas = pd.DataFrame([
{'ta_name': 'Ryan', 'ta_id': 1},
{'ta_name': 'Nicole', 'ta_id': 3},
])
print('tas')
print(tas)
print()
# Make the grading DataFrame
grading = pd.DataFrame([
{'grader_id': 2, 'student_name': 'Flora'},
{'grader_id': 3, 'student_name': 'Paul'},
{'grader_id': 3, 'student_name': 'Wen'},
{'grader_id': 3, 'student_name': 'Andrew'},
])
print('grading')
print(grading)
print()
# Join the datasets by ta_id and grader_id
merged = tas.merge(grading, left_on='ta_id', right_on='grader_id')
print('merged')
print(merged)
Nicole is the only TA represented in this result! The rows that have IDs that did not “line up” with any other row got tossed out. This makes sense in the context of the pseudo-code we showed above since they didn’t have any matches. The topic of what to do with rows with missing values is very interesting and is the topic of the next section.
Join Types¶
There are multiple types of joins that handle “unaligned” rows differently. The join described earlier is an inner-join. An inner-join only outputs rows that have values that appear in both tables.
All joins consist of a left table and a right table. In DataFrame merge, this is determined by
left.merge(right, left_on=col, right_on=col, how=type)
The how parameter determines which type of join to use in the list below. There are four common types of joins:
- An inner-join (default). Keeps values from each table as long as there is at least one match in the other table.
- A left join
how='left'. Same as an inner-join but also includes all rows in the left table even if they have no matches in the right table by associating them with NaN values. - A right join
how='right'. Same as an inner-join but also includes all rows in the right table even if they have no matches in the left table by associating them with NaN values. - An outer join
how='outer'. Combines the logic of the left join and right join to ensure that all rows in both tables are included!
Food for thought: How does changing the how parameter in the following code block change the result of merged?
import pandas as pd
# Make the tas DataFrame
tas = pd.DataFrame([
{'ta_name': 'Ryan', 'ta_id': 1},
{'ta_name': 'Nicole', 'ta_id': 3},
])
print('tas')
print(tas)
print()
# Make the grading DataFrame
grading = pd.DataFrame([
{'grader_id': 2, 'student_name': 'Flora'},
{'grader_id': 3, 'student_name': 'Paul'},
{'grader_id': 3, 'student_name': 'Wen'},
{'grader_id': 3, 'student_name': 'Andrew'},
])
print('grading')
print(grading)
print()
# Join the datasets by ta_id and grader_id
merged = tas.merge(grading, left_on='ta_id', right_on='grader_id',
how='left')
print('merged')
print(merged)
Hurricane Florence, Revisited¶
Let’s apply a join to create a map of the US with the path of Hurricane Florence as well as highlighting for affected states.
Let’s start by repeating the setup and visualization from Lesson 17.
# Load in US Data
country = gpd.read_file("data/gz_2010_us_040_00_5m.json")
country = country[(country['NAME'] != 'Alaska') & (country['NAME'] != 'Hawaii')]
country.plot()
from shapely.geometry import Point
# Load in Florence data
florence = pd.read_csv('stormhistory.csv')
florence['coordinates'] = [Point(-long, lat) for long, lat in
zip(florence['Long'], florence['Lat'])]
florence = gpd.GeoDataFrame(florence, geometry='coordinates')
# Advanced: Need to specify map projection points so we can join them later
florence.crs = country.crs
florence.plot()
# Plot both on top of each other
fig, ax = plt.subplots(1, figsize=(20,10))
country.plot(ax=ax)
florence.plot(ax=ax, color='#000000', markersize=10)
Joining Geospatial Data¶
So if our goal is to highlight the states that were hit by hurricane Florence, we will need some way to “line up” the two datasets so we know which rows in the US dataset correspond to rows in the Florence dataset if they overlap.
Given what we learned in this lesson, this might give you an idea that we want to use a join somehow to do this, and that intuition is exactly right! Whenever you have this task of trying to “line up” two datasets, the join is usually the tool for the job. What type of join will we want to use? Well let’s just start by finding the states that do intersect, so we will want to use some type of inner-join so that we can find rows from both tables that overlap.
Using that intution, you would think that we need to call the merge function like we saw prevously. You might try to write something like:
country.merge(florence, left_on='geometry', right_on='coordinates', how='inner')
However, this won’t work! Remember, merge will join two rows together if the value in left_on from the left dataset equals the value of right_on in the right dataset. But the US states geometry are all Polygon or MultiPolygon objects whereas the hurricane path geometry are all Point objects.
In this context, we don’t want to join two rows if they have the exact same geometry, but rather if they intersect (or overlap) somehow. This idea of having more complex semantics for how to join rows based on geospatial information is exactly why geopandas as a special type of join called a “spatial join”, or sjoin for short.
The semantics for doing a spatial join are very similar to merge except for two differences:
- You don’t need to specify which columns to join on.
GeoDataFramesautomatically detects the geometry columns. - You have to specify what it means for two geometries to “match” (i.e., if they should be joined together) using the
opparameter. The most common (and default) is'intersects'.
Note that we call gpd.sjoin because it’s not a GeoDataFrame method.
affected_states = gpd.sjoin(country, florence, how='inner', predicate='intersects')
print('Num Rows', len(affected_states))
display(affected_states.head())
The number of rows is 25 because the same state will appear every time it intersects with a row in the hurricane path! sjoin preserves the geometry of the left dataset. So in our example above, since we joined with country on the left, the resulting GeoDataFrame will keep the country geometry column.
Now that we know which states were affected by the hurricane, we can highlight them on the map.
- Plot the entire US in grey (
color='#EEEEEE') with a white border (edgecolor='#FFFFFF'). - Plot the affected states in blue (default) with a white border.
- Plot the hurricane points in black (
color='#000000') with a smaller size of the dot (markersize=10).
All of this code is below. Note that the order in which we plot is important since new calls to plotting “draw over” the old calls.
fig, ax = plt.subplots(1, figsize=(20, 10))
country.plot(ax=ax, color='#EEEEEE', edgecolor='#FFFFFF')
affected_states.plot(ax=ax, edgecolor='#FFFFFF')
florence.plot(ax=ax, color='#000000', markersize=10)
Food for thought: What would happen if we draw the entire US in grey last?
⏸️ Pause and 🧠 Think¶
Take a moment to review the following concepts and reflect on your own understanding. A good temperature check for your understanding is asking yourself whether you might be able to explain these concepts to a friend outside of this class.
Here’s what we covered in this lesson:
dissolve- Joining
mergesjoin- Types of joins
Here are some other guiding exercises and questions to help you reflect on what you’ve seen so far:
- In your own words, write a few sentences summarizing what you learned in this lesson.
- What did you find challenging in this lesson? Come up with some questions you might ask your peers or the course staff to help you better understand that concept.
- What was familiar about what you saw in this lesson? How might you relate it to things you have learned before?
- Throughout the lesson, there were a few Food for thought questions. Try exploring one or more of them and see what you find.
In-Class¶
When you come to class, we will work together on completing population_maps.ipynb! Make sure you have a way of opening and running this file.
Canvas Quiz¶
All done with the lesson? Complete the Canvas Quiz linked here!