Dissolve, Intersect, and Join¶

In this lesson, we'll learn about geospatial operations such as dissolves, intersections, and spatial joins. We'll also relate tabular data to geospatial data and learn new ways of combining different types of datasets. By the end of this lesson, students will be able to:

  • Apply the dissolve method to aggregate GeoDataFrame columns.
  • Apply merge to join tabular data and sjoin to join geospatial data.
  • Apply the clip methods to perform bounding box queries.
In [1]:
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd

We'll continue using the same world countries dataset today.

In [2]:
columns = ["POP_EST", "GDP_MD", "CONTINENT", "SUBREGION", "geometry"]
countries = gpd.read_file("ne_110m_admin_0_countries.shp").set_index("NAME")[columns]
countries
Out[2]:
POP_EST GDP_MD CONTINENT SUBREGION geometry
NAME
Fiji 889953.0 5496 Oceania Melanesia MULTIPOLYGON (((180 -16.06713, 180 -16.55522, ...
Tanzania 58005463.0 63177 Africa Eastern Africa POLYGON ((33.90371 -0.95, 34.07262 -1.05982, 3...
W. Sahara 603253.0 907 Africa Northern Africa POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
Canada 37589262.0 1736425 North America Northern America MULTIPOLYGON (((-122.84 49, -122.97421 49.0025...
United States of America 328239523.0 21433226 North America Northern America MULTIPOLYGON (((-122.84 49, -120 49, -117.0312...
... ... ... ... ... ...
Serbia 6944975.0 51475 Europe Southern Europe POLYGON ((18.82982 45.90887, 18.82984 45.90888...
Montenegro 622137.0 5542 Europe Southern Europe POLYGON ((20.0707 42.58863, 19.80161 42.50009,...
Kosovo 1794248.0 7926 Europe Southern Europe POLYGON ((20.59025 41.85541, 20.52295 42.21787...
Trinidad and Tobago 1394973.0 24269 North America Caribbean POLYGON ((-61.68 10.76, -61.105 10.89, -60.895...
S. Sudan 11062113.0 11998 Africa Eastern Africa POLYGON ((30.83385 3.50917, 29.9535 4.1737, 29...

177 rows × 5 columns

Dissolve: groupby for geospatial data¶

How can we plot the total population for each continent (rather than per country)? In pandas, to perform a computation for each group, we would typically use a groupby!

In [3]:
populations = countries.groupby("CONTINENT")["POP_EST"].sum()
populations.plot()
Out[3]:
<Axes: xlabel='CONTINENT'>
No description has been provided for this image
In [4]:
populations
Out[4]:
CONTINENT
Africa                     1.306370e+09
Antarctica                 4.490000e+03
Asia                       4.550277e+09
Europe                     7.454125e+08
North America              5.837560e+08
Oceania                    4.120487e+07
Seven seas (open ocean)    1.400000e+02
South America              4.270667e+08
Name: POP_EST, dtype: float64

Clearly, this is not a map. For geospatial data, we need to use a GeoDataFrame method called dissolve. Dissolve not only takes care of aggregation, but it also performs a geospatial union to combine all the countries within a single continent into a larger geometric shape.

In [6]:
# populations = countries.groupby("CONTINENT")["POP_EST"].sum()
populations = countries.dissolve("CONTINENT", aggfunc="sum")
populations.plot(column="POP_EST")
Out[6]:
<Axes: >
No description has been provided for this image

Combining tabular and geospatial data¶

When we first learned about data frames, we introduced a dataset about earthquakes around the world. How might we plot these earthquakes?

In [7]:
# Not appropriate to use gpd.read_file since this isn't a geospatial dataset
earthquakes = gpd.read_file("earthquakes.csv").set_index("id")
earthquakes
Out[7]:
year month day latitude longitude name magnitude
id
nc72666881 2016 7 27 37.6723333 -121.619 California 1.43
us20006i0y 2016 7 27 21.5146 94.5721 Burma 4.9
nc72666891 2016 7 27 37.5765 -118.85916670000002 California 0.06
nc72666896 2016 7 27 37.5958333 -118.99483329999998 California 0.4
nn00553447 2016 7 27 39.3775 -119.845 Nevada 0.3
... ... ... ... ... ... ... ...
nc72685246 2016 8 25 36.51549910000001 -121.0998306 California 2.42
ak13879193 2016 8 25 61.4984 -149.8627 Alaska 1.4
nc72685251 2016 8 25 38.8050003 -122.8215027 California 1.06
ci37672328 2016 8 25 34.308 -118.6353333 California 1.55
ci37672360 2016 8 25 34.1191667 -116.93366670000002 California 0.89

8394 rows × 7 columns

In [10]:
# Kevin thinks the reason why we cannot plot this is because
#   the dtype: object (which means that it's not interpreted as numbers)
earthquakes["longitude"]
Out[10]:
id
nc72666881               -121.619
us20006i0y                94.5721
nc72666891    -118.85916670000002
nc72666896    -118.99483329999998
nn00553447               -119.845
                     ...         
nc72685246           -121.0998306
ak13879193              -149.8627
nc72685251           -122.8215027
ci37672328           -118.6353333
ci37672360    -116.93366670000002
Name: longitude, Length: 8394, dtype: object
In [12]:
type(earthquakes)
Out[12]:
pandas.core.frame.DataFrame

Unfortunately, even though the earthquakes dataset has latitude and longitude columns, it is not strictly in a geospatial data format like a shapefile. It's better to read this dataset using plain pandas and then work on adding a geometry column from the latitude-longitude pairs.

In [14]:
# Since this is a CSV file, we should use read_csv to read it and properly
#   load-in the correct data types (e.g. detecting as float64 numbers)
earthquakes = pd.read_csv("earthquakes.csv").set_index("id")
earthquakes["longitude"]
Out[14]:
id
nc72666881   -121.619000
us20006i0y     94.572100
nc72666891   -118.859167
nc72666896   -118.994833
nn00553447   -119.845000
                 ...    
nc72685246   -121.099831
ak13879193   -149.862700
nc72685251   -122.821503
ci37672328   -118.635333
ci37672360   -116.933667
Name: longitude, Length: 8394, dtype: float64
In [15]:
earthquakes = gpd.GeoDataFrame(
    earthquakes,
    # crs="EPSG:4326" specifies WGS84 or GPS coordinate system, see https://epsg.io/4326
    geometry=gpd.points_from_xy(earthquakes["longitude"], earthquakes["latitude"], crs="EPSG:4326")
)
earthquakes
Out[15]:
year month day latitude longitude name magnitude geometry
id
nc72666881 2016 7 27 37.672333 -121.619000 California 1.43 POINT (-121.619 37.67233)
us20006i0y 2016 7 27 21.514600 94.572100 Burma 4.90 POINT (94.5721 21.5146)
nc72666891 2016 7 27 37.576500 -118.859167 California 0.06 POINT (-118.85917 37.5765)
nc72666896 2016 7 27 37.595833 -118.994833 California 0.40 POINT (-118.99483 37.59583)
nn00553447 2016 7 27 39.377500 -119.845000 Nevada 0.30 POINT (-119.845 39.3775)
... ... ... ... ... ... ... ... ...
nc72685246 2016 8 25 36.515499 -121.099831 California 2.42 POINT (-121.09983 36.5155)
ak13879193 2016 8 25 61.498400 -149.862700 Alaska 1.40 POINT (-149.8627 61.4984)
nc72685251 2016 8 25 38.805000 -122.821503 California 1.06 POINT (-122.8215 38.805)
ci37672328 2016 8 25 34.308000 -118.635333 California 1.55 POINT (-118.63533 34.308)
ci37672360 2016 8 25 34.119167 -116.933667 California 0.89 POINT (-116.93367 34.11917)

8394 rows × 8 columns

Now, we can plot the earthquakes on a map!

In [16]:
fig, ax = plt.subplots(figsize=(13, 5))
countries.plot(ax=ax, color="#EEE")
earthquakes.plot(ax=ax, column="magnitude", markersize=0.1, legend=True)
ax.set(title="Earthquakes between July 27, 2016 and August 25, 2016")
ax.set_axis_off()
No description has been provided for this image

But what if we want to only plot the earthquakes that occurred over land in North America? The following map shows the earthquakes in Washington atop a background of North America, but it's pretty tedious trying to specify every single place name one-by-one.

In [17]:
fig, ax = plt.subplots()
na_countries = countries[countries["CONTINENT"] == "North America"]
na_countries.plot(ax=ax, color="#EEE")
earthquakes[earthquakes["name"] == "Washington"].plot(ax=ax, column="magnitude", markersize=0.1)
ax.set(title="Earthquakes in Washington")
ax.set_axis_off()
No description has been provided for this image

Bounding boxes¶

A bounding box is one way to specify an intersection query. Say we wanted to show all the earthquakes over the map area covered by North America. We can see the bounding boxes for every country in North America through the bounds field.

In [18]:
na_countries.bounds
Out[18]:
minx miny maxx maxy
NAME
Canada -140.997780 41.675105 -52.648099 83.233240
United States of America -171.791111 18.916190 -66.964660 71.357764
Haiti -74.458034 18.030993 -71.624873 19.915684
Dominican Rep. -71.945112 17.598564 -68.317943 19.884911
Bahamas -78.980000 23.710000 -77.000000 27.040000
Greenland -73.297000 60.036760 -12.208550 83.645130
Mexico -117.127760 14.538829 -86.811982 32.720830
Panama -82.965783 7.220541 -77.242566 9.611610
Costa Rica -85.941725 8.225028 -82.546196 11.217119
Nicaragua -87.668493 10.726839 -83.147219 15.016267
Honduras -89.353326 12.984686 -83.147219 16.005406
El Salvador -90.095555 13.149017 -87.723503 14.424133
Guatemala -92.229249 13.735338 -88.225023 17.819326
Belize -89.229122 15.886938 -88.106813 18.499982
Puerto Rico -67.242428 17.946553 -65.591004 18.520601
Jamaica -78.337719 17.701116 -76.199659 18.524218
Cuba -84.974911 19.855481 -74.178025 23.188611
Trinidad and Tobago -61.950000 10.000000 -60.895000 10.890000

The total_bounds field provides the bounding box for the entire dataset.

In [19]:
na_countries.total_bounds
Out[19]:
array([-171.7911106 ,    7.22054149,  -12.20855   ,   83.64513   ])

We can use the total bounds as input to the clip method to keep only the rows where the geometry falls within the total bounds.

In [20]:
earthquakes_na_bounds = earthquakes.clip(na_countries.total_bounds)
earthquakes_na_bounds
Out[20]:
year month day latitude longitude name magnitude geometry
id
pr16228001 2016 8 15 17.0398 -67.6287 Puerto Rico 3.3 POINT (-67.6287 17.0398)
pr16236015 2016 8 23 17.5515 -66.4619 Puerto Rico 2.4 POINT (-66.4619 17.5515)
pr16227004 2016 8 14 17.7280 -67.0207 Puerto Rico 2.5 POINT (-67.0207 17.728)
pr16226007 2016 8 13 17.7597 -66.8927 Puerto Rico 2.6 POINT (-66.8927 17.7597)
pr16226009 2016 8 13 17.7644 -66.9978 Puerto Rico 2.3 POINT (-66.9978 17.7644)
... ... ... ... ... ... ... ... ...
ak13835344 2016 8 7 69.3298 -144.0389 Alaska 1.4 POINT (-144.0389 69.3298)
ak13875730 2016 8 23 69.5927 -143.5277 Alaska 2.3 POINT (-143.5277 69.5927)
ak13859105 2016 8 20 69.6423 -144.8114 Alaska 1.9 POINT (-144.8114 69.6423)
ak13839206 2016 8 9 69.7401 -146.3768 Alaska 1.7 POINT (-146.3768 69.7401)
ak13841660 2016 8 12 70.7787 -145.8338 Alaska 2.6 POINT (-145.8338 70.7787)

7621 rows × 8 columns

In [21]:
fig, ax = plt.subplots()
na_countries = countries[countries["CONTINENT"] == "North America"]
na_countries.plot(ax=ax, color="#EEE")
earthquakes_na_bounds.plot(ax=ax, column="magnitude", markersize=0.1)
ax.set(title="Earthquakes in North America (over land or sea)")
ax.set_axis_off()
No description has been provided for this image

Spatial join¶

But what if we only want to plot the earthquakes that appeared over land—not over sea? A spatial join allows us to specify geospatial operations to link two datasets so that we can find all the earthquakes that were located in a North American country. This allows us to provide more precise intersection queries.

In [22]:
earthquakes_na_countries = earthquakes.sjoin(na_countries)
earthquakes_na_countries
Out[22]:
year month day latitude longitude name magnitude geometry NAME POP_EST GDP_MD CONTINENT SUBREGION
id
nc72666881 2016 7 27 37.672333 -121.619000 California 1.43 POINT (-121.619 37.67233) United States of America 328239523.0 21433226 North America Northern America
nc72666891 2016 7 27 37.576500 -118.859167 California 0.06 POINT (-118.85917 37.5765) United States of America 328239523.0 21433226 North America Northern America
nc72666896 2016 7 27 37.595833 -118.994833 California 0.40 POINT (-118.99483 37.59583) United States of America 328239523.0 21433226 North America Northern America
nn00553447 2016 7 27 39.377500 -119.845000 Nevada 0.30 POINT (-119.845 39.3775) United States of America 328239523.0 21433226 North America Northern America
ak13805337 2016 7 27 61.296300 -152.460000 Alaska 1.80 POINT (-152.46 61.2963) United States of America 328239523.0 21433226 North America Northern America
... ... ... ... ... ... ... ... ... ... ... ... ... ...
nc72685246 2016 8 25 36.515499 -121.099831 California 2.42 POINT (-121.09983 36.5155) United States of America 328239523.0 21433226 North America Northern America
ak13879193 2016 8 25 61.498400 -149.862700 Alaska 1.40 POINT (-149.8627 61.4984) United States of America 328239523.0 21433226 North America Northern America
nc72685251 2016 8 25 38.805000 -122.821503 California 1.06 POINT (-122.8215 38.805) United States of America 328239523.0 21433226 North America Northern America
ci37672328 2016 8 25 34.308000 -118.635333 California 1.55 POINT (-118.63533 34.308) United States of America 328239523.0 21433226 North America Northern America
ci37672360 2016 8 25 34.119167 -116.933667 California 0.89 POINT (-116.93367 34.11917) United States of America 328239523.0 21433226 North America Northern America

6785 rows × 13 columns

The columns to the left of geometry are from the earthquakes dataset, while the columns to the right of geometry are linked-up from the na_countries dataset. Complete this sentence to describe how sjoin computed the above result.

For every row in the earthquakes dataset, sjoin _________________________

By default, sjoin uses the keyword argument how="inner". We can also customize the join result with how="left" or how="right" too. Can you explain how the results differ?

In [24]:
# how="left" includes all the entries from earthquakes and, if there's
#   a match, it will pull in the corresponding row from na_countries
earthquakes.sjoin(na_countries, how="left")

# Why is it called how="left"... prefers all the entries from the left dataset
# And it's also a historical consequence of how database systems from the past
Out[24]:
year month day latitude longitude name magnitude geometry NAME POP_EST GDP_MD CONTINENT SUBREGION
id
nc72666881 2016 7 27 37.672333 -121.619000 California 1.43 POINT (-121.619 37.67233) United States of America 328239523.0 21433226.0 North America Northern America
us20006i0y 2016 7 27 21.514600 94.572100 Burma 4.90 POINT (94.5721 21.5146) NaN NaN NaN NaN NaN
nc72666891 2016 7 27 37.576500 -118.859167 California 0.06 POINT (-118.85917 37.5765) United States of America 328239523.0 21433226.0 North America Northern America
nc72666896 2016 7 27 37.595833 -118.994833 California 0.40 POINT (-118.99483 37.59583) United States of America 328239523.0 21433226.0 North America Northern America
nn00553447 2016 7 27 39.377500 -119.845000 Nevada 0.30 POINT (-119.845 39.3775) United States of America 328239523.0 21433226.0 North America Northern America
... ... ... ... ... ... ... ... ... ... ... ... ... ...
nc72685246 2016 8 25 36.515499 -121.099831 California 2.42 POINT (-121.09983 36.5155) United States of America 328239523.0 21433226.0 North America Northern America
ak13879193 2016 8 25 61.498400 -149.862700 Alaska 1.40 POINT (-149.8627 61.4984) United States of America 328239523.0 21433226.0 North America Northern America
nc72685251 2016 8 25 38.805000 -122.821503 California 1.06 POINT (-122.8215 38.805) United States of America 328239523.0 21433226.0 North America Northern America
ci37672328 2016 8 25 34.308000 -118.635333 California 1.55 POINT (-118.63533 34.308) United States of America 328239523.0 21433226.0 North America Northern America
ci37672360 2016 8 25 34.119167 -116.933667 California 0.89 POINT (-116.93367 34.11917) United States of America 328239523.0 21433226.0 North America Northern America

8394 rows × 13 columns

In [25]:
# how="right" is like how="left" (but of course for the right-side dataset
#   but also prioritizes the index from the right-side dataset
earthquakes.sjoin(na_countries, how="right")
Out[25]:
id year month day latitude longitude name magnitude POP_EST GDP_MD CONTINENT SUBREGION geometry
NAME
Canada ak13806840 2016.0 7.0 27.0 61.4966 -140.6142 Canada 1.10 37589262.0 1736425 North America Northern America MULTIPOLYGON (((-122.84 49, -122.97421 49.0025...
Canada ak13831382 2016.0 7.0 27.0 61.4500 -140.6041 Alaska 0.90 37589262.0 1736425 North America Northern America MULTIPOLYGON (((-122.84 49, -122.97421 49.0025...
Canada ak13831385 2016.0 7.0 27.0 59.7753 -136.5700 Alaska 1.20 37589262.0 1736425 North America Northern America MULTIPOLYGON (((-122.84 49, -122.97421 49.0025...
Canada uw61172492 2016.0 7.0 27.0 49.4250 -120.5230 Canada 1.94 37589262.0 1736425 North America Northern America MULTIPOLYGON (((-122.84 49, -122.97421 49.0025...
Canada ak13811717 2016.0 7.0 27.0 61.2486 -139.3301 Canada 1.30 37589262.0 1736425 North America Northern America MULTIPOLYGON (((-122.84 49, -122.97421 49.0025...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
Puerto Rico pr16236006 2016.0 8.0 23.0 18.0020 -67.0385 Puerto Rico 2.10 3193694.0 104988 North America Caribbean POLYGON ((-66.28243 18.51476, -65.7713 18.4266...
Puerto Rico pr16237004 2016.0 8.0 24.0 18.4722 -66.4264 Puerto Rico 2.50 3193694.0 104988 North America Caribbean POLYGON ((-66.28243 18.51476, -65.7713 18.4266...
Jamaica NaN NaN NaN NaN NaN NaN NaN NaN 2948279.0 16458 North America Caribbean POLYGON ((-77.5696 18.49053, -76.89662 18.4008...
Cuba NaN NaN NaN NaN NaN NaN NaN NaN 11333483.0 100023 North America Caribbean POLYGON ((-82.26815 23.18861, -81.40446 23.117...
Trinidad and Tobago NaN NaN NaN NaN NaN NaN NaN NaN 1394973.0 24269 North America Caribbean POLYGON ((-61.68 10.76, -61.105 10.89, -60.895...

6794 rows × 13 columns

Finally, let's plot all the earthquakes that occurred over land in North America.

In [26]:
fig, ax = plt.subplots()
na_countries = countries[countries["CONTINENT"] == "North America"]
na_countries.plot(ax=ax, color="#EEE")
earthquakes_na_countries.plot(ax=ax, column="magnitude", markersize=0.1)
ax.set(title="Earthquakes in North America (over land)")
ax.set_axis_off()
No description has been provided for this image

Attribute joins¶

The idea of joining two datasets isn't exclusive to geospatial data. In fact, in many data-centric contexts we may need to combine multiple datasets. Just as sjoin combines two geospatial datasets on their geometry columns using geometric intersection, merge combines two tabular datasets on specific columns using exact == matches.

Consider the following two datasets of movies and directors. Not all directors have a movie listed, and not all movies have a corresponding director in the dataset.

In [27]:
movies = pd.DataFrame([
    {"movie_id": 51, "movie_name": "Lady Bird", "year": 2017, "director_id": 23},
    {"movie_id": 47, "movie_name": "Grand Budapest Hotel", "year": 2014, "director_id": 16},
    {"movie_id": 103, "movie_name": "Parasite", "year": 2019, "director_id": 14},
    {"movie_id": 34, "movie_name": "Frozen", "year": 2013, "director_id": 18},
    {"movie_id": 37, "movie_name": "Moonrise Kingdom", "year": 2012, "director_id": 16},
])
movies
Out[27]:
movie_id movie_name year director_id
0 51 Lady Bird 2017 23
1 47 Grand Budapest Hotel 2014 16
2 103 Parasite 2019 14
3 34 Frozen 2013 18
4 37 Moonrise Kingdom 2012 16
In [28]:
directors = pd.DataFrame([
    {"director_id": 14, "director_name": "Bong Joon Ho"},
    {"director_id": 23, "director_name": "Greta Gerwig"},
    {"director_id": 16, "director_name": "Wes Anderson"},
    {"director_id": 21, "director_name": "Quentin Tarantino"},
    {"director_id": 27, "director_name": "Kathryn Bigelow"},
])
directors
Out[28]:
director_id director_name
0 14 Bong Joon Ho
1 23 Greta Gerwig
2 16 Wes Anderson
3 21 Quentin Tarantino
4 27 Kathryn Bigelow

Before running the following cell, predict the output if the default attribute join type is how="inner". Visualize the procedure using PandasTutor. What would happen if we changed the join type?

In [29]:
movies.merge(directors, on="director_id")
Out[29]:
movie_id movie_name year director_id director_name
0 51 Lady Bird 2017 23 Greta Gerwig
1 47 Grand Budapest Hotel 2014 16 Wes Anderson
2 103 Parasite 2019 14 Bong Joon Ho
3 37 Moonrise Kingdom 2012 16 Wes Anderson

For these two datasets, the column names are the same so we can use the on keyword argument. If the column names are not the same, specify the column name in the left with left_on and the column name in the right with right_on.

Interactive maps¶

Geopandas supports interactive maps with the folium library.

In [30]:
!pip install -q folium mapclassify

Generating an interactive map is as simple as generating a static map: instead of calling plot, call explore.

In [31]:
earthquakes.explore(column="magnitude")
Out[31]:
Make this Notebook Trusted to load map: File -> Trust Notebook