Dissolve, Intersect, and Join¶
In this lesson, we'll learn about geospatial operations such as dissolves, intersections, and spatial joins. We'll also relate tabular data to geospatial data and learn new ways of combining different types of datasets. By the end of this lesson, students will be able to:
- Apply the
dissolvemethod to aggregateGeoDataFramecolumns. - Apply
mergeto join tabular data andsjointo join geospatial data. - Apply the
clipmethods to perform bounding box queries.
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
We'll continue using the same world countries dataset today.
columns = ["POP_EST", "GDP_MD", "CONTINENT", "SUBREGION", "geometry"]
countries = gpd.read_file("ne_110m_admin_0_countries.shp").set_index("NAME")[columns]
countries
| POP_EST | GDP_MD | CONTINENT | SUBREGION | geometry | |
|---|---|---|---|---|---|
| NAME | |||||
| Fiji | 889953.0 | 5496 | Oceania | Melanesia | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... |
| Tanzania | 58005463.0 | 63177 | Africa | Eastern Africa | POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... |
| W. Sahara | 603253.0 | 907 | Africa | Northern Africa | POLYGON ((-8.66559 27.65643, -8.66512 27.58948... |
| Canada | 37589262.0 | 1736425 | North America | Northern America | MULTIPOLYGON (((-122.84000 49.00000, -122.9742... |
| United States of America | 328239523.0 | 21433226 | North America | Northern America | MULTIPOLYGON (((-122.84000 49.00000, -120.0000... |
| ... | ... | ... | ... | ... | ... |
| Serbia | 6944975.0 | 51475 | Europe | Southern Europe | POLYGON ((18.82982 45.90887, 18.82984 45.90888... |
| Montenegro | 622137.0 | 5542 | Europe | Southern Europe | POLYGON ((20.07070 42.58863, 19.80161 42.50009... |
| Kosovo | 1794248.0 | 7926 | Europe | Southern Europe | POLYGON ((20.59025 41.85541, 20.52295 42.21787... |
| Trinidad and Tobago | 1394973.0 | 24269 | North America | Caribbean | POLYGON ((-61.68000 10.76000, -61.10500 10.890... |
| S. Sudan | 11062113.0 | 11998 | Africa | Eastern Africa | POLYGON ((30.83385 3.50917, 29.95350 4.17370, ... |
177 rows × 5 columns
Dissolve: groupby for geospatial data¶
How can we plot the total population for each continent (rather than per country)? In pandas, to perform a computation for each group, we would typically use a groupby!
countries.plot()
<Axes: >
populations
CONTINENT Africa 1.306370e+09 Antarctica 4.490000e+03 Asia 4.550277e+09 Europe 7.454125e+08 North America 5.837560e+08 Oceania 4.120487e+07 Seven seas (open ocean) 1.400000e+02 South America 4.270667e+08 Name: POP_EST, dtype: float64
populations = countries.groupby("CONTINENT")["POP_EST"].sum()
populations.plot()
<Axes: xlabel='CONTINENT'>
Clearly, this is not a map. For geospatial data, we need to use a GeoDataFrame method called dissolve. Dissolve not only takes care of aggregation, but it also performs a geospatial union to combine all the countries within a single continent into a larger geometric shape.
countries.dissolve(by="CONTINENT", aggfunc="sum")[["POP_EST", "geometry"]].plot(column="POP_EST")
<Axes: >
countries.dissolve(by="CONTINENT", aggfunc="sum").plot(column="POP_EST")
<Axes: >
Combining tabular and geospatial data¶
When we first learned about data frames, we introduced a dataset about earthquakes around the world. How might we plot these earthquakes?
earthquakes = gpd.read_file("earthquakes.csv").set_index("id")
earthquakes
| year | month | day | latitude | longitude | name | magnitude | geometry | |
|---|---|---|---|---|---|---|---|---|
| id | ||||||||
| nc72666881 | 2016 | 7 | 27 | 37.6723333 | -121.619 | California | 1.43 | None |
| us20006i0y | 2016 | 7 | 27 | 21.5146 | 94.5721 | Burma | 4.9 | None |
| nc72666891 | 2016 | 7 | 27 | 37.5765 | -118.85916670000002 | California | 0.06 | None |
| nc72666896 | 2016 | 7 | 27 | 37.5958333 | -118.99483329999998 | California | 0.4 | None |
| nn00553447 | 2016 | 7 | 27 | 39.3775 | -119.845 | Nevada | 0.3 | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| nc72685246 | 2016 | 8 | 25 | 36.51549910000001 | -121.0998306 | California | 2.42 | None |
| ak13879193 | 2016 | 8 | 25 | 61.4984 | -149.8627 | Alaska | 1.4 | None |
| nc72685251 | 2016 | 8 | 25 | 38.8050003 | -122.8215027 | California | 1.06 | None |
| ci37672328 | 2016 | 8 | 25 | 34.308 | -118.6353333 | California | 1.55 | None |
| ci37672360 | 2016 | 8 | 25 | 34.1191667 | -116.93366670000002 | California | 0.89 | None |
8394 rows × 8 columns
Unfortunately, even though the earthquakes dataset has latitude and longitude columns, it is not strictly in a geospatial data format like a shapefile. It's better to read this dataset using plain pandas and then work on adding a geometry column from the latitude-longitude pairs.
earthquakes = pd.read_csv("earthquakes.csv").set_index("id")
earthquakes = gpd.GeoDataFrame(
earthquakes,
# crs="EPSG:4326" specifies WGS84 or GPS coordinate system, see https://epsg.io/4326
geometry=gpd.points_from_xy(earthquakes["longitude"], earthquakes["latitude"], crs="EPSG:4326")
)
earthquakes
| year | month | day | latitude | longitude | name | magnitude | geometry | |
|---|---|---|---|---|---|---|---|---|
| id | ||||||||
| nc72666881 | 2016 | 7 | 27 | 37.672333 | -121.619000 | California | 1.43 | POINT (-121.61900 37.67233) |
| us20006i0y | 2016 | 7 | 27 | 21.514600 | 94.572100 | Burma | 4.90 | POINT (94.57210 21.51460) |
| nc72666891 | 2016 | 7 | 27 | 37.576500 | -118.859167 | California | 0.06 | POINT (-118.85917 37.57650) |
| nc72666896 | 2016 | 7 | 27 | 37.595833 | -118.994833 | California | 0.40 | POINT (-118.99483 37.59583) |
| nn00553447 | 2016 | 7 | 27 | 39.377500 | -119.845000 | Nevada | 0.30 | POINT (-119.84500 39.37750) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| nc72685246 | 2016 | 8 | 25 | 36.515499 | -121.099831 | California | 2.42 | POINT (-121.09983 36.51550) |
| ak13879193 | 2016 | 8 | 25 | 61.498400 | -149.862700 | Alaska | 1.40 | POINT (-149.86270 61.49840) |
| nc72685251 | 2016 | 8 | 25 | 38.805000 | -122.821503 | California | 1.06 | POINT (-122.82150 38.80500) |
| ci37672328 | 2016 | 8 | 25 | 34.308000 | -118.635333 | California | 1.55 | POINT (-118.63533 34.30800) |
| ci37672360 | 2016 | 8 | 25 | 34.119167 | -116.933667 | California | 0.89 | POINT (-116.93367 34.11917) |
8394 rows × 8 columns
Now, we can plot the earthquakes on a map!
fig, ax = plt.subplots(figsize=(13, 5))
countries.plot(ax=ax, color="#EEE")
earthquakes.plot(ax=ax, column="magnitude", markersize=0.1, legend=True)
ax.set(title="Earthquakes between July 27, 2016 and August 25, 2016")
ax.set_axis_off()
But what if we want to only plot the earthquakes that occurred over land in North America? The following map shows the earthquakes in Washington atop a background of North America, but it's pretty tedious trying to specify every single place name one-by-one.
fig, ax = plt.subplots()
na_countries = countries[countries["CONTINENT"] == "North America"]
na_countries.plot(ax=ax, color="#EEE")
earthquakes[earthquakes["name"] == "Washington"].plot(ax=ax, column="magnitude", markersize=0.1)
ax.set(title="Earthquakes in Washington")
ax.set_axis_off()
Bounding boxes¶
A bounding box is one way to specify an intersection query. Say we wanted to show all the earthquakes over the map area covered by North America. We can see the bounding boxes for every country in North America through the bounds field.
na_countries.bounds
| minx | miny | maxx | maxy | |
|---|---|---|---|---|
| NAME | ||||
| Canada | -140.997780 | 41.675105 | -52.648099 | 83.233240 |
| United States of America | -171.791111 | 18.916190 | -66.964660 | 71.357764 |
| Haiti | -74.458034 | 18.030993 | -71.624873 | 19.915684 |
| Dominican Rep. | -71.945112 | 17.598564 | -68.317943 | 19.884911 |
| Bahamas | -78.980000 | 23.710000 | -77.000000 | 27.040000 |
| Greenland | -73.297000 | 60.036760 | -12.208550 | 83.645130 |
| Mexico | -117.127760 | 14.538829 | -86.811982 | 32.720830 |
| Panama | -82.965783 | 7.220541 | -77.242566 | 9.611610 |
| Costa Rica | -85.941725 | 8.225028 | -82.546196 | 11.217119 |
| Nicaragua | -87.668493 | 10.726839 | -83.147219 | 15.016267 |
| Honduras | -89.353326 | 12.984686 | -83.147219 | 16.005406 |
| El Salvador | -90.095555 | 13.149017 | -87.723503 | 14.424133 |
| Guatemala | -92.229249 | 13.735338 | -88.225023 | 17.819326 |
| Belize | -89.229122 | 15.886938 | -88.106813 | 18.499982 |
| Puerto Rico | -67.242428 | 17.946553 | -65.591004 | 18.520601 |
| Jamaica | -78.337719 | 17.701116 | -76.199659 | 18.524218 |
| Cuba | -84.974911 | 19.855481 | -74.178025 | 23.188611 |
| Trinidad and Tobago | -61.950000 | 10.000000 | -60.895000 | 10.890000 |
The total_bounds field provides the bounding box for the entire dataset.
na_countries.total_bounds
array([-171.7911106 , 7.22054149, -12.20855 , 83.64513 ])
We can use the total bounds as input to the clip method to keep only the rows where the geometry falls within the total bounds.
earthquakes_na_bounds = earthquakes.clip(na_countries.total_bounds)
earthquakes_na_bounds
| year | month | day | latitude | longitude | name | magnitude | geometry | |
|---|---|---|---|---|---|---|---|---|
| id | ||||||||
| pr16228001 | 2016 | 8 | 15 | 17.0398 | -67.6287 | Puerto Rico | 3.3 | POINT (-67.62870 17.03980) |
| pr16236015 | 2016 | 8 | 23 | 17.5515 | -66.4619 | Puerto Rico | 2.4 | POINT (-66.46190 17.55150) |
| pr16227004 | 2016 | 8 | 14 | 17.7280 | -67.0207 | Puerto Rico | 2.5 | POINT (-67.02070 17.72800) |
| pr16226007 | 2016 | 8 | 13 | 17.7597 | -66.8927 | Puerto Rico | 2.6 | POINT (-66.89270 17.75970) |
| pr16226009 | 2016 | 8 | 13 | 17.7644 | -66.9978 | Puerto Rico | 2.3 | POINT (-66.99780 17.76440) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ak13835344 | 2016 | 8 | 7 | 69.3298 | -144.0389 | Alaska | 1.4 | POINT (-144.03890 69.32980) |
| ak13875730 | 2016 | 8 | 23 | 69.5927 | -143.5277 | Alaska | 2.3 | POINT (-143.52770 69.59270) |
| ak13859105 | 2016 | 8 | 20 | 69.6423 | -144.8114 | Alaska | 1.9 | POINT (-144.81140 69.64230) |
| ak13839206 | 2016 | 8 | 9 | 69.7401 | -146.3768 | Alaska | 1.7 | POINT (-146.37680 69.74010) |
| ak13841660 | 2016 | 8 | 12 | 70.7787 | -145.8338 | Alaska | 2.6 | POINT (-145.83380 70.77870) |
7621 rows × 8 columns
fig, ax = plt.subplots()
na_countries = countries[countries["CONTINENT"] == "North America"]
na_countries.plot(ax=ax, color="#EEE")
earthquakes_na_bounds.plot(ax=ax, column="magnitude", markersize=0.1)
ax.set(title="Earthquakes in North America (over land or sea)")
ax.set_axis_off()
Spatial join¶
But what if we only want to plot the earthquakes that appeared over land—not over sea? A spatial join allows us to specify geospatial operations to link two datasets so that we can find all the earthquakes that were located in a North American country. This allows us to provide more precise intersection queries.
earthquakes
| year | month | day | latitude | longitude | name | magnitude | geometry | |
|---|---|---|---|---|---|---|---|---|
| id | ||||||||
| nc72666881 | 2016 | 7 | 27 | 37.672333 | -121.619000 | California | 1.43 | POINT (-121.61900 37.67233) |
| us20006i0y | 2016 | 7 | 27 | 21.514600 | 94.572100 | Burma | 4.90 | POINT (94.57210 21.51460) |
| nc72666891 | 2016 | 7 | 27 | 37.576500 | -118.859167 | California | 0.06 | POINT (-118.85917 37.57650) |
| nc72666896 | 2016 | 7 | 27 | 37.595833 | -118.994833 | California | 0.40 | POINT (-118.99483 37.59583) |
| nn00553447 | 2016 | 7 | 27 | 39.377500 | -119.845000 | Nevada | 0.30 | POINT (-119.84500 39.37750) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| nc72685246 | 2016 | 8 | 25 | 36.515499 | -121.099831 | California | 2.42 | POINT (-121.09983 36.51550) |
| ak13879193 | 2016 | 8 | 25 | 61.498400 | -149.862700 | Alaska | 1.40 | POINT (-149.86270 61.49840) |
| nc72685251 | 2016 | 8 | 25 | 38.805000 | -122.821503 | California | 1.06 | POINT (-122.82150 38.80500) |
| ci37672328 | 2016 | 8 | 25 | 34.308000 | -118.635333 | California | 1.55 | POINT (-118.63533 34.30800) |
| ci37672360 | 2016 | 8 | 25 | 34.119167 | -116.933667 | California | 0.89 | POINT (-116.93367 34.11917) |
8394 rows × 8 columns
earthquakes_na_countries = earthquakes.sjoin(na_countries)
earthquakes_na_countries
| year | month | day | latitude | longitude | name | magnitude | geometry | index_right | POP_EST | GDP_MD | CONTINENT | SUBREGION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||
| nc72666881 | 2016 | 7 | 27 | 37.672333 | -121.619000 | California | 1.43 | POINT (-121.61900 37.67233) | United States of America | 328239523.0 | 21433226 | North America | Northern America |
| nc72666891 | 2016 | 7 | 27 | 37.576500 | -118.859167 | California | 0.06 | POINT (-118.85917 37.57650) | United States of America | 328239523.0 | 21433226 | North America | Northern America |
| nc72666896 | 2016 | 7 | 27 | 37.595833 | -118.994833 | California | 0.40 | POINT (-118.99483 37.59583) | United States of America | 328239523.0 | 21433226 | North America | Northern America |
| nn00553447 | 2016 | 7 | 27 | 39.377500 | -119.845000 | Nevada | 0.30 | POINT (-119.84500 39.37750) | United States of America | 328239523.0 | 21433226 | North America | Northern America |
| ak13805337 | 2016 | 7 | 27 | 61.296300 | -152.460000 | Alaska | 1.80 | POINT (-152.46000 61.29630) | United States of America | 328239523.0 | 21433226 | North America | Northern America |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| us100069px | 2016 | 8 | 3 | 11.231700 | -85.861100 | Nicaragua | 4.70 | POINT (-85.86110 11.23170) | Nicaragua | 6545502.0 | 12520 | North America | Central America |
| us10006bx6 | 2016 | 8 | 7 | 10.607400 | -83.831600 | Costa Rica | 4.50 | POINT (-83.83160 10.60740) | Costa Rica | 5047561.0 | 61801 | North America | Central America |
| us10006c9f | 2016 | 8 | 9 | 8.494000 | -82.980300 | Costa Rica | 4.50 | POINT (-82.98030 8.49400) | Costa Rica | 5047561.0 | 61801 | North America | Central America |
| us10006ek2 | 2016 | 8 | 17 | 9.632100 | -84.133500 | Costa Rica | 4.60 | POINT (-84.13350 9.63210) | Costa Rica | 5047561.0 | 61801 | North America | Central America |
| us10006dq1 | 2016 | 8 | 14 | 13.987600 | -89.917600 | El Salvador | 4.80 | POINT (-89.91760 13.98760) | El Salvador | 6453553.0 | 27022 | North America | Central America |
6785 rows × 13 columns
The columns to the left of geometry are from the earthquakes dataset, while the columns to the right of geometry are linked-up from the na_countries dataset. Complete this sentence to describe how sjoin computed the above result.
For every row in the
earthquakesdataset,sjoin_________________________
By default, sjoin uses the keyword argument how="inner". We can also customize the join result with how="left" or how="right" too. Can you explain how the results differ?
earthquakes.sjoin(na_countries, how="inner")
| year | month | day | latitude | longitude | name | magnitude | geometry | index_right | POP_EST | GDP_MD | CONTINENT | SUBREGION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||
| nc72666881 | 2016 | 7 | 27 | 37.672333 | -121.619000 | California | 1.43 | POINT (-121.61900 37.67233) | United States of America | 328239523.0 | 21433226 | North America | Northern America |
| nc72666891 | 2016 | 7 | 27 | 37.576500 | -118.859167 | California | 0.06 | POINT (-118.85917 37.57650) | United States of America | 328239523.0 | 21433226 | North America | Northern America |
| nc72666896 | 2016 | 7 | 27 | 37.595833 | -118.994833 | California | 0.40 | POINT (-118.99483 37.59583) | United States of America | 328239523.0 | 21433226 | North America | Northern America |
| nn00553447 | 2016 | 7 | 27 | 39.377500 | -119.845000 | Nevada | 0.30 | POINT (-119.84500 39.37750) | United States of America | 328239523.0 | 21433226 | North America | Northern America |
| ak13805337 | 2016 | 7 | 27 | 61.296300 | -152.460000 | Alaska | 1.80 | POINT (-152.46000 61.29630) | United States of America | 328239523.0 | 21433226 | North America | Northern America |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| us100069px | 2016 | 8 | 3 | 11.231700 | -85.861100 | Nicaragua | 4.70 | POINT (-85.86110 11.23170) | Nicaragua | 6545502.0 | 12520 | North America | Central America |
| us10006bx6 | 2016 | 8 | 7 | 10.607400 | -83.831600 | Costa Rica | 4.50 | POINT (-83.83160 10.60740) | Costa Rica | 5047561.0 | 61801 | North America | Central America |
| us10006c9f | 2016 | 8 | 9 | 8.494000 | -82.980300 | Costa Rica | 4.50 | POINT (-82.98030 8.49400) | Costa Rica | 5047561.0 | 61801 | North America | Central America |
| us10006ek2 | 2016 | 8 | 17 | 9.632100 | -84.133500 | Costa Rica | 4.60 | POINT (-84.13350 9.63210) | Costa Rica | 5047561.0 | 61801 | North America | Central America |
| us10006dq1 | 2016 | 8 | 14 | 13.987600 | -89.917600 | El Salvador | 4.80 | POINT (-89.91760 13.98760) | El Salvador | 6453553.0 | 27022 | North America | Central America |
6785 rows × 13 columns
# how="inner" only includes geometry matches,
# but how="left" and how="right" also include nearly-matching rows
# [What do you mean by "nearly-matching"? -- Geopandas does exact intersections!]
# how="inner" only includes geometry matches (and only include matching rows),
# but how="left" and how="right" also include non-matching rows
# how="inner" includes all columns from both datasets,
# but how="left" and how="right" don't include all columns
# how="inner" requires overlapping indexes [index != geometry],
# but how="left" and how="right" don't need exact matching indexes
Finally, let's plot all the earthquakes that occurred over land in North America.
fig, ax = plt.subplots()
na_countries = countries[countries["CONTINENT"] == "North America"]
na_countries.plot(ax=ax, color="#EEE")
earthquakes_na_countries.plot(ax=ax, column="magnitude", markersize=0.1)
ax.set(title="Earthquakes in North America (over land)")
ax.set_axis_off()
Attribute joins¶
The idea of joining two datasets isn't exclusive to geospatial data. In fact, in many data-centric contexts we may need to combine multiple datasets. Just as sjoin combines two geospatial datasets on their geometry columns using geometric intersection, merge combines two tabular datasets on specific columns using exact == matches.
Consider the following two datasets of movies and directors. Not all directors have a movie listed, and not all movies have a corresponding director in the dataset.
movies = pd.DataFrame([
{"movie_id": 51, "movie_name": "Lady Bird", "year": 2017, "director_id": 23},
{"movie_id": 47, "movie_name": "Grand Budapest Hotel", "year": 2014, "director_id": 16},
{"movie_id": 103, "movie_name": "Parasite", "year": 2019, "director_id": 14},
{"movie_id": 34, "movie_name": "Frozen", "year": 2013, "director_id": 18},
{"movie_id": 37, "movie_name": "Moonrise Kingdom", "year": 2012, "director_id": 16},
])
movies
| movie_id | movie_name | year | director_id | |
|---|---|---|---|---|
| 0 | 51 | Lady Bird | 2017 | 23 |
| 1 | 47 | Grand Budapest Hotel | 2014 | 16 |
| 2 | 103 | Parasite | 2019 | 14 |
| 3 | 34 | Frozen | 2013 | 18 |
| 4 | 37 | Moonrise Kingdom | 2012 | 16 |
directors = pd.DataFrame([
{"director_id": 14, "director_name": "Bong Joon Ho"},
{"director_id": 23, "director_name": "Greta Gerwig"},
{"director_id": 16, "director_name": "Wes Anderson"},
{"director_id": 21, "director_name": "Quentin Tarantino"},
{"director_id": 27, "director_name": "Kathryn Bigelow"},
])
directors
| director_id | director_name | |
|---|---|---|
| 0 | 14 | Bong Joon Ho |
| 1 | 23 | Greta Gerwig |
| 2 | 16 | Wes Anderson |
| 3 | 21 | Quentin Tarantino |
| 4 | 27 | Kathryn Bigelow |
Before running the following cell, predict the output if the default attribute join type is how="inner". Visualize the procedure using PandasTutor. What would happen if we changed the join type?
movies.merge(directors, on="director_id")
| movie_id | movie_name | year | director_id | director_name | |
|---|---|---|---|---|---|
| 0 | 51 | Lady Bird | 2017 | 23 | Greta Gerwig |
| 1 | 47 | Grand Budapest Hotel | 2014 | 16 | Wes Anderson |
| 2 | 37 | Moonrise Kingdom | 2012 | 16 | Wes Anderson |
| 3 | 103 | Parasite | 2019 | 14 | Bong Joon Ho |
For these two datasets, the column names are the same so we can use the on keyword argument. If the column names are not the same, specify the column name in the left with left_on and the column name in the right with right_on.
Interactive maps¶
Geopandas supports interactive maps with the folium library.
!pip install -q folium mapclassify
Generating an interactive map is as simple as generating a static map: instead of calling plot, call explore.
earthquakes.explore(column="magnitude")