Groupby and Indexing¶
In this lesson, we'll learn about an important DataFrame operation called groupby. Along the way, we'll also discuss how this groupby operation introduces an extra level of complexity toward indexing and slicing values. By the end of this lesson, students will be able to:
- Apply the groupby operation to a list of dictionaries and to a
pandasDataFrame. - Select values from a hierarchical index using tuples and the
sliceobject as keys. - Apply the apply operation to a list of dictionaries and to a
pandasDataFrame.
Previously, we learned how to find the largest earthquake in a dataset using both a list of dictionaries and using a pandas DataFrame. How about finding the largest earthquake for each place in the dataset?
import doctest
import io
import pandas as pd
To help visualize our work, the following dataset contains the first 12 rows from earthquakes.csv.
csv = """
id,year,month,day,latitude,longitude,name,magnitude
nc72666881,2016,7,27,37.6723333,-121.619,California,1.43
us20006i0y,2016,7,27,21.5146,94.5721,Burma,4.9
nc72666891,2016,7,27,37.5765,-118.85916670000002,California,0.06
nc72666896,2016,7,27,37.5958333,-118.99483329999998,California,0.4
nn00553447,2016,7,27,39.3775,-119.845,Nevada,0.3
ak13805337,2016,7,27,61.2963,-152.46,Alaska,1.8
hv61354276,2016,7,27,19.4235,-155.60983330000005,Hawaii,1.0
ak13805339,2016,7,27,61.3019,-152.4507,Alaska,2.0
ci37640584,2016,7,27,35.503,-118.40583329999998,California,1.2
nc72666901,2016,7,27,37.673,-121.6133333,California,1.67
ci37640592,2016,7,27,33.5888333,-116.8165,California,0.48
nn00553416,2016,7,27,38.2638,-118.7351,Nevada,0.9
"""
earthquakes = pd.read_csv(io.StringIO(csv), index_col="id")
earthquakes
| year | month | day | latitude | longitude | name | magnitude | |
|---|---|---|---|---|---|---|---|
| id | |||||||
| nc72666881 | 2016 | 7 | 27 | 37.672333 | -121.619000 | California | 1.43 |
| us20006i0y | 2016 | 7 | 27 | 21.514600 | 94.572100 | Burma | 4.90 |
| nc72666891 | 2016 | 7 | 27 | 37.576500 | -118.859167 | California | 0.06 |
| nc72666896 | 2016 | 7 | 27 | 37.595833 | -118.994833 | California | 0.40 |
| nn00553447 | 2016 | 7 | 27 | 39.377500 | -119.845000 | Nevada | 0.30 |
| ak13805337 | 2016 | 7 | 27 | 61.296300 | -152.460000 | Alaska | 1.80 |
| hv61354276 | 2016 | 7 | 27 | 19.423500 | -155.609833 | Hawaii | 1.00 |
| ak13805339 | 2016 | 7 | 27 | 61.301900 | -152.450700 | Alaska | 2.00 |
| ci37640584 | 2016 | 7 | 27 | 35.503000 | -118.405833 | California | 1.20 |
| nc72666901 | 2016 | 7 | 27 | 37.673000 | -121.613333 | California | 1.67 |
| ci37640592 | 2016 | 7 | 27 | 33.588833 | -116.816500 | California | 0.48 |
| nn00553416 | 2016 | 7 | 27 | 38.263800 | -118.735100 | Nevada | 0.90 |
Groupby in plain Python¶
Let's first see how we can solve this problem using the list of dictionaries approach.
max_mag_by_place = {}
for earthquake in earthquakes.to_dict("records"): # Convert to list of dictionaries
if earthquake["name"] not in max_mag_by_place or earthquake["magnitude"] > max_mag_by_place[earthquake["name"]]:
max_mag_by_place[earthquake["name"]] = earthquake["magnitude"]
max_mag_by_place
{'California': 1.67, 'Burma': 4.9, 'Nevada': 0.9, 'Alaska': 2.0, 'Hawaii': 1.0}
Groupby in Pandas¶
The inventors of pandas defined a DataFrame function called groupby to streamline this operation into a single expression.
earthquakes.groupby("name")["magnitude"].max()
name Alaska 2.00 Burma 4.90 California 1.67 Hawaii 1.00 Nevada 0.90 Name: magnitude, dtype: float64
What's going on here? We can take a closer view at each step of the process in PandasTutor. In summary, this expression:
- Calls
earthquakes.groupby("name")to split theearthquakesinto groups by"name". - For each group, selects the column
"magnitude"indicated in square brackets. - Combines (summarizes) each group on the selected column using the
max()function.
groupby help us quickly answer questions involving "grouping by" one or more columns and then summarizing data in another column.
The best part about pandas groupby is that it allows us to quickly answer many different kinds of questions following the same format. For example, suppose we want to compute descriptive statistics for all the earthquake magnitudes that occurred on each day. Let's read the full dataset and try it out.
earthquakes = pd.read_csv("earthquakes.csv", index_col="id")
earthquakes
| year | month | day | latitude | longitude | name | magnitude | |
|---|---|---|---|---|---|---|---|
| id | |||||||
| nc72666881 | 2016 | 7 | 27 | 37.672333 | -121.619000 | California | 1.43 |
| us20006i0y | 2016 | 7 | 27 | 21.514600 | 94.572100 | Burma | 4.90 |
| nc72666891 | 2016 | 7 | 27 | 37.576500 | -118.859167 | California | 0.06 |
| nc72666896 | 2016 | 7 | 27 | 37.595833 | -118.994833 | California | 0.40 |
| nn00553447 | 2016 | 7 | 27 | 39.377500 | -119.845000 | Nevada | 0.30 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| nc72685246 | 2016 | 8 | 25 | 36.515499 | -121.099831 | California | 2.42 |
| ak13879193 | 2016 | 8 | 25 | 61.498400 | -149.862700 | Alaska | 1.40 |
| nc72685251 | 2016 | 8 | 25 | 38.805000 | -122.821503 | California | 1.06 |
| ci37672328 | 2016 | 8 | 25 | 34.308000 | -118.635333 | California | 1.55 |
| ci37672360 | 2016 | 8 | 25 | 34.119167 | -116.933667 | California | 0.89 |
8394 rows × 7 columns
magnitudes_per_day = earthquakes.groupby(["year", "month", "day"])["magnitude"].describe()
magnitudes_per_day
| count | mean | std | min | 25% | 50% | 75% | max | |||
|---|---|---|---|---|---|---|---|---|---|---|
| year | month | day | ||||||||
| 2016 | 7 | 27 | 272.0 | 1.617574 | 1.095349 | 0.06 | 0.9000 | 1.400 | 2.0000 | 5.60 |
| 28 | 308.0 | 1.448149 | 0.896851 | 0.10 | 0.8775 | 1.240 | 1.8000 | 5.10 | ||
| 29 | 309.0 | 1.640129 | 1.165952 | 0.01 | 0.8800 | 1.450 | 1.9000 | 7.70 | ||
| 30 | 329.0 | 1.615076 | 1.262618 | 0.03 | 0.7000 | 1.240 | 2.0000 | 5.70 | ||
| 31 | 278.0 | 1.750827 | 1.261577 | 0.10 | 0.9000 | 1.500 | 2.1475 | 5.90 | ||
| 8 | 1 | 356.0 | 1.520056 | 1.157326 | 0.04 | 0.8000 | 1.245 | 1.8025 | 6.10 | |
| 2 | 292.0 | 1.539418 | 1.089946 | 0.05 | 0.8000 | 1.300 | 1.9000 | 5.50 | ||
| 3 | 294.0 | 1.556327 | 1.147365 | 0.01 | 0.8300 | 1.200 | 1.8150 | 5.10 | ||
| 4 | 420.0 | 1.249190 | 1.034738 | 0.05 | 0.6000 | 1.000 | 1.5825 | 6.30 | ||
| 5 | 256.0 | 1.428789 | 1.144244 | 0.10 | 0.6200 | 1.185 | 1.7150 | 5.70 | ||
| 6 | 316.0 | 1.313228 | 1.065587 | 0.09 | 0.5600 | 1.100 | 1.6200 | 5.40 | ||
| 7 | 316.0 | 1.356994 | 1.078556 | 0.10 | 0.6000 | 1.120 | 1.7425 | 5.10 | ||
| 8 | 335.0 | 1.484925 | 1.131495 | 0.02 | 0.6300 | 1.200 | 1.9900 | 5.10 | ||
| 9 | 272.0 | 1.614779 | 1.164186 | 0.10 | 0.8075 | 1.300 | 1.9000 | 5.09 | ||
| 10 | 329.0 | 1.404742 | 1.038701 | 0.02 | 0.7700 | 1.170 | 1.7400 | 5.40 | ||
| 11 | 356.0 | 1.390534 | 1.159147 | 0.04 | 0.6775 | 1.100 | 1.7000 | 7.20 | ||
| 12 | 326.0 | 1.533282 | 1.158696 | 0.04 | 0.7400 | 1.200 | 1.9000 | 5.40 | ||
| 13 | 284.0 | 1.421901 | 1.080338 | 0.10 | 0.7000 | 1.105 | 1.7775 | 5.20 | ||
| 14 | 231.0 | 1.692684 | 1.372191 | 0.05 | 0.8200 | 1.200 | 1.9000 | 5.90 | ||
| 15 | 222.0 | 1.583964 | 1.157553 | 0.07 | 0.8550 | 1.300 | 1.8400 | 5.40 | ||
| 16 | 223.0 | 1.629910 | 1.223131 | 0.10 | 0.7300 | 1.300 | 2.0000 | 5.10 | ||
| 17 | 220.0 | 1.583682 | 1.203617 | 0.10 | 0.8000 | 1.210 | 1.9675 | 5.20 | ||
| 18 | 219.0 | 1.499772 | 1.159497 | 0.04 | 0.7600 | 1.200 | 1.8050 | 5.90 | ||
| 19 | 226.0 | 1.819469 | 1.416291 | 0.11 | 0.8925 | 1.340 | 2.1900 | 7.40 | ||
| 20 | 237.0 | 1.553207 | 1.296262 | 0.02 | 0.7000 | 1.200 | 2.0000 | 6.40 | ||
| 21 | 266.0 | 1.332368 | 1.032210 | 0.03 | 0.6050 | 1.140 | 1.6925 | 5.10 | ||
| 22 | 215.0 | 1.451488 | 1.185657 | 0.01 | 0.7000 | 1.180 | 1.8000 | 5.60 | ||
| 23 | 233.0 | 1.643391 | 1.245661 | 0.04 | 0.8200 | 1.300 | 2.1000 | 6.20 | ||
| 24 | 216.0 | 1.553194 | 1.144054 | 0.03 | 0.8375 | 1.290 | 1.9000 | 6.80 | ||
| 25 | 238.0 | 1.519328 | 0.926028 | 0.10 | 0.8900 | 1.400 | 1.9950 | 5.90 |
Explain in your own words the result of the following code snippet.
# the maximum latitude of the earthquakes that happened in the "place"
earthquakes.groupby("name")["latitude"].max()
name
Afghanistan 36.634500
Alaska 70.778700
Anguilla 18.435800
Argentina -22.394200
Arizona 36.811667
...
Washington 48.965667
West Virginia 37.863000
Western Indian-Antarctic Ridge -49.281000
Western Xizang 34.444600
Wyoming 44.749000
Name: latitude, Length: 118, dtype: float64
Hierarchical indexing¶
If you look closely at the magnitudes_per_day DataFrame, you'll notice something interesting: there are three index columns in bold on the left to denote each year, month, and day group. In pandas, a DataFrame can have a hierarchical (aka multi-level) index called a MultiIndex.
magnitudes_per_day.index
MultiIndex([(2016, 7, 27),
(2016, 7, 28),
(2016, 7, 29),
(2016, 7, 30),
(2016, 7, 31),
(2016, 8, 1),
(2016, 8, 2),
(2016, 8, 3),
(2016, 8, 4),
(2016, 8, 5),
(2016, 8, 6),
(2016, 8, 7),
(2016, 8, 8),
(2016, 8, 9),
(2016, 8, 10),
(2016, 8, 11),
(2016, 8, 12),
(2016, 8, 13),
(2016, 8, 14),
(2016, 8, 15),
(2016, 8, 16),
(2016, 8, 17),
(2016, 8, 18),
(2016, 8, 19),
(2016, 8, 20),
(2016, 8, 21),
(2016, 8, 22),
(2016, 8, 23),
(2016, 8, 24),
(2016, 8, 25)],
names=['year', 'month', 'day'])
A MultiIndex is .loc-accessible with Python tuples. However, the syntax is somewhat unusual, particularly when combined with slicing due to limitations in the Python programming language. For each example below, predict the output type (single value, 1-d Series, or 2-d DataFrame) as well as the contents of the output before running it.
magnitudes_per_day
| count | mean | std | min | 25% | 50% | 75% | max | |||
|---|---|---|---|---|---|---|---|---|---|---|
| year | month | day | ||||||||
| 2016 | 7 | 27 | 272.0 | 1.617574 | 1.095349 | 0.06 | 0.9000 | 1.400 | 2.0000 | 5.60 |
| 28 | 308.0 | 1.448149 | 0.896851 | 0.10 | 0.8775 | 1.240 | 1.8000 | 5.10 | ||
| 29 | 309.0 | 1.640129 | 1.165952 | 0.01 | 0.8800 | 1.450 | 1.9000 | 7.70 | ||
| 30 | 329.0 | 1.615076 | 1.262618 | 0.03 | 0.7000 | 1.240 | 2.0000 | 5.70 | ||
| 31 | 278.0 | 1.750827 | 1.261577 | 0.10 | 0.9000 | 1.500 | 2.1475 | 5.90 | ||
| 8 | 1 | 356.0 | 1.520056 | 1.157326 | 0.04 | 0.8000 | 1.245 | 1.8025 | 6.10 | |
| 2 | 292.0 | 1.539418 | 1.089946 | 0.05 | 0.8000 | 1.300 | 1.9000 | 5.50 | ||
| 3 | 294.0 | 1.556327 | 1.147365 | 0.01 | 0.8300 | 1.200 | 1.8150 | 5.10 | ||
| 4 | 420.0 | 1.249190 | 1.034738 | 0.05 | 0.6000 | 1.000 | 1.5825 | 6.30 | ||
| 5 | 256.0 | 1.428789 | 1.144244 | 0.10 | 0.6200 | 1.185 | 1.7150 | 5.70 | ||
| 6 | 316.0 | 1.313228 | 1.065587 | 0.09 | 0.5600 | 1.100 | 1.6200 | 5.40 | ||
| 7 | 316.0 | 1.356994 | 1.078556 | 0.10 | 0.6000 | 1.120 | 1.7425 | 5.10 | ||
| 8 | 335.0 | 1.484925 | 1.131495 | 0.02 | 0.6300 | 1.200 | 1.9900 | 5.10 | ||
| 9 | 272.0 | 1.614779 | 1.164186 | 0.10 | 0.8075 | 1.300 | 1.9000 | 5.09 | ||
| 10 | 329.0 | 1.404742 | 1.038701 | 0.02 | 0.7700 | 1.170 | 1.7400 | 5.40 | ||
| 11 | 356.0 | 1.390534 | 1.159147 | 0.04 | 0.6775 | 1.100 | 1.7000 | 7.20 | ||
| 12 | 326.0 | 1.533282 | 1.158696 | 0.04 | 0.7400 | 1.200 | 1.9000 | 5.40 | ||
| 13 | 284.0 | 1.421901 | 1.080338 | 0.10 | 0.7000 | 1.105 | 1.7775 | 5.20 | ||
| 14 | 231.0 | 1.692684 | 1.372191 | 0.05 | 0.8200 | 1.200 | 1.9000 | 5.90 | ||
| 15 | 222.0 | 1.583964 | 1.157553 | 0.07 | 0.8550 | 1.300 | 1.8400 | 5.40 | ||
| 16 | 223.0 | 1.629910 | 1.223131 | 0.10 | 0.7300 | 1.300 | 2.0000 | 5.10 | ||
| 17 | 220.0 | 1.583682 | 1.203617 | 0.10 | 0.8000 | 1.210 | 1.9675 | 5.20 | ||
| 18 | 219.0 | 1.499772 | 1.159497 | 0.04 | 0.7600 | 1.200 | 1.8050 | 5.90 | ||
| 19 | 226.0 | 1.819469 | 1.416291 | 0.11 | 0.8925 | 1.340 | 2.1900 | 7.40 | ||
| 20 | 237.0 | 1.553207 | 1.296262 | 0.02 | 0.7000 | 1.200 | 2.0000 | 6.40 | ||
| 21 | 266.0 | 1.332368 | 1.032210 | 0.03 | 0.6050 | 1.140 | 1.6925 | 5.10 | ||
| 22 | 215.0 | 1.451488 | 1.185657 | 0.01 | 0.7000 | 1.180 | 1.8000 | 5.60 | ||
| 23 | 233.0 | 1.643391 | 1.245661 | 0.04 | 0.8200 | 1.300 | 2.1000 | 6.20 | ||
| 24 | 216.0 | 1.553194 | 1.144054 | 0.03 | 0.8375 | 1.290 | 1.9000 | 6.80 | ||
| 25 | 238.0 | 1.519328 | 0.926028 | 0.10 | 0.8900 | 1.400 | 1.9950 | 5.90 |
magnitudes_per_day.loc[(2016, 7, 27), "count"]
272.0
magnitudes_per_day.loc[:, "count"]
year month day
2016 7 27 272.0
28 308.0
29 309.0
30 329.0
31 278.0
8 1 356.0
2 292.0
3 294.0
4 420.0
5 256.0
6 316.0
7 316.0
8 335.0
9 272.0
10 329.0
11 356.0
12 326.0
13 284.0
14 231.0
15 222.0
16 223.0
17 220.0
18 219.0
19 226.0
20 237.0
21 266.0
22 215.0
23 233.0
24 216.0
25 238.0
Name: count, dtype: float64
magnitudes_per_day.loc[(2016, 8, 10:15), "count"]
Cell In[12], line 1 magnitudes_per_day.loc[(2016, 8, 10:15), "count"] ^ SyntaxError: invalid syntax
slice(10, 15)
slice(10, 15, None)
magnitudes_per_day.loc[(2016, 8, slice(10,15)), "count"]
year month day
2016 8 10 329.0
11 356.0
12 326.0
13 284.0
14 231.0
15 222.0
Name: count, dtype: float64
magnitudes_per_day.loc[[(2016, 8, 10), (2016, 8, 15)], "count"]
year month day
2016 8 10 329.0
15 222.0
Name: count, dtype: float64
magnitudes_per_day.loc[magnitudes_per_day["count"] < 220, "count"]
year month day
2016 8 18 219.0
22 215.0
24 216.0
Name: count, dtype: float64
Poll question: what if we want to get all dates after 2016-08-10?
"magnitude"[2:]
'gnitude'
magnitudes_per_day.loc[(2016, 8, slice(10, None)), "count"]
year month day
2016 8 10 329.0
11 356.0
12 326.0
13 284.0
14 231.0
15 222.0
16 223.0
17 220.0
18 219.0
19 226.0
20 237.0
21 266.0
22 215.0
23 233.0
24 216.0
25 238.0
Name: count, dtype: float64
magnitudes_per_day.loc[(2016, 8, slice(10)), "count"]
year month day
2016 8 1 356.0
2 292.0
3 294.0
4 420.0
5 256.0
6 316.0
7 316.0
8 335.0
9 272.0
10 329.0
Name: count, dtype: float64
magnitudes_per_day.loc[2016, 8, 10:, "count"]
magnitudes_per_day.loc[(2016, 8, 10:), "count"]
Cell In[23], line 1 magnitudes_per_day.loc[(2016, 8, 10:), "count"] ^ SyntaxError: invalid syntax
Practice: UFO sightings¶
UFO (unidentified flying object) sightings have received attention from US Congress in the past couple years. We've collected a public dataset consisting of 1001 reported UFO sightings around the world to help us practice groupby operations.
ufos = pd.read_csv("ufos.csv", index_col="datetime")
ufos
| city | state | country | shape | duration (seconds) | duration (hours/min) | comments | date posted | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|
| datetime | ||||||||||
| 10/10/1949 20:30 | san marcos | tx | us | cylinder | 2700.0 | 45 minutes | This event took place in early fall around 194... | 4/27/2004 | 29.883056 | -97.941111 |
| 10/10/1949 21:00 | lackland afb | tx | NaN | light | 7200.0 | 1-2 hrs | 1949 Lackland AFB, TX. Lights racing acros... | 12/16/2005 | 29.384210 | -98.581082 |
| 10/10/1955 17:00 | chester (uk/england) | NaN | gb | circle | 20.0 | 20 seconds | Green/Orange circular disc over Chester, En... | 1/21/2008 | 53.200000 | -2.916667 |
| 10/10/1956 21:00 | edna | tx | us | circle | 20.0 | 1/2 hour | My older brother and twin sister were leaving ... | 1/17/2004 | 28.978333 | -96.645833 |
| 10/10/1960 20:00 | kaneohe | hi | us | light | 900.0 | 15 minutes | AS a Marine 1st Lt. flying an FJ4B fighter/att... | 1/22/2004 | 21.418056 | -157.803611 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10/12/1982 05:30 | pearlington | ms | us | light | 29.0 | 29 seconds | Round light which was observed going into spac... | 12/12/2013 | 30.246389 | -89.611111 |
| 10/12/1985 23:45 | swansboro | nc | us | disk | 300.0 | 23:45 - 23:50 | My sister and I observed a disk for about 5 mi... | 12/2/2000 | 34.687500 | -77.119444 |
| 10/12/1988 16:30 | melbourne (vic, australia) | NaN | au | cigar | 900.0 | 15 min's | Large cigar shaped craft,flying sideways... | 9/29/2004 | -37.813938 | 144.963425 |
| 10/12/1994 11:55 | schenectady | ny | us | other | 120.0 | a few minutes | I had just picked up my Daugther (2yrs old) fr... | 4/1/2001 | 42.814167 | -73.940000 |
| 10/12/1994 15:00 | monticello | ky | us | chevron | 120.0 | 1-2 minutes | Triangular/chevron small object with fixed lig... | 10/30/2006 | 36.829722 | -84.849167 |
1001 rows × 10 columns
Compute the average (mean) "duration (seconds)" for each UFO "shape".
ufos.groupby("shape")["duration (seconds)"].mean()
shape changing 9.265600e+02 chevron 3.111250e+02 cigar 8.217407e+02 circle 6.804353e+02 cone 3.000000e+02 cross 6.000000e+01 cylinder 1.499556e+03 delta 1.440000e+04 diamond 1.140300e+03 disk 1.143329e+03 egg 3.088000e+03 fireball 3.467656e+02 flash 4.639091e+02 formation 1.913088e+03 light 1.122005e+03 other 9.259301e+04 oval 1.425136e+03 rectangle 2.285882e+02 sphere 1.506268e+06 teardrop 1.397143e+02 triangle 7.352900e+02 unknown 1.207963e+03 Name: duration (seconds), dtype: float64
Since we're focusing on US Congress, identify the UFO sighting with the longest "duration (seconds)" for each "city" in the US ("us"). Do not include any cities outside the US.
ufos[ufos["country"] == "us"].groupby("city")["duration (seconds)"].max()
city
acton 180.0
addison (i-355 and us 20 (lake st.) 600.0
albany 120.0
albuquerque 3600.0
algona 3600.0
...
wolfforth 300.0
worcester 4.0
yakima 240.0
york 15.0
yuma 900.0
Name: duration (seconds), Length: 627, dtype: float64
ufos.loc[ufos["country"] == "us"]
| city | state | country | shape | duration (seconds) | duration (hours/min) | comments | date posted | latitude | longitude | word count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| datetime | |||||||||||
| 10/10/1949 20:30 | san marcos | tx | us | cylinder | 2700.0 | 45 minutes | This event took place in early fall around 194... | 4/27/2004 | 29.883056 | -97.941111 | 24 |
| 10/10/1956 21:00 | edna | tx | us | circle | 20.0 | 1/2 hour | My older brother and twin sister were leaving ... | 1/17/2004 | 28.978333 | -96.645833 | 26 |
| 10/10/1960 20:00 | kaneohe | hi | us | light | 900.0 | 15 minutes | AS a Marine 1st Lt. flying an FJ4B fighter/att... | 1/22/2004 | 21.418056 | -157.803611 | 25 |
| 10/10/1961 19:00 | bristol | tn | us | sphere | 300.0 | 5 minutes | My father is now 89 my brother 52 the girl wit... | 4/27/2007 | 36.595000 | -82.188889 | 29 |
| 10/10/1965 23:45 | norwalk | ct | us | disk | 1200.0 | 20 minutes | A bright orange color changing to reddish colo... | 10/2/1999 | 41.117500 | -73.408333 | 16 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10/12/1982 05:30 | pearlington | ms | us | light | 20.0 | 20 seconds | While fishing on the east pearl river which se... | 12/12/2013 | 30.246389 | -89.611111 | 25 |
| 10/12/1982 05:30 | pearlington | ms | us | light | 29.0 | 29 seconds | Round light which was observed going into spac... | 12/12/2013 | 30.246389 | -89.611111 | 18 |
| 10/12/1985 23:45 | swansboro | nc | us | disk | 300.0 | 23:45 - 23:50 | My sister and I observed a disk for about 5 mi... | 12/2/2000 | 34.687500 | -77.119444 | 16 |
| 10/12/1994 11:55 | schenectady | ny | us | other | 120.0 | a few minutes | I had just picked up my Daugther (2yrs old) fr... | 4/1/2001 | 42.814167 | -73.940000 | 28 |
| 10/12/1994 15:00 | monticello | ky | us | chevron | 120.0 | 1-2 minutes | Triangular/chevron small object with fixed lig... | 10/30/2006 | 36.829722 | -84.849167 | 10 |
823 rows × 11 columns
What is the name of the "city" that has the largest count of UFO sightings?
ufos.groupby("city")["city"].count().idxmax()
'seattle'
String accessor functions¶
In data science, many tasks involve string data. In plain Python, we know that we can call string functions like split() to split a string on whitespace or find(target) to find the index that a target appears in a string.
To help improve readability of code, the inventors of pandas provide these functions as element-wise operations but hide them behind a special .str string accessor such as s.str.split().
ufos["comments"].str.split()
datetime
10/10/1949 20:30 [This, event, took, place, in, early, fall, ar...
10/10/1949 21:00 [1949, Lackland, AFB,, TX., Lights, racing,...
10/10/1955 17:00 [Green/Orange, circular, disc, over, Chester&#...
10/10/1956 21:00 [My, older, brother, and, twin, sister, were, ...
10/10/1960 20:00 [AS, a, Marine, 1st, Lt., flying, an, FJ4B, fi...
...
10/12/1982 05:30 [Round, light, which, was, observed, going, in...
10/12/1985 23:45 [My, sister, and, I, observed, a, disk, for, a...
10/12/1988 16:30 [Large, cigar, shaped, craft,flying, sidewa...
10/12/1994 11:55 [I, had, just, picked, up, my, Daugther, (2yrs...
10/12/1994 15:00 [Triangular/chevron, small, object, with, fixe...
Name: comments, Length: 1001, dtype: object
The above expression splits each comment by whitespace. This isn't too useful on its own, but we can then compute the length of each list to find the number of words in each comment.
ufos["comments"].str.split().str.len()
datetime
10/10/1949 20:30 24
10/10/1949 21:00 17
10/10/1955 17:00 6
10/10/1956 21:00 26
10/10/1960 20:00 25
..
10/12/1982 05:30 18
10/12/1985 23:45 16
10/12/1988 16:30 8
10/12/1994 11:55 28
10/12/1994 15:00 10
Name: comments, Length: 1001, dtype: int64
These functions don't modify the original DataFrame. To add the result as a new column in the original DataFrame, use an assignment statement.
ufos["word count"] = ufos["comments"].str.split().str.len()
ufos
| city | state | country | shape | duration (seconds) | duration (hours/min) | comments | date posted | latitude | longitude | word count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| datetime | |||||||||||
| 10/10/1949 20:30 | san marcos | tx | us | cylinder | 2700.0 | 45 minutes | This event took place in early fall around 194... | 4/27/2004 | 29.883056 | -97.941111 | 24 |
| 10/10/1949 21:00 | lackland afb | tx | NaN | light | 7200.0 | 1-2 hrs | 1949 Lackland AFB, TX. Lights racing acros... | 12/16/2005 | 29.384210 | -98.581082 | 17 |
| 10/10/1955 17:00 | chester (uk/england) | NaN | gb | circle | 20.0 | 20 seconds | Green/Orange circular disc over Chester, En... | 1/21/2008 | 53.200000 | -2.916667 | 6 |
| 10/10/1956 21:00 | edna | tx | us | circle | 20.0 | 1/2 hour | My older brother and twin sister were leaving ... | 1/17/2004 | 28.978333 | -96.645833 | 26 |
| 10/10/1960 20:00 | kaneohe | hi | us | light | 900.0 | 15 minutes | AS a Marine 1st Lt. flying an FJ4B fighter/att... | 1/22/2004 | 21.418056 | -157.803611 | 25 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10/12/1982 05:30 | pearlington | ms | us | light | 29.0 | 29 seconds | Round light which was observed going into spac... | 12/12/2013 | 30.246389 | -89.611111 | 18 |
| 10/12/1985 23:45 | swansboro | nc | us | disk | 300.0 | 23:45 - 23:50 | My sister and I observed a disk for about 5 mi... | 12/2/2000 | 34.687500 | -77.119444 | 16 |
| 10/12/1988 16:30 | melbourne (vic, australia) | NaN | au | cigar | 900.0 | 15 min's | Large cigar shaped craft,flying sideways... | 9/29/2004 | -37.813938 | 144.963425 | 8 |
| 10/12/1994 11:55 | schenectady | ny | us | other | 120.0 | a few minutes | I had just picked up my Daugther (2yrs old) fr... | 4/1/2001 | 42.814167 | -73.940000 | 28 |
| 10/12/1994 15:00 | monticello | ky | us | chevron | 120.0 | 1-2 minutes | Triangular/chevron small object with fixed lig... | 10/30/2006 | 36.829722 | -84.849167 | 10 |
1001 rows × 11 columns
Apply your own functions¶
So what if you want to call your own functions on each element? Call the apply(...) function on a Series or DataFrame and pass in another function as an argument. Let's try writing a program that can remove the trailing parentheticals in the city name for the UFO dataset.
def clean_city_name(s):
"""
Returns all the characters in the given string with trailing parentheticals removed.
>>> clean_city_name("seattle (ballard area)")
'seattle'
>>> clean_city_name("seattle (west)")
'seattle'
>>> clean_city_name("melbourne (vic, australia)")
'melbourne'
>>> clean_city_name("chester (uk/england)")
'chester'
>>> clean_city_name("carrieres sous poissy (france)")
'carrieres sous poissy'
>>> clean_city_name("seattle")
'seattle'
"""
index = s.find("(")
if index == -1:
return s
return s[:index].rstrip()
doctest.run_docstring_examples(clean_city_name, globals())
ufos["city"].apply(clean_city_name).value_counts() # like groupby("city").count()
city
new york city 12
seattle 12
tinley park 8
oak forest 8
las vegas 7
..
holiday 1
eagan 1
siloam sprngs 1
canyonlands np 1
monticello 1
Name: count, Length: 774, dtype: int64
In practice, this can be useful for carrying-out data cleaning tasks such as removing punctuation or converting special characters. apply lets us write and test a function that achieves our task on a single string, and then apply that function to every string in a dataset.
short review¶
.groupby(["column1", "column2"])
.loc[row_indexers, column_indexers]
.loc[(v1, v2, slice(1,5))]
["city"].apply(your_method_for_cleaning_data)
https://pandas.pydata.org/docs/reference/frame.html Pandas Documentation