Collections and Data Classes

Python provides many built-in data structures like lists, dictionaries, and tuples. However, sometimes we need more specialized or structured ways to represent and process data. Python’s built-in collections and dataclasses modules provide powerful tools for these scenarios.

`collections.Counter`¶

When we want to count the occurrences of elements in a list, we typically use a dictionary. For example, let’s figure out how many Olympic disciplines are hosted in each location from olympics.txt.

with open("data/olympics.txt") as f:
    locations = []
    for line in f:
        # Example line: "Ice Hockey: Milan"
        location = line.strip().split(": ")[1]
        locations.append(location)

counts = {}
for loc in locations:
    if loc not in counts:
        counts[loc] = 0
    counts[loc] += 1
print(counts)

The Counter class from the collections module simplifies this common pattern. When we pass a list to Counter, it automatically counts the frequency of each element for us.

from collections import Counter

counts = Counter(locations)
print(counts)
print(counts.most_common(2))

`collections.defaultdict`¶

Earlier, we created a dictionary grouping each location to a list of its Olympic disciplines.

disciplines_by_location = {}
with open("data/olympics.txt") as f:
    for line in f:
        discipline, location = line.strip().split(": ")
        if location not in disciplines_by_location:
            disciplines_by_location[location] = []
        disciplines_by_location[location].append(discipline)

We can simplify the process of initializing empty lists for new keys using defaultdict from the collections module. When you access a key that doesn’t exist, defaultdict automatically creates it using the function you provided (like list for an empty list or int for 0).

Practice: Grouping with `defaultdict`¶

Which line of code correctly initializes disciplines_by_location using a defaultdict so that we no longer need the if statement inside the for loop?

disciplines_by_location = defaultdict(list)
disciplines_by_location = defaultdict([])
disciplines_by_location = defaultdict(int)
disciplines_by_location = defaultdict(list())

Data classes¶

When processing structured files like CSVs, dictionaries are useful but we’ve learned that classes can be a better alternative if we want to define methods that act on the structured data. But we’ve seen that classes in Python require quite a substantial boilerplate (template) code. The @dataclass decorator automatically generates dunder methods like __init__, __repr__ and __eq__.

from dataclasses import dataclass

@dataclass
class Game:
    season: int
    date: str
    team_home: str
    team_away: str
    score_home: int
    score_away: int

my_game = Game(2024, "2024-10-20", "Washington Commanders", "Carolina Panthers", 40, 7)
my_game

We can combine dataclasses with our file processing skills.

import csv

games = []
with open("data/games.csv") as f:
    reader = csv.DictReader(f)
    for row in reader:
        game = Game(
            season=int(row["schedule_season"]),
            date=row["schedule_date"],
            team_home=row["team_home"],
            team_away=row["team_away"],
            score_home=int(row["score_home"]),
            score_away=int(row["score_away"])
        )
        games.append(game)

games

Practice: NFL Dataclasses¶

After defining the Game dataclass and creating the my_game instance above, which expression will raise an error?

my_game["score_home"]
my_game.score_home
my_game.season == 2024
my_game.team_home = "Seattle Seahawks"

Sorting by attribute¶

One of the major benefits of using dataclasses (and objects in general) is how nicely they work with Python’s sorted function. Just as we used operator.itemgetter to sort dictionaries by a specific key, we can use operator.attrgetter to sort objects by a specific attribute.

from operator import attrgetter

# Sort the games from lowest home score to highest home score
sorted_games = sorted(games, key=attrgetter("score_home"))
sorted_games[-1].score_home

collections.Counter¶

collections.defaultdict¶

Practice: Grouping with defaultdict¶