Objects¶

Over the past few weeks, we've used the word "object" frequently without defining exactly what it means. In this lesson, we'll introduce objects and see how we can use them in real data programming work. By the end of this lesson, students will be able to:

  • Define a Python class to represent objects with specific states and behaviors.
  • Explain how the Python memory model allows multiple references to the same objects.
  • Add type annotations to variables, function definitions, and class fields.
In [1]:
import pandas as pd

An object (aka instance) in Python is a way of combining into a distinct unit (aka encapsulating) two software concepts:

  • State, or data like the elements of a list.
  • Behavior, or methods like a function that can take a list and return the size of the list.

A DataFrame stores data (state) and has many methods (behaviors), such as groupby.

In [24]:
seattle_air = pd.read_csv("seattle_air.csv", index_col="Time", parse_dates=True)
seattle_air.groupby(seattle_air.index.year).count()
Out[24]:
PM2.5
Time
2017 6283
2018 8540
2019 8597
2020 8683
2021 8664
2022 2292
In [25]:
original_seattle_air = seattle_air
original_seattle_air
Out[25]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

43848 rows × 1 columns

In [26]:
seattle_air["PM2.5"].dropna()
Out[26]:
Time
2017-04-06 00:00:00    6.8
2017-04-06 01:00:00    5.3
2017-04-06 02:00:00    5.3
2017-04-06 03:00:00    5.6
2017-04-06 04:00:00    5.9
                      ... 
2022-04-06 19:00:00    5.1
2022-04-06 20:00:00    5.0
2022-04-06 21:00:00    5.3
2022-04-06 22:00:00    5.2
2022-04-06 23:00:00    5.2
Name: PM2.5, Length: 43059, dtype: float64
In [10]:
seattle_air
Out[10]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

43848 rows × 1 columns

Reference semantics¶

When we call a method like groupby and then count each group, the result is a new object that is distinct from the original. If we now ask for the value of seattle_air, we'll see that the original DataFrame is still there with all its data intact and untouched by the groupby or count operations.

In [3]:
seattle_air
Out[3]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

43848 rows × 1 columns

However, unlike groupby, there are some DataFrame methods that can modify the underlying DataFrame. The dropna method for removing NaN values can modify the original when we include the keyword argument inplace=True (default False). Furthermore, if inplace=True, dropna will return None to more clearly communicate that changes were made to the original DataFrame.

In [4]:
seattle_air.dropna() # By default, it does not modify the underlying data frame
Out[4]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

43059 rows × 1 columns

In [5]:
seattle_air
Out[5]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

43848 rows × 1 columns

In [11]:
seattle_air = seattle_air.dropna()
seattle_air
Out[11]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

43059 rows × 1 columns

In [12]:
original_seattle_air
Out[12]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

43848 rows × 1 columns

In [13]:
seattle_air = seattle_air["PM2.5"].dropna()
In [14]:
seattle_air
Out[14]:
Time
2017-04-06 00:00:00    6.8
2017-04-06 01:00:00    5.3
2017-04-06 02:00:00    5.3
2017-04-06 03:00:00    5.6
2017-04-06 04:00:00    5.9
                      ... 
2022-04-06 19:00:00    5.1
2022-04-06 20:00:00    5.0
2022-04-06 21:00:00    5.3
2022-04-06 22:00:00    5.2
2022-04-06 23:00:00    5.2
Name: PM2.5, Length: 43059, dtype: float64
In [15]:
original_seattle_air
Out[15]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

43848 rows × 1 columns

In [20]:
# Doesn't actually appear to modify the data frame
# (It does actually change the result, but just doesn't look like it.)
seattle_air["PM2.5"] = seattle_air["PM2.5"].dropna()
seattle_air
Out[20]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

43848 rows × 1 columns

In [22]:
seattle_air[seattle_air.isna()]
Out[22]:
PM2.5
Time
2017-04-06 00:00:00 NaN
2017-04-06 01:00:00 NaN
2017-04-06 02:00:00 NaN
2017-04-06 03:00:00 NaN
2017-04-06 04:00:00 NaN
... ...
2022-04-06 19:00:00 NaN
2022-04-06 20:00:00 NaN
2022-04-06 21:00:00 NaN
2022-04-06 22:00:00 NaN
2022-04-06 23:00:00 NaN

43848 rows × 1 columns

In [23]:
original_seattle_air
Out[23]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2022-04-06 19:00:00 5.1
2022-04-06 20:00:00 5.0
2022-04-06 21:00:00 5.3
2022-04-06 22:00:00 5.2
2022-04-06 23:00:00 5.2

43848 rows × 1 columns

Defining classes¶

Python allows us to create our own custom objects by defining a class: a blueprint or template for objects. The pandas developers defined a DataFrame class so that you can construct DataFrame objects to use. Here's a highly simplified outline of the code that they could have written to define the DataFrame class. Let's breakdown the new syntax for the class definition.

  • class DataFrame: begins the class definition. We always name classes by capitalizing each word removing spaces between words.
  • def __init__(self, index, columns, data): defines a special function called an initializer. The initializer is called whenever constructing a new object. Each DataFrame stores its own data in fields (variables associated with an object), in this case called index, columns, and data.
  • def dropna(self, inplace=False): defines a function that can be called on DataFrame objects. Like the initializer, it also takes a self parameter as well as a default parameter inplace=False. Depending on the value of inplace, it can either return a new DataFrame or None.
  • def __getitem__(self, column_or_indexer): defines a special function that is called when you use square bracket notation for indexing.

Notice how every method always takes self as the first parameter. The two special functions that we defined above are only "special" in the sense that they have a specific naming format preceded by two underscores and followed by two underscores. These dunder methods are used internally by Python to enable the convenient syntax that we're all used to using.

In [1]:
class DataFrame:
    """Represents two-dimensional tabular data structured around an index and column names."""

    def __init__(self, index, columns, data):
        """Initializes a new DataFrame object from the given index, columns, and tabular data."""
        print("Initializing DataFrame")
        # seattle_air["PM2.5"] = ...
        self.index = index
        self.columns = columns
        self.data = data

    # seattle_air.dropna() -> dropna(seattle_air, ...)
    def dropna(self, inplace=False):
        """"
        Drops all rows containing NaN from this DataFrame. If inplace, returns None and modifies
        self. If not inplace, returns a new DataFrame without modifying self.
        """
        print("Calling dropna")
        if not inplace:
            return DataFrame([...], [...], [...])
        else:
            self.columns = [...]
            self.index = [...]
            self.data = [...]
            return None

    # seattle_air["PM2.5"] -> seattle_air.__getitem__("PM2.5") -> DataFrame.__getitem__(seattle_air, "PM2.5")
    def __getitem__(self, column_or_indexer):
        """Given a column or indexer, returns the selection as a new Series or DataFrame object."""
        print("Calling __getitem__")
        if column_or_indexer in self.columns:
            return "Series" # placeholder for a Series
        else: # If I have a boolean Series for filtering...
            return DataFrame([...], [...], [...])

    def __repr__(self):
        """Return a string representation of this object that can be evaluated in Python to reproduce the object."""
        return f"DataFrame({self.index}, {self.columns}, {self.data})"


example = DataFrame([0, 1, 2], ["PM2.5"], [10, 20, 30])
example["PM2.5"]
Initializing DataFrame
Calling __getitem__
Out[1]:
'Series'
In [ ]:
example["PM2.5"]["2024-04-06"]
# example.__getitem__("PM2.5").__getitem__("2024-04-06")
# Series.__getitem__(DataFrame.__getitem__(example, "PM2.5"), "2024-04-06")
In [ ]:
# Remember that Python will pass in the instance for self as the first argument!
# In this example, we end up accidentally getting two arguments for staff being passed in!
# staff.__getitem__(staff, "Hours").__getitem__(staff, "Iris")
staff.__getitem__("Hours").__getitem__("Iris")
In [ ]:
# df.__getitem__("foo").__setitem__(df["bar"] > 5, 100)
# Actually has a potentially subtle bug in it! Could be potentially confusing what is intended.
df["foo"][df["bar"] > 5] = 100

# df.__getitem__("foo") -> Series but is this column associated with the original dataframe?
#                          or is it a copy?
#                      .__setitem__(df["bar"] > 5, 100)
#                       -> is this changing the original df? Or is it just changing some temporary Series?
# Pandas' solution is copy on write: in future versions of Python this code simply will not work.
In [ ]:
# We are doing one df.loc.__setitem__( (df["bar"] > 5, "foo"), 100 )
df.loc[df["bar"] > 5, "foo"] = 100

# Why does using .loc avoid the problem above with chained assignments?
# Because it does not materialize a temporary Series (which was the cause of all the potential confusion)

Another useful dunder method is the __repr__ method, which should return a string representing the object. By default, __repr__ just tells you the fully-qualified name of the object's class and the location it is stored in your computer memory. But we can make it much more useful by defining our own __repr__ method.

In [33]:
example
Out[33]:
DataFrame([0, 1, 2], ['PM2.5'], [10, 20, 30])
In [34]:
DataFrame([0, 1, 2], ['PM2.5'], [10, 20, 30])
Initializing DataFrame
Out[34]:
DataFrame([0, 1, 2], ['PM2.5'], [10, 20, 30])

Practice: Student class¶

Write a Student class that represents a UW student, where each student has a name, a student number, and a courses dictionary that associates the name of each course to a number of credits. The Student class should include the following methods:

  • An initializer that takes the student number and the name of a file containing information about their schedule.
  • A method __getitem__ that takes a str course name and returns the int number of credits for the course. If the student is not taking the given course, return None.
  • A method get_courses that returns a list of the courses the student is taking.

Consider the following file nicole.txt.

CSE163 4
PHIL100 4
CSE390HA 1

The student's name is just the name of the file without the file extension. The file indicates they are taking CSE163 for 4 credits, PHIL100 for 4 credits, and CSE390HA for 1 credit.

In [12]:
# 1. Write the template out and convert it to Python code for each method.
# 2. Figure out what data/state you need to keep track of in your class.
# 3. Sometimes, it can be helpful to start with the initializer.
#    Other times, it can be helpful to start with the other methods.
#    And it's probably useful to write them somewhat in tandem (at the same time).

class Student:
    """..."""
    # Is there a notion of a static method? Yes, in Python using a special decorator.

    def __init__(self, number: int, filename: str) -> None:
        """..."""
        self.name = filename[:-4]
        self.number = number
        self.filename = filename
        self.courses = {}
        with open(filename) as f:
            for line in f.readlines():
                course, credits = line.split()
                self.courses[course] = int(credits)

    def __getitem__(self, course: str) -> int | None:
        """..."""
        if course not in self.courses:
            return None
        return self.courses[course]

    def get_courses(self) -> list[str]:
        """..."""
        return list(self.courses)

    def __repr__(self):
        """Return a string representation of this object that can be evaluated in Python to reproduce the object."""
        return f"Student({self.number}, '{self.filename}')"


nicole = Student(1234567, "nicole.txt")
for course in nicole.get_courses():
    print(course, nicole[course])
<cell>31: error: Function is missing a type annotation  [no-untyped-def]
CSE163 4
PHIL100 4
CSE390HA 1
In [13]:
nicole["CSE163"]
Out[13]:
4
In [14]:
# To access information that was passed to the initializer, we need to assign it as a field
nicole.number
Out[14]:
1234567
In [15]:
nicole
Out[15]:
Student(1234567, 'nicole.txt')
In [17]:
Student(1234567, 'nicole.txt')
Out[17]:
Student(1234567, 'nicole.txt')

Type annotations¶

We've talked a lot about the types of each variable in the Python programs that we write, but we can also optionally write-in the type of each variable or return value as a type hint. In certain assignments, we'll use mypy to check your type annotations. Let's read the Type hints cheat sheet and practice adding type annotations to our previous class definitions.

In [1]:
!pip install -q nb_mypy
%reload_ext nb_mypy
%nb_mypy mypy-options --strict
[notice] A new release of pip is available: 25.0.1 -> 25.1
[notice] To update, run: pip install --upgrade pip
Version 1.0.5
In [23]:
# How to sort students?
students = [nicole, Student(10, "nicole.txt")]
students[0].name
Out[23]:
'nicole'
In [24]:
students
Out[24]:
[Student(1234567, 'nicole.txt'), Student(10, 'nicole.txt')]
In [25]:
# How do we define a way to sort these Student objects? It's ambiguous!
sorted(students)
<cell>2: error: Value of type variable "SupportsRichComparisonT" of "sorted" cannot be "Student"  [type-var]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[25], line 2
      1 # How do we define a way to sort these Student objects? It's ambiguous!
----> 2 sorted(students)

TypeError: '<' not supported between instances of 'Student' and 'Student'
In [26]:
# key= takes a function as an argument, which when given an object from your list,
#      it should return something that can be sorted in Python (str, int, float)

def get_name(student: Student) -> str:
    return student.name

def get_number(student: Student) -> int:
    return student.number

sorted(students, key=get_number) # Do not call the function with parentheses!
Out[26]:
[Student(10, 'nicole.txt'), Student(1234567, 'nicole.txt')]
In [27]:
# Shorter way to do the same thing using lambdas (inline function definition)!
sorted(students, key=lambda student: student.number)
Out[27]:
[Student(10, 'nicole.txt'), Student(1234567, 'nicole.txt')]
In [28]:
# Even more "Pythonic" (definitely some problems with limited framings of what is Pythonic)

from operator import attrgetter

sorted(students, key=attrgetter("number"))
Out[28]:
[Student(10, 'nicole.txt'), Student(1234567, 'nicole.txt')]

Practice: University class¶

Write a University class that represents one or more students enrolled in courses at a university. The University class should include the following methods:

  • An initializer that takes the university name and, optionally, a list of Student objects to enroll in this university.
  • A method enrollments that returns all the enrolled Student objects sorted in alphabetical order by student name.
  • A method enroll that takes a Student object and enrolls them in the university.

Later, we'll add more methods to this class. How well does your approach stand up to changing requirements?

In [32]:
# Pretty handy to write type annotations as part of your templating.
# How does Python evaluate a class?
#   Look inside the class for definitions, and bind them to the class!
#   University.__init__ will get the __init__ function definition

class University:
    """..."""

    def __init__(self, name: str, students: list[Student] = None) -> None:
        """Initializes a new University object with the given name and list of students."""
        if students is None:
            students = []
        self.students = students # Could be a dictionary

    def enrollments(self) -> list[Student]:
        """Returns a list of all students enrolled in this University sorted alphabetically by name."""
        return sorted(self.students, key=attrgetter("name"))

    def enroll(self, student: Student) -> None:
        """Enrolls the given student in this University"""
        self.students.append(student)

    def roster(self, course: str) -> list[Student]:
        """Given a course name, return the list of students who are enrolled in that course."""
        result = []
        # Loop over the list of students,
        # checking each one to see if they are enrolled in the course
        for student in self.students:
            if student[course] is not None: # If the student is taking the course?
                result.append(student)
        # In the Search homework, you will need to sort the documents by their relevance!
        return result

    # What other methods might you like?


uw = University("Udub", [nicole])
uw.enrollments()
Out[32]:
[Student(1234567, 'nicole.txt')]

Mutable default parameters¶

Default parameter values are evaluated and bound to the parameter when the function is defined. This can lead to some unanticipated results when using mutable values like lists or dictionaries as default parameter values.

Say we make two new University objects without specifying a list of students to enroll. The initializer might then assign this list value to a field.

In [33]:
# Constructing a new University with no students, means that I don't have any students!
wsu = University("Wazzu")
wsu.enrollments()
Out[33]:
[]
In [34]:
# Construct another new University with no students: it should also have no students.
seattle_u = University("SeattleU")
seattle_u.enrollments()
Out[34]:
[]

When we enroll a student to seattle_u, the change will also affect wsu. There are several ways to work around this, with the most common approach changing the default parameter value to None and adding an if statement in the program logic.

In [35]:
seattle_u.enroll(nicole)
seattle_u.enrollments()
Out[35]:
[Student(1234567, 'nicole.txt')]
In [36]:
wsu.enrollments()
Out[36]:
[Student(1234567, 'nicole.txt')]

Python, when it reads a function definition def, it will evaluate the function signature but not its body!

In [40]:
# UW has a separate enrollment list because passed in a specific argument for students!
uw.enroll(nicole)
uw.enrollments()
Out[40]:
[Student(1234567, 'nicole.txt'), Student(1234567, 'nicole.txt')]
In [41]:
wsu.enrollments()
Out[41]:
[Student(1234567, 'nicole.txt')]
In [42]:
seattle_u.enrollments()
Out[42]:
[Student(1234567, 'nicole.txt')]