Objects¶

Over the past few weeks, we've used the word "object" frequently without defining exactly what it means. In this lesson, we'll introduce objects and see how we can use them in real data programming work. By the end of this lesson, students will be able to:

  • Define a Python class to represent objects with specific states and behaviors.
  • Explain how the Python memory model allows multiple references to the same objects.
  • Add type annotations to variables, function definitions, and class fields.
In [1]:
import pandas as pd

An object (aka instance) in Python is a way of combining into a distinct unit (aka encapsulating) two software concepts:

  • State, or data like the elements of a list.
  • Behavior, or methods like a function that can take a list and return the size of the list.

Recently, we've been using DataFrame objects frequently. A DataFrame stores data (state) and has many methods (behaviors), such as groupby.

In [2]:
seattle_air = pd.read_csv("seattle_air.csv", index_col="Time", parse_dates=True)
seattle_air.groupby(seattle_air.index.year).count()
Out[2]:
PM2.5
Time
2017 6283
2018 8540
2019 8597
2020 8683
2021 8664
2022 8625
2023 8409
2024 8558
2025 643
In [4]:
seattle_air.groupby(seattle_air.index.year)
Out[4]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7bcd743ca3d0>

Reference semantics¶

When we call a method like groupby and then count each group, the result is a new object that is distinct from the original. If we now ask for the value of seattle_air, we'll see that the original DataFrame is still there with all its data intact and untouched by the groupby or count operations.

In [3]:
seattle_air
Out[3]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2025-01-27 19:00:00 9.0
2025-01-27 20:00:00 9.0
2025-01-27 21:00:00 9.0
2025-01-27 22:00:00 11.0
2025-01-27 23:00:00 13.0

68496 rows × 1 columns

However, unlike groupby, there are some DataFrame methods that can modify the underlying DataFrame. The dropna method for removing missing values can modify the original when we include the keyword argument inplace=True (default False). Furthermore, if inplace=True, dropna will return None to more clearly communicate that instead of returning a new DataFrame, changes were made to the original DataFrame.

In [9]:
data = seattle_air.dropna()
                     PM2.5
Time                      
2017-04-06 00:00:00    6.8
2017-04-06 01:00:00    5.3
2017-04-06 02:00:00    5.3
2017-04-06 03:00:00    5.6
2017-04-06 04:00:00    5.9
...                    ...
2025-01-27 19:00:00    9.0
2025-01-27 20:00:00    9.0
2025-01-27 21:00:00    9.0
2025-01-27 22:00:00   11.0
2025-01-27 23:00:00   13.0

[67002 rows x 1 columns]
In [6]:
seattle_air
Out[6]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2025-01-27 19:00:00 9.0
2025-01-27 20:00:00 9.0
2025-01-27 21:00:00 9.0
2025-01-27 22:00:00 11.0
2025-01-27 23:00:00 13.0

68496 rows × 1 columns

In [11]:
data = seattle_air.dropna(inplace=True)
data
In [12]:
seattle_air
Out[12]:
PM2.5
Time
2017-04-06 00:00:00 6.8
2017-04-06 01:00:00 5.3
2017-04-06 02:00:00 5.3
2017-04-06 03:00:00 5.6
2017-04-06 04:00:00 5.9
... ...
2025-01-27 19:00:00 9.0
2025-01-27 20:00:00 9.0
2025-01-27 21:00:00 9.0
2025-01-27 22:00:00 11.0
2025-01-27 23:00:00 13.0

67002 rows × 1 columns

And, if we have another instance of the same data, that data will continue to be unmodified, regardless of any changes we made to seattle_air.

In [16]:
another_seattle_air = pd.read_csv("seattle_air.csv", index_col="Time", parse_dates=True)
In [15]:
seattle_air = pd.read_csv("seattle_air.csv", index_col="Time", parse_dates=True)
In [86]:
l1 = [1, 2, 3]
l2 = [1, 2, 3]
In [31]:
l1[0] = 6
In [32]:
l1
Out[32]:
[6, 2, 3]
In [33]:
l2
Out[33]:
[1, 2, 3]
In [84]:
l1 == l2
Out[84]:
True
In [87]:
l1 is l2
Out[87]:
False
In [21]:
l1[0] = 6
In [22]:
l1 == l2
Out[22]:
False
In [34]:
l1 = l2
In [26]:
l2
Out[26]:
[1, 2, 3]
In [28]:
l1[0] = 6
l1
Out[28]:
[6, 2, 3]
In [29]:
l2
Out[29]:
[6, 2, 3]
In [35]:
my_list = [1, 2, 3]
In [36]:
my_list = "hello world"

Defining classes¶

Python allows us to create our own custom data types by defining a class: a blueprint or template for objects. The pandas developers defined a DataFrame class so that you can construct DataFrame objects to use. Here's a highly simplified outline of the code that they could have written to define the DataFrame class.

In [89]:
class DataFrame:
    def __init__(self, x=0):
        self.x = x
        
    def dropna(self):
        self.columns = ["a", "b", "c"]
        return ...

    def add_value(self, x):
        print(type(self))
        self.x = x
In [90]:
df = DataFrame()
In [91]:
df.add_value(42)
<class '__main__.DataFrame'>
In [39]:
type(df)
Out[39]:
__main__.DataFrame
In [40]:
type(seattle_air)
Out[40]:
pandas.core.frame.DataFrame
In [45]:
df.dropna()
# Sort of but not really:
#DataFrame.dropna(df)
Out[45]:
Ellipsis
In [51]:
df.add_value(42)
df.x
Out[51]:
42
In [56]:
df2 = DataFrame()
df2.add_value(84)
df2.x
Out[56]:
84
In [57]:
df.x
Out[57]:
42
In [81]:
df3 = DataFrame()
In [82]:
df3.dropna()
Out[82]:
Ellipsis
In [83]:
df3.columns
Out[83]:
['a', 'b', 'c']

DataFrame "Solution"¶

In [17]:
class DataFrame:
    """Represents two-dimensional tabular data structured around an index and column names."""
    
    def __init__(self, index, columns, data):
        """Initializes a new DataFrame object from the given index, columns, and tabular data."""
        print("Initializing DataFrame")
        self.index = index
        self.columns = columns
        self.data = data

    def dropna(self, inplace=False):
        """"
        Drops all rows containing NaN from this DataFrame. If inplace, returns None and modifies
        self. If not inplace, returns a new DataFrame without modifying self.
        """
        print("Calling dropna")
        if inplace:
            columns = [...]
            index = [...]
            data = [...]
            return None
        else:
            return DataFrame([...], [...], [...])

    def __getitem__(self, column_or_indexer):
        """Given a column or indexer, returns the selection as a new Series or DataFrame object."""
        print("Calling __getitem__")
        if column_or_indexer in self.columns:
            return "Series" # placeholder for a Series
        else:
            return DataFrame([...], [...], [...])

    def __repr__(self):
        """Return a Python-interpretable string representation of this DataFrame."""
        # return "DataFrame([0, 1, 2], ["PM2.5"], [10, 20, 30])"
        # return "DataFrame(" + repr(self.index) + ", " + repr(self.columns) + ", " + repr(self.data) + ")"
        return f"DataFrame({self.index}, {self.columns}, {self.data})"

    def __str__(self):
        """Return a human-readable string representation of this DataFrame."""
        return "My favorite DataFrame"

example = DataFrame([0, 1, 2], ["PM2.5"], [10, 20, 30])
example # display(example.__repr__())
Initializing DataFrame
Out[17]:
DataFrame([0, 1, 2], ['PM2.5'], [10, 20, 30])
In [18]:
print(example) # print calls the .__str__ dunder method
My favorite DataFrame
In [19]:
print(repr(example))

# 1. repr(example) -> str
# 2. print(... some str... ) by calling that string's .__str__()
DataFrame([0, 1, 2], ['PM2.5'], [10, 20, 30])

Let's breakdown each line of code.

  • class DataFrame: begins the class definition. We always name classes by capitalizing each word removing spaces between words.
  • def __init__(self, index, columns, data): defines a special function called an initializer. The initializer is called whenever constructing a new object. Each DataFrame stores its own data in fields (variables associated with an object), in this case called index, columns, and data.
  • def dropna(self, inplace=False): defines a function that can be called on DataFrame objects. Like the initializer, it also takes a self parameter as well as a default parameter inplace=False. Depending on the value of inplace, it can either return a new DataFrame or None.
  • def __getitem__(self, column_or_indexer): defines a special function that is called when you use the square brackets for indexing.

Notice how every method (function associated with an object) always takes self as the first parameter. The two special functions that we defined above are only "special" in the sense that they have a specific naming format preceded by two underscores and followed by two underscores. These "dunder" methods (functions whose names are surrounded by double underscores) are used internally by Python to enable the convenient syntax that we're all used to using.

Using the Custom DataFrame¶

Just like how we need to call a function to use it, we also need to create an object (instance) to use a class.

In [2]:
example["PM2.5"]
Calling __getitem__
Out[2]:
'Series'
In [3]:
example.__getitem__("PM2.5")
Calling __getitem__
Out[3]:
'Series'
In [20]:
example["PM2.5"][0]
Calling __getitem__
Out[20]:
'S'
In [21]:
example.__getitem__("PM2.5").__getitem__(0)
Calling __getitem__
Out[21]:
'S'
In [ ]:
staff[self, "Hours", "Iris"]
# becomes
staff.__getitem__(self, "Hours", "Iris")

# Slightly incorrect (but important to keep in mind)
# The correct translation is actually:
staff[(self, "Hours", "Iris")]
# becomes
staff.__getitem__((self, "Hours", "Iris"))
In [ ]:
seattle_air.loc[:, ["PM2.5"]]
# becomes
seattle_air.loc.__getitem__( (slice(None), ["PM2.5"]) )
In [ ]:
# Why might the Pandas developers be concerned here with this assignment statement?
# Ambiguity here: What are we actually reassigning?
  # Are we changing the original DataFrame? Or the intermediate Series?

dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)

Another useful dunder method is the __repr__ method, which should return a string representing the object. By default, __repr__ just tells you the fully-qualified name of the object's class and the location it is stored in your computer memory. But we can make it much more useful by defining our own __repr__ method.

In [4]:
example
Out[4]:
<__main__.DataFrame at 0x7c05a32f5a50>
In [5]:
"Hello world"
Out[5]:
'Hello world'
In [7]:
repr('Hello world')
Out[7]:
"'Hello world'"
In [8]:
'Hello world'.__repr__()
Out[8]:
"'Hello world'"
In [9]:
len('Hello world')
Out[9]:
11
In [10]:
'Hello world'.__len__()
Out[10]:
11

Practice: Student class¶

Write a Student class that represents a UW student, where each student has a name, a student number, and a courses dictionary that associates the name of each course to a number of credits. The Student class should include the following methods:

  • An initializer that takes the student number and the name of a file containing information about their schedule.
  • A method __getitem__ that takes a str course name and returns the int number of credits for the course. If the student is not taking the given course, return None.
  • A method get_courses that returns a list of the courses the student is taking.

Consider the following file nicole.txt.

CSE163 4
PHIL100 4
CSE390HA 1

The student's name is just the name of the file without the file extension. The file indicates they are taking CSE163 for 4 credits, PHIL100 for 4 credits, and CSE390HA for 1 credit.

In [41]:
class Student:
    """..."""

    # In Java, there are things called access control modifieres like `private`.
    # We don't have this in Python. In fact, Python has no access control modifiers.
    # In Python, we have a convention to prefix variables names with an underscore
    # to indicate that they should not be modified or accessed by any code outside
    # the current class.
    # In your assessments, be sure to use private fields whenever possible.

    def __init__(self, student_number: int, filename: str) -> None:
        """..."""
        self._name: str = filename[:-4] # trimming-out the .txt part
        self._number: int = student_number
        self._courses: dict[str, str] = {}
        with open(filename) as f:
            for line in f.readlines():
                course, credits = line.split()
                # list[str]
                self._courses[course] = credits

    def __getitem__(self, course_name: str) -> str | None:
        """..."""
        # if course_name in self.courses:
        #     return self.courses[course_name]
        # by default, Python functions return None
        return self._courses.get(course_name) # number of credits

    def get_courses(self) -> list[str]:
        """..."""
        return list(self._courses) # by default, dictionaries loop over only keys

    def get_name(self) -> str:
        return self._name

    def get_number(self) -> int:
        return self._number


nicole = Student(1234567, "nicole.txt")
for course in nicole.get_courses():
    print(course, nicole[course])
CSE163 4
PHIL100 4
CSE390HA 1

Type annotations¶

We've talked a lot about the types of each variable in the Python programs that we write, but we can also optionally write-in the type of each variable or return value as a type hint. In certain assessments, we'll use mypy to check your type annotations. Let's read the Type hints cheat sheet and practice adding type annotations to our previous class definitions.

In [30]:
!pip install -q nb_mypy
%reload_ext nb_mypy
%nb_mypy mypy-options --strict
12332.09s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
Version 1.0.5

Practice: University class¶

Write a University class that represents one or more students enrolled in courses at a university. The University class should include the following methods:

  • An initializer that takes the university name and, optionally, a list of Student objects to enroll in this university.
  • A method enrollments that returns all the enrolled Student objects sorted in alphabetical order by student name.
  • A method enroll that takes a Student object and enrolls them in the university.

Later, we'll add more methods to this class. How well does your approach stand up to changing requirements?

In [42]:
class University:
    """..."""

    # What does it mean to "enroll students"? Your answer is a class invariant.
      # In this implementation, it means to add them to the self.students list.

    # Problem: This default empty list for students is created at class initialization.
    #   So it's shared amongst all instances of the University class.
    # Solution: Change it to default to None.
    def __init__(self, name: str, students: list[Student] = None) -> None:
        """..."""
        if students is None:
            students = []
        self.name = name
        self.students = students

    def enrollments(self) -> list[Student]:
        """..."""
        # return sorted(self.students, key=lambda student: student.name)
        return sorted(self.students, key=Student.get_name)

    def enroll(self, student: Student) -> None:
        """..."""
        self.students.append(student)


uw = University("Udub", [nicole])
uw.enrollments()
Out[42]:
[<__main__.Student at 0x7da680af1fd0>]
In [31]:
sorted([5, 4, 3, 2, 1])
Out[31]:
[1, 2, 3, 4, 5]
In [32]:
# Need to define some way of sorting (comparing pairwise) Student objects
sorted([nicole, Student(123, "nicole.txt")])
<cell>1: error: Name "nicole" is not defined  [name-defined]
<cell>1: error: Name "Student" is not defined  [name-defined]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[32], line 1
----> 1 sorted([nicole, Student(123, "nicole.txt")])

TypeError: '<' not supported between instances of 'Student' and 'Student'
In [39]:
# key parameter allows us to define how we want to sort the list
#   Specifically, give it a 1-argument function that returns something sortable
#     The 1-argument function takes an element from your list.

def sorting_key(student):
    return student.number

sorted([nicole, Student(123, "nicole.txt")], key=sorting_key)
<cell>5: error: Function is missing a type annotation  [no-untyped-def]
Out[39]:
[<__main__.Student at 0x7da69a3b5810>, <__main__.Student at 0x7da69536cdd0>]
In [40]:
sorted([nicole, Student(123, "nicole.txt")], key=lambda student: student.number)
Out[40]:
[<__main__.Student at 0x7da65d119ed0>, <__main__.Student at 0x7da69536cdd0>]
In [44]:
sorted([nicole, Student(123, "nicole.txt")], key=Student.get_number)
Out[44]:
[<__main__.Student at 0x7da69b394b50>, <__main__.Student at 0x7da680af1fd0>]
In [45]:
sorted([nicole, Student(123, "nicole.txt")], key=student.number)
<cell>1: error: Name "student" is not defined  [name-defined]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[45], line 1
----> 1 sorted([nicole, Student(123, "nicole.txt")], key=student.number)

NameError: name 'student' is not defined
In [49]:
sorted([nicole, Student(123, "nicole.txt")], key=Student.number)
<cell>1: error: No overload variant of "sorted" matches argument types "list[Student]", "int"  [call-overload]
<cell>1: note: Possible overload variants:
<cell>1: note:     def [SupportsRichComparisonT: SupportsDunderLT[Any] | SupportsDunderGT[Any]] sorted(Iterable[SupportsRichComparisonT], /, *, key: None = ..., reverse: bool = ...) -> list[SupportsRichComparisonT]
<cell>1: note:     def [_T] sorted(Iterable[_T], /, *, key: Callable[[_T], SupportsDunderLT[Any] | SupportsDunderGT[Any]], reverse: bool = ...) -> list[_T]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[49], line 1
----> 1 sorted([nicole, Student(123, "nicole.txt")], key=Student.number)

AttributeError: type object 'Student' has no attribute 'number'

Mutable default parameters¶

Default parameter values are evaluated and bound to the parameter when the function is defined. This can lead to some unanticipated results when using mutable values like lists or dictionaries as default parameter values.

Say we make two new University objects without specifying a list of students to enroll. The initializer might then assign this list value to a field.

In [50]:
wsu = University("Wazzu")
wsu.enrollments()
Out[50]:
[]
In [51]:
seattle_u = University("SeattleU")
seattle_u.enrollments()
Out[51]:
[]

When we enroll a student to seattle_u, the change will also affect wsu. There are several ways to work around this, with the most common approach changing the default parameter value to None and adding an if statement in the program logic.

In [52]:
seattle_u.enroll(nicole)
seattle_u.enrollments()
Out[52]:
[<__main__.Student at 0x7da680af1fd0>]
In [53]:
wsu.enrollments()
Out[53]:
[<__main__.Student at 0x7da680af1fd0>]