Objects¶
Over the past few weeks, we've used the word "object" frequently without defining exactly what it means. In this lesson, we'll introduce objects and see how we can use them in real data programming work. By the end of this lesson, students will be able to:
- Define a Python class to represent objects with specific states and behaviors.
- Explain how the Python memory model allows multiple references to the same objects.
- Add type annotations to variables, function definitions, and class fields.
import pandas as pd
An object (aka instance) in Python is a way of combining into a distinct unit (aka encapsulating) two software concepts:
- State, or data like the elements of a list.
- Behavior, or methods like a function that can take a list and return the size of the list.
Recently, we've been using DataFrame objects frequently. A DataFrame stores data (state) and has many methods (behaviors), such as groupby.
seattle_air = pd.read_csv("seattle_air.csv", index_col="Time", parse_dates=True)
seattle_air.groupby(seattle_air.index.year).count()
| PM2.5 | |
|---|---|
| Time | |
| 2017 | 6283 |
| 2018 | 8540 |
| 2019 | 8597 |
| 2020 | 8683 |
| 2021 | 8664 |
| 2022 | 2292 |
type(seattle_air)
pandas.core.frame.DataFrame
Reference semantics¶
When we call a method like groupby and then count each group, the result is a new object that is distinct from the original. If we now ask for the value of seattle_air, we'll see that the original DataFrame is still there with all its data intact and untouched by the groupby or count operations.
seattle_air
| PM2.5 | |
|---|---|
| Time | |
| 2017-04-06 00:00:00 | 6.8 |
| 2017-04-06 01:00:00 | 5.3 |
| 2017-04-06 02:00:00 | 5.3 |
| 2017-04-06 03:00:00 | 5.6 |
| 2017-04-06 04:00:00 | 5.9 |
| ... | ... |
| 2022-04-06 19:00:00 | 5.1 |
| 2022-04-06 20:00:00 | 5.0 |
| 2022-04-06 21:00:00 | 5.3 |
| 2022-04-06 22:00:00 | 5.2 |
| 2022-04-06 23:00:00 | 5.2 |
43848 rows × 1 columns
seattle_air.groupby(seattle_air.index.year)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x78bb42ff3c70>
However, unlike groupby, there are some DataFrame methods that can modify the underlying DataFrame. The dropna method for removing NaN values can modify the original when we include the keyword argument inplace=True (default False). Furthermore, if inplace=True, dropna will return None to more clearly communicate that instead of returning a new DataFrame, changes were made to the original DataFrame.
seattle_air_nanremoved = seattle_air.dropna()
type(seattle_air_nanremoved)
pandas.core.frame.DataFrame
seattle_air.dropna(inplace=True)
seattle_air
| PM2.5 | |
|---|---|
| Time | |
| 2017-04-06 00:00:00 | 6.8 |
| 2017-04-06 01:00:00 | 5.3 |
| 2017-04-06 02:00:00 | 5.3 |
| 2017-04-06 03:00:00 | 5.6 |
| 2017-04-06 04:00:00 | 5.9 |
| ... | ... |
| 2022-04-06 19:00:00 | 5.1 |
| 2022-04-06 20:00:00 | 5.0 |
| 2022-04-06 21:00:00 | 5.3 |
| 2022-04-06 22:00:00 | 5.2 |
| 2022-04-06 23:00:00 | 5.2 |
43059 rows × 1 columns
Defining classes¶
Python allows us to create our own custom objects by defining a class: a blueprint or template for objects. The pandas developers defined a DataFrame class so that you can construct DataFrame objects to use. Here's a highly simplified outline of the code that they could have written to define the DataFrame class.
seattle_air.groupby([...]) # seattle_air is the self
class DataFrame:
"""Represents two-dimensional tabular data structured around an index and column names."""
def __init__(self, index, columns, data):
"""Initializes a new DataFrame object from the given index, columns, and tabular data."""
print("Initializing DataFrame")
self.index = index
self.columns = columns
self.data = data
def dropna(self, inplace=False):
""""
Drops all rows containing NaN from this DataFrame. If inplace, returns None and modifies
self. If not inplace, returns a new DataFrame without modifying self.
"""
print("Calling dropna")
if not inplace:
return DataFrame([...], [...], [...])
else:
self.columns = [...]
self.index = [...]
self.data = [...]
return None
def __getitem__(self, column_or_indexer):
"""Given a column or indexer, returns the selection as a new Series or DataFrame object."""
print("Calling __getitem__")
if column_or_indexer in self.columns:
return "Series" # placeholder for a Series
else:
return DataFrame([...], [...], [...])
# my_object = DataFrame(...)
# my_list[0]
# def __getitem__(self, index):
# return self.data.at(0)
Let's breakdown each line of code.
class DataFrame:begins the class definition. We always name classes by capitalizing each word removing spaces between words.def __init__(self, index, columns, data):defines a special function called an initializer. The initializer is called whenever constructing a new object. EachDataFramestores its own data in fields (variables associated with an object), in this case calledindex,columns, anddata.def dropna(self, inplace=False):defines a function that can be called onDataFrameobjects. Like the initializer, it also takes aselfparameter as well as a default parameterinplace=False. Depending on the value ofinplace, it can either return a newDataFrameorNone.def __getitem__(self, column_or_indexer):defines a special function that is called when you use the square brackets for indexing.
Notice how every method (function associated with an object) always takes self as the first parameter. The two special functions that we defined above are only "special" in the sense that they have a specific naming format preceded by two underscores and followed by two underscores. These dunder methods are used internally by Python to enable the convenient syntax that we're all used to using.
Just like how we need to call a function to use it, we also need to create an object (instance) to use a class.
example = DataFrame([0, 1, 2], ["PM2.5"], [10, 20, 30])
example["PM2.5"]
Initializing DataFrame Calling __getitem__
'Series'
Another useful dunder method is the __repr__ method, which should return a string representing the object. By default, __repr__ just tells you the fully-qualified name of the object's class and the location it is stored in your computer memory. But we can make it much more useful by defining our own __repr__ method.
"stri\'ng"
"stri'ng"
my_dict = {}
my_list = [0, 1, 2]
my_dict[my_list] = 3
# __hash__()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[16], line 3 1 my_dict = {} 2 my_list = [0, 1, 2] ----> 3 my_dict[my_list] = 3 TypeError: unhashable type: 'list'
Poll questions: staff["Hours"]["Thrisha"]
csv = """
Name,Hours
Diana,10
Thrisha,15
Yuxiang,20
Sheamin,12
"""
import io
staff = pd.read_csv(io.StringIO(csv), index_col=["Name"])
staff["Hours"]["Thrisha"]
15
staff.__getitem__(self, "Hours", "Thrisha")
column_or_indexer = (self, "Hours", "Thrisha")
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[20], line 1 ----> 1 staff.__getitem__(self, "Hours", "Thrisha") NameError: name 'self' is not defined
staff.__getitem__("Hours")
Name Diana 10 Thrisha 15 Yuxiang 20 Sheamin 12 Name: Hours, dtype: int64
staff.__getitem__("Hours").__getitem__("Thrisha")
15
Practice: Student class¶
Write a Student class that represents a UW student, where each student has a name, a student number, and a courses dictionary that associates the name of each course to a number of credits. The Student class should include the following methods:
- An initializer that takes the student number and the name of a file containing information about their schedule.
- A method
__getitem__that takes astrcourse name and returns theintnumber of credits for the course. If the student is not taking the given course, returnNone. - A method
get_coursesthat returns a list of the courses the student is taking.
Consider the following file nicole.txt.
CSE163 4
PHIL100 4
CSE390HA 1
The student's name is just the name of the file without the file extension. The file indicates they are taking CSE163 for 4 credits, PHIL100 for 4 credits, and CSE390HA for 1 credit.
class Student:
"""
Represents a UW student, which contains the student's name, student number, and their course schedule.
"""
def __init__(self, student_number: int, file_to_schedule: str) -> None:
"""
Initializes a student with the student number and their schedule.
The file_to_schedule is expected to be of format "<name>.txt".
"""
self._number: int = student_number
self._name: str = file_to_schedule.split(".")[0]
self._schedule: dict[str, int] = {}
self._load_file(file_to_schedule)
def _load_file(self, file_to_schedule: str) -> None:
"""
Loads the student's schedule from the input file.
"""
with open(file_to_schedule) as f:
for line in f.readlines():
course_name, course_credit = line.rstrip().split(' ')
# print(type(course_credit))
self._schedule[course_name] = int(course_credit)
def __getitem__(self, course_name: str) -> int | None:
"""
Return the number of credit of the course; if the student is not taking
the course, return None.
"""
if course_name in self._schedule:
return self._schedule[course_name]
return None
def get_courses(self) -> list[str]:
"""
Return the list of courses the student is taking.
"""
return list(self._schedule)
def total_number_of_credits(self) -> int:
"""
Return the total of number of credits the student is taking.
"""
return sum(list(self._schedule.values()))
# Getter methods for private variables
def get_name(self) -> str:
return self._name
def get_number(self) -> int:
return self._number
def __repr__(self) -> str:
return f'Student({self._number}, "{self._name}.txt")'
# def __lt__(self, other: Student) -> int | bool
nicole = Student(1234567, "nicole.txt")
for course in nicole.get_courses():
print(course, nicole[course])
CSE163 4 PHIL100 4 CSE390HA 1
nicole.get_name()
Type annotations¶
We've talked a lot about the types of each variable in the Python programs that we write, but we can also optionally write-in the type of each variable or return value as a type hint. In certain assessments, we'll use mypy to check your type annotations. Let's read the Type hints cheat sheet and practice adding type annotations to our previous class definitions.
!pip install -q nb_mypy
%reload_ext nb_mypy
%nb_mypy mypy-options --strict
Version 1.0.5
Practice: University class¶
Write a University class that represents one or more students enrolled in courses at a university. The University class should include the following methods:
- An initializer that takes the university name and, optionally, a list of
Studentobjects to enroll in this university. - A method
enrollmentsthat takes returns all the enrolledStudentobjects sorted in alphabetical order by student name. - A method
enrollthat takes aStudentobject and enrolls them in the university.
Later, we'll add more methods to this class. How well does your approach stand up to changing requirements?
lambda student:student.get_name()
students = [Student(1234568, "student.txt"), Student(1234567, "nicole.txt")]
sorted(students, key=lambda student:student.get_name())
[Student(1234567, "nicole.txt"), Student(1234568, "student.txt")]
def get_student_key(student):
return student.get_name()
for student in students:
student_key = key(student)
student_key = student.get_name()
student_key = (lambda student:student.get_name())(student)
student_key = get_student_key(student)
sorted(students, key=get_student_key)
sorted(students, key=Student.get_name)
Additional method:
roster()that takes in a course name and returns a list of students enrolled in that course.average_number_of_credits()that returns the average number of credits the students enrolled are taking.
class University:
"""
Represents one or more students enrolled in courses at a university.
"""
def __init__(self, univ_name: str, students: list[Student] | None = None) -> None:
"""Takes the name of the university and optionally a list of students enrolled."""
self._name: str = univ_name
if students is None:
self._students: list[Student] = []
else:
self._students = students
self._courses: dict[str, list[Student]] = {}
for student in self._students:
for course in student.get_courses():
if course in self._courses:
self._courses[course].append(student)
else:
self._courses[course] = [student]
def enrollments(self) -> list[Student]:
"""Returns all the enrolled students sorted by their name in alphabetical order."""
sorted_students = sorted(self._students, key=lambda student:student.get_name())
return sorted_students
def enroll(self, student: Student) -> None:
"""Enrolls the student to the university."""
self._students.append(student)
for course in student.get_courses():
if course in self._courses:
self._courses[course].append(student)
else:
self._courses[course] = [student]
def roster(self, course_name: str) -> list[Student] | None:
"""
Returns the list of students enrolled in the given course. If there's no students
enrolled, return an empty list. If the course does not exist, return None.
"""
if course_name in self._courses:
return self._courses[course_name]
return None
def average_number_of_credits(self) -> float:
"""
Returns the average number of credits a student is taking.
"""
total = 0
for student in self._students:
total += student.total_number_of_credits()
#student.get_courses()
#...
if len(self._students) == 0:
return 0
return total / len(self._students)
uw = University("Udub", [nicole])
uw.enrollments()
[Student(1234567, "nicole.txt")]
uw.roster("CSE163")
[Student(1234567, "nicole.txt")]
uw.enroll(Student(1234568, "student.txt"))
uw.roster("CSE163")
[Student(1234567, "nicole.txt"), Student(1234568, "student.txt")]
uw.roster("CSE400")
Mutable default parameters¶
Default parameter values are evaluated and bound to the parameter when the function is defined. This can lead to some unanticipated results when using mutable values like lists or dictionaries as default parameter values.
Say we make two new University objects without specifying a list of students to enroll. The initializer might then assign this list value to a field.
wsu = University("Wazzu")
wsu.enrollments()
[]
sea_u = University("SeaU")
sea_u.enrollments()
[]
When we enroll a student to sea_u, the change will also affect wsu. There are several ways to work around this, with the most common approach changing the default parameter value to None and adding an if statement in the program logic.
sea_u.enroll(nicole)
sea_u.enrollments()
[Student(1234567, "nicole.txt")]
wsu.enrollments()
[]