Lesson 5. CSVs - CSE 163

The content for this lesson is adapted from material by Hunter Schafer and by Kevin Lin.

Objectives¶

In this lesson, we’ll see more advanced dictionary features and learn about the CSV data file format. By the end of this lesson, students will be able to:

Loop over the keys, values, and items of a dictionary.
Identify the list of dictionaries corresponding to some CSV data.
Loop over a list of dictionaries (CSV rows) and access dictionary values (CSV columns).

Setting up¶

To follow along with the code examples in this lesson, please download the files here:

Dictionary Methods¶

In Lesson 4, we learned about the dict (dictionary) type in Python, which represented relationships between keys and values.

d = {'a': 1, 'b': 2}
d['c'] = 3
d['a'] = 4
print(d['b'])
print(d)

Like lists, dictionaries are also objects, so they have methods that you can use to store, retrieve, and modify data. Here are some common dict methods.

dict() or {} creates a new, empty dictionary.
d.pop(key) removes key from d.
d.keys() returns a collection of all the keys in d.
d.values() returns a collection of all the values in d.
d.items() returns a collection of all (key, value) tuples in d.

Looping Over a Dictionary¶

You might have been wondering from the last lesson how you would loop over a dict. With the methods we have shown, you might see how you could loop over the keys of a dictionary.

To iterate over the keys of a dict, use the keys method. The keys method returns a collection (similar to a set) of all the keys in the dict.

d = {'a': 1, 'b': 2, 'c': 3}
for k in d.keys():
    print(k, '-', d[k])

Likewise, use the values method to get a collection of all the values in a dict.

d = {'a': 1, 'b': 2, 'c': 3}
for v in d.values():
    print(v)

A common approach is to use the items method to get a collection of tuples representing each key-value pair. In this example, d.items() returns a sequence of tuples and the for loop processes each tuple.

d = {'a': 1, 'b': 2, 'c': 3}
for pair in d.items():  # pair will be a tuple: (key, value)
    print(pair[0], '-', pair[1])

Recall from Lesson 4 that we learned about the tuple and that you can unpack them to store their values inside separate variables:

p = (1, 2)
print(p)

a, b = p
print(a)  # same as p[0]
print(b)  # same as p[1]

You can use this same technique in the loop over the items to make a variable for both the key and the value! This unpacks the tuple and gives a variable name to each component.

d = {'a': 1, 'b': 2, 'c': 3}
for k, v in d.items():  # unpacks the tuple into k and v
    print(k, '-', v)

This is a very common pattern when looping over all the key-value pairs in a dictionary! This uses the unpacking syntax we saw in the previous code block, but we just learned that it can be used as the for loop variables.

`enumerate` and `zip`¶

This new loop unpacking syntax is not only useful for dictionaries, but also for looping over other data structures. enumerate and zip are two built-in Python functions that can improve your experience looping over sequences.

enumerate helps you loop over both the indices and elements of a sequence at the same time. enumerate takes a sequence, such as a list, and returns another sequence of 2-element tuples containing the index and the element itself.

squares = [i ** 2 for i in range(1, 11)]

for i in range(len(squares)):
    n = squares[i]
    print(i, n)

for i, n in enumerate(squares):
    print(i, n)

zip helps you loop over multiple sequences at the same time. zip takes one or more sequences and returns the “zipped” sequence, which presents the first element of each sequence as a tuple, the second element of each sequence as a tuple, the third element of each sequence as a tuple, etc.

numbers = [i for i in range(1, 11)]
squares = [i ** 2 for i in numbers]
cubes = [i ** 3 for i in numbers]

for i in range(len(numbers)):
    print(numbers[i], squares[i], cubes[i])

for n, s, c in zip(numbers, squares, cubes):
    print(n, s, c)

If the sequences are not all the same length, then zip stops after yielding all elements from the shortest sequence.

CSVs¶

In Lesson 3, we started looking at file processing with .txt files. Now, let’s look at a well-structured type of text data called a CSV (stands for comma-separated values). If you are familiar with an Excel Spreadsheet, a Calc Spreadsheet, or a Google Sheet, you already understand the basic idea behind what a CSV is!

Consider a table of scientists and the number of books about them:

Name	Books
Marie Curie	7
Mary Anning	3
George Washington Carver	5
Guido van Rossum	2

A table has two main components to it:

Rows: In this example, each row corresponds to one scientist.
Columns: Each column defines a different aspect or component of your data. In this case we have one column for the scientist’s name and the other for the books about them.

A CSV file is just a well-formatted file that preserves this tabular structure of rows/columns in a format that is more easily read by programs written in Python. Here’s the corresponding CSV file:

Name,Books
Marie Curie,7
Mary Anning,3
George Washington Carver,6
Guido van Rossum,2

Notice that each row appears on its own line and each column value is separated by a comma (hence the name Comma Separated Values). You can have whitespace in an entry, such as "Marie Curie". It’s usually conventional to have the first line of the CSV store the names of the columns so that you can refer to them by name later.

Processing CSVs¶

Now that we understand what a CSV looks like, how do we process CSV data? What if I want to find the total number of books about all our scientists?

You might imagine that we will solve this with the skills we have learned so far in file-processing. We could do something by reading the file line by line in a loop, splitting the line based on commas, and then doing our computation on the data we’ve extracted. Unfortunately, this ends up being much more complicated than we anticipated:

The code is not very flexible if I want to compute some other value. What if I want to compute the scientist with the greatest number of books? I would have to duplicate all this complex file-parsing code to access a different column. This also comes at a cost of efficiency (e.g. speed of program) since, for each task, you will need to re-read the file.
Our example CSV is relatively simple. In reality, the CSV format can get much more complicated; maybe there could be string data that includes commas in the text itself! It would be nice to separate the logic of parsing the data from our computations so that our code is more readable and maintainable.

List of Dictionaries¶

To accomplish this, we will start by storing our data in some data structure (list, set, dictionary, etc.) that will help us process it later. A very common thing to do when processing this type of data is to store it in a list of dictionaries.

data = [
    {'Name': 'Marie Curie',              'Books': 7},
    {'Name': 'Mary Anning',              'Books': 3},
    {'Name': 'George Washington Carver', 'Books': 5},
    {'Name': 'Guido van Rossum',         'Books': 2}
]

This data structure is a list that stores dicts as its entries; therefore we call it a list of dictionaries. Each dictionary represents a single row of the dataset: this is why there are 3 dictionaries inside this list. Inside each dictionary, there is a key/value pair for every column of the data showing the values for each row and that column.

This is a bit complex when you see it at first because the data structures are nested: inside each list is a dictionary! This means if you stored that above data in a variable called data, you could access a dictionary by indexing into a list. For example, to get the name of the scientist at index 1, you might write:

data = [
    {'Name': 'Marie Curie',              'Books': 7},
    {'Name': 'Mary Anning',              'Books': 3},
    {'Name': 'George Washington Carver', 'Books': 5},
    {'Name': 'Guido van Rossum',         'Books': 2}
]

print('Data:', data)
print('Number of rows:', len(data))  # Since data is just a list
print('Row 2:', data[1])

sci = data[1]  # This is a dictionary: {'Name': 'Mary Anning', 'Books': 3}
print('Name of Scientist in Row 2:', sci['Name'])

# It helps to print out the types of things
print()
print('Types')
print('type(data)', type(data))
print('type(data[1])', type(data[1]))
print("type(sci['Name'])", type(sci['Name']))

We can directly access the name of data[1] without assigning it to a variable. This is much more convenient for picking-out a particular element in the list of dictionaries.

data = [
    {'Name': 'Marie Curie',              'Books': 7},
    {'Name': 'Mary Anning',              'Books': 3},
    {'Name': 'George Washington Carver', 'Books': 5},
    {'Name': 'Guido van Rossum',         'Books': 2}
]

print('Name of Scientist in Row 2:', data[1]['Name'])

Let’s look at another example. Before running this code cell, see if you can predict the output of each line of code!

data = [
    {'Name': 'Marie Curie',              'Books': 7},
    {'Name': 'Mary Anning',              'Books': 3},
    {'Name': 'George Washington Carver', 'Books': 5},
    {'Name': 'Guido van Rossum',         'Books': 2}
]

print('First example')
print(data[2]['Name'])
print()

print('Second example')
print(data['Books'][0])

Food for thought: Why does the "First example" run correctly but the "Second example" does not?

Looping Over the List of Dictionaries¶

Let’s look back to our example from earlier where we want to compute the total number of scientist books. We start by writing a loop to go over each scientist in the list (each scientist is a dictionary). We then access the 'Books' entry in each dictionary and add that to a variable for a cumulative sum.

data = [
    {'Name': 'Marie Curie',              'Books': 7},
    {'Name': 'Mary Anning',              'Books': 3},
    {'Name': 'George Washington Carver', 'Books': 5},
    {'Name': 'Guido van Rossum',         'Books': 2}
]

total_books = 0
for sci in data:  # sci is a dictionary
    total_books = total_books + sci['Books']
print(total_books)

It sometimes helps to pause and think about the types again. Recall that the type of data in this example is list. The type of each value inside that list is a dict (e.g. {'Name': 'Marie Curie', 'Books': 7}). Each of these dicts will have the same keys (e.g. 'Name' and 'Books').

Parsing a CSV File¶

You might be asking: How do you write the code to parse this CSV into the list of dictionaries?

That actually turns out to be a very hard problem that is well outside of what we have learned so far! There are lots of cases to handle around getting the types of the values correct, such as knowing when to turn the values of a column into ints rather than strs. In fact, this task is so tricky to implement from scratch that the way we will read a CSV file is by using the industry-standard pandas library. We’ll learn more about this library next week!

For now, it suffices that we have created a parse function and will provide it to you in cse163_utils.py everywhere it’s needed!

Typing List of Dictionaries¶

Since the list of dictionaries is such a common format in Python, they provide a special syntax to specifically annotate the types of the data in the list of dictionaries. You can specifically say which columns to expect in your data and what types the values in those columns. The syntax is a little more complicated, but it essentially lets us define a new type for what we expect each row to have in the data. We will provide the types for a list of dictionaries for the problems and assignments you do in our course. So you don’t need to write a new one of these on your own for our assignments.

For example, with the scientist information example above, we would write a total cost method like the following. This defines a brand new type called ScientistInfo which we outlined the expected columns and their types.

from typing import TypedDict

# Defines a new type for each row of our scientist dataset
# This new type is called ScientistInfo
ScientistInfo = TypedDict('ScientistInfo', {'Name': str, 'Books': int})


def total_books(scientists: list[ScientistInfo]) -> int:
    """
    This function takes in a list of dictionaries representing
    information about scientists, and returns the total number
    of books about all the scientists.
    """
    books = 0
    for sci in data:  # sci is a dictionary
        books = books + sci['Books']
    return books

In your doc-string comment, be sure to specify that the input is a dictionary and what that dictionary represents!

⏸️ Pause and 🧠 Think¶

Take a moment to review the following concepts and reflect on your own understanding. A good temperature check for your understanding is asking yourself whether you might be able to explain these concepts to a friend outside of this class.

Here’s what we covered in this lesson:

More dictionary methods
Looping over a dictionary
enumerate
zip
CSV format
List of dictionaries for CSV processing
parse
Type annotations for list of dictionaries

Here are some other guiding exercises and questions to help you reflect on what you’ve seen so far:

In your own words, write a few sentences summarizing what you learned in this lesson.
What did you find challenging in this lesson? Come up with some questions you might ask your peers or the course staff to help you better understand that concept.
What was familiar about what you saw in this lesson? How might you relate it to things you have learned before?
Throughout the lesson, there were a few Food for thought questions. Try exploring one or more of them and see what you find.

In-Class¶

When you come to class, we will work together on largest_earthquake.py and shakiness_by_location.py. We will also need cse163_utils.py and earthquakes.csv for these tasks. Make sure that you have a way of editing and running these files!

List of dictionaries practice¶

Consider the following CSV data:

Name,Major,Section
Rit,PHIL,AA
Alex,EE,AB
Paul,ECO,AC

What does the Python list of dictionaries look like for this data?
How would you access Rit’s Major?
How would you access Paul’s Section?

`largest_earthquake`¶

For this practice problem, refer to largest_earthquake.py.

For this problem, we will be using a dataset containing information about earthquakes around the world, which is stored in Jupyter Hub with the path /materials/data/earthquakes.csv. When developing locally, you will use the relative path earthquakes.csv (assuming that earthquakes.csv and largest_earthquake.py are located in the same directory on your computer).

Here are the first few rows of the dataset for reference:

id	year	month	day	latitude	longitude	name	magnitude
nc726666881	2016	7	27	37.6723333	-121.619	California	1.43
us200006i0y	2016	7	27	21.5146	94.5721	Burma	4.9
nc726666891	2016	7	27	37.5765	-118.8561667	California	0.06

Write a function largest_magnitude that takes the above earthquake data represented as a list of dictionaries and returns the name of the location that experienced the largest earthquake by magnitude. If there are no rows in the dataset (no data at all), return None.

If you only look at the rows of the dataset shown above, the result would be 'Burma' because it had the earthquake with the largest magnitude (4.9).

Do not assume the dataset passed has the exact same values or the number of rows like the one shown above. But you can assume the dataset will have all of the columns provided for any row in the dataset. We sometimes call the particular columns of a CSV its schema.

`shakiness_by_location`¶

For this practice problem, refer to shakiness_by_location.py.

We will use the same earthquakes dataset as the last problem.

Write a function shakiness_by_location that takes the earthquakes data in the list of dictionaries format and returns a dict that stores the “shakiness” of each location. The shakiness of a location is defined as the sum of all earthquake magnitudes at that location. If there are no earthquakes in the dataset, it should return an empty dict.

Consider the following earthquake data (only showing the name and magnitude columns)..

[
    {'name': 'Seattle', 'magnitude': 4},
    {'name': 'Genovia', 'magnitude': 6},
    {'name': 'Seattle', 'magnitude': 3.5}
]

shakiness_by_location(data) should return {'Seattle': 7.5, 'Genovia': 6}.

Like the last problem, do not assume anything about the number of rows in the dataset or their values, but each row will have all the expected columns.

Canvas Quiz¶

All done with the lesson? Complete the Canvas Quiz linked here!