CSV Data¶

In this lesson, we'll learn more advanced dictionary features and the CSV data file format. By the end of this lesson, students will be able to:

  • Loop over the keys, values, and items of a dictionary.
  • Identify the list of dictionaries corresponding to some CSV data.
  • Loop over a list of dictionaries (CSV rows) and access dictionary values (CSV columns).
In [1]:
import doctest

Dictionary functions¶

Dictionaries, like lists, are also mutable data structures so they have functions to help store and retrieve elements.

  • d.pop(key) removes key from d.
  • d.keys() returns a collection of all the keys in d.
  • d.values() returns a collection of all the values in d.
  • d.items() returns a collection of all (key, value) tuples in d.

There are different ways to loop over a dictionary.

In [9]:
example = "I really like dogs".split()
example[1] # Lists are sort of like dictionaries... in that you have an index
Out[9]:
'really'
In [8]:
dictionary["a"] # This gets the value associated with "a"
Out[8]:
1
In [2]:
dictionary = {"a": 1, "b": 2, "c": 3}
for key in dictionary:
    print(key, dictionary[key]) # dictionary[key] gets value associated with key
a 1
b 2
c 3
In [3]:
dictionary = {"a": 1, "b": 2, "c": 3}
for key in dictionary.keys():
    print(key, dictionary[key])
a 1
b 2
c 3
In [4]:
dictionary = {"a": 1, "b": 2, "c": 3}
for value in dictionary.values():
    # Dictionaries do not have an easy way to go from values back to keys!
    print(value)
1
2
3
In [6]:
dictionary = {"a": 1, "b": 2, "c": 3}
for key, value in dictionary.items():
    # value = dictionary[key]
    print(key, value)
a 1
b 2
c 3

None in Python¶

In an earlier lesson, we wrote a function to count the occurrences of each token in a file as a dict where the keys are words and the values are counts.

{"green": 2, "eggs": 6, "and": 3, "yam": 2}

Suppose we want to debug the following function most_frequent that takes this dictionary as input and returns the word with the highest count. If the input were a list, we could index the zero-th element from the list and loop over the remaining values by slicing the list. But it's harder to do this with a dictionary.

Python has a special None keyword, like null in Java, that represents a placeholder value.

In [11]:
def most_frequent(counts):
    """
    Returns the token in the given dictionary with the highest count, or None if empty.

    >>> most_frequent({"green": 2, "eggs": 6, "and": 3, "yam": 2})
    'eggs'
    >>> most_frequent({}) # None is not displayed as output

    """
    max_word = None
    for word in counts:
        # KeyError means that we've tried to look up a dictionary value
        # for a key that is not in the dictionary
        # Often want to check if a key is None before accessing its value from a dictionary
        if max_word is None or counts[word] > counts[max_word]:
            max_word = word
    return max_word


doctest.run_docstring_examples(most_frequent, globals())
In [13]:
def most_frequent(counts):
    """
    Returns the token in the given dictionary with the highest count, or None if empty.

    >>> most_frequent({"green": 2, "eggs": 6, "and": 3, "yam": 2})
    'eggs'
    >>> most_frequent({}) # None is not displayed as output

    """
    max_word = None
    for word in counts:
        # Hmm, what if we change the order of the "or" operator
        # First, Python has to check the left condition, then the right one!
        # Can look up "short circuiting"
        if counts[word] > counts[max_word] or max_word is None:
            max_word = word
    return max_word


doctest.run_docstring_examples(most_frequent, globals())
**********************************************************************
File "__main__", line 5, in NoName
Failed example:
    most_frequent({"green": 2, "eggs": 6, "and": 3, "yam": 2})
Exception raised:
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.10/doctest.py", line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "<doctest NoName[0]>", line 1, in <module>
        most_frequent({"green": 2, "eggs": 6, "and": 3, "yam": 2})
      File "/tmp/ipykernel_190/3519406350.py", line 13, in most_frequent
        if counts[word] > counts[max_word] or max_word is None:
    KeyError: None

There's a special keyword called is that we just saw here! It's not the same (exactly) as ==. For now, the only time we need to use is is for comparing against None. Technically, you can also do max_word == None, but it's not as precise.

In [12]:
most_frequent({})

Loop unpacking¶

When we need keys and values, we can loop over and unpack each key-value pair by looping over the dictionary.items().

In [ ]:
dictionary = {"a": 1, "b": 2, "c": 3}
for key, value in dictionary.items():
    print(key, value)

Loop unpacking is not only useful for dictionaries, but also for looping over other sequences such as enumerate and zip. enumerate is a built-in function that takes a sequence and returns another sequence of pairs representing the element index and the element value.

In [15]:
with open("poem.txt") as f:
    # This handles the line_num variable that we used to have to create manually
    for i, line in enumerate(f.readlines(), start=1):
        print(i, line[:-1])
1 she sells
2 sea
3 shells by
4 the sea shore

zip is another built-in function that takes one or more sequences and returns a sequence of tuples consisting of the first element from each given sequence, the second element from each given sequence, etc. If the sequences are not all the same length, zip stops after yielding all elements from the shortest sequence.

In [16]:
arabic_nums = [  1,    2,     3,    4,   5]
alpha_nums  = ["a",  "b",   "c",  "d", "e"]
roman_nums  = ["i", "ii", "iii", "iv", "v"]

# Zip iterates through each list at the same time
for arabic, alpha, roman in zip(arabic_nums, alpha_nums, roman_nums):
    print(arabic, alpha, roman)
1 a i
2 b ii
3 c iii
4 d iv
5 e v

Comma-separated values¶

In data science, we often work with tabular data such as the following table representing the names and hours of some of the TAs.

Name Hours
Anna 20
Iris 15
Abiy 10
Gege 12

A table has two main components to it:

  • Rows corresponding to each entry, such as each individual TA.
  • Columns corresponding to (required or optional) fields for each entry, such as TA name and TA hours.

A comma-separated values (CSV) file is a particular way of representing a table using only plain text. Here is the corresponding CSV file for the above table. Each row is separated with a newline. Each column is separated with a single comma ,.

Name,Hours
Anna,20
Iris,15
Abiy,10
Gege,12

We'll learn a couple ways of processing CSV data in this course, first of which is representing the data as a list of dictionaries.

In [17]:
staff = [
    {"Name": "Anna", "Hours": 20},
    {"Name": "Iris", "Hours": 15},
    {"Name": "Abiy", "Hours": 10},
    {"Name": "Gege", "Hours": 12},
]
staff
Out[17]:
[{'Name': 'Anna', 'Hours': 20},
 {'Name': 'Iris', 'Hours': 15},
 {'Name': 'Abiy', 'Hours': 10},
 {'Name': 'Gege', 'Hours': 12}]

To see the total number of TA hours available, we can loop over the list of dictionaries and sum the "Hours" value.

In [18]:
total_hours = 0
for ta in staff: # Loop over the list
    # For each ta (dictionary)...
    total_hours += ta["Hours"]
total_hours
Out[18]:
57

What are some different ways to get the value of Iris's hours?

In [19]:
for ta in staff:
    if ta["Name"] == "Iris":
        print(ta["Hours"])
15
In [20]:
staff["Iris"]["Hours"]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 staff["Iris"]["Hours"]

TypeError: list indices must be integers or slices, not str
In [24]:
staff[1]["Hours"]
Out[24]:
15

Practice: Largest earthquake place¶

Suppose we have a dataset of earthquakes around the world stored in the CSV file earthquakes.csv.

id year month day latitude longitude name magnitude
nc72666881 2016 7 27 37.672 -121.619 California 1.43
us20006i0y 2016 7 27 21.515 94.572 Burma 4.9
nc72666891 2016 7 27 37.577 -118.859 California 0.06
nc72666896 2016 7 27 37.596 -118.995 California 0.4
nn00553447 2016 7 27 39.378 -119.845 Nevada 0.3

Write a function largest_earthquake_place that takes the earthquake data represented as a list of dictionaries and returns the name of the location that experienced the largest earthquake. If there are no rows in the dataset (no data at all), return None.

For example, considering only the data shown above, the result would be "Burma" because it had the earthquake with the largest magnitude (4.9).

In [26]:
def largest_earthquake_place(path):
    """
    Returns the name of the place with the largest-magnitude earthquake in the specified CSV file.

    >>> largest_earthquake_place("earthquakes.csv")
    'Northern Mariana Islands'
    """
    import pandas as pd
    earthquakes = pd.read_csv(path).to_dict("records") # Reads file as a list of dictionaries
    # print(earthquakes[0])

    max_name = None
    max_magn = None
    for earthquake in earthquakes:
        if max_magn is None or earthquake["magnitude"] > max_magn:
            max_name = earthquake["name"]
            max_magn = earthquake["magnitude"]
    return max_name


doctest.run_docstring_examples(largest_earthquake_place, globals())