CSV Data (& Dictionaries)¶
In this lesson, we'll learn more advanced dictionary features and the CSV data file format. By the end of this lesson, students will be able to:
- Loop over the
keys
,values
, anditems
of a dictionary. - Identify the list of dictionaries corresponding to some CSV data.
- Loop over a list of dictionaries (CSV rows) and access dictionary values (CSV columns).
import doctest
Dictionaries¶
A dictionary represents mutable unordered collections of key-value pairs, where the keys are immutable and unique. In other words, dictionaries are more flexible than lists. A list could be considered a dictionary where the "keys" are non-negative integers counting from 0 to the length minus 1.
# what, if any, is the difference between these two?
d = {0: 'a', 1: 'b', 2: 'c'}
l = ['a', 'b', 'c']
len(d)
3
len(l)
3
d[0]
'a'
l[0]
'a'
l[4] = 'aardvark'
l[0]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[13], line 1 ----> 1 l[4] = 'aardvark' 2 l[0] IndexError: list assignment index out of range
d1 = {0: 'a', 1: 'b', 2: 'c'}
d2 = {1: 'b', 0: 'a', 2: 'c'}
d1 == d2
True
l[0] += 'a'
l[0]
'aa'
d1[4] = 'hello'
d1[4]
'hello'
d.keys()
dict_keys([0, 1, 2])
d[120398] = "Hello"
d["Hello"] = "World"
Dictionaries are often helpful for counting occurrences. Whereas an earlier example counted the total number of unique words in a text file, a dictionary can help us count the number of occurrences of each unique word in that file.
def count_tokens(path):
counts = dict()
with open(path) as f:
for token in f.read().split():
counts[token] = counts.get(token, 0) + 1
# if token not in counts:
# counts[token] = 1
# else:
# counts[token] += 1
return counts
%time count_tokens("moby-dick.txt")['Moby']
CPU times: user 44.9 ms, sys: 4.64 ms, total: 49.5 ms Wall time: 48.2 ms
76
l[123]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[31], line 1 ----> 1 l[123] IndexError: list index out of range
d['doesnt exist'] = "exists"
As an aside, there's also a more Pythonic way to write this program using collections.Counter
, which is a specialized dictionary. The Counter
type also sorts the results in order from greatest to least.
def count_tokens(path):
from collections import Counter
with open(path) as f:
return Counter(f.read().split())
%time count_tokens("moby-dick.txt")
Dictionary functions¶
Dictionaries, like lists, are also mutable data structures so they have functions to help store and retrieve elements.
d.pop(key)
removeskey
fromd
.d.keys()
returns a collection of all the keys ind
.d.values()
returns a collection of all the values ind
.d.items()
returns a collection of all(key, value)
tuples ind
.
There are different ways to loop over a dictionary.
dictionary = {"a": 1, "b": 2, "c": 3}
for key in dictionary:
print(key, dictionary[key])
a 1 b 2 c 3
def multi():
return 5, 7
a, b = multi()
print(a, b)
5 7
for i in dictionary.items():
print(i)
('a', 1) ('b', 2) ('c', 3)
dictionary = {"a": 1, "b": 2, "c": 3}
for key, value in dictionary.items():
print(key, value)
a 1 b 2 c 3
None in Python¶
In an earlier lesson, we wrote a function to count the occurrences of each token in a file as a dict
where the keys are words and the values are counts.
{"green": 2, "eggs": 6, "and": 3, "yam": 2}
Suppose we want to debug the following function most_frequent
that takes this dictionary as input and returns the word with the highest count. If the input were a list, we could index the zero-th element from the list and loop over the remaining values by slicing the list. But it's harder to do this with a dictionary.
Python has a special None
keyword, like null
in Java, that represents a placeholder value.
def max_item(nums):
'''
>>> max_item([12, 23, 1287675476834])
1234
'''
max_n = nums[0]
for n in nums:
if n > max_n:
max_n = n
return max_n
doctest.run_docstring_examples(max_item, globals())
words = {"green": 2, "eggs": 6, "and": 3, "yam": 2}
words[None]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[57], line 2 1 words = {"green": 2, "eggs": 6, "and": 3, "yam": 2} ----> 2 words[None] KeyError: None
words['yam']
2
def most_frequent(counts):
"""
Returns the token in the given dictionary with the highest count, or None if empty.
>>> most_frequent({"green": 2, "eggs": 6, "and": 3, "yam": 2})
'eggs'
>>> most_frequent({}) # None is not displayed as output
"""
max_word = None
for word in counts:
if counts[word] > counts.get(max_word, 0):
max_word = word
return max_word
doctest.run_docstring_examples(most_frequent, globals())
Loop unpacking¶
When we need keys and values, we can loop over and unpack each key-value pair by looping over the dictionary.items()
.
dictionary = {"a": 1, "b": 2, "c": 3}
for key, value in dictionary.items():
print(key, value)
Loop unpacking is not only useful for dictionaries, but also for looping over other sequences such as enumerate
and zip
. enumerate
is a built-in function that takes a sequence and returns another sequence of pairs representing the element index and the element value.
with open("poem.txt") as f:
for i, line in enumerate(f.readlines()):
print(i, line[:-1])
0 she sells 1 sea 2 shells by 3 the sea shore
zip
is another built-in function that takes one or more sequences and returns a sequence of tuples consisting of the first element from each given sequence, the second element from each given sequence, etc. If the sequences are not all the same length, zip
stops after yielding all elements from the shortest sequence.
arabic_nums = [ 1, 2, 3, 4, 5]
alpha_nums = ["a", "b", "c", "d", "e"]
roman_nums = ["i", "ii"]
for arabic, alpha, roman in zip(arabic_nums, alpha_nums, roman_nums):
print(arabic, alpha, roman)
# print(items)
1 a i 2 b ii
Comma-separated values¶
In data science, we often work with tabular data such as the following table representing the names and hours of some of the TAs.
Name | Hours |
---|---|
Anna | 20 |
Iris | 15 |
Abiy | 10 |
Gege | 12 |
A table has two main comasfawponents to it:
- Rows corresponding to each entry, such as each individual TA.
- Columns corresponding to (required or optional) fields for each entry, such as TA name and TA hours.
A comma-separated values (CSV) file is a particular way of representing a table using only plain text. Here is the corresponding CSV file for the above table. Each row is separated with a newline. Each column is separated with a single comma ,
.
Name,Hours
Anna,20
Iris,15
Abiy,10
Gege,12
We'll learn a couple ways of processing CSV data in this course, first of which is representing the data as a list of dictionaries.
import csv
with open('staff.csv') as f:
reader = csv.DictReader(f)
for r in reader:
print(r)
{'Name': 'Anna', 'Hours': '20'} {'Name': 'Iris', 'Hours': '15'} {'Name': 'Abiy', 'Hours': '10'} {'Name': 'Gege', 'Hours': '12'}
staff = [
{"Name": "Anna", "Hours": 20},
{"Name": "Iris", "Hours": 15},
{"Name": "Abiy", "Hours": 10},
{"Name": "Gege", "Hours": 12},
]
type(staff)
list
type(staff[0])
dict
staff[1]['Hours']
15
To see the total number of TA hours available, we can loop over the list of dictionaries and sum the "Hours" value.
total_hours = 0
for ta in staff:
total_hours += ta["Hours"]
total_hours
57
What are some different ways to get the value of Iris's hours?
for ta in staff:
if ta["Name"] == "Iris":
print(ta["Hours"])
15
Practice: Largest earthquake place¶
Suppose we have a dataset of earthquakes around the world stored in the CSV file earthquakes.csv
.
id | year | month | day | latitude | longitude | name | magnitude |
---|---|---|---|---|---|---|---|
nc72666881 | 2016 | 7 | 27 | 37.672 | -121.619 | California | 1.43 |
us20006i0y | 2016 | 7 | 27 | 21.515 | 94.572 | Burma | 4.9 |
nc72666891 | 2016 | 7 | 27 | 37.577 | -118.859 | California | 0.06 |
nc72666896 | 2016 | 7 | 27 | 37.596 | -118.995 | California | 0.4 |
nn00553447 | 2016 | 7 | 27 | 39.378 | -119.845 | Nevada | 0.3 |
Write a function largest_earthquake_place
that takes the earthquake data
represented as a list of dictionaries and returns the name of the location that experienced the largest earthquake. If there are no rows in the dataset (no data at all), return None
.
For example, considering only the data shown above, the result would be "Burma"
because it had the earthquake with the largest magnitude (4.9).
def largest_earthquake_place(path):
"""
Returns the name of the place with the largest-magnitude earthquake in the specified CSV file.
>>> largest_earthquake_place("earthquakes.csv")
'Northern Mariana Islands'
"""
import pandas as pd
earthquakes = pd.read_csv(path).to_dict("records")
...
doctest.run_docstring_examples(largest_earthquake_place, globals())