CSV Data¶
In this lesson, we'll learn more advanced dictionary features and the CSV data file format. By the end of this lesson, students will be able to:
- Loop over the
keys
,values
, anditems
of a dictionary. - Identify the list of dictionaries corresponding to some CSV data.
- Loop over a list of dictionaries (CSV rows) and access dictionary values (CSV columns).
import doctest
Dictionary functions¶
Dictionaries, like lists, are also mutable data structures so they have functions to help store and retrieve elements.
d.pop(key)
removeskey
fromd
.d.keys()
returns a collection of all the keys ind
.d.values()
returns a collection of all the values ind
.d.items()
returns a collection of all(key, value)
tuples ind
.
There are different ways to loop over a dictionary.
dictionary = {"a": 1, "b": 2, "c": 3}
# By default, dictionaries loops in the order in which keys were first-added to the dictionary
dictionary["d"] = 4
for key in dictionary:
print(key, dictionary[key])
a 1 b 2 c 3 d 4
dictionary[-1]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[8], line 1 ----> 1 dictionary[-1] KeyError: -1
dictionary = {"a": 1, "b": 2, "c": 3}
# By default, dictionaries loop over their keys, so it's not necessary to write .keys()
for key in dictionary.keys():
print(key, dictionary[key])
a 1 b 2 c 3
dictionary = {"a": 1, "b": 2, "c": 3}
# By default, dictionaries loop over their keys, so it's not necessary to write .keys()
for value in dictionary.values():
print(value)
# Can I get the key associated with a value easily in a dictionary?
# Ans: Not so easy since dictionaries are only indexable by their keys
# If you really wanted to do this, you would have to loop over the dictionary again (nested)
# and check for the key that matches the given value.
# In general, we tend to use .values() less frequently in the real world
# and most commonly use it for one-line solutions like len(set(dictionary.values()))
1 2 3
dictionary = {"a": 1, "b": 2, "c": 3}
# By default, dictionaries loop over their keys, so it's not necessary to write .keys()
for item in dictionary.items():
key, value = item
print(key, value)
a 1 b 2 c 3
None in Python¶
In an earlier lesson, we wrote a function to count the occurrences of each token in a file as a dict
where the keys are words and the values are counts.
{"green": 2, "eggs": 6, "and": 3, "yam": 2}
Suppose we want to debug the following function most_frequent
that takes this dictionary as input and returns the word with the highest count. If the input were a list, we could index the zero-th element from the list and loop over the remaining values by slicing the list. But it's harder to setup an initial value for a program that involves looping over the elements in a dictionary.
Python has a special None
keyword, like null
in Java, that represents a placeholder value.
def most_frequent(counts):
"""
Returns the token in the given dictionary with the highest count, or None if empty.
>>> most_frequent({"green": 2, "eggs": 6, "and": 3, "yam": 2})
'eggs'
>>> most_frequent({}) # None is not displayed as output
"""
max_word = None
for word in counts:
# What if I don't want to combine these two cases?
# I think there is a legitimate argument to be made on not combining.
# I think experts would probably prefer to combine them.
# How can we change the logic for this program to remove the redundancy?
# Python has a rule for order for operations: go left to right and only
# check the next condition if the last condition broke the rule.
if max_word is None or counts[word] > counts[max_word]:
max_word = word
return max_word
doctest.run_docstring_examples(most_frequent, globals())
True or False
True
True and False
False
False and False # Python will not check the second False
False
# Truth-y and false-y values
1 and 2
2
0 and 1 / 0 # Python will indeed not run the second part if the first part already concludes the logic!
0
# Is a negative number truth-y or false-y?
# So, if -1 is truth-y we will get a ZeroDivisionError
-1 and 1 / 0
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) Cell In[24], line 3 1 # Is a negative number truth-y or false-y? 2 # So, if -1 is truth-y we will get a ZeroDivisionError ----> 3 -1 and 1 / 0 ZeroDivisionError: division by zero
# So we've been talking about "and", but what about "&" symbol?
0 & 1 / 0
# "&" symbol has a different behavior: its definition is an "element-wise and"
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) Cell In[25], line 2 1 # So we've been talking about "and", but what about "&" symbol? ----> 2 0 & 1 / 0 ZeroDivisionError: division by zero
[] or [1, 2, 3]
[1, 2, 3]
[] and [1, 2, 3] # empty list in Python is false-y, all other lists are truth-y
[]
1 / 0
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) Cell In[21], line 1 ----> 1 1 / 0 ZeroDivisionError: division by zero
Loop unpacking¶
When we need keys and values, we can loop over and unpack each key-value pair by looping over the dictionary.items()
.
dictionary = {"a": 1, "b": 2, "c": 3}
for key, value in dictionary.items():
print(key, value)
a 1 b 2 c 3
Loop unpacking is not only useful for dictionaries, but also for looping over other sequences such as enumerate
and zip
. enumerate
is a built-in function that takes a sequence and returns another sequence of pairs representing the element index and the element value.
with open("poem.txt") as f:
# [(0, "..."), (1, "..."), (2, "...")]
for i, line in enumerate(f.readlines()):
print(i, line[:-1])
0 she sells 1 sea 2 shells by 3 the sea shore
zip
is another built-in function that takes one or more sequences and returns a sequence of tuples consisting of the first element from each given sequence, the second element from each given sequence, etc. If the sequences are not all the same length, zip
stops after yielding all elements from the shortest sequence.
arabic_nums = [ 1, 2, 3, 4, 5]
alpha_nums = ["a", "b", "c", "d", "e"]
roman_nums = ["i", "ii", "iii", "iv"]
# If the sequences are different in length, we go to the shortest one.
# There's another function in itertools called zip_longest.
for arabic, alpha, roman in zip(arabic_nums, alpha_nums, roman_nums):
print(arabic, alpha, roman)
1 a i 2 b ii 3 c iii 4 d iv
Comma-separated values¶
In data science, we often work with tabular data such as the following table representing the names and hours of some of the TAs.
Name | Hours |
---|---|
Anna | 20 |
Iris | 15 |
Abiy | 10 |
Gege | 12 |
A table has two main components to it:
- Rows corresponding to each entry, such as each individual TA.
- Columns corresponding to (required or optional) fields for each entry, such as TA name and TA hours.
A comma-separated values (CSV) file is a particular way of representing a table using only plain text. Here is the corresponding CSV file for the above table. Each row is separated with a newline. Each column is separated with a single comma ,
.
Name,Hours
Anna,20
Iris,15
Abiy,10
Gege,12
We'll learn a couple ways of processing CSV data in this course, first of which is representing the data as a list of dictionaries.
staff = [
{"Name": "Anna", "Hours": 20},
{"Name": "Iris", "Hours": 15},
{"Name": "Abiy", "Hours": 10},
{"Name": "Gege", "Hours": 12},
]
staff
[{'Name': 'Anna', 'Hours': 20}, {'Name': 'Iris', 'Hours': 15}, {'Name': 'Abiy', 'Hours': 10}, {'Name': 'Gege', 'Hours': 12}]
To see the total number of TA hours available, we can loop over the list of dictionaries and sum the "Hours" value.
total_hours = 0
for ta_dictionary in staff:
total_hours += ta_dictionary["Hours"]
total_hours
57
What are some different ways to get the value of Iris's hours?
for ta in staff:
if ta["Name"] == "Iris":
print(ta["Hours"])
15
staff
[{'Name': 'Anna', 'Hours': 20}, {'Name': 'Iris', 'Hours': 15}, {'Name': 'Abiy', 'Hours': 10}, {'Name': 'Gege', 'Hours': 12}]
staff["Iris"]["Hours"]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[35], line 1 ----> 1 staff["Iris"]["Hours"] TypeError: list indices must be integers or slices, not str
staff[1]
{'Name': 'Iris', 'Hours': 15}
staff[1]["Hours"]
15
staff["Hours"]["Iris"]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[43], line 1 ----> 1 staff["Hours"]["Iris"] TypeError: list indices must be integers or slices, not str
staff["Hours"][1]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[44], line 1 ----> 1 staff["Hours"][1] TypeError: list indices must be integers or slices, not str
"Iris" in staff # Checking for the string "Iris" in the list
# False because the list of dictionaries does not contain (at the top level) any
# strings!
False
"Iris" in staff["Name"]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[46], line 1 ----> 1 "Iris" in staff["Name"] TypeError: list indices must be integers or slices, not str
staff["Name"]["Iris"]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[47], line 1 ----> 1 staff["Name"]["Iris"] TypeError: list indices must be integers or slices, not str
Practice: Largest earthquake place¶
Suppose we have a dataset of earthquakes around the world stored in the CSV file earthquakes.csv
.
id | year | month | day | latitude | longitude | name | magnitude |
---|---|---|---|---|---|---|---|
nc72666881 | 2016 | 7 | 27 | 37.672 | -121.619 | California | 1.43 |
us20006i0y | 2016 | 7 | 27 | 21.515 | 94.572 | Burma | 4.9 |
nc72666891 | 2016 | 7 | 27 | 37.577 | -118.859 | California | 0.06 |
nc72666896 | 2016 | 7 | 27 | 37.596 | -118.995 | California | 0.4 |
nn00553447 | 2016 | 7 | 27 | 39.378 | -119.845 | Nevada | 0.3 |
Write a function largest_earthquake_place
that takes the earthquake data
represented as a list of dictionaries and returns the name of the location that experienced the largest earthquake. If there are no rows in the dataset (no data at all), return None
.
For example, considering only the data shown above, the result would be "Burma"
because it had the earthquake with the largest magnitude (4.9).
def largest_earthquake_place(path):
"""
Returns the name of the place with the largest-magnitude earthquake in the specified CSV file.
>>> largest_earthquake_place("earthquakes.csv")
'Northern Mariana Islands'
"""
import pandas as pd
earthquakes = pd.read_csv(path).to_dict("records")
max_name = None
max_magn = 0.0
for earthquake in earthquakes:
# No KeyError, no TypeError, no issue with None
# Because we are not comparing or computing with None: it's just a
# placeholder that gets replaced the first time we enter this if.
if earthquake["magnitude"] > max_magn:
max_magn = earthquake["magnitude"]
max_name = earthquake["name"]
return max_name
doctest.run_docstring_examples(largest_earthquake_place, globals())