File Processing¶
In this lesson, we'll introduce two ways to process files and synthesize what we've learned about debugging. By the end of this lesson, students will be able to:
- Read text files line-by-line (line processing).
- Read text files token-by-token (token processing).
- Write doctests and debug programs using the debugger.
import doctest
Opening files in Python¶
In computers, data is stored in files that can represent text documents, pictures, structured spreadsheet-like data, etc. For now, we'll focus on files that represent text data that we indicate with the .txt
file extension.
We can open and read files in Python using the built-in open
function and specifying the path to the file. We will talk about file paths in a bit, but think of it like the full name of a file on a computer. The following code snippet opens the file path poem.txt
and reads the text into the Python variable, content
.
with open("poem.txt") as f:
content = f.read()
print(content)
she sells sea shells by the sea shore
The with open(...) as f
syntax negotiates access to the file with the computer's operating system by maintaining a file handle, which is assigned to the variable f
. (You can use any variable name instead of f
.) All the code contained in the with
block has access to the file handle f
. f.read()
returns all the contents of the file as string.
Line processing¶
It's often useful to read a text file line-by-line so that you can process each line separately. We can accomplish this using the split
function on the content of the file, but Python conveniently provides a f.readlines()
function that returns all the string text as a list of lines.
The following code snippet prints out the file with a line number in front of each line. In this example lines
will store a list of each line in the file and our loop over that just keeps track of a counter and prints that before the line itself.
with open("poem.txt") as f:
lines = f.readlines()
# lines = ["she sells\n", "sea\n", "shells by\n", "the sea shore\n"]
line_num = 1
for line in lines:
# What if we remove the slicing part?
print(line, end="") # Slice-out the newline character at the end
line_num += 1
she sells sea shells by the sea shore
print("Hello, my name is Kevin", end="")
print("My favorite course is CSE 163")
Hello, my name is KevinMy favorite course is CSE 163
Token processing¶
It's also often useful to process each line of text token-by-token. A token is a generalization of the idea of a "word" that allows for any sequence of characters separated by spaces. For example, the string 'I really <3 dogs'
has 4 tokens in it.
Token processing extends the idea of line processing by splitting each line on whitespace using the split
function. In this course, we will use "word" and "token" interchangeably.
"I really like dogs".split()
['I', 'really', 'like', 'dogs']
with open("poem.txt") as f:
lines = f.readlines()
# lines = ["she sells\n", "sea\n", "shells by\n", "the sea shore\n"]
line_num = 1
for line in lines:
tokens = line.split()
# line1: ["she", "sells"]
print(line_num, tokens)
for token in tokens:
print(token, "has", len(token), "characters")
line_num += 1
1 ['she', 'sells'] she has 3 characters sells has 5 characters 2 ['sea'] sea has 3 characters 3 ['shells', 'by'] shells has 6 characters by has 2 characters 4 ['the', 'sea', 'shore'] the has 3 characters sea has 3 characters shore has 5 characters
Practice: Count odd-length tokens¶
How might we write a Python code snippet that takes the poem.txt
file and prints the number of odd-length tokens per line?
def count_odd(path):
"""
For the file path, prints out each line number followed by the number of odd-length tokens.
>>> count_odd("poem.txt")
1 2
2 1
3 0
4 3
"""
with open(path) as f:
lines = f.readlines()
# lines = ["she sells\n", "sea\n", "shells by\n", "the sea shore\n"]
line_num = 1
for line in lines:
tokens = line.split()
# line1: ["she", "sells"]
# Note: must assign num_odd = 0 inside the `for line in lines` loop!
num_odd = 0
for token in tokens:
# line1 token1: "she"
if len(token) % 2 == 1: # if odd length...
num_odd += 1
# Probably calling print too frequently, so if I call print less frequently...
print(line_num, num_odd)
line_num += 1
doctest.run_docstring_examples(count_odd, globals())
Debugging tips¶
- Compare "expected" to "got" carefully with the goal of finding patterns or explaining why "got" looks so different from "expected".
- Our goal is to come up with an explanation of the bug... which sometimes requires making things worse. Try exploratory debugging: make edits that don't necessarily address the bug but give you more information to help confirm or deny your explanation.
- Create new code cells to simulate the behavior of specific lines of code. The debugger can also be a good help here too.
Practice: Debugging first tokens¶
Let's help your coworker debug a function first_tokens
, which should return a list containing the first token from each line in the specified file path. They sent you this message via team chat.
Hey, do you have a minute to help me fix this function? There's an error when I run it.
Unfortunately, your teammate only provided the code but did not provide any information about the error message, sample inputs to reproduce the problem, or a description of what they already tried.
Let's practice debugging this together and compose a helpful chat response to them.
def first_tokens(path):
"""
Returns the first token in each line in the specified text file as a list of strings.
>>> first_tokens("poem.txt")
['she', 'sea', 'shells', 'the']
"""
result = []
with open(path) as f:
for line in f.readlines():
# We need to assign the result of splitting on tokens
tokens = line.split()
result += tokens[0]
# result.append(tokens[0]) # Probably the program is only adding the first character
return result
doctest.run_docstring_examples(first_tokens, globals())
********************************************************************** File "__main__", line 5, in NoName Failed example: first_tokens("poem.txt") Expected: ['she', 'sea', 'shells', 'the'] Got: ['s', 'h', 'e', 's', 'e', 'a', 's', 'h', 'e', 'l', 'l', 's', 't', 'h', 'e']
result = []
# result += "she"
result.extend("she")
result
['s', 'h', 'e']
result = []
# result.append("she")
result += ["she"]
result
['she']