File Processing¶

In this lesson, we'll introduce two ways to process files and synthesize what we've learned about debugging. By the end of this lesson, students will be able to:

  • Read text files line-by-line (line processing).
  • Read text files token-by-token (token processing).
  • Write doctests and debug programs.
In [1]:
import doctest

Opening files in Python¶

In computers, data is stored in files that can represent text documents, pictures, structured spreadsheet-like data, etc. For now, we'll focus on files that represent text data that we indicate with the .txt file extension.

We can open and read files in Python using the built-in open function and specifying the path to the file. We will talk about file paths in a bit, but think of it like the full name of a file on a computer. The following code snippet opens the file path poem.txt and reads the text into the Python variable, content.

In [2]:
with open("poem.txt") as f:
    content = f.read()
    if content.split()[0] == "she":
        # Note that indentation is how Python tells apart different blocks of code
        ...
    print(content)
she sells
sea
shells by
the sea shore

The with open(...) as f syntax negotiates access to the file with the computer's operating system by maintaining a file handle, which is assigned to the variable f. (You can use any variable name instead of f.) All the code contained in the with block has access to the file handle f. f.read() returns all the contents of the file as string.

Line processing¶

It's often useful to read a text file line-by-line so that you can process each line separately. We can accomplish this using the split function on the content of the file, but Python conveniently provides a f.readlines() function that returns all the string text as a list of lines.

The following code snippet prints out the file with a line number in front of each line. In this example lines will store a list of each line in the file and our loop over that just keeps track of a counter and prints that before the line itself.

In [7]:
# Q: What is the purpose of with?
with open("poem.txt") as f:
    lines = f.readlines()

# lines is a list of each line in the poem.txt file
line_num = 1
for line in lines:
    print(line_num, line[:-1]) # Slice-out the newline character at the end
    line_num += 1
1 she sells
2 sea
3 shells by
4 the sea shore
In [12]:
first_line = lines[0]
first_line
Out[12]:
'she sells\n'
In [13]:
first_line[-1]
Out[13]:
'\n'
In [8]:
lines
Out[8]:
['she sells\n', 'sea\n', 'shells by\n', 'the sea shore\n']
In [14]:
# Do we have access to f after closing the file?
f
Out[14]:
<_io.TextIOWrapper name='poem.txt' mode='r' encoding='UTF-8'>

Token processing¶

It's also often useful to process each line of text token-by-token. A token is a generalization of the idea of a "word" that is sequence of characters separated by space(s). For example, the string 'I really <3 dogs' has 4 tokens in it: 'I', 'really', '<3', 'dogs'.

Token processing extends the idea of line processing by splitting each line on whitespace using the split function. In this course, we will use "word" and "token" interchangeably.

In [17]:
with open("poem.txt") as f:
    lines = f.readlines()
    line_num = 1
    for line in lines:
        # Token processing: Splits each string (line) into separate tokens
        tokens = line.split()
        print(line_num, tokens)
        line_num += 1
1 ['she', 'sells']
2 ['sea']
3 ['shells', 'by']
4 ['the', 'sea', 'shore']
In [22]:
with open("poem.txt") as f:
    # f.readlines() -> list
    # Does a list have a split method? Python says no.
    # Instead, strings have a split method!
    result = f.readlines().split()

result
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[22], line 2
      1 with open("poem.txt") as f:
----> 2     result = f.readlines().split()
      4 result

AttributeError: 'list' object has no attribute 'split'
In [20]:
with open("poem.txt") as f:
    result = f.read()

result
Out[20]:
'she sells\nsea\nshells by\nthe sea shore\n'
In [21]:
# Whitespace, including newline characters, are candidates for split points
result.split()
Out[21]:
['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']

Practice: Count odd-length tokens¶

How might we write a Python code snippet that takes the poem.txt file and prints the number of odd-length tokens per line?

In [23]:
def count_odd(path):
    """
    For the file path, prints out each line number followed by the number of odd-length tokens.

    >>> count_odd("poem.txt")
    1 2
    2 1
    3 0
    4 3
    """
    # How many odd-length tokens are on each line?
    # Need to loop over all the lines, and split each line so we can inspect each token
    with open("poem.txt") as f:
        lines = f.readlines()
        line_num = 1
        for line in lines:
            num_odd = 0
            tokens = line.split()
            # Is this token an odd-length token? But there could be multiple tokens to inspect!
            for token in tokens:
                if len(token) % 2 == 1: # Token is odd-length
                    num_odd += 1
            print(line_num, num_odd)
            line_num += 1


doctest.run_docstring_examples(count_odd, globals())
In [24]:
len(first_line) # "she sells\n" is 10 characters if the newline is one character
Out[24]:
10
In [25]:
len(lines)
Out[25]:
4
In [26]:
len(1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[26], line 1
----> 1 len(1)

TypeError: object of type 'int' has no len()
In [27]:
len("1")
Out[27]:
1

Practice: Debugging first tokens¶

Let's help your coworker debug a function first_tokens, which should return a list containing the first token from each line in the specified file path. Unfortunately, your coworker only provided their current code without any information about the error message, sample inputs to reproduce the problem, or a description of what they already tried.

Let's practice debugging this together and compose a list of recommendations for how to ask questions and get help.

In [37]:
def first_tokens(path):
    """
    >>> first_tokens("poem.txt")
    ['she', 'sea', 'shells', 'the']
    """
    result = []
    with open(path) as f:
        for line in f.readlines():
            # line.split() returns a new list, which is not being used in this program
            # How could we fix this? Assign the result to a variable like tokens?
            tokens = line.split()
            result.append(tokens[0]) # grabbing the first character from the line, not the first token
    return result


doctest.run_docstring_examples(first_tokens, globals())
In [33]:
result = []
result += tokens[0]
result
Out[33]:
['t', 'h', 'e']
In [35]:
result = []
# result += tokens[0] will convert to an .extend call
result.extend(tokens[0])
# .extend loops over all the elements of the thing you want to add, and 
#  adds each element one at a time to the current list
result
Out[35]:
['t', 'h', 'e']
In [36]:
result = []
result.append(tokens[0])
# .append adds exactly the one item you give it to the end of the list
result
Out[36]:
['the']
In [34]:
[] + "the"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[34], line 1
----> 1 [] + "the"

TypeError: can only concatenate list (not "str") to list
In [38]:
result = []
result.extend(tokens)
# .extend loops over all the elements of the thing you want to add, and 
#  adds each element one at a time to the current list
result
Out[38]:
['the', 'sea', 'shore']
In [40]:
tokens
Out[40]:
['the', 'sea', 'shore']
In [39]:
result = []
result += [tokens[0]]
result
Out[39]:
['the']
In [41]:
result = []
result += tokens[0:1] # A slice returns a new list!
result
Out[41]:
['the']
In [42]:
result = []
result += [ [ tokens[0] ] ]
result
Out[42]:
[['the']]