File Processing¶

In this lesson, we'll introduce two ways to process files and synthesize what we've learned about debugging. By the end of this lesson, students will be able to:

Read text files line-by-line (line processing).
Read text files token-by-token (token processing).
Write doctests and debug programs.

In [1]:

import doctest

Opening files in Python¶

In computers, data is stored in files that can represent text documents, pictures, structured spreadsheet-like data, etc. For now, we'll focus on files that represent text data that we indicate with the .txt file extension.

We can open and read files in Python using the built-in open function and specifying the path to the file. We will talk about file paths in a bit, but think of it like the full name of a file on a computer. The following code snippet opens the file path poem.txt and reads the text into the Python variable, content.

In [2]:

with open("poem.txt") as f:
    content = f.read()
    print(content)

she sells
sea
shells by
the sea shore

The with open(...) as f syntax negotiates access to the file with the computer's operating system by maintaining a file handle, which is assigned to the variable f. (You can use any variable name instead of f.) All the code contained in the with block has access to the file handle f. f.read() returns all the contents of the file as string.

Line processing¶

It's often useful to read a text file line-by-line so that you can process each line separately. We can accomplish this using the split function on the content of the file, but Python conveniently provides a f.readlines() function that returns all the string text as a list of lines.

The following code snippet prints out the file with a line number in front of each line. In this example lines will store a list of each line in the file and our loop over that just keeps track of a counter and prints that before the line itself.

In [4]:

with open("poem.txt") as f:
    lines = f.readlines()
    # print(lines)
    line_num = 1
    for line in lines:
        print(line_num, line[:-1]) # Slice-out the newline character at the end
        line_num += 1

1 she sells
2 sea
3 shells by
4 the sea shore

Token processing¶

It's also often useful to process each line of text token-by-token. A token is a generalization of the idea of a "word" that allows for any sequence of characters separated by spaces. For example, the string 'I really <3 dogs' has 4 tokens in it.

Token processing extends the idea of line processing by splitting each line on whitespace using the split function. In this course, we will use "word" and "token" interchangeably.

In [5]:

with open("poem.txt") as f:
    lines = f.readlines()
    line_num = 1
    for line in lines:
        tokens = line.split()
        print(line_num, tokens)
        line_num += 1

1 ['she', 'sells']
2 ['sea']
3 ['shells', 'by']
4 ['the', 'sea', 'shore']

Practice: Count odd-length tokens¶

How might we write a Python code snippet that takes the poem.txt file and prints the number of odd-length tokens per line?

In [6]:

def count_odd(path):
    """
    For the file path, prints out each line number followed by the number of odd-length tokens.

    >>> count_odd("poem.txt")
    1 2
    2 1
    3 0
    4 3
    """
    with open(path) as f:
        lines = f.readlines()
        line_num = 1
        for line in lines:
            tokens = line.split()
            odd_count = 0
            for token in tokens:
                if len(token) % 2 == 1:
                    odd_count += 1
            print(line_num, odd_count)
            line_num += 1

doctest.run_docstring_examples(count_odd, globals())

Practice: Debugging `first_tokens`¶

Let's help your coworker debug a function first_tokens, which should return a list containing the first token from each line in the specified file path. They sent you this message via team chat.

Hey, do you have a minute to help me debug this function? There's an error when I run it but I can't figure out how to fix it.

Unfortunately, your teammate only provided the code but did not provide any information about the error message, sample inputs to reproduce the problem, or a description of what they already tried.

Let's practice debugging this together and compose a helpful chat response to them.

In [10]:

def first_tokens(path):
    """
    Returns a list of first tokens from each line in the file at the specified path.

    >>> first_tokens("poem.txt")
    ['she', 'sea', 'shells', 'the']
    """
    result = []
    with open(path) as f:
        for line in f.readlines():
            tokens = line.split()
            result += [tokens[0]]
    return result

doctest.run_docstring_examples(first_tokens, globals())

In [11]:

[] + ['she']

Out[11]:

['she']