Lesson 3. File Processing

The content for this lesson is adapted from material by Hunter Schafer and by Kevin Lin.

Objectives¶

In this lesson, we’ll introduce file processing, more ways of working with lists, and different ways of interacting with the Python programming environment. By the end of this lesson, students will be able to:

Read text files line-by-line (line processing) or token-by-token (token processing).
Identify a relative and absolute file path from the current directory to any file.

Setting up¶

To follow along with the code examples in this lesson, please download and unzip the files here:

Lesson 3 files

Note: since this lesson deals with files and folders, it is important that you unzip all the files as they appear in the zip folder and save them to a location called lesson3! Otherwise, some of the examples will not work. After unzipping the contents of lesson3, your directory should look like this:

files (a folder containing the following)
- empty.txt
- poem.txt
- store.txt
count_unique_words.py
filter_long_lines.py
lesson3.ipynb
print_tokens.py

List Methods¶

Just like with strings, list also has methods you can call on a list object to observe or modify its values. You can call any of these methods on a list object.

l.append(x) adds x to the end of l.
l.extend(xs) adds all elements in xs to the end of l. (Note that xs must be another multi-element data structure, like a list or string)
l.insert(i, x) inserts x at index i in l.
l.remove(x) removes the first x found in l.
l.pop(i) removes the element at index i in l.
l.clear() removes all values from l.
l.index(x) returns the first index whose associated value is x. Raises an error if x is not in l.
l.reverse() reverses the order of all elements in l.
l.sort() rearranges all elements of l into sorted order.

Note that, with the exception of l.index(x), all these methods will modify the existing list data structure! Unlike strings, lists are mutable, meaning that you can change them.

Here are some examples of usage:

l = []  # Empty list
l.append(1)
l.append(2)
l.append(3)
l.append(2)

print('Before  remove', l)
l.remove(2)  # Removes first instance of the value 2
print('After   remove', l)

# Can call pop one of two ways because the index is optional
l.pop(1)  # Removes the value at index 1
print('After   pop(1)', l)
l.pop()  # Removes the last value in the list
print('After    pop()', l)

l.extend([1, 2, 3])
l.extend("hello")
print('After extend()', l)

Food for thought: What happens when you use the extend method with a string, like in the example of l.extend("hello")? Why do you think this occurs?

The `in` Keyword¶

The idea of checking if a list contains a value is incredibly important for applications including finding all the distinct values in a collection, or only looking for values in a subset of all possible values (like looking at all students from WA, OR, or CA rather than all 50 states).

There is a special keyword in Python precisely made for doing these contains queries (we also call them membership queries). The following snippet shows the syntax for this keyword. The syntax goes, value in collection and it is an expression that evaluates to True/False. This means you could use it in an if statement or a while loop! You can try editing the code block to see what happens if you searched for the word 'cats' instead.

words = ['I', 'love', 'dogs']
if 'dogs' in words:
    print('Found it!')
else:
    print('No luck :(')

Notice that we didn’t say you could only use this on lists. It turns out that you can use it on almost all structures we learn in this class that store values. For example, you can use it on strings too: 'og' in 'dogs'.

To see if something is not in a list, you can use not in as shown in the next example. It’s exactly the opposite of the in keyword!

words = ['I', 'love', 'dogs']
if 'cats' not in words:
    print('Not there!')
else:
    print("It's there")

File Processing¶

Files on computers store some type of data. This data could be pictures, a word document, a video game, etc. For the first part of this class, we will only work with files that store text data. One such file type that holds text data is the .txt file type.

Most of the time, we open files in your computer’s file explorer. The Terminal is an alternative, programmatic way of interacting with your files. The cat program (short for “concatenate”) prints out the contents of a file. If we have a file called poem.txt, we can run the following command in the terminal to display its contents:

cat poem.txt

The output would be:

she sells
sea
shells by
the sea shore

Opening Files in Python¶

With Python, you can open and read files using the built-in open function. The syntax is shown in the following snippet. Note that the value you pass into the open is a path to the file. We will talk about file paths in the next section, but you can think of it like the full name of a file on a computer! The following code snippet opens poem.txt and reads the text into the variable named content before printing out content.

with open('files/poem.txt') as f:
    content = f.read()  # returns the file contents as a str
    print(content)

The with open(...) as f syntax negotiates access to the file with the computer’s operating system by maintaining a file handle, which in this case is stored in f. (You can change the target name from f to any other name.) All of the code contained within the with block has access to the file handle f.

Syntax

You should always use this with open(…) as f syntax when working with files in Python.

Line Processing¶

A very common pattern is to read the file line by line so that you can process each line on its own. We could accomplish this with the split function on the content of the file, but Python conveniently provides a readlines function on the file object that returns the lines in a list of strings.

For example, the following code snippet will print out the file with a line number in front of each line. In this example, lines will store a list of each line in the file and our loop keeps track of a counter and prints that before the line itself.

New lines

As a minor detail, each line will still contain a special new-line character (\n) at the end. To make sure our output doesn’t have extra new-lines in it, we strip each line to remove this trailing whitespace.

def number_lines(file_name):
    """
    Takes a file name as a parameter and prints out the file 
    line by line (prefixed with that line's line number)
    """
    with open(file_name) as f:
        lines = f.readlines() # Read lines from the file handle
        line_num = 1
        for line in lines:
            line = line.strip()
            # Remember we have to cast line_num to a str!
            print(str(line_num) + ': ' + line)
            line_num += 1

def main():
    number_lines('files/poem.txt')

if __name__ == '__main__':
    main()

So while the code is getting more complex, all of the code kind of falls into solving one of 3 sub-tasks of the problem:

The standard main-method pattern code and defining the function for number_lines
The standard code for opening a file (with open...) and to read the lines of a file (f.readlines())
The rest is just a problem we could have solved from Lesson 2 that involves looping over a list!

Token Processing¶

Another very common task when processing files, is to also break up each line into each token in the line. A token is similar to the notion of a “word” but is generalized to any series of characters separated by spaces. In CSE 163, we commonly use the word “word” and “token” interchangeably to mean a sequence of characters separated by spaces. For example, the string 'I really <3 dogs' has 4 tokens in it (we would also count it as having 4 words since we are not interested in differentiating between valid English words).

For example, what if we wanted to print out the number of odd length words on each line? For the file above, our program we want to write should output

1: 2
2: 1
3: 0
4: 3

This might sound complicated at first, but we can actually use what we know about strings in Python to solve this in our loop over the lines of the files. Why? Because each line is just a string! Recall, there is a really useful string method called split (from Lesson 2)that lets us break apart a string into parts based on some delimiter (in this case, spaces).

It will help to start by solving a sub-part of this problem before trying to solve the entire thing. What if I was given a string, and wanted to count the number of odd-length words in that string? You could write code that splits the string up by spaces and then loops over that list of words to count up all the ones with odd lengths.

s = 'I am a really cool sentence.'
words = s.split()
count = 0
for word in words:
    # If the length has a remainder when divided by 2, it's odd
    if len(word) % 2 == 1:
        count += 1

print('Number of odd-length words:', count)

Now that we have this sub-problem solved, we can tackle the larger problem of doing this task above multiple times, once for each line in the file.

We start with the code with the general pattern of looping over the lines of a file

def count_odd(file_name: str) -> None:
    with open(file_name) as f:
        lines = f.readlines()
        for line in lines:
            # Do something with line

Now that we have that starter code, we can go ahead and use the ideas we saw to count the number of odd length words in a single line inside this loop over the lines! The only other thing that needs to be added is some bookkeeping to keep track of the line number for printing.

def count_odd(file_name):
    """
    Takes a file name as a parameter and for each line
    prints out the line number and the number of odd length words on that line.
    """
    with open(file_name) as f:
        lines = f.readlines()
        line_num = 1
        for line in lines:
            # Break the line into words (this also removes trailing whitespace)
            words = line.split()

            # Count the number of odd-length words in this line
            odd_count = 0
            for word in words:
                if len(word) % 2 == 1:
                    odd_count += 1

            # Print it out!
            print(str(line_num) + ': ' + str(odd_count))
            line_num += 1


def main():
    count_odd('files/poem.txt')


if __name__ == '__main__':
    main()

File Paths¶

Files in your computer are stored in folders. Folders can contain other folders or other files and this makes a hierarchy in your computer. If you have a Windows PC, you can use the File Explorer. If you have a Mac, you can use the Finder application to navigate these folders. When you open a Python project in JupyterHub or locally using something like VSCode, it opens the project to some folder that it will use as your workspace.

Why is it important to know where your workspace is or what files are in it? You usually need to know that in order to specify where files are relative to that location. Files are specified by their path rather than their name. For example, the command to run any Python file from the terminal is the command python, like python main.py. When we say main.py, we are referring to a file in the current directory named main.py. This path is relative to wherever our workspace is. Let’s use another example to clarify.

In the JupyterHub folder for this lesson, you will note that there is a subfolder called files. What if you wanted to print out the lyrics to Carly Rae Jepsen’s hit-class “Store” that is store-d (ha!) in this folder? Let’s see what happens if you try running `with open(‘store.txt’):

with open('store.txt') as f:
    print(f.readlines())

You should see that it crashes because it can’t find a file named store.txt in this workspace. To make this work, you have to specify a path to get to that file from this workspace. In this case, the format puts the list of folder names from here to there separated by / characters. So to properly open this file, you would use open('files/store.txt)' to tell Python to look in the files folder for the file.

with open('files/store.txt') as f:
    print(f.readlines())

Absolute Paths¶

You may have noticed that we were using a slightly different path in some of the previous examples for opening and reading files. When we read poem.txt, we specified the path /lesson3/files/poem.txt. We call this an absolute path because it starts with a /. It’s absolute because we think of the path '/' as the top-level folder of the computer so we are specifying the path from the top (hence absolute because it’s not relative). The absolute path on your local computer will necessarily be different than someone else’s!

The key takeaway is to think about where your Python program is running from, and what path you need to specify to get to the desired file from that workspace. If you right-click a file in VS Code, you will see options to copy the relative or the absolute path!

⏸️ Pause and 🧠 Think¶

Take a moment to review the following concepts and reflect on your own understanding. A good temperature check for your understanding is asking yourself whether you might be able to explain these concepts to a friend outside of this class.

Here’s what we covered in this lesson:

More list methods
Membership queries using in
File processing
- Line processing
- Token processing
Relative vs. absolute paths

Here are some other guiding exercises and questions to help you reflect on what you’ve seen so far:

In your own words, write a few sentences summarizing what you learned in this lesson.
What did you find challenging in this lesson? Come up with some questions you might ask your peers or the course staff to help you better understand that concept.
What was familiar about what you saw in this lesson? How might you relate it to things you have learned before?
Throughout the lesson, there were a few Food for thought questions. Try exploring one or more of them and see what you find.

In-Class¶

When you come to class, we will work together on print_tokens.py, filter_long_lines.py, and count_unique_words.py. Make sure that you have a way of editing and running these files!

`print_tokens`¶

For this practice problem, refer to print_tokens.py.

Write a function print_tokens that takes a file name and prints out each token (word) in the file on a newline.

For example, suppose we had a file called store.txt in the files subfolder which had the following contents:

I'm just goin' to the store, to the store
I'm just goin' to the store
You might not see me anymore, anymore
I'm just goin' to the store

The first 9 lines of output from calling print_tokens('files/store.txt')are shown below. (All of the tokens shown below are from the first line, but make sure to print the tokens from all of the other lines too.)

I'm
just
goin'
to
the
store,
to
the
store
...

`filter_long_lines`¶

For this practice problem, refer to filter_long_lines.py.

Write a function filter_long_lines that takes a file name and a minimum number of words and prints out only the lines in the file containing at least the minimum number of words (tokens separated by spaces).

If the file is empty, the function should print 'Empty file'.

For example, suppose we had a file called store.txt in the files subfolder which had the following contents.

I'm just goin' to the store, to the store
I'm just goin' to the store
You might not see me anymore, anymore
I'm just goin' to the store

filter_long_lines('files/song.txt', 7) would print the output below because these are all the lines with 7 or more words:

I'm just goin' to the store, to the store
You might not see me anymore, anymore

If we had a file called empty.txt in the files subfolder which had no content, then filter_long_lines('files/empty.txt', 5) would print the output below:

Empty file

Empty file

Do not modify the provided empty.txt in the files subfolder. If you did modify it, you will need to delete it and recreate a new one for this code to work!

`count_unique_words`¶

For this practice problem, refer to count_unique_words.py.

Write a function count_unique_words that takes a file name and returns the number of unique tokens that appear in that file. Remember a token is a sequence of characters separated by spaces.

Consider a file store.txt with the following contents.

I'm just goin' to the store, to the store
I'm just goin' to the store
You might not see me anymore, anymore
I'm just goin' to the store

count_unique_words('files/store.txt') should return 14. This is because it contains the unique words ["I'm", "just", "goin'", "to", "the", "store,", "store", "You", "might", "not", "see", "me", "anymore,", "anymore"]. Notice that the tokens 'store,' and 'store' are different in punctuation. (So you don’t have to worry about removing punctuation.)

We recommend following these steps:

Start by writing the function header.
Then, write the code to go through the file word-by-word. (What kind of processing do we need?)
Finally, think about how to store the data in a list so that you can solve the problem.

Canvas Quiz¶

All done with the lesson? Complete the Canvas Quiz linked here!

Objectives¶

Setting up¶

List Methods¶

The in Keyword¶

File Processing¶

Opening Files in Python¶

Line Processing¶

Token Processing¶

File Paths¶

Absolute Paths¶

⏸️ Pause and 🧠 Think¶

In-Class¶

print_tokens¶

filter_long_lines¶

count_unique_words¶

Canvas Quiz¶

The `in` Keyword¶

`print_tokens`¶

`filter_long_lines`¶

`count_unique_words`¶