Chapter 6
File Processing

Copyright © 2004 by Stuart Reges

6.1 Introduction

In chapter 4 we saw how to construct a Scanner object to read input from the Console. Now we will see how to construct Scanner objects to read input from files. The idea is fairly straightforward, but Java does not make it easy to read from input files. This is unfortunate because many interesting problems can be formulated as file processing tasks. Many introductory computer science classes have abandoned file processing altogether or the topic has moved into the second course because it is considered too advanced for novices.

There is nothing intrinsically complex about file processing. The languages C++ and C# provide mechanisms for easily reading and writing files. But Java was not designed for file processing and Sun has not been particularly eager to provide a simple solution. They did, however, introduce the Scanner class as a way to simplify some of the details associated with reading files. The result is that file reading is still awkward in Java, but at least the level of detail is manageable.

Before we can write a file processing program, we have to explore some issues related to Java exceptions. Remember that exceptions are errors that halt the execution of a program. In the case of file processing, we might try to open a file that doesn't exist, which would generate an exception.

6.2 Try/Catch Statements

Java has a special construct for catching exceptions that is known as the try/catch statement. We will not be exploring all of the details of try/catch, but we will explore how to write some basic try/catch statements that we will need for file processing. Let's first see why we need this. We have been constructing our Scanner objects by passing System.in to the Scanner constructor:

        Scanner console = new Scanner(System.in);
You are also allowed to construct Scanner objects by passing an object of type File. File objects in turn are constructed by passing a String that represents the file's name. For example, suppose that we have a file called "numbers.dat" that contains a sequence of real numbers. So using this file name we can construct a File object:

        new File("numbers.dat")
And using this File object we can construct a Scanner object:

        new Scanner(new File("numbers.dat"))
Putting this all together, we'd say something like the following:
        Scanner input = new Scanner(new File("numbers.dat"));
But what if Java can't find a file named "numbers.dat"? Then what happens? The answer is that this version of the Scanner constructor throws an exception known as a FileNotFound exception. This particular exception is known as a checked exception.
Checked Exception

An exception that must be caught or specifically declared in the header of the method that might generate it.

Because it is a checked exception, we can't just ignore it. One alternative is to include what are known as "throws" clauses in the header of any method that might generate such an exception. This approach works, but it can be rather tedious. Instead, we will see how to use a try/catch statement to handle the error.

The try/catch statement has the following general syntax.

try { <statement>; <statement>; ... <statement>; } catch (<type> <name>) { <statement>; <statement>; ... <statement>; } Notice that it is divided into two blocks using the keywords "try" and "catch". The first block has the code you want to execute. The second block has error recovery code that should be executed if an exception is thrown. So think of this as saying, "Try to execute these statements, but if something goes wrong, I'm going to give you some other code in the catch part that you should execute if an error occurs."

Notice that the catch part of this statement has a set of parentheses in which you include a type and name. The type should be the type of exception you are trying to catch. The name can be any legal identifier. For example, in the case of our Scanner code, we know that a FileNotFound exception might be thrown. What do we do if the exception occurs? That's a tough question, but for now let's just write an error message.

        try {
            Scanner input = new Scanner(new File("numbers.dat"));
        } catch (FileNotFoundException e) {
            System.out.println("File not found");
        }
This code says to try constructing the Scanner from the file "numbers.dat" but if the file is not found, then print an error message instead. This is the basic idea we want to follow, but there are several issues we must address to make this code work for us. First of all, there is a scope issue. The variable input isn't going to be much use to us if it's trapped inside the try block. So we have to declare the Scanner variable outside the try/catch statement:

        Scanner input;
        try {
            input = new Scanner(new File("numbers.dat"));
        } catch (FileNotFoundException e) {
            System.out.println("File not found");
        }
We have a bigger problem in that simply printing an error message isn't a good way to recover from this problem. How is the program supposed to proceed with execution if it can't read from the file? It probably can't. So what would be a more appropriate way to recover from the error? That depends a lot on the particular program you are writing, so the answer is likely to vary from one program to the next. Later in the chapter we'll explore putting this into a loop where we keep prompting for a legal file name until the user gives us something that works.

For now we'll look at an alternative that just stops the program from executing. One way to do this is to call a special method System.exit. Some people would write the following code:

        Scanner input;
        try {
            input = new Scanner(new File("numbers.dat"));
        } catch (FileNotFoundException e) {
            System.exit(1);
        }
The call on System.exit stops the program from executing. The value you pass to System.exit (a 1 in the example above) indicates the conditions under which you exited. It is a convention to return the value 0 as a way to say, "We exited normally without errors." By passing a value like 1 you are saying, "We exited abnormally, with error code 1." This solution works, but there is a better solution.

Java has a family of exceptions that are unchecked. In particular, we can throw something called a RuntimeException.

        Scanner input;
        try {
            input = new Scanner(new File("numbers.dat"));
        } catch (FileNotFoundException e) {
            throw new RuntimeException("File not found");
        }
In effect, we are turning the FileNotFoundException, which is a checked exception, into a RuntimeException, which is not checked. There are several advantages to this. If someone who calls our code wants to write their own try/catch statement for our RuntimeException, they can handle this error. If not, then this will halt the program in the same way that the call on System.exit halts the program, but in this case Java will display a stack trace showing where the exception was thrown and how Java ended up there (a list in backwards order of each method called).

We will use this code snippet as a model to follow in the programs that we write.

6.3 File Processing Basics

We are now ready to look at a complete program that reads an input file. Suppose that we have used a text editor to create a file called "numbers.dat" with the following content.

308.2 14.9 7.4 2.8 3.9 4.7 -15.4 2.8 We can read the file and echo the numbers using the following program. // This program reads a file of numbers, echoing their values one per line. import java.io.*; public class Echo1 { public static void main(String[] args) { Scanner input; try { input = new Scanner(new File("numbers.dat")); } catch (FileNotFoundException e) { throw new RuntimeException("file not found"); } while (input.hasNextDouble()) { double next = input.nextDouble(); System.out.println("next number = " + next); } } } When we run it, we get the following output.

next number = 308.2 next number = 14.9 next number = 7.4 next number = 2.8 next number = 3.9 next number = 4.7 next number = -15.4 next number = 2.8 The try/catch statement in the program constructs a Scanner object that is tied to this file and stores a reference to it in the variable called input. Then in the while loop that follows we repeatedly read and echo double values from the file as long as there are more doubles to find.

To process the file the Scanner object keeps track of a current position in the file. You can think of this as a cursor or pointer into the file.

Input cursor

A pointer to the current position in an input file.

When the Scanner object is first constructed, this cursor points to the beginning of the file. But as we perform various "next" operations, this cursor moves forward. After the first call on nextDouble, the cursor will be positioned in the middle of the first line after the token "308.2". After another call on nextDouble the cursor is positioned between the tokens "14.9" and "7.4". And so on.

We refer to this process as consuming input.

Consuming input

Moving the input cursor forward past some input.

Scanner objects are very flexible about when and how you consume input. You can consume part of the input in one section of code, then do some other work, then come back to consuming more of the input in another section of code. You decide exactly how much input to consume at a time through the calls you make on the Scanner object.

The various "has" methods of the Scanner class also consume input. Consider our sample program. The fourth call on "nextDouble" will read in the value 2.8. This leaves the input cursor positioned at the end of the line with 2.8. The program then performs the while loop test again which has a call on "hasNextDouble". But the input file has two blank lines after the line with 2.8. The Scanner object has to consume these blank lines before it encounters the value 3.9 at the beginning of the fifth line of input. Keeping track of exactly where the input cursor is positioned can be tricky. If the data is line-oriented, it is best to read it in a line-oriented manner. We will see how to do that in a later section.

Notice that the Echo1 program does not necessarily consume the entire input file. It has a while loop that continues as long as it sees a double. If it encounters anything other than a double, it will stop reading without processing that part of the input file.

6.3.1 File Names

In the previous section we used the file name "numbers.dat". When Java finds you using a simple name like that, it looks in the current directory to find the file. The definition of "current directory" varies depending upon what Java environment you are using. If you are using the TextPad editor, then the current directory is the directory in which your program appears.

You can also use a fully-qualified file name. For example, if you are on a Windows machine and you have stored the file in a directory known as c:\data, we could use a file name like this:

        Scanner input;
        try {
            input = new Scanner(new File("c:\\data\\numbers.dat"));
        } catch (FileNotFoundException e) {
            throw new RuntimeException("File not found");
        }
Notice that we have to use the escape sequence "\\" to represent a single backslash character. This approach works well when you know exactly where your file is going to be stored on your system.

Another alternative is to ask the user for a file name. In the last chapter we saw a program called FindSum that prompted the user for a series of numbers to add together. Below is a variation that prompts the user for the name of a file of numbers to be added together.

// This program adds together a series of numbers from a file. It prompts // the user for the file name, then reads the file and reports the sum. import java.io.*; public class FindSum2 { public static void main(String[] args) { System.out.println("This program will add together a series of real"); System.out.println("numbers from a file."); System.out.println(); Scanner console = new Scanner(System.in); System.out.print("What is the file name? "); String name = console.nextLine(); Scanner input; try { input = new Scanner(new File(name)); } catch (FileNotFoundException e) { throw new RuntimeException("file not found"); } System.out.println(); double sum = 0; while (input.hasNextDouble()) { double next = input.nextDouble(); sum += next; } System.out.println("Sum = " + sum); } } If we have it read from the file "numbers.dat" that we saw in the last section, then the program would execute like this:

This program will add together a series of real numbers from a file. What is the file name? numbers.dat Sum = 329.29999999999995 The user also has the option of specifying a full file name, as in:

This program will add together a series of real numbers from a file. What is the file name? c:\data\numbers.dat Sum = 329.29999999999995 Notice that the user doesn't have to type two backslashes to get a single backslash. That's because the Scanner object that reads the user's input is able to read it without escape sequences.

6.3.2 A more complex input file

Suppose that you have an input file that has information about how many hours have been worked by each employee of a company. For example, it might look like the following:

Erica 7.5 8.5 10.25 8 8.5 Greenlee 10.5 11.5 12 11 10.75 Simone 8 8 8 Ryan 6.5 8 9.25 8 Kendall 2.5 3 The idea is that we have a list of hours worked by each employee and we want to find out the total hours worked by each individual. We can construct a Scanner object linked to this file to solve this task. As you start writing more complex file processing programs, you will want to divide it up into methods to break up the code into logical subtasks. In this case, we can separate the details of opening the file from the details of processing the file.

We have already looked in detail at how to open a file and the code for doing so tends to be fairly standard. We refer to this as "boilerplate" code.

Boilerplate Code

Code that tends to be the same from one program to another.

The more interesting code involves processing the file. Most file processing will involve while loops because we won't know in advance how much data the file has in it. We'll choose different tests depending upon the particular file we are processing, but they will almost all be calls on the various "has" methods of the Scanner class. We basically want to say, "while you have more data for me to process, let's keep reading."

In this case we have a series of input lines that each begin with a name. For this program we are assuming that names are simple, with no spaces in the middle. That means we'll be reading them with a call on the next() method. As a result, our overall test involves seeing if there is another name in the input file:

while (input.hasNext()) { <process next person> } So how do we process one person? We have to read their name and then read their list of hours. If you look at the sample input file, you will see that the list of hours is not always the same length. This is a common occurrence in input files. For example, some employees might have worked on 5 different days while others worked only 2 days or 3 days. So we will use a loop for this as well. This is a nested loop. The outer loop is handling one person at a time and the inner loop will handle one number at a time. The task is a fairly straightforward cumulative sum:

	double sum = 0.0;
	while (input.hasNextDouble())
	    sum += input.nextDouble();
Putting this all together, we end up with the following complete program.

// This program reads an input file of hours worked by various employees. Each // line of the input file should have an employee's name (without any spaces) // followed by a list of hours worked, as in: // // Erica 7.5 8.5 10.25 8 // Greenlee 10.5 11.5 12 11 // Ryan 6.5 8 9.25 8 // // The program reports the total hours worked by each employee. import java.io.*; public class HoursWorked { public static void main(String[] args) { Scanner console = new Scanner(System.in); Scanner input = getInput(console); process(input); } public static Scanner getInput(Scanner console) { System.out.print("What is the name of the input file? "); String name = console.nextLine(); Scanner result; try { result = new Scanner(new File(name)); } catch (FileNotFoundException e) { throw new RuntimeException("file not found"); } System.out.println(); return result; } public static void process(Scanner input) { while (input.hasNext()) { String name = input.next(); double sum = 0.0; while (input.hasNextDouble()) sum += input.nextDouble(); System.out.println("Total hours worked by " + name + " = " + sum); } } } If we put the input above into a file called "hours.dat" and execute the program, we get the following result.

What is the name of the input file? hours.dat Total hours worked by Erica = 42.75 Total hours worked by Greenlee = 55.75 Total hours worked by Simone = 24.0 Total hours worked by Ryan = 31.75 Total hours worked by Kendall = 5.5 As mentioned above, the getFile method is an example of boilerplate code. You will find that you can copy it verbatim from this program and use it in many others.

6.4 Line-based input and String-based Scanners

The program in the last section required that names have no spaces in them. This isn't a very practical restriction. It would be more convenient to be able to type anything for a name, including numbers. One way to do that is to put the name on a separate line from the rest of the data. For example, suppose that you want to compute weighted GPAs for a series of students. Suppose, for example, that a student has a 3-unit 3.0, a 4-unit 2.9, a 3-unit 3.2 and a 2-unit 2.5. We can compute an overall GPA that is weighted by the individual units for each course.

So we might have an input file that has its data on pairs of lines. For each pair the name will appear on the first line and the grade data will appear on the second line. For example, we might have an input file that looks like this:

Erica Kane 3 2.8 4 3.9 3 3.1 Greenlee Smythe 3 3.9 3 4.0 4 3.9 Ryan Laveree 2 4.0 3 3.6 4 3.8 1 2.8 Adam Chandler 3 3.0 4 2.9 3 3.2 2 2.5 Adam Chandler, Jr 4 1.5 5 1.9 When you have data that appears on multiple lines, it is best to read entire lines of the input file using calls on the nextLine method. That means that we can control our overall file processing loop with a test on hasNextLine. For the input file above, our basic structure will be:

while (input.hasNextLine()) { String name = input.nextLine(); String grades = input.nextLine(); <process this student's data> } This works well for reading the name because it's all one piece of data. But the input line with grades has internal structure to it. Wouldn't it be nice to use a Scanner to process the individual parts of the line? Java makes this possible. We can construct a Scanner object from an individual String. So instead of reading each second line of input into a String, let's instead put it into a Scanner object:

while (input.hasNextLine()) { String name = input.nextLine(); Scanner grades = new Scanner(input.nextLine()); <process this student's data> } Notice that for each input line of grades we construct a Scanner object. Because it is inside the loop, we construct a different Scanner object for each such input line. We can process the input line the same way we process an input file. The Scanner object will have an input cursor to keep track of a position within the String and we can consume input through calls on various "next" and "has" methods.

This approach to file processing will work well for any input file that is line oriented. Some lines might represent a single value like the name in the example above. For those lines, we can use a call on nextLine to read the entire line as a String that we can keep track of. Other lines will have multiple data values on the line, in which case we can construct a Scanner object from the String that will allow us to extract the individual data values.

Let's explore how we would process the grades using a Scanner. This is a place place to introduce a static method. The code above involves processing the overall file. The task of processing one list of grades is a lower level task that can be split off into its own method. Let's call it processGrades. Obviously it can't do its work without the Scanner object that has the grades, so we'll pass that as a parameter. What exactly needs to be done? The plan was to compute a weighted GPA for each student. So this method needs to read the individual grades and turn that into a single GPA score.

Weighted GPAs involve computing a value known as the "quality points" for each grade. The quality points are defined as the units time the grade. The weighted GPAs is calculated by dividing the total quality points by the total units. So we just need to add up the total quality points and add up the total units, then divide. This involves a pair of cumulative sum tasks that we can express in pseudocode as follows:

        set total units to 0.
        set total quality points to 0.
        while (more grades to process) {
            read next units and next grade.
            add next units to total units.
            add (next units) * (next grade) to total quality points.
        }
        set gpa to (total quality points)/(total units).
This is fairly simple to translate into Java code by incorporating our Scanner object called "data":

	double totalQualityPoints = 0.0;
	double totalUnits = 0;
	while (data.hasNextInt()) {
	    int units = data.nextInt();
	    double grade = data.nextDouble();
	    totalUnits += units;
	    totalQualityPoints += units * grade;
	}
        double gpa = totalQualityPoints/totalUnits
Because our Scanner object data was constructed from a single line of input, we can process just one person's grades with this loop. There is still a potential problem. What if there are no grades? Some students might have dropped all of their classes, for example. There are several ways we might handle that situation, but let's assume that it is appropriate to use a GPA of 0.0 in that case.

Making that correction and putting this into a method, we end up with the following code.


    public static double processGrades(Scanner data) {
	double totalQualityPoints = 0.0;
	double totalUnits = 0;
	while (data.hasNextInt()) {
	    int units = data.nextInt();
	    double grade = data.nextDouble();
	    totalUnits += units;
	    totalQualityPoints += units * grade;
	}
	if (totalUnits == 0)
	    return 0.0;
	else
	    return totalQualityPoints/totalUnits;
    }
Recall that our high-level code looked like this: while (input.hasNextLine()) { String name = input.nextLine(); Scanner grades = new Scanner(input.nextLine()); <process this student's data> } We can now start to fill in the details of what it means to "process this student's data." We will call the method we just wrote to process the grades for this student and to turn it into a weighted GPA and then print the results:

	    double gpa = processGrades(grades);
	    System.out.println("GPA for " + name + " = " + gpa);
This would complete the program, but let's add one more calculation. Let's compute the max and min GPA that we see among these students. We can accomplish this fairly easily with some simple if statements after the println:

	    if (gpa > max)
		max = gpa;
	    if (gpa < min)
		min = gpa;
We simply compare the current gpa against what we currently consider the max and min, resetting if the new gpa represents a new max or a new min. But how do we initialize these variables? We have two approaches to choose from. One approach involves initializing the max and the min to the first value in the sequence. We could do that, but it would make our loop much more complicated than it is currently. The second approach involves setting the max to the lowest possible value and setting the min to the highest possible value. This approach isn't always possible because we don't always know how high or low our values might go. But in the case of GPAs, we know that they will always be between 0.0 and 4.0.

Thus, we can initialize the variables as follows:

	double max = 0.0;
	double min = 4.0;
It may seem odd to set the max to 0 and the min to 4, but that's because we are intending to have them reset inside the loop. If the first student has a GPA of 3.2, for example, then this will constitute a new max (higher than 0.0) and a new min (lower than 4.0). Of course, it's possible that all students end up with a 4.0, but then our choice of 4.0 for the min is appropriate. Or all students could end up with a 0.0, in which case our choice of a max of 0.0 is appropriate.

Putting this all together we get the following complete program.

// This program reads an input file with GPA data for a series of students // and reports a weighted GPA for each. The input file should consist of // a series of line pairs where the first line has a student's name and the // second line has a series of grade entries. The grade entries should be // a number of units (an integer) followed by a grade (a number between 0.0 // and 4.0). For example, the input might look like this: // // Erica 7.5 8.5 10.25 8 // Greenlee 10.5 11.5 12 11 // Ryan 6.5 8 9.25 8 // // The program reports the total hours worked by each employee. import java.io.*; public class Gpa { public static void main(String[] args) { Scanner console = new Scanner(System.in); Scanner input = getInput(console); process(input); } public static Scanner getInput(Scanner console) { System.out.print("What is the name of the input file? "); String name = console.nextLine(); Scanner result; try { result = new Scanner(new File(name)); } catch (FileNotFoundException e) { throw new RuntimeException("file not found"); } System.out.println(); return result; } public static void process(Scanner input) { double max = 0.0; double min = 4.0; while (input.hasNextLine()) { String name = input.nextLine(); Scanner grades = new Scanner(input.nextLine()); double gpa = processGrades(grades); System.out.println("GPA for " + name + " = " + gpa); } System.out.println(); System.out.println("max GPA = " + max); System.out.println("min GPA = " + min); } public static double processGrades(Scanner data) { double totalQualityPoints = 0.0; double totalUnits = 0; while (data.hasNextInt()) { int units = data.nextInt(); double grade = data.nextDouble(); totalUnits += units; totalQualityPoints += units * grade; } if (totalUnits == 0) return 0.0; else return totalQualityPoints/totalUnits; } } Which executes as follows assuming the data above is placed in a file called "gpa.dat".

What is the name of the input file? gpa.dat GPA for Erica Kane = 3.3299999999999996 GPA for Greenlee Smythe = 3.9299999999999997 GPA for Ryan Laveree = 3.6799999999999997 GPA for Adam Chandler = 2.9333333333333336 GPA for Adam Chandler, Jr = 1.7222222222222223 max GPA = 3.9299999999999997 min GPA = 1.7222222222222223

6.5 Programming Problems

  1. Write a program that takes as input a single-spaced text file and produces as output a double-spaced text file.

  2. Students are often told that their term papers should have a certain number of words in them. Counting words in a long paper is a tedious task, but the computer can help. Write a program that counts the number of words in a paper assuming that consecutive words are separated either by spaces or end-of-line characters. You could then extend the program to count not just the number of words, but the number of lines and the total number of characters in the file.

  3. Write a program that takes as input lines of text like:

    This is some text here. and produces as output the same text inside a box, as in:

    	+--------------+
    	| This is some |
    	| text here.   |
    	+--------------+
    
    Your program will have to assume some maximum line length (e.g., 12 above).