CSE 413 -- Assignment 6 -- D Lexical Analyzer

Due: Monday, May 24 at the beginning of class. Turn in a printed listing of your code and a listing of some sample input and the output produced. You should be able to finish this assignment well ahead of time, and the next assignment will be handed out before this one is due. You are encouraged to work with a partner. If you do work with someone, you should plan on continuing to work with the same person for the rest of the compiler project.

Overview

The purpose of this assignment is to construct a scanner for the D language, which is described in a separate writeup. The scanner, as well as the rest of your compiler, should be written in Java. The code should not just work properly, but it should also be readable and well-organized.

Include an appropriate test program with your scanner. This program should open a D source file and a text output file, then use the scanner to read the D program one token at a time, and print the tokens to the output file. The test program should use a file dialog to let the user select the input file, and it can either create and name the output file automatically (i.e., create test.txt in the same directory as the input file test.d), or it can present the user with another dialog box to select the output file. Lines from the source program should appear in the token stream output to make it easier to see the correlation between the source code and the tokens.

Scanner Organization

A scanner (or lexical analyzer) reads the character (text) representation of a program and transforms it into a stream of tokens representing the basic lexical items in the language. These tokens include punctuation (lparen, rparen, semicolon, ...), keywords (if, int, return, ...), integers, and identifiers. The scanner skips over whitespace and comments in the source code; these do not appear in the token stream.

The scanner should be packaged in a class, and should provide a nextToken() method that the client program can use to obtain tokens sequentially. A first approximation to the test program might look something like this.

   // scanner test program
   public static void main (String [] args) {
      InputStream sourceFile;   // D source program

      // use a file dialog to select and open sourceFile
      ...
      // create scanner object to read tokens from sourceFile
      Scanner scan = new Scanner(sourceFile);

      // read source program and print token stream
      Token t = scan.nextToken();
      while (t.kind != Token.EOF) {
         write string version of t to output file;
         t = scan.nextToken();
      }
}

You will also need to define a class Token to represent the lexical tokens. Each Token object should include a field to store the lexical class of the Token (id, int, lparen, not, ...). Tokens for identifiers and integers need to contain additional information: the String representation of the identifier or the numeric (int) value of the integer. Since there's little space overhead involved, it's reasonable to have an int and a String field in each Token and ignore these fields if the Token is something other than an integer or identifier. Class Token should include appropriate symbolic constant names (static final ints) for the various lexical classes.

Implementation Notes

Take advantage of the features of Java when you write your code. For example, Java classes should normally contain a toString() method that yields a String representation of instances of that class. If your Token class contains an appropriate toString, then the test program can write the tokens directly to a text stream without having to decode them.

Java provides at least two ways to break input lines (Strings) into tokens. Look at the definitions of classes StreamTokenizer and StringTokenizer for ideas. Class Integer contains a constructor that takes a string argument containing digits and returns the corresponding Integer value.

Although one would normally use the tokenizer classes to parse input, for this assignment you might want to implement the scanner without them, examining each input character individually and breaking the input into individual tokens by hand. That could take a bit more time, but you could get more insight into how the tokenizer classes are implemented. If you want to do this, you'll find the methods in class Character to be helpful in classifying input characters.

Be sure your program is well written, formatted neatly, contains appropriate comments, etc. Use public and private to control access to information, particularly to hide implementation details that should not be visible outside a class definition.

Keep things simple. One could imagine a Token class hierarchy with an abstract class Token, a separate subclass for each kind of Token, each containing a separate toString, and hierarchies of classes for different groups of operators (addition operators, multiplication operators, etc.). One could imagine it, but it's probably not a great idea. Tokens are simple data objects; even with the symbolic constants for each kind of Token, the entire class definition should fit on a page or two. A hierarchy containing 15 or 20 public Token classes and subclasses would be 15 or 20 files and fill many pages when printed. Without some truly compelling reason to do this, it's hard to imagine why it would be worth the added complexity.