CSE 413 Spring 2000

Assignment 6 -- D Lexical Analyzer

Due:  Wednesday, May 17 at the beginning of class. You should be able to finish this assignment well ahead of time, and the next assignment will be handed out before this one is due.  You may work with a partner on this assignment.  If so, you should continue working with the same partner you had for the previous compiler assignment.

Written problems

Here are a few warmup problems involving context free grammars for the next assignment, the parser.  You should do these problems individually, even if you have a partner for the compiler project.

(Sources: Aho, Sethi, & Ullman Compilers...; Appel Modern Compiler Implementation...)

  1. Consider the following grammar:
                   S ::= ( L ) | a
                   L ::= L, S | S
    1. What are the terminals, nonterminals, and start symbol for this grammar?
    2. Draw parse trees for the following sentences:
      1. (a, a)
      2. (a, (a, a))
    3. Give a leftmost derivation of the sentences in (b).
  2. Give a context free grammar that generates the language consisting of balanced parentheses and square brackets.  Example: ([[](()[()][])])[]
  3. (optional)  Construct a context-free grammar for roman numerals.
  4. (optional)  Write a grammar for English sentences using the words
                        time, arrow, banana, flies, like, a, an, the, fruit
    and the semicolon.  Be sure to include all the senses (noun, verb, etc.) of each word.  Then show that this grammar is ambiguous by exhibiting more than one parse tree for "time flies like an arrow; fruit flies like a banana."

Programming project - Scanner

The purpose of this part of the assignment is to construct a scanner for the D language, which is defined in a separate handout.  The scanner, as well as the rest of your compiler, should be written in Java.  The code should not just work properly, but it should also be readable and well-organized.

Scanner Organization

A scanner (or lexical analyzer) reads the character (text) representation of a program and transforms it into a stream of tokens representing the basic lexical items in the language.   These tokens include  punctuation (lparen, rparen, semicolon, ...), keywords (if, int, return, ...),  integers, and identifiers.  The scanner should skip over whitespace and comments in the source code; these do not appear in the token stream.

The scanner should be defined as a Java class, and should provide a nextToken() method that the client program can call to obtain tokens sequentially.  The scanner class should include an appropriate test program.  This test program should use the EchoIO class from the previous assignment to provide line-oriented input and output operations for the scanner.  The test program should read tokens from a D source program one at a time, and print the tokens to the output file.  Source program lines should be echoed to the output file as they are read, to make it easier to see the correlation between the source code and the tokens.  This echoing of input lines should be handled automatically by EchoIO.   A first approximation to the test program might look something like this:

   // scanner test program
   public static void main (String [] args) {
      // create I/O object and open files
      EchoIO eio = new EchoIO(...);
      // set eio to turn input echoing on, etc.
      ...
      // create scanner object to read tokens
      Scanner scan = new Scanner(eio);
      
      // read source program and print token stream
      do {
         Token t = scan.nextToken();
         write string version of t to output file;
      } while (t.kind != Token.EOF);
   }

(Note that this is one of the few situations where a do-while loop is appropriate, as opposed to a regular while or for.)  Class Scanner should hide all details of I/O from its clients, including handling any possible I/O exceptions.

You will also need to define a class Token to represent the lexical tokens.   Each Token object should include a field to store the lexical class of the Token (id, int, lparen, not, ...).  Class Token should include appropriate symbolic constant names (static final ints) for the various lexical classes.  Tokens for identifiers and integers need to contain additional information: the String representation of the identifier or the numeric (int) value of the integer.  Since there's little space overhead involved, it's reasonable to have an int and a String field in every Token and ignore these fields if the Token is something other than an integer or identifier.  See the course notes on Lexical Analysis for additional information and suggestions.

Implementation

Take advantage of features of Java when you write your code.  For example, Java classes should normally contain a toString() method that yields a String representation of instances of that class.  If your Token class contains an appropriate toString, then the test program can write the tokens directly to a text stream without having to decode them.

While Java provides classes StreamTokenizer and StringTokenizer to break input streams or Strings into tokens, for this assignment you should implement the scanner without them, examining input characters, and breaking the input into individual tokens manually.  Normally, a Java programmer would use a tokenizer class, if it was suitable for the problem, but the purpose of this assignment is to understand how such things are done, and implementing the details will provide more insight than calling library functions written by someone else.

The Java library provides several functions that you may find useful.  Class Character contains methods for classifying characters.  Class Integer contains a constructor that takes a string argument containing digits and returns the corresponding Integer value.

Be sure your program is well written, formatted neatly, contains appropriate comments, etc.  Be careful to precisely specify the state of the scanner in comments describing variables, particularly exactly how far the scan has progressed on the current line, where the next unprocessed character is, and so forth.   Use public and private to control access to information, particularly to hide implementation details that should not be visible outside the scanner.

Keep things simple.  One could imagine a Token class hierarchy with an abstract class Token, a separate subclass for each kind of Token, each containing a separate toString, and hierarchies of classes for different groups of operators (addition operators, multiplication operators, etc.).  One could imagine it, but it's probably not a great idea.  Tokens are simple data objects; even with the symbolic constants for each kind of Token, the entire class definition should fit on a page or two.  A hierarchy containing 15 or 20 Token classes and subclasses would be 15 or 20 files and fill many pages when printed.   Without some truly compelling reason to do this, it's hard to imagine why it would be worth the added complexity.

Project Turnin

Turn in your program electronically using the links below.

Hand in the online receipt along with test output and your written problems in class on May 17.