CSE 413 Winter 2002

Assignment 6 -- Parsing & D Lexical Analyzer

Electronic turnin of programming part due Tuesday, Feb. 26, by 10:00 pm.
Test output, and written problems due in class Wednesday, Feb. 27.

Written problems

Here are a few warmup problems involving context free grammars for the next assignment, the parser. You should do these problems individually, even if you have a partner for the compiler project.

Consider the following grammar:
A ::= SbA | SS | ba
S ::= aAS | a

What are the terminals, nonterminals, and start symbol for this grammar?
Draw parse trees for the following sentences:

aabbaa
aabaaabaa

Give a leftmost derivation of the sentences in (b).

Give a context free grammar that generates
1. The language containing strings with an equal number of a's and b's.
2. The language containing strings of the form ww^Rcaⁿb²ⁿ, where w is a string of 1 or more a's and b's ( (a|b)+ ); w^R denotes the reversal of w, and the superscript n stands for any positive integer, meaning that the end of the string should contain 1 or more a's followed by twice as many b's as a's.
(optional) Construct a context-free grammar for roman numerals.
(optional) Write a grammar for English sentences using the words
time, arrow, banana, flies, like, a, an, the, fruit
and the semicolon. Be sure to include all the senses (noun, verb, etc.) of each word. Then show that this grammar is ambiguous by exhibiting more than one parse tree for "time flies like an arrow; fruit flies like a banana."

Programming project - Scanner

The purpose of this part of the assignment is to construct a scanner for the D language, which is defined in a separate handout. The scanner, as well as the rest of your compiler, should be written in Java. The code should not just work properly, but it should also be readable and well-organized.

Scanner Organization

A scanner (or lexical analyzer) reads the character (text) representation of a program and transforms it into a stream of tokens representing the basic lexical items in the language. These tokens include punctuation (lparen, rparen, semicolon, ...), keywords (if, int, return, ...), integers, and identifiers. The scanner should skip over whitespace and comments in the source code; these are ignored by the compiler and should not appear in the token stream that will eventually be read by the parser.

The scanner should be defined as a Java class Scanner, and should provide a nextToken() method that the client program can call to obtain tokens sequentially. The scanner class should include an appropriate test program. This test program should use the CompilerIO class from the previous assignment for line-oriented input and output operations that read and write the input and output files. The test program should read tokens from a D source program one at a time, and print the tokens to the output file. Source program lines should be echoed to the output file as they are read, to make it easier to see the correlation between the source code and the tokens. Echoing of the input lines should be handled automatically by CompilerIO. A first approximation to the test program might look something like this:

   // scanner test program
   public static void main (String [] args) {
      // create I/O object, which opens files when it is constructed
      CompilerIO cio = new CompilerIO(...);
      // set cio to turn input echoing on, etc.
      ...
      // create scanner object to read tokens
      Scanner scan = new Scanner(cio);
      
      // read source program and print token stream
      do {
         Token t = scan.nextToken();
         write string version of t to output file;
      } while (t.kind != Token.EOF);
   }

(Note that this is one of the situations where a do-while loop is appropriate, as opposed to a regular while or for. We know there will be at least one token in the stream.)

You will also need to define a class Token to represent the lexical tokens. Each Token object should include a field to store the lexical class of the Token (id, int, lparen, not, ...). Class Token should include appropriate symbolic constant names (static final ints) for the various lexical classes. Tokens for identifiers and integers need to contain additional information: the String representation of the identifier or the numeric (int) value of the integer. Since there's little space overhead involved, it's reasonable to have an int and a String field in every Token and ignore these fields if the Token is something other than an integer or identifier. Class Token should contain an appropriate toString() method that returns a descriptive string representation of a Token. See the course notes on Lexical Analysis for additional information and suggestions.

Implementation

Take advantage of features of Java when you write your code. Assuming your Token class contains an appropriate toString method, the test program can write the tokens directly to the output text file without having to decode them. Similarly, you'll need a table of language keywords and their corresponding lexical tokens. Use a Java HashMap for this; don't re-implement a symbol table from scratch.

While Java provides classes StreamTokenizer and StringTokenizer to break input streams or Strings into tokens, for this assignment you must implement the scanner without them. Your scanner should examine the source program one character at a time and decode the input into individual tokens manually. Normally, a Java programmer would use a tokenizer class, if it was suitable for the problem, but the purpose of this assignment is to understand how such things are done.

The Java library provides several functions that you may find useful. Class Character contains methods for classifying characters. Class Integer contains a constructor that takes a string argument containing digits and returns the corresponding Integer value.

Be sure your program is well written, formatted neatly, contains appropriate comments, etc. Be careful to precisely specify the state of the scanner in comments describing variables, particularly exactly how far the scan has progressed on the current line, where the next unprocessed character is, and so forth. Use public and private to control access to information; be sure to hide implementation details that should not be visible outside the scanner.

Keep things simple. One could imagine a Token class hierarchy with an abstract class Token, a separate subclass for each kind of Token, each containing a separate toString, and hierarchies of classes for different groups of operators (addition operators, multiplication operators, etc.). One could imagine it, but it's probably not a great idea. Tokens are simple data objects; even with the symbolic constants for each kind of Token, the entire class definition should fit on a page or two. A hierarchy containing 15 or 20 Token classes and subclasses would be 15 or 20 files and fill many pages when printed. Without some truly compelling reason to do this, it's hard to imagine why it would be worth the added complexity.

Project Turnin

Turn in your program electronically using this turnin form.

You should hand in some examples of test input and output that demonstrate that your class works, and your answers to the written problems in lecture on February 27.