Due: Wednesday, May 17 at the beginning of class. You should be able to finish this assignment well ahead of time, and the next assignment will be handed out before this one is due. You may work with a partner on this assignment. If so, you should continue working with the same partner you had for the previous compiler assignment.
Here are a few warmup problems involving context free grammars for the next assignment, the parser. You should do these problems individually, even if you have a partner for the compiler project.
(Sources: Aho, Sethi, & Ullman Compilers...; Appel Modern Compiler Implementation...)
([[](()[()][])])[]
time, arrow, banana, flies, like, a, an, the, fruit
The purpose of this part of the assignment is to construct a scanner for the D language, which is defined in a separate handout. The scanner, as well as the rest of your compiler, should be written in Java. The code should not just work properly, but it should also be readable and well-organized.
A scanner (or lexical analyzer) reads the character (text) representation of a program and transforms it into a stream of tokens representing the basic lexical items in the language. These tokens include punctuation (lparen, rparen, semicolon, ...), keywords (if, int, return, ...), integers, and identifiers. The scanner should skip over whitespace and comments in the source code; these do not appear in the token stream.
The scanner should be defined as a Java class, and should provide a nextToken()
method that the client program can call to obtain tokens sequentially. The
scanner class should include an appropriate test program.
This test program should use the EchoIO
class from the previous
assignment to provide line-oriented input and output operations for the
scanner. The test program should read tokens from a D source program one
at a time, and print the tokens to the output file. Source program lines
should be echoed to the output file as they are read, to make it easier to see the correlation between the source code and the tokens.
This echoing of input lines should be handled automatically by EchoIO
. A first
approximation to the test program might look something like this:
// scanner test program public static void main (String [] args) { // create I/O object and open files EchoIO eio = new EchoIO(...); // set eio to turn input echoing on, etc. ... // create scanner object to read tokens Scanner scan = new Scanner(eio); // read source program and print token stream do { Token t = scan.nextToken(); write string version of t to output file; } while (t.kind != Token.EOF); }
(Note that this is one of the few situations where a do-while
loop is appropriate, as opposed to a regular while
or for
.)
Class Scanner
should hide all details of I/O from its clients,
including handling any possible I/O exceptions.
You will also need to define a class Token to represent the lexical tokens. Each Token object should include a field to store the lexical class of the Token (id, int, lparen, not, ...). Class Token should include appropriate symbolic constant names (static final ints) for the various lexical classes. Tokens for identifiers and integers need to contain additional information: the String representation of the identifier or the numeric (int) value of the integer. Since there's little space overhead involved, it's reasonable to have an int and a String field in every Token and ignore these fields if the Token is something other than an integer or identifier. See the course notes on Lexical Analysis for additional information and suggestions.
Take advantage of features of Java when you write your code. For example, Java classes should normally contain a toString() method that yields a String representation of instances of that class. If your Token class contains an appropriate toString, then the test program can write the tokens directly to a text stream without having to decode them.
While Java provides classes StreamTokenizer and StringTokenizer
to break input streams or Strings
into tokens, for this
assignment you should implement the scanner without them, examining input
characters, and breaking the input into individual tokens manually.
Normally, a Java programmer would use a tokenizer class, if it was suitable for
the problem, but the purpose of this
assignment is to understand how such things are done, and implementing the
details will provide more insight than calling library functions written by
someone else.
The Java library provides several functions that you may find useful.
Class Character
contains methods for classifying characters. Class Integer contains a constructor that takes a string
argument containing digits and returns the corresponding Integer value.
Be sure your program is well written, formatted neatly, contains appropriate comments, etc. Be careful to precisely specify the state of the scanner in comments describing variables, particularly exactly how far the scan has progressed on the current line, where the next unprocessed character is, and so forth. Use public and private to control access to information, particularly to hide implementation details that should not be visible outside the scanner.
Keep things simple. One could imagine a Token class hierarchy with an abstract class Token, a separate subclass for each kind of Token, each containing a separate toString, and hierarchies of classes for different groups of operators (addition operators, multiplication operators, etc.). One could imagine it, but it's probably not a great idea. Tokens are simple data objects; even with the symbolic constants for each kind of Token, the entire class definition should fit on a page or two. A hierarchy containing 15 or 20 Token classes and subclasses would be 15 or 20 files and fill many pages when printed. Without some truly compelling reason to do this, it's hard to imagine why it would be worth the added complexity.
Turn in your program electronically using the links below.
Hand in the online receipt along with test output and your written problems in class on May 17.