Electronic turnin of programming part due Tuesday, Feb. 20, by 10:00 pm.
Turnin receipt, test output, and written problems due in class Wednesday, Feb.
21.
Here are a few warmup problems involving context free grammars for the next assignment, the parser. You should do these problems individually, even if you have a partner for the compiler project.
(Source: Aho & Ullman, Theory of Parsing..., Appel, Modern Compiler Implementation...)
time, arrow, banana, flies, like, a, an, the, fruit
The purpose of this part of the assignment is to construct a scanner for the D language, which is defined in a separate handout. The scanner, as well as the rest of your compiler, should be written in Java. The code should not just work properly, but it should also be readable and well-organized.
A scanner (or lexical analyzer) reads the character (text) representation of a program and transforms it into a stream of tokens representing the basic lexical items in the language. These tokens include punctuation (lparen, rparen, semicolon, ...), keywords (if, int, return, ...), integers, and identifiers. The scanner should skip over whitespace and comments in the source code; these are ignored by the compiler and do not appear in the token stream.
The scanner should be defined as a Java class, and should provide a nextToken()
method that the client program can call to obtain tokens sequentially. The
scanner class should include an appropriate test program.
This test program should use the EchoIO
class from the previous
assignment to obtain line-oriented input and output operations for the
scanner. The test program should read tokens from a D source program one
at a time, and print the tokens to the output file. Source program lines
should be echoed to the output file as they are read, to make it easier to see the correlation between the source code and the tokens.
Echoing of the input lines should be handled automatically by EchoIO
. A first
approximation to the test program might look something like this:
// scanner test program public static void main (String [] args) { // create I/O object and open files EchoIO eio = new EchoIO(...); // set eio to turn input echoing on, etc. ... // create scanner object to read tokens Scanner scan = new Scanner(eio); // read source program and print token stream do { Token t = scan.nextToken(); write string version of t to output file; } while (t.kind != Token.EOF); }
(Note that this is one of the situations where a do-while
loop is appropriate, as opposed to a regular while
or for
.
We know there will be at least one token in the stream.)
You will also need to define a class Token to represent the lexical tokens. Each Token object should include a field to store the lexical class of the Token (id, int, lparen, not, ...). Class Token should include appropriate symbolic constant names (static final ints) for the various lexical classes. Tokens for identifiers and integers need to contain additional information: the String representation of the identifier or the numeric (int) value of the integer. Since there's little space overhead involved, it's reasonable to have an int and a String field in every Token and ignore these fields if the Token is something other than an integer or identifier. See the course notes on Lexical Analysis for additional information and suggestions.
Take advantage of features of Java when you write your code. For example, Java classes should normally contain a toString() method that yields a String representation of instances of that class. If your Token class contains an appropriate toString, then the test program can write the tokens directly to a text stream without having to decode them.
While Java provides classes StreamTokenizer and StringTokenizer
to break input streams or Strings
into tokens, for this
assignment you must implement the scanner without them. Your scanner
should examine the source program one character at a time and decode the input into individual tokens manually.
Normally, a Java programmer would use a tokenizer class, if it was suitable for
the problem, but the purpose of this
assignment is to understand how such things are done.
The Java library provides several functions that you may find useful.
Class Character
contains methods for classifying characters. Class Integer contains a constructor that takes a string
argument containing digits and returns the corresponding Integer value.
Be sure your program is well written, formatted neatly, contains appropriate comments, etc. Be careful to precisely specify the state of the scanner in comments describing variables, particularly exactly how far the scan has progressed on the current line, where the next unprocessed character is, and so forth. Use public and private to control access to information; be sure to hide implementation details that should not be visible outside the scanner.
Keep things simple. One could imagine a Token class hierarchy with an abstract class Token, a separate subclass for each kind of Token, each containing a separate toString, and hierarchies of classes for different groups of operators (addition operators, multiplication operators, etc.). One could imagine it, but it's probably not a great idea. Tokens are simple data objects; even with the symbolic constants for each kind of Token, the entire class definition should fit on a page or two. A hierarchy containing 15 or 20 Token classes and subclasses would be 15 or 20 files and fill many pages when printed. Without some truly compelling reason to do this, it's hard to imagine why it would be worth the added complexity.
Turn in your program electronically using this turnin form.
Hand in the online receipt along with test output and your written problems in class on February 21.