Electronic turnin of programming part due Tuesday, Feb. 26, by 10:00 pm.
Test output, and written problems due in class Wednesday, Feb. 27.
Here are a few warmup problems involving context free grammars for the next assignment, the parser. You should do these problems individually, even if you have a partner for the compiler project.
time, arrow, banana, flies, like, a, an, the, fruit
The purpose of this part of the assignment is to construct a scanner for the D language, which is defined in a separate handout. The scanner, as well as the rest of your compiler, should be written in Java. The code should not just work properly, but it should also be readable and well-organized.
A scanner (or lexical analyzer) reads the character (text) representation of a program and transforms it into a stream of tokens representing the basic lexical items in the language. These tokens include punctuation (lparen, rparen, semicolon, ...), keywords (if, int, return, ...), integers, and identifiers. The scanner should skip over whitespace and comments in the source code; these are ignored by the compiler and should not appear in the token stream that will eventually be read by the parser.
The scanner should be defined as a Java class Scanner
, and should provide a nextToken()
method that the client program can call to obtain tokens sequentially. The
scanner class should include an appropriate test program.
This test program should use the CompilerIO
class from the previous
assignment for line-oriented input and output operations that read and write
the input and output files. The test program should read tokens from a D source program one
at a time, and print the tokens to the output file. Source program lines
should be echoed to the output file as they are read, to make it easier to see the correlation between the source code and the
tokens.
Echoing of the input lines should be handled automatically by CompilerIO
. A first
approximation to the test program might look something like this:
// scanner test program public static void main (String [] args) { // create I/O object, which opens files when it is constructed CompilerIO cio = new CompilerIO(...); // set cio to turn input echoing on, etc. ... // create scanner object to read tokens Scanner scan = new Scanner(cio); // read source program and print token stream do { Token t = scan.nextToken(); write string version of t to output file; } while (t.kind != Token.EOF); }
(Note that this is one of the situations where a do-while
loop is appropriate, as opposed to a regular while
or for
.
We know there will be at least one token in the stream.)
You will also need to define a class Token to represent the lexical tokens.
Each Token object should include a field to store the lexical class of the
Token (id, int, lparen, not, ...). Class Token should include appropriate
symbolic constant names (static final ints) for the various lexical classes.
Tokens for identifiers and
integers need to contain additional information: the String representation of the
identifier or the numeric (int) value of the integer. Since there's little
space overhead involved, it's reasonable to have an int and a String
field in every Token and ignore these fields if the Token is something
other than an integer or identifier. Class Token should contain an
appropriate toString()
method that returns a descriptive string
representation of a Token. See the course notes on Lexical
Analysis for additional information and suggestions.
Take advantage of features of Java when you write your code. Assuming your Token class contains an
appropriate toString method, the test program can write the tokens directly to
the output text file without having to decode them. Similarly, you'll need
a table of language keywords and their corresponding lexical tokens. Use a
Java HashMap
for this; don't re-implement a symbol table from
scratch.
While Java provides classes StreamTokenizer and StringTokenizer
to break input streams or Strings
into tokens, for this
assignment you must implement the scanner without them. Your scanner
should examine the source program one character at a time and decode the input into individual tokens manually.
Normally, a Java programmer would use a tokenizer class, if it was suitable for
the problem, but the purpose of this
assignment is to understand how such things are done.
The Java library provides several functions that you may find useful.
Class Character
contains methods for classifying characters. Class Integer contains a constructor that takes a string
argument containing digits and returns the corresponding Integer value.
Be sure your program is well written, formatted neatly, contains appropriate comments, etc. Be careful to precisely specify the state of the scanner in comments describing variables, particularly exactly how far the scan has progressed on the current line, where the next unprocessed character is, and so forth. Use public and private to control access to information; be sure to hide implementation details that should not be visible outside the scanner.
Keep things simple. One could imagine a Token class hierarchy with an abstract class Token, a separate subclass for each kind of Token, each containing a separate toString, and hierarchies of classes for different groups of operators (addition operators, multiplication operators, etc.). One could imagine it, but it's probably not a great idea. Tokens are simple data objects; even with the symbolic constants for each kind of Token, the entire class definition should fit on a page or two. A hierarchy containing 15 or 20 Token classes and subclasses would be 15 or 20 files and fill many pages when printed. Without some truly compelling reason to do this, it's hard to imagine why it would be worth the added complexity.
Turn in your program electronically using this turnin form.
You should hand in some examples of test input and output that demonstrate that your class works, and your answers to the written problems in lecture on February 27.