Due: Partner selection due by Friday Nov. 16th at 11:59pm. (See below for details.) Electronic turnin due no later than 8am, Wednesday, Nov. 21. Printouts of your code and tests, and answers to written questions due at the beginning of class on Wednesday, Nov. 21.
This is the first of a sequence of assignments that will result in a small, but complete compiler. You may work with a partner on this project. If you do so, you should plan on working with the same person for the entire project.
If working with a partner, you must find your partner and send a single email (i.e. only one partner needs to do this) to the instructor with the name and UW email id of both partners by 11:59pm Friday November 16th. (If you are planning on working alone, you do not need to send email.) Please use this link.
These problems cover some of the basics of regular expressions. You should do these problems individually, even if you have a partner for the compiler project.
Note: It may be useful to use the following notation to exclude a character. Use [^c] to mean: {sigma - c} where sigma is the alphabet. For example, a character string in the programming language C could be described as "[^"]*". In otherwords, zero or more characters (other than the double quote symbol) surrounded by double quotes.
Write a set of regular expressions that generate C integer constants as described above.An integer constant consisting of a sequence of digits is taken to be octal if it begins with 0 (digit zero), decimal otherwise. Octal constants do not contain the digits 8 or 9. A sequence of digits preceded by 0x or 0X (digit zero) is taken to be a hexadecimal integer. The hexadecimal digits include a or A through f or F with values 10 through 15.
An integer constant may be suffixed with the letter u or U, to specify that it is unsigned. It may also be suffixed by the letter l or L to specify that it is long.
The purpose of this part of the assignment is to construct a scanner for the D language, which is defined in a separate handout. The scanner, as well as the rest of your compiler, should be written in C. The code should not only work properly, but it should also be readable and well-organized.
A scanner (or lexical analyzer) reads the character (text) representation of a program and transforms it into a sequence of tokens representing the basic lexical items in the language. These tokens include punctuation (lparen, rparen, semicolon, ...), keywords (if, int, return, ...), integers, and identifiers. The scanner should skip over whitespace and comments in the source code; these are ignored by the compiler and should not appear in the token stream that will eventually be read by the parser.
The main interface to the scanner is a function next_token()
that
returns the next lexical token from the source program each time it is called.
After the last token has been returned, further calls to next_token()
should
return an "end-of-file" token to indicate that there is no more input.
To test the scanner, you should write a main program that reads successive tokens from an input file one at a time and prints them to the output file. We are supplying a fair amount of starter code for this project, including header files for the scanner and tokens, and a set of I/O routines that handle the details of opening the correct files, reading and writing text lines, and that also automatically copy input lines to the output file as assembly-language comments. This later copying will make it easier to see the correlation between the input source program and the tokens returned by the scanner and printed by the test program.
You should download the following files and use them as a starting point for your scanner project.
io.h
and io.c
: These specify and implement a set of I/O routines that open
files and provide operations to read and write text lines for the compiler.
The
header
file contains complete documentation, but briefly, your program should call
io_init
when it starts and pass it the arguments supplied to
your main program. By default, input will be read from stdin
and
output written to stdout
. But if two arguments are supplied
on the command line, the first is taken to be the input file name and the
second is the output file
name; stdin
and
stdout
will be redirected to these files. If a single argument
is supplied, it is taken to be the input file name, and a file with the same
name but a ".s" extension will be opened for output.io_init
has opened the files,
use the supplied getline
and putline
functions to read
and write data as needed. In particular, the scanner should use getline
to read
each line from the source program.token.h
and token.c
:
These give a partial specification and implementation of a token data structure.
You will need to complete these by adding definitions and implementations
for missing lexical classes that appear in D programs but are not already
included in the code.scan.h
: This specifies the single next_token
method
that is to be implemented by the scanner. You should include this file in
your scanner implementation and
in any files that use it.For the most part the scanner is a reasonably straightforward character and text manipulation program. A couple of suggestions:
<string.h>
, header <ctype.h>
specifies
functions for classifying characters, and <stdlib.h>
has
functions that convert strings of digits to ints and back.'\0'
characters
appear in all the right places, etc. Obviously you should check for things
that are important to the compiler logic, but don't spend lots of time worrying
about buffer overruns and other problems for this project. Save that for
when you are writing production code that needs to be bullet-proof.Electronic: Turn in your program electronically using the link on the assignments page. If you worked with a partner only one person should submit.
Paper: You should hand in some examples of test input and output that demonstrate that your scanner works, and your answers to the written problems in class on Nov. 21. Note that if you worked with a partner on the compiler each partner must still turn in solutions to the written problems individually.
Added 11/14/07:
Added 11/16/07:
Makefile
to use for the
project. Notes: If in doubt, make clean, then make things over again
from scratch. Feel free to modify this as you would like and
certainly as the project goes on. Obviously you can use whatever
name you would like for your test file. I put all of the .h files as
dependencies for each of the .o files but this is actually much more
than you need. You can make depend to see what the actual
dependencies are (for example if token.h changes then scan.o should be
rebuilt because you may have changed what a token is or some of the
#define values for tokens).
hello.s
token output for
the hello.d
sample
file. You do not need to put each token on its own line or have
format like this, or even the exact same token names but you should
have the same number of tokens and you should have ignored comments
and whitespace etc.
Added 11/19/07: