Due: Electronic turnin due no later than 11:00 pm, Tuesday, Nov. 21. Printouts of your code and tests, and answers to written questions due at the beginning of class on Wednesday, Nov. 22.
This is the first of a sequence of assignments that will result in a small, but complete compiler. You may work with a partner on this project. If you do so, you should plan on working with the same person for the entire project.
These problems cover some of the basics of regular expressions. You should do these problems individually, even if you have a partner for the compiler project.
For each of the following regular expressions, (i) give an example of two strings that can be generated by the expression and two that cannot be generated but use the same alphabet (if possible), and (ii) give an English description of the set of strings generated (for example, "all strings consisting of 1 or more w's followed by xyz").
Give regular expressions that will generate the following strings.
A floating constant consists of an integer part, a decimal point, a fraction part, an e or E, an optionally signed integer exponent and an optional type suffix, one of f, F, l, or L. The integer and fraction parts both consist of a sequence of digits. Either the integer part or the fractional part (not both) may be missing; either the decimal point or the e and the exponent (not both) may be missing. The type is determined by the suffix; F or f makes it float, L or l makes it long double; otherwise it is double.
(Optional) Give the diagram of a finite automata (state diagram) that recognizes the regular expressions for C floating constants given in your answer to question 3.
The purpose of this part of the assignment is to construct a scanner for the D language, which is defined in a separate handout. The scanner, as well as the rest of your compiler, should be written in C. The code should not only work properly, but it should also be readable and well-organized.
A scanner (or lexical analyzer) reads the character (text) representation of a program and transforms it into a sequence of tokens representing the basic lexical items in the language. These tokens include punctuation (lparen, rparen, semicolon, ...), keywords (if, int, return, ...), integers, and identifiers. The scanner should skip over whitespace and comments in the source code; these are ignored by the compiler and should not appear in the token stream that will eventually be read by the parser.
The main interface to the scanner is a function next_token()
that
returns the next lexical token from the source program each time it is called.
After the last token has been returned, further calls to next_token()
should
return an "end-of-file" token to indicate that there is no more input.
To test the scanner, you should write a main program that reads successive tokens from an input file one at a time and prints them to the output file. We are supplying a fair amount of starter code for this project, including header files for the scanner and tokens, and a set of I/O routines that handle the details of opening the correct files, reading and writing text lines, and that also automatically copy input lines to the output file as assembly-language comments. This later copying will make it easier to see the correlation between the input source program and the tokens returned by the scanner and printed by the test program.
You should download the following files and use them as a starting point for your scanner project.
io.h
and io.c
: These specify and implement a set of I/O routines that open
files and provide operations to read and write text lines for the compiler.
The
header
file contains complete documentation, but briefly, your program should call
io_init
when it starts and pass it the arguments supplied to
your main program. By default, input will be read from stdin
and
output written to stdout
. But if two arguments are supplied
on the command line, the first is taken to be the input file name and the
second is the output file
name; stdin
and
stdout
will be redirected to these files. If a single argument
is supplied, it is taken to be the input file name, and a file with the same
name but a ".s" extension will be opened for output.io_init
has opened the files,
use the supplied getline
and putline
functions to read
and write data as needed. In particular, the scanner should use getline
to read
each line from the source program.token.h
and token.c
:
These give a partial specification and implementation of a token data structure.
You will need to complete these by adding definitions and implementations
for missing lexical classes that appear in D programs but are not already
included in the code.scan.h
: This specifies the single next_token
method
that is to be implemented by the scanner. You should include this file in
your scanner implementation and
in any files that use it.For the most part the scanner is a reasonably straightforward character and text manipulation program. A couple of suggestions:
<string.h>
, header <ctype.h>
specifies
functions for classifying characters, and <stdlib.h>
has
functions that convert strings of digits to ints and back.'\0'
characters
appear in all the right places, etc. Obviously you should check for things
that are important to the compiler logic, but don't spend lots of time worrying
about buffer overruns and other problems for this project. Save that for
when you are writing production code that needs to be bullet-proof.Turn in your program electronically using this online turnin form.
You should hand in some examples of test input and output that demonstrate that your scanner works, and your answers to the written problems in class on Nov. 22.