CSE 413 Autumn 2006 -- Assignment 6

Regular Expressions & Compiler Scanner

Due: Electronic turnin due no later than 11:00 pm, Tuesday, Nov. 21. Printouts of your code and tests, and answers to written questions due at the beginning of class on Wednesday, Nov. 22.

Overview

This is the first of a sequence of assignments that will result in a small, but complete compiler. You may work with a partner on this project. If you do so, you should plan on working with the same person for the entire project.

Written problems

These problems cover some of the basics of regular expressions. You should do these problems individually, even if you have a partner for the compiler project.

For each of the following regular expressions, (i) give an example of two strings that can be generated by the expression and two that cannot be generated but use the same alphabet (if possible), and (ii) give an English description of the set of strings generated (for example, "all strings consisting of 1 or more w's followed by xyz").
1. (0*1*)*
2. (1|10)*
3. (0|1)*0(0|1)(0|1)
Give regular expressions that will generate the following strings.
1. All strings of lower case letters in the range a-f where the letters that do appear are in ascending lexicographic order.
2. All strings of 0's and 1's that do not contain 011 as a substring. (Hint: sometimes it's easier to visualize problems like this by using finite automata to get an idea of what characters should or should not be allowed to follow others.)
Here is the description of floating constants from the C reference manual (by Kernighan and Ritche). Write a set of regular expressions that generate floating constants as described here.
A floating constant consists of an integer part, a decimal point, a fraction part, an e or E, an optionally signed integer exponent and an optional type suffix, one of f, F, l, or L. The integer and fraction parts both consist of a sequence of digits. Either the integer part or the fractional part (not both) may be missing; either the decimal point or the e and the exponent (not both) may be missing. The type is determined by the suffix; F or f makes it float, L or l makes it long double; otherwise it is double.
(Optional) Give the diagram of a finite automata (state diagram) that recognizes the regular expressions for C floating constants given in your answer to question 3.

Programming project - Scanner

The purpose of this part of the assignment is to construct a scanner for the D language, which is defined in a separate handout. The scanner, as well as the rest of your compiler, should be written in C. The code should not only work properly, but it should also be readable and well-organized.

Scanner Organization

A scanner (or lexical analyzer) reads the character (text) representation of a program and transforms it into a sequence of tokens representing the basic lexical items in the language. These tokens include punctuation (lparen, rparen, semicolon, ...), keywords (if, int, return, ...), integers, and identifiers. The scanner should skip over whitespace and comments in the source code; these are ignored by the compiler and should not appear in the token stream that will eventually be read by the parser.

The main interface to the scanner is a function next_token() that returns the next lexical token from the source program each time it is called. After the last token has been returned, further calls to next_token() should return an "end-of-file" token to indicate that there is no more input.

To test the scanner, you should write a main program that reads successive tokens from an input file one at a time and prints them to the output file. We are supplying a fair amount of starter code for this project, including header files for the scanner and tokens, and a set of I/O routines that handle the details of opening the correct files, reading and writing text lines, and that also automatically copy input lines to the output file as assembly-language comments. This later copying will make it easier to see the correlation between the input source program and the tokens returned by the scanner and printed by the test program.

Supplied files

You should download the following files and use them as a starting point for your scanner project.

io.h and io.c: These specify and implement a set of I/O routines that open files and provide operations to read and write text lines for the compiler. The header file contains complete documentation, but briefly, your program should call io_init when it starts and pass it the arguments supplied to your main program. By default, input will be read from stdin and output written to stdout. But if two arguments are supplied on the command line, the first is taken to be the input file name and the second is the output file name; stdin and stdout will be redirected to these files. If a single argument is supplied, it is taken to be the input file name, and a file with the same name but a ".s" extension will be opened for output.
Once io_init has opened the files, use the supplied getline and putline functions to read and write data as needed. In particular, the scanner should use getline to read each line from the source program.
token.h and token.c: These give a partial specification and implementation of a token data structure. You will need to complete these by adding definitions and implementations for missing lexical classes that appear in D programs but are not already included in the code.
scan.h: This specifies the single next_token method that is to be implemented by the scanner. You should include this file in your scanner implementation and in any files that use it.

Implementation

For the most part the scanner is a reasonably straightforward character and text manipulation program. A couple of suggestions:

Take advantage of functions in the C library for processing strings. In addition to the main string library <string.h>, header <ctype.h> specifies functions for classifying characters, and <stdlib.h> has functions that convert strings of digits to ints and back.
A key piece of data that is needed in your scanner is an array of characters to store the current input line from the source program, along with pointers or subscripts to keep track of the scanner's current position in the input line. Define these variables carefully and check your definitions against all of the code that manipulates them. Time spent on careful checking and thinking here will more than repay itself in reduced time spent on debugging later.
You will need to have a list of the keywords (reserved words) in the D language stored somewhere so you can check for them after reading a token that could be an identifier or a keyword. In a real compiler we would probably use a hash table or some other fancy data structure to allow us to look up keywords very quickly. But for this project, simple and correct should be your priority: sequentially searching a list of strings or writing a nest of string comparisons is fine. (Although searching a data structure is probably more compact and elegant.)
C strings are inherently unsafe unless a lot of error checking is included to be sure that buffers are long enough, trailing '\0' characters appear in all the right places, etc. Obviously you should check for things that are important to the compiler logic, but don't spend lots of time worrying about buffer overruns and other problems for this project. Save that for when you are writing production code that needs to be bullet-proof.

Project Turnin

Turn in your program electronically using this online turnin form.

You should hand in some examples of test input and output that demonstrate that your scanner works, and your answers to the written problems in class on Nov. 22.