CSE 413 Autumn 2006 -- Assignment 6

Regular Expressions & Compiler Scanner

Due: Electronic turnin due no later than 11:00 pm, Tuesday, Nov. 21. Printouts of your code and tests, and answers to written questions due at the beginning of class on Wednesday, Nov. 22.

Overview

This is the first of a sequence of assignments that will result in a small, but complete compiler. You may work with a partner on this project. If you do so, you should plan on working with the same person for the entire project.

Written problems

These problems cover some of the basics of regular expressions.  You should do these problems individually, even if you have a partner for the compiler project.

  1. For each of the following regular expressions, (i) give an example of two strings that can be generated by the expression and two that cannot be generated but use the same alphabet (if possible), and (ii) give an English description of the set of strings generated (for example, "all strings consisting of 1 or more w's followed by xyz").

    1. (0*1*)*
    2. (1|10)*
    3. (0|1)*0(0|1)(0|1)
  2. Give regular expressions that will generate the following strings.

    1. All strings of lower case letters in the range a-f where the letters that do appear are in ascending lexicographic order.
    2. All strings of 0's and 1's that do not contain 011 as a substring. (Hint: sometimes it's easier to visualize problems like this by using finite automata to get an idea of what characters should or should not be allowed to follow others.)

  3. Here is the description of floating constants from the C reference manual (by Kernighan and Ritche). Write a set of regular expressions that generate floating constants as described here.

    A floating constant consists of an integer part, a decimal point, a fraction part, an e or E, an optionally signed integer exponent and an optional type suffix, one of f, F, l, or L. The integer and fraction parts both consist of a sequence of digits. Either the integer part or the fractional part (not both) may be missing; either the decimal point or the e and the exponent (not both) may be missing. The type is determined by the suffix; F or f makes it float, L or l makes it long double; otherwise it is double.

  4. (Optional) Give the diagram of a finite automata (state diagram) that recognizes the regular expressions for C floating constants given in your answer to question 3.

Programming project - Scanner

The purpose of this part of the assignment is to construct a scanner for the D language, which is defined in a separate handout.  The scanner, as well as the rest of your compiler, should be written in C.  The code should not only work properly, but it should also be readable and well-organized.

Scanner Organization

A scanner (or lexical analyzer) reads the character (text) representation of a program and transforms it into a sequence of tokens representing the basic lexical items in the language.   These tokens include  punctuation (lparen, rparen, semicolon, ...), keywords (if, int, return, ...),  integers, and identifiers.  The scanner should skip over whitespace and comments in the source code; these are ignored by the compiler and should not appear in the token stream that will eventually be read by the parser.

The main interface to the scanner is a function next_token() that returns the next lexical token from the source program each time it is called. After the last token has been returned, further calls to next_token() should return an "end-of-file" token to indicate that there is no more input.

To test the scanner, you should write a main program that reads successive tokens from an input file one at a time and prints them to the output file. We are supplying a fair amount of starter code for this project, including header files for the scanner and tokens, and a set of I/O routines that handle the details of opening the correct files, reading and writing text lines, and that also automatically copy input lines to the output file as assembly-language comments. This later copying will make it easier to see the correlation between the input source program and the tokens returned by the scanner and printed by the test program.

Supplied files

You should download the following files and use them as a starting point for your scanner project.

Implementation

For the most part the scanner is a reasonably straightforward character and text manipulation program. A couple of suggestions:

Project Turnin

Turn in your program electronically using this online turnin form.

You should hand in some examples of test input and output that demonstrate that your scanner works, and your answers to the written problems in class on Nov. 22.