CSE 413 Autumn 2007 -- Assignment 6

Regular Expressions & Compiler Scanner

Due: Partner selection due by Friday Nov. 16th at 11:59pm. (See below for details.) Electronic turnin due no later than 8am, Wednesday, Nov. 21. Printouts of your code and tests, and answers to written questions due at the beginning of class on Wednesday, Nov. 21.

Overview

This is the first of a sequence of assignments that will result in a small, but complete compiler. You may work with a partner on this project. If you do so, you should plan on working with the same person for the entire project.

If working with a partner, you must find your partner and send a single email (i.e. only one partner needs to do this) to the instructor with the name and UW email id of both partners by 11:59pm Friday November 16th. (If you are planning on working alone, you do not need to send email.) Please use this link.

Written problems

These problems cover some of the basics of regular expressions.  You should do these problems individually, even if you have a partner for the compiler project.

Note: It may be useful to use the following notation to exclude a character. Use [^c] to mean: {sigma - c} where sigma is the alphabet. For example, a character string in the programming language C could be described as "[^"]*". In otherwords, zero or more characters (other than the double quote symbol) surrounded by double quotes.

  1. For each of the following regular expressions, (i) give an example of two strings that can be generated by the regular expression and two that use the same alphabet but cannot be generated, and (ii) give an English description of the set of strings generated (for example, "all strings consisting of the word 'cow' followed by 1 or more 'x's and 'o's").
    1. (a|xy)*
    2. b(oz)+o
    3. ((ε|0)1)*
  2. Give regular expressions that will generate the following sets of strings.
    1. All strings of a's and b's with at least 3 a's.
    2. All strings of a's and b's where b's only appear in sequences whose length is a multiple of 2 (i.e., abba, bbbbabbaaa, and a are in this set; aba, b, and abbab are not).
    3. All strings of lower-case letters that contain the 5 vowels (aeiou) exactly once and in that order, with all other possible sequences of letters before, after, or inbetween the individual vowels.
       
  3. In The C Programming Language (Kernighan and Ritchie), an integer constant is defined as follows.

    An integer constant consisting of a sequence of digits is taken to be octal if it begins with 0 (digit zero), decimal otherwise. Octal constants do not contain the digits 8 or 9. A sequence of digits preceded by 0x or 0X (digit zero) is taken to be a hexadecimal integer. The hexadecimal digits include a or A through f or F with values 10 through 15.

    An integer constant may be suffixed with the letter u or U, to specify that it is unsigned. It may also be suffixed by the letter l or L to specify that it is long.

    Write a set of regular expressions that generate C integer constants as described above.

Programming project - Scanner

The purpose of this part of the assignment is to construct a scanner for the D language, which is defined in a separate handout.  The scanner, as well as the rest of your compiler, should be written in C.  The code should not only work properly, but it should also be readable and well-organized.

Scanner Organization

A scanner (or lexical analyzer) reads the character (text) representation of a program and transforms it into a sequence of tokens representing the basic lexical items in the language.   These tokens include  punctuation (lparen, rparen, semicolon, ...), keywords (if, int, return, ...),  integers, and identifiers.  The scanner should skip over whitespace and comments in the source code; these are ignored by the compiler and should not appear in the token stream that will eventually be read by the parser.

The main interface to the scanner is a function next_token() that returns the next lexical token from the source program each time it is called. After the last token has been returned, further calls to next_token() should return an "end-of-file" token to indicate that there is no more input.

To test the scanner, you should write a main program that reads successive tokens from an input file one at a time and prints them to the output file. We are supplying a fair amount of starter code for this project, including header files for the scanner and tokens, and a set of I/O routines that handle the details of opening the correct files, reading and writing text lines, and that also automatically copy input lines to the output file as assembly-language comments. This later copying will make it easier to see the correlation between the input source program and the tokens returned by the scanner and printed by the test program.

Supplied files

You should download the following files and use them as a starting point for your scanner project.

Implementation

For the most part the scanner is a reasonably straightforward character and text manipulation program. A couple of suggestions:

Project Turnin

Electronic: Turn in your program electronically using the link on the assignments page. If you worked with a partner only one person should submit.

Paper: You should hand in some examples of test input and output that demonstrate that your scanner works, and your answers to the written problems in class on Nov. 21. Note that if you worked with a partner on the compiler each partner must still turn in solutions to the written problems individually.

Additional Notes (added after 11/13/07)

Added 11/14/07:

  1. You may assume that lines in programs you will be given will not exceed 1200 characters (including the null character). You will probably want to add a #define to your program for this.
  2. You may assume that the user will not give you file names that exceed the maximum filenamelength (1024).
  3. You may assume that you will not be given programs that contain identifiers that exceed the maximum length as given in token.h (30 characters).
  4. You may assume you will not be given integer constants that are larger than what can be stored in a C int. (Note that you will never be given negative integer constants. Integer constants will always start with a digit. To get a negative value, -n can be computed by evaluating 0-n.)

    Added 11/16/07:

  5. Here is a sample Makefile to use for the project. Notes: If in doubt, make clean, then make things over again from scratch. Feel free to modify this as you would like and certainly as the project goes on. Obviously you can use whatever name you would like for your test file. I put all of the .h files as dependencies for each of the .o files but this is actually much more than you need. You can make depend to see what the actual dependencies are (for example if token.h changes then scan.o should be rebuilt because you may have changed what a token is or some of the #define values for tokens).
  6. Here is sample hello.s token output for the hello.d sample file. You do not need to put each token on its own line or have format like this, or even the exact same token names but you should have the same number of tokens and you should have ignored comments and whitespace etc.

    Added 11/19/07:

  7. At this point main should just be considered the same as any other function name (not as a keyword). Later on we treat main differently than other functions (for example we will check that main exists and that only one main exists).