CSE 413 Autumn 2008 -- Assignment 6

Regular Expressions & Scanner

Due: Electronically: 11 pm, Thursday, Nov. 20, 2008. Written problems and printed copy of the code and test output at the beginning of class on Friday, November 21, 2008.

Written problems

You should answer these questions using regular expressions as described in class, not the variants found in programming language like Ruby, Perl, Python, Bash, grep, or whatever.
  1. For each of the following regular expressions, (i) give an example of two strings that can be generated by the regular expression and two that use the same alphabet but cannot be generated, and (ii) give an English description of the set of strings generated (for example, "all strings consisting of the word 'cow' followed by 1 or more 'x's and 'o's"). For (ii), you should not just paraphrase the regular expression operators in English; describe the sets of strings generated.
    1. (a|xy)*
    2. b(oz)+o
    3. ((ε|0)1)*

  2. Give regular expressions or sets of regular expressions that will generate the following sets of strings.
    1. All strings of a's and b's with at least 3 a's.
    2. All strings of a's and b's where b's only appear in sequences whose length is a multiple of 2 (i.e., abba, bbbbabbaaa, and a are in this set; aba, b, and abbabab are not).
    3. All strings of lower-case letters that contain the 5 vowels (aeiou) exactly once and in that order, with all other possible sequences of letters before, after, or in between the individual vowels.
       
  3. In The C Programming Language (Kernighan and Ritchie), an integer constant is defined as follows.

    An integer constant consisting of a sequence of digits is taken to be octal if it begins with 0 (digit zero), decimal otherwise. Octal constants do not contain the digits 8 or 9. A sequence of digits preceded by 0x or 0X (digit zero) is taken to be a hexadecimal integer. The hexadecimal digits include a or A through f or F with values 10 through 15.

    An integer constant may be suffixed with the letter u or U, to specify that it is unsigned. It may also be suffixed by the letter l or L to specify that it is long.

    Write a regular expression or set of regular expressions that generate C integer constants as described above.

Programming - Calculator Scanner

This is the first of two programming assignments to build an interpreter for the language specified in the Calculator Language handout. We will build the interpreter in two logical parts - a scanner that reads the calculator program from the input stream and breaks the input into tokens, and a parser/evaluator that parses the input token stream according to the specifications in the grammar and executes the program. The program should be implemented in Ruby. For the most part it will just be a collection of top-level functions, but you should create classes when these are helpful in organizing the code.

In this part of the program you should implement a scanner that provides a single function next_token. Each time next_token is called it should return a new Token object that describes the next terminal symbol read from the input. Objects of class Token should respond to the following messages:

To test the scanner, you shoud write a small program that calls next_token repeatedly to get the next token from the input and prints the result. After reading and printing a quit or exit token, the test program should stop.

Feel free to take advantage of Ruby's string and regular expression classes and methods to chop the input into tokens.

Be sure to include your name and other identifying information as comments in your code. There should also be descriptive comments as needed; in particular for your Token class to describe the possible values returned by the kind method.

Project Turnin

Electronic: Turn in your program electronically using the link on the assignments page.

Paper: Turn in answers to the written problems, a copy of your code, and some examples of test input and output that demonstrate that your scanner works.