CSE 413 21sp Assignment 7 - Regular Expressions & Scanner

Due: Tuesday, May 25 by 11 pm. You should use Gradescope to submit your assignment in two parts, corresponding to the written and programming parts below. Details are given in the "What to hand in" section at the bottom of the assignment.

Added Monday, May 24: we want to allow a bit of extra time for this assignment because of the end of the quarter crunch. So, for this assignment, everyone has two additional free late days, meaning this assignment can be turned in as late as 11 pm, Thursday, May 27, even if you have already used all or most of your late days previously. No further extensions allowed after that deadline and no credit will be awarded for assignments submitted after that time.

Part I. Written problems

You should answer these questions using standard regular expressions as described in class, not the variants and extensions found in "regular expression" packages in programming language like Ruby, Perl, Python, Bash, grep, sed, or others. The operations you may use include |, *, +, ?, concatenation, and character classes [...]. You may also specify character classes containing all characters except for specific characters, e.g., [^abc]. You can also use abbreviations and be somewhat informal when the meaning is clear, e.g., allexceptx = a | b | c | ... | w | y | z. Note that you cannot use something like "allexceptx" in a solution without first defining it as a name for a particular regular expression, as in the above example. You may not use other operations that might be found in some software packages; in particular, you cannot use not(re) as a regular expression that matches all other regular expressions except for re.

For each of the following regular expressions, (i) draw a DFA that accepts the set of strings generated by that regular expression, (ii) give an example of two strings that can be generated by the regular expression and two that use the same alphabet but cannot be generated, and (iii) give an English description of the set of strings generated (for example, "all strings consisting of the word 'cow' followed by 1 or more occurrences of 'milk' "). For (iii), you should not just paraphrase the regular expression operators in English; describe the sets of strings generated.
1. (a|xy)*
2. b(oz)+o
3. ((ε|0)1)*
Give regular expressions that will generate the following sets of strings. You do not need to draw DFAs, but sometimes sketching a DFA is a useful tool to help solve problems like these.
1. All strings of a's and b's with at least 3 a's.
2. All strings of a's and b's where b's only appear in sequences of b's whose length is a multiple of 2 (i.e., abbaa, bbbbabbaaa, and a are in this set; aba, b, abab, and abbabab are not).
3. All strings of lower-case letters that contain the 5 vowels (aeiou) exactly once and in that order, with all other possible sequences of lower-case letters before, after, or in between the individual vowels.
In The C Programming Language (Kernighan and Ritchie), an integer constant is defined as follows.

"An integer constant consisting of a sequence of digits is taken to be octal if it begins with 0 (digit zero), decimal otherwise. Octal constants do not contain the digits 8 or 9. A sequence of digits preceded by 0x or 0X (digit zero) is taken to be a hexadecimal integer. The hexadecimal digits include a or A through f or F with values 10 through 15.

"An integer constant may be suffixed with the letter u or U, to specify that it is unsigned. It may also be suffixed by the letter l or L to specify that it is long."

(a) Write a regular expression that generates C integer constants as described above. (Of course you can break the solution down into a regular expression with several named parts if it makes things easier to write and read.)

(b) Draw a DFA that recognizes integer constants as defined by your solution to part (a).

Hint: You might find it helpful to alternate between designing the DFA and writing the regular expressions as you work on your solution.

Part II. Programming - Calculator Scanner

This is the first of two programming assignments to build an interpreter for the language given in the Calculator Language description. We will build the interpreter in two parts - a scanner that reads the calculator program from the input stream and breaks the input into tokens, and a parser/evaluator that parses the token stream according to the specifications in the grammar and executes the program. The calculator program should be implemented in Ruby. For the most part it will just be a collection of top-level functions, but you should create classes when these are helpful in organizing the code.

For this assignment you should implement a scanner that provides a function next_token. Each time next_token is called it should return a new Token object that describes the next terminal symbol read from the input. Objects of class Token should respond to the following messages:

kind - return the lexical class of the token as a string. This should be a distinct string for each lexical class in the program, possibly just the operator or keyword itself. However, all identifiers should be treated as instances of a single lexical class and the kind method should return the same value for every identifier. Similarly, all numbers should be treated as a single lexical class. You will also want to have a lexical class to represent the end of an input line, since end-of-line is semantically meaningful - it indicates the end of a statement.
value - if the token kind is either an identifier or number, then this message should return the actual identifier or floating-point value. Its value is not defined for other lexical classes.
to_s - the standard Ruby "to string" method. This should produce a descriptive string representation of the token, including the associated value if the token is an identifier or a number.

To test the scanner, you should write a small program that calls next_token repeatedly to get the next token from the input and prints the result (the result of sending to_s to a Token object). After reading and printing a quit or exit token, the test program should stop. Adjacent tokens should be separated by a single space or newline character in the output.

Feel free to take advantage of Ruby's string and regular expression classes and methods to chop input lines into tokens. You also might want to use the readline module (require 'readline') to allow convenient input editing while you are typing, but this is not required.

Your scanner and Token code should be contained in a file scan.rb and the small program that calls the scanner and prints tokens should be in a file hw7.rb. Be sure to include your name and other identifying information as comments at the beginning of your file. There should also be descriptive comments as needed; in particular, your Token class should include documentation of the possible values returned by the kind method.

What to Hand In

Use Gradescope to submit your solutions to this assignment in two parts.

For the problems in Part I please turn in a pdf file named hw7.pdf with your answers. This can be a scanned handwritten document if that is convenient, as long as it is clear and legible. Gradescope's web site has several instructional videos and help pages to outline the process, and this guide has specific information about scanning and uploading pdf files containing assignments.

Unreadable solutions cannot be graded--no blurry photos, poor contrast, or illegible handwriting, please.
Type-written solutions are encouraged but not required.
If possible, don't split the solution to a problem across a page break.

For Part II, turn in the scan.rb and hw7.rb files containing your Ruby code. Also submit a text file named hw7.txt containing some examples of test input and output that demonstrate that your scanner works on a variety of test input containing both legal tokens and other input characters. The hw7.txt file should be a transcript of a terminal session showing operation of your scanner and test program.

On Linux and OS X systems you can create a transcript file using the command script hw7.txt to start a new shell and capture all of the input and output while it is running. Run your ruby command there, then, when you're done, terminate ruby and type exit in the shell window. That will leave you back in the original window where you ran script, and there will be a hw7.txt file containing your work in the current directory. Turn in that file.

Historically, Windows has not had a script command. If no better tool is available, you can create a transcript file as follows. Run ruby or irb in a command window. When you're done, right-click the window title bar and pick edit>select all from the popup menu, then pick edit>copy. Open a plain text editor like notepad++, paste the copied terminal session into a new window, and save that file as hw7.txt.

Regardless of how you create the transcript file, you can do some light editing to delete major typos or false starts, but don't worry about minor problems or a few extra characters like backspaces entered to correct typing errors. We're interested in your overall demonstration, not in every small detail (as long as it's clear which things are false starts that should be ignored).