CSE 401/M501 24au Homework 1 - Regular expressions / DFAs

Due: Thursday, October 3 by 11:59 pm. Please use Gradescope (gradescope.com, also linked from the CSE 401/M501 resources web page) to submit your homework online. Gradescope's web site has several instructional videos and help pages to outline the process, and this guide has specific information about scanning and uploading pdf files containing assignments (although this guide is now somewhat old, so you may have better methods of creating pdf files on your phone using more recent software).

Unreadable solutions cannot be graded--no blurry photos, poor contrast, or illegible handwriting, please.
Type-written solutions are encouraged but not required.
If possible, don't split the solution to a problem across a page break.

We suggest you show your work to help us award partial credit if appropriate, and for TA sanity.

For questions that ask for regular expressions, you must restrict yourself to the basic operations and abbreviations outlined in the lecture slides: concatenation, |, *, +, ?, and character classes like [abc], [ax-z], or [a-zA-Z]. You may also specify character classes containing all characters except for specific characters, e.g., [^abc] means all characters in the entire alphabet except for a, b, and c; and [^ax-z] means all characters in the entire alphabet except for a, x, y, and z. You can also use abbreviations and be somewhat informal when the meaning of the full regular expression defined by an abbreviation is clear and unambiguous, e.g., allexceptx = a|b|c|...|w|y|z. Note that you cannot use something like allexceptx in a solution without first defining it as a name for a particular regular expression, as in this example. You may not use additional "regular expression" operations found in libraries or languages like perl, python, java, javascript, ruby, etc., or in unix tools like grep and sed. In particular, you cannot use not(re) as a regular expression that matches all other regular expressions except for re.

Sometimes a regular expression might need to include a literal character like * or +, which is also a regular expression operator. Use some simple convention to distinguish between the two. We suggest underlining literal characters, like * to distinguish them from operators like *. Please do not use backslash characters as escape characters, they are \v\e\r\y\ \h\a\r\d\ \t\o\ \r\e\a\d.

When drawing DFAs, you should not include an explicit "error state" or show transitions to this state for situations where the DFA rejects its input. We will assume that if there is no transition defined from a state on a particular input character then it corresponds to a transition to an error state, and the input is rejected. Diagrams are much simpler if we do not include this extra state and its many, many incoming transitions.

To help make DFA diagrams more readable, it's fine to label a transition arrow from one state to another with more than one symbol from the alphabet, instead of drawing multiple transition arrows each labeled with a single symbol. The easiest way to do this is to list the symbols in brackets, like [0-9] to label a transition on a single decimal digit , or [abcxyz] for a transition on one of those six characters. However, remember that transitions correspond to matching a single character in the input (or no character if it is an ε-transition), and cannot be labeled with a more complex regular expression.

Feel free to submit clear, hand-written diagrams for DFAs. However, if you want to use computer diagraming tools and don't already have a favorite, here are a couple of suggestions that students have found useful in the past:

Websites: https://draw.io, https://madebyevan.com/fsm/. There's a version of the "madebyevan" tool at https://www.cs.unc.edu/~otternes/comp455/fsm_designer/ that can also save and reload diagrams as JSON files if you want to store a version to possibly reload and edit later.
LaTeX: check out TikZ, which is a LaTeX package included in most standard distributions. Manual: https://tikz.dev/. This has a steep learning curve, but the two versions of the "madebyevan" web tool above can export diagrams to TikZ code that can then be copied into a LaTeX source file and typeset.

You should do this homework individually.

For each of the following regular expressions, (i) give an example of two strings that can be generated by the regular expression and two that use the same alphabet but cannot be generated, and (ii) give an English description of the set of strings generated (for example, "all strings consisting of the word 'cow' followed by 1 or more 'x's and 'o's in any order", not just a transliteration of the regular expression operations into English like "a 'c' followed by a 'o' followed by a 'w' followed by ..."). In other words, aim to describe the sets of strings and not the details of how the regular expressions produce them.
1. (a|xy)*
2. b(oz)+o
3. ((ε|0)1)*
Give regular expressions that generate the following sets of strings.
1. All strings of a's and b's with at least 3 a's.
2. All strings of a's and b's where any b's only appear in sequences of b's whose length is a multiple of 2 (a few examples: abba, bbbbabbaaa, a and ε are in this set; aba, b, abab, and abbab are not).
3. All strings of lower-case letters that contain the 5 vowels (aeiou) exactly once and in that order, with all other possible sequences of lower-case letters before, after, or in between the individual vowels.
(Cooper & Torczon exercise 2 for section 2.2, parts (a) and (b) only. p. 80 2nd ed, p. 82 3rd ed). Construct a DFA accepting each of the following languages (sets of strings):
1. {w in {a,b}* | w starts with 'a' and contains 'baba' as a substring}
2. {w in {0,1}* | w contains '111' as a substring and does not contain '00' as a substring }
You do not need to go through the full subset construction to produce this DFA from a NFA, although you can use some of those ideas to help you produce your answer.
In The C Programming Language (Kernighan and Ritchie), an integer constant is defined as follows:

An integer constant consisting of a sequence of digits is taken to be octal if it begins with 0 (digit zero), decimal otherwise. Octal constants do not contain the digits 8 or 9. A sequence of digits preceded by 0x or 0X (digit zero) is taken to be a hexadecimal integer. The hexadecimal digits include a or A through f or F with values 10 through 15.

An integer constant may be suffixed with the letter u or U, to specify that it is unsigned. It may also be suffixed by the letter l or L to specify that it is long.

(Note: this description is not quite precise about the rules for possible suffixes: is more than one of u or U allowed, or is there a restriction on the order of U and L, etc.? The C language standard specifies that at most one of u or U can appear, at most one of l or L, and the suffix letters can appear in any order.)

(a) Write a regular expression that generates all C integer constants as described above. (Of course, you can break the solution down into a regular expression with several named parts if it makes things easier to write and read.)

(b) Draw a DFA that recognizes integer constants as defined by your solution to part (a). You may draw this directly; you don't need to formally trace through an algorithm for converting a regular expression to a NFA and then constructing a DFA from that. However, you might find it useful to do so at least partially to help your design.

Hint: You might find it helpful to alternate between designing the DFA and writing the regular expressions as you work on your solution. Also, you may discover that the English description given above is not as precise as it might seem, so you may need to explore some edge cases more carefully.
Comments can be included in html documents. Any text that starts with the 4-character sequence  is treated as a comment, and a comment can span multiple lines. Examples:
```
       

          this is not a comment because it isn't surrounded
             by comment brackets

          
```
(a) Write a regular expression that generates html-style comments. You can restrict the alphabet to lower-case letters, digits, spaces, newlines (\n), carriage returns (\r) and the characters -, !, <, and >. Remember that these comments do not nest (i.e., the sequence --> marks the end of a comment no matter how many times  that terminates a comment. That three-character sequence cannot appear inside a comment since it would terminate it.

(b) Draw a DFA that recognizes html comments as defined by your solution to part (a). As with the previous problems, you may draw this directly without tracing through the steps of the regexp->NFA->DFA algorithms.

Hints: be careful about what is included between . As with the previous problem, you may find it helpful to alternate between the regular expressions and DFA as you work on the problem.

Note: the actual html standard does not allow the sequence <!-- to appear in the middle of a comment. We omitted that restriction here to simplify the problem.