CSE 401/M501 22sp Homework 1 - Regular expressions
Due: Thursday, April 7, by 11 pm. Submit
your homework online
via Gradescope (gradescope.com
,
also linked from the CSE 401/M501 resources web page). Gradescope's web site has several
instructional videos and help pages to outline the process,
and this guide has
specific information about scanning and uploading pdf files
containing assignments.
- Unreadable solutions cannot be graded--no blurry photos,
poor contrast, or illegible handwriting, please.
- Type-written solutions are encouraged but not required.
Typed prose with handwritten formulas/diagrams is better
than completely handwritten.
- If possible, don't split the solution to a problem across a page break.
You should do this and all other homework assignments (as
opposed to project pieces) individually.
We suggest you show your work to help us award partial credit
if appropriate, and for TA sanity.
For questions that ask for regular expressions, restrict yourself
to the basic operations and abbreviations outlined in the lecture slides: concatenation,
|
, *
, +
, ?
,
and character classes [...]
.
You may also specify character classes containing
all characters except for specific characters, e.g., [^abc]
.
You can also use abbreviations and be somewhat informal when the meaning of the full regular expression
it abbreviates is clear and unambiguous,
e.g., allexceptx = a|b|c|...|w|y|z
.
Note that you cannot use something like "allexceptx" in a solution without first defining
it as a name for a particular regular expression, as in the above example.
You may not use additional "regular expression" operations found in libraries
or languages like perl, python, java, or ruby, or in unix tools like grep,
sed or jflex. In particular, you cannot use not(
re)
as
a regular expression that matches all strings not natched by re.
When drawing DFAs, you do not need to include an explicit "error state"
or show transitions to this state for situations where the DFA rejects its input.
We will informally assume that if there is no transition defined from a state on
a particular input character then it corresponds to a transition to an error state,
and diagrams will be simpler if we do not include this extra state and its many,
many incoming transitions.
Problems:
- For each of the following regular expressions, (i) give an example of two
strings that can be generated by the regular expression and two that use the
same alphabet but cannot be generated, and (ii) give an English description
of the set of strings generated (for example, "all strings consisting
of the word 'cow' followed by 1 or more 'x's and 'o's in any order",
not just a transliteration of the regular expression operations into English
like "a 'c' followed by an 'o' followed by a 'w' followed by ...").
In other words, aim to describe the set of strings and not the details
of how the regular expressions produce them.
- (
a
|xy
)*
b
(oz
)+o
- ((ε|
0
)1
)*
- Give regular expressions that generate the following sets of strings.
- All strings of a's and b's with at least 3 a's.
- All strings of a's and b's where any b's only appear in sequences
of b's whose
length is a multiple of 2 (a few examples: abba, bbbbabbaaa, a and ε are in this
set; aba, b, abab, and abbab are not).
- All strings of lower-case letters that contain the 5 vowels (aeiou)
exactly once and in that order, with all other possible sequences
of lower-case letters before, after, or in between the individual vowels.
- (Cooper & Torczon exercise 2 for section 2.2, parts
(a) and (b) only. p. 80). Construct a DFA accepting each of
the following languages:
- {w in {a,b}* | w starts with 'a' and contains 'baba' as a substring}
- {w in {0,1}* | w contains '111' as a substring and does not contain '00' as a substring }
You do not need to go through the full subset construction to produce this DFA
from a NFA, although you can use some of those ideas to help you
produce your answer.
- In The C Programming Language (Kernighan and Ritchie), an integer
constant is defined as follows:
An integer constant consisting of a sequence of digits is taken to be
octal if it begins with 0
(digit zero), decimal otherwise. Octal constants do not contain the digits
8
or 9
.
A sequence of digits preceded by 0x
or 0X
(digit zero) is taken
to be a hexadecimal integer. The hexadecimal digits include a
or A
through f
or F
with values 10 through 15.
An integer constant may be suffixed with the letter u
or U
, to specify that it
is unsigned. It may also be suffixed by the letter l
or L
to specify that it
is long. If both suffixes are present, they may appear in
either order.
(a) Write a regular expression that generates C integer
constants as described above. (Of course you can break the
solution down into a regular expression with several named
parts if it makes things easier to write and read.)
(b) Draw a DFA that recognizes integer constants as
defined by your solution to part (a). You may draw this
directly; you don't need to formally trace through an
algorithm for converting a regular expression to a NFA and
then constructing a DFA from that. However, you might find
it useful to do so at least partially.
Hint: You might find it helpful to alternate between designing the DFA and
writing the regular expressions as you work on your solution.
-
A commment in C is a sequence of characters
/* ... */
.
(a) Write a regular expression that generates C-style
comments. You can restrict the alphabet to lower-case letters, digits,
spaces, newlines
(\n), carriage returns (\r) and the characters *
and /
. Also, remember that
in C comments do not nest (i.e., a */
marks the end of a comment
no matter how many times /*
appears before it.)
(b) Draw a DFA that recognizes C comments as defined by your solution to part (a).
As with the previous problems, you may draw this directly without
tracing through the steps of the regexp->NFA->DFA algorithms.
Hints: be careful about what is included
in the ...
between /*
and */
.
Also, as with the previous problem, you may find it helpful to alternate between the
regular expressions and DFA as you work on the problem.