CSE333 15au HW4: egrep--

out: Tuesday December 1st, 2015.
due: Friday, December 11th, 2015 11:59pm.

Overview

Build a very basic subset of egrep (man egrep), a pattern matching tool based on regular expressions. Here's a truncated example use:

$ ./regexp 'c+i*e' dickens.txt
re-use it under the terms of the Project Gutenberg License included
[Last updated: December 30, 2011]
Produced by An Anonymous Volunteer
photographs), my first fancies regarding what they were like were
...
  
Like egrep, your code should perform pattern matching on a line-by-line basis, and should print (only) those lines that match. Execution of your program should return EXIT_SUCCESS if there was at least one matching line, and EXIT_FAILURE otherwise. If the regular expression matches any contiguous substring of a line, the line matches; otherwise, it doesn't.

Whether or not you're already familiar with regular expressions and NFAs, I suggest you read this paper. You should read the rest of this writeup first, though, as we're not trying to build everything the paper talks about (nor even support for all actual regular expressions).

The recommended implementation approach is to convert the regular expression into a non-deterministic finite automaton (NFA), and then to simulate the NFA on each line of input. Simulating the NFA can be a bit slow. There is a classical technique to convert the NFA into a deterministic finite state automaton (DFA). The cited paper suggests a dynamic version of that transformation. You don't need to do either (but we'll notice if you do).

Regular Expression Language

Note: To reduce the amount of code required, you can assume that there are no syntax errors in the regular expression argument to your code.

We'll implement only a simple subset of regular expressions (REs), and not all the features supported by popular regular expression matching tools you may know. The RE language we implement is a subset of what egrep implements. That means you can use egrep as a debugging aid - the results you get should be identical to what egrep produces for the same RE pattern string and file.

We implement the following features (where in each case c is any single character, and r and s are regular expressions):

We also use backslash escape to indicate a literal character, rather than an operator, so c+ matches one or more c's while c\+ matches a 'c' immediately followed by a '+'. (Note that we are not implementing everything mentioned in that paper nor required for actual regular expressions - just the operators shown above.)

Examples
We show a valid pattern and then one or more example matching strings:

Application Architecture
There are two primary steps, building the finite state automaton (FSA) equivalent to the regular expression and then simulating non-deterministic execution of that FSA on the lines of the input file.

RE to FSA

See section "Converting Regular Expressions to NFAs" in the article.

The restrictions we've placed on regular expressions result in particularly simple FSAs. The FSA is basically a simply chain through states, some of which have self loops (a transition to themselves). We use epsilon transitions -- unlabeled edges that the FSA can traverse without consuming an input character -- to further simplify FSA construction.

Start with an FSA containing just the start state and an epsilon edge -- an edge that can be taken without consuming a character from the input file. Now scan the RE from beginning to end. Each time you move forward one symbol you check if the next symbol is an operator (+ or *) or not. If it is, you consume the symbol and the operator, otherwise just the symbol. Depending on what you found, you generate one of three small machines that encode the character(s) just read from the RE. (The three machines are shown in the article and the example below.) As you generate these sub-machines, you connect the output edge(s) of the FSA you're building to the start state of the sub-machine. When you reach the end of the RE, you add a sub-machine consisting only of the accept state.

Here's the final machine built this way from RE ab*c+d. (The start state is green and the accept state is red.)

Non-deterministic execution

See the second method described in section "Regular Expression Search Algorithms" in the article

FSA execution is involves moving from one state to another. A simple execution step reads a single character from the input source (the file), then examines the labels on the outgoing edges of the current state(s) and transitions to the states they lead to. If more than one edge is labeled with the input symbol, the FSA ends up in more than one state. If no edges from any of the current states are labeled with the input symbol, the FSA "rejects" the input (and stops executing).

For instance, on the FSA shown above, if the FSA had so far consumed the symbols a, it would be in states 2 and 3 (where the start state is state 0). A while later, having read acc it would be in states 3 and 4. If it now read z it would reject the input. If instead it now read d it would accept the input, meaning there is a match.

Because our application wants to find all lines that match the RE anywhere in the line, you must also add the start state to the set of states the FSA is in each time you read a symbol.

Implementation Requirements
Optional Extra Credit
The extra credit is small, so do this only if you're interested.

Implement full regular expressions - add alternation (the | operator) and subexpressions (the parentheses operator).

This seems straightforward, but it makes things quite a bit harder. Even simply parsing the regular expression becomes harder. I'd implement the basic assignment first, even though substantial re-design might be required to then move on to extra credit extension.

What to turn in

Delete executables, data files, and anything else that is extraneous, and then tar.gz your entire working directory.

If your code doesn't meet the full specification, or otherwise contains bugs, include a README.txt file that indicates what is working properly and what isn't. If you've done more than the basics, also indicate that in README.txt.

Submit to the course dropbox.