Summary |
Build, in C++, a very basic subset of egrep (man egrep), a pattern matching tool based on regular expressions. Here's a truncated example use:
$ ./regexp 'c+i*e' dickens.txt re-use it under the terms of the Project Gutenberg License included [Last updated: December 30, 2011] Produced by An Anonymous Volunteer photographs), my first fancies regarding what they were like were ...Like egrep, your code should perform pattern matching on a line-by-line basis, and should print (only) those lines that match. Execution of your program should return EXIT_SUCCESS if there was at least one matching line, and EXIT_FAILURE otherwise. If the regular expression matches any contiguous substring of a line, the line matches; otherwise, it doesn't.
Whether or not you're already familiar with regular expressions and NFAs, I suggest you read this paper. You should read the rest of this writeup first, though, as we're not trying to build everything the paper talks about.
The recommended implementation approach is to convert the regular expression into a non-deterministic finite automaton (NFA), and then to simulate the NFA on each line of input. Simulating the NFA can be a bit slow. There is a classical technique to convert the NFA into a deterministic finite state automaton (DFA). The cited paper suggests a dynamic version of that transformation. You don't need to do either (but we'll notice if you do).
What to do.
There is no starter code for this assignment, so no code to fetch. (This is the only one of the three HW4's that allows you to decide everything (except that even for this one you have to implement in C++).)
You can't design or implement until you're sure you understand what it is you're trying to do. Read this assignment, and the paper linked above. You might want to consult additional references if things still aren't clear. Maybe try egrep if you haven't used it before. Then at least sketch out your code architecture at the level of what classes exist, what their major interfaces are, and how they interact to solve the problem. Then start grappling with C++...
Regular Expression Language |
Note: To reduce the amount of code required, you can assume that there are no syntax errors in the regular expression argument to your code.
We'll implement only simple regular expressions (REs), not all the features supported by popular regular expression matching tools you may know. The RE language we implement is a subset of what egrep implements. That means you can use egrep as a debugging aid - the results you get should be identical to what egrep produces for the same RE pattern string and file.
We implement the following features (where in each case a is any single character, and r and s are regular expressions):
Examples
We show a valid pattern and then one or more example matching strings:
Alternative, Advanced Option (Assignment #4B.B) |
Instead of writing code that produces and then simulates an automaton, write code that "compiles the regular expression into C." That is, the output of your program is a C program whose code represents the automaton. Your program then compiles the customized automaton implementation, and runs it against the input file.
Do either the regular assignment or this option, not both.
A completely correct and beautiful implementation of this option garners you a lot of admiration.
What to turn in |
Delete executables, data files, and anything else that is extraneous, and then tar.gz your entire working directory.
If your code doesn't meet the full specification, or otherwise contains bugs, include a README.txt file that indicates what is working properly and what isn't. If you've done more than the basics, also indicate that in README.txt.
Submit to the course dropbox linked from the course home page.
Points and Grading |
We will grade your submission using a combination of automated testing and manual examination of the code. To get a substantial fraction of full points, your submitted files must:
$ g++ -Wall -std=gnu++0x *.cc