Homework 5
Due: Thursday, May 11, at midnight.
Assignment goal
In this assignment, you will develop a more complex program using dynamic data structures. In doing so you will:
- Gain experience developing a larger system one part at a time, testing as you go.
- Learn about the trie data structure, a version of a search tree.
- Gain experience working with trees, structs, and dynamically allocated data.
- Gain more experience reading and processing text files in C.
- Practice writing simple Makefiles.
Synopsis
In this assignment, you will build a program to predict all possible completions of a word. In order to perform this prediction, we will use a data structure called a trie.
For clarity, your program should perform a post-order traversal of all possible words, such that all children of a node are printed before the word at that node.
Technical Requirements
Implement in C a program autocomplete
. The command
./autocomplete FILE
should read in a dictionary file (FILE
) that contains
a list of words, one per line. Store each word in the trie. If that
word already exists in the trie, make sure your program gracefully
leaves the trie unchanged.
Once your program has successfully read the dictionary and built the trie containing the words, it should start an interactive session. The user should be able to type in any string, and the program will respond by printing out all of the valid completions of that string. The program should not prompt the user, simply accept input and only then produce any output. When the user enters the EOF (end of file) character, the program should exit. EOF is usually entered with Ctrl-D.
One benefit of this interaction scheme is the ease at which you can use a file as your input and a file as your output. You should be able to make a file with one or two words in it, and redirect it to stdin. You can also redirect stdout to a file, to capture your autocompletions in a file.
You should separate your implementation into 2
files, trie.c
and main.c
, with a header
file trie.h
to be included by both, defining the
interface between the two. It is up to you to design a good API for
your trie data structure, but you should only declare your trie node
structure in trie.h
, and define it
in trie.c
. This will force every function which needs to
know the implementation details of a trie to live
in trie.c
, and enforce a good encapsulation.
You need to make a Makefile
which will correctly build
the autocomplete
binary. It should also contain the
standard all
target, and an
appropriate clean
target.
- Use
malloc
to dynamically allocate the nodes, strings, and any other data that make up your trie. - While reading input from the input file or stdin, you may assume any single word will not be longer than 500 characters. However, when storing strings in your trie, you should allocate only as much space as each word needs, not 501 bytes for each.
- If you need to create a copy of a string or other variable-size data, you
should dynamically allocate an appropriate amount of storage using
malloc
and return the storage withfree
when you are done with it. The amount allocated should be based on the actual size needed, not some arbitrary size that is assumed to be "large enough". - Use standard C library functions where possible; do not reimplement operations available in the standard libraries.
- You must check the return status (result code) of every library function
you call to be sure that no errors occurred. In particular,
malloc
will returnNULL
if it is unable to allocate storage. Although this is extremely unlikely to happen, a robust program must check for the possibility and react appropriately if it does (most likely by printing an error to stderr and exiting). - If an error occurs when opening or reading a file, the program should write
an appropriate error message to
stderr
and terminate. - Before the program terminates, all dynamically allocated data must be properly freed (i.e.,
free
everything acquired withmalloc
). This should be done explicitly without relying on the operating system to clean up after the program finishes. - Your code must compile and run without errors or warnings when compiled with
gcc -Wall -std=c11 -g
on klaatu or the CSE Linux VM. Your program should build without errors when
make
is used to run yourMakefile
. You are, of course, free to use other systems for development, and you should be fine as long as you have a relatively recent version of gcc. But we will test the code on the CSE Linux machines. - Your program should terminate cleanly with no memory leaks or other memory errors reported when it is run using
valgrind
. (Warning:valgrind
slows down execution considerably. It will take several minutes to load the full dictionary file and then free the resulting tree undervalgrind
. We suggest you use smaller input files during development to test for memory problems withvalgrind
.) If memory leaks are detected,valgrind
's--leak-check=full
option will be useful to generate more extensive messages with information about the memory leaks.
Code Quality Requirements
As with any program you write, your code should be readable and understandable to anyone who knows C. In particular, for full credit your code must observe the following requirements.
- Divide your program into suitable source files (at least three) and functions, each of which does a single well- defined aspect of the assignment. For example, there should almost certainly be a header and source file for the trie data structure and the operations needed on it (create a new empty trie, insert a word, search, delete the trie, etc.), and an additional file that contains the the main part of the program that uses the trie data structure. Your program most definitely may not consist of one huge main function that does everything.
- The header (
.h
) file for the trie (and any other header files) should only declare items that are shared between client programs that use the header and the file(s) that implement it. Don't include in the header file implementation details that should be hidden. Be sure to use the standard#ifndef
preprocessor trick so your header files work properly if included more than once in a source file, either directly or indirectly. - Be sure to include appropriate function prototypes near the beginning of each source file for functions whose declarations are not included in a header file.
- Comment sensibly, but not excessively. You should not use comments to repeat the obvious or
explain how the C language works - assume that the reader knows C at least as well as you do.
Your code should, however, include the following minimum comments:
- Every function must include a heading comment that explains what the function does (not how it does it), including the significance of all parameters. It must not be necessary to read the function code to determine how to call it or what happens when it is called. (But these comments do not need to be nearly as verbose as, for example JavaDoc comments.)
- Every significant variable must include a comment that is sufficient to understand the information in the variable and how it is stored. It must not be necessary to read code that initializes or uses a variable to understand this.
- Every source file should begin with a comment identifying the file, author, and purpose (i.e., the assignment or project).
- Use appropriate names for variables and functions: nouns or noun phrases
suggesting the contents of variables and the results of value-returning functions;
verbs or verb phrases for
void
functions that perform an action without returning a value. Variables of local significance like loop counters or indices should be given simple names likei,
n
, orp
and do not require further comments. Avoid names likecount
orflag
that provide no useful information - use names that suggests the values being counted or the condition that is represented. - No global variables. Use parameters (particularly pointers) appropriately. Exception: if you wish, you may have global variables that record the settings of any command-line options added for the extra credit part (if you create any of these). It is may be appropriate to use global variables for constant data like translation tables if the program is better structured this way.
- No unnecessary computation or excessive use of
malloc
orfree
- these are expensive. Don't make unnecessary copies of large data structures; use pointers. (Copies of ints, pointers, and similar things are cheap; copies of large arrays and structs are expensive.) - While we typically think of English as having only 26 letters, you're expected to be able to handle words which consist of any string of characters. For simplicity in this assignment, we only require you to support character values 0 through 127.
- As with the previous assignment, we strongly suggest that you use
the clint.py style checker (right-click to
download, and
chmod +x
to make it executable) to review your code. While this checker may flag a few things that you wish to leave as-is, most of the things it catches, including whitespace errors in the code, should be fixed. We will run this style checker on your code to check for any issues that should have been fixed.
Implementation Hints
We suggest you write a testing dictionary file with just a few
words in it in order to develop your program, before running it on the
whole dictionary. Recall that all the words in English live
in /usr/share/dict/words
on Klaatu.
- There are a lot of things to get right here; the job may seem overwhelming if you try to do it all at once. But if you break it into small tasks, each one of which can be done individually by itself, it should be quite manageable. For instance, figure out how to add a single word to the trie before you implement the logic to process all the words in the dictionary. Figure out how to add a few words that are completely different before you handle words that have the same prefix. Implement the code to traverse the trie to translate an input key sequence into the corresponding word once you've built the trie, not before.
- Before you start typing code into the computer, spend some time sketching out data structures
and code (particularly trie node
struct
s) on paper or on a whiteboard. Be sure you understand what you are trying to do before you start typing. - Every time you add something new to your code (see hint #1), test it. Right
Now! It is much
easier to find and fix problems if you can isolate the potential bug to
a small section of code you just added or changed.
gdb
andprintf
are your friends here to examine values while debugging. - You will probably find it very useful to include code that can print the contents of the trie in some understandable format. This is not required, but how can you be sure your code is correct if you can't look at the trie that is built for a small set of input data?
- Start with a small data file and figure out in advance what the resulting trie should look like. Then verify that the program does, in fact, create that tire.
gdb
is your friend.valgrind
is probably an even better friend.- Be sure to check for errors like trying to open a nonexistent file to see if your error handling is working properly.
- Once you're done, read the instructions again to see if you overlooked anything.
- Reread the previous hint and obey.
Sample Input
If you create a test dictionary file with the words:- he
- he's
- halp
- hapless
- hello
- world
ha
should give the
output: halp
followed by hapless
, the
input world
should give the output world
,
the input xylophone
should give no output, and the
input h
should give the output:
- halp
- hapless
- he's
- hello
- he
in that order.
Turn-in Instructions
You should submit 5 files:
- trie.h
- trie.c
- main.c
- Makefile
- README
Extra Credit
Think of a cool extension to this assignment and implement it. Maybe you use the trie to build a spellchecker, or you are able to build a ranked autocomplete which suggests only the best words (not all of them).
In order to turn in the extra credit, you should first turn in your
normal program without any extra credit implemented. This is what will
be graded for the main part of the assignment. To then get extra
credit,
submit trie_extra.h
, trie_extra.c
, main_extra.c
,
and Makefile_extra
. Be sure to describe in the README
what you've built.
Before you attempt the extra credit, make sure you finish the rest of the assignment. The extra credit will be worth no more than 10% of the total value of the main assignment, so focus your efforts on perfecting your main assignment.
- General
- Communication
- Documentation
Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX
Comments to adminanchor