75 points
Due: Monday, May 13, 2019, at 11:59pm
In this assignment, you will develop a more complex program using dynamic data structures. In doing so you will:
This is a significantly larger task than the previous programming assignment! Start early.
In this assignment, you will build programs to implement T9 predictive text,
a text input mode developed originally for cell phones and
still used for numeric keypads. Each number
from 2-9 on the keypad represents three or four letters, the number 0 represents
a space, and 1 represents a set of symbols such as { , . ! ? } etc. The numbers
from 2-9 represent letters as follows:
2 => ABC
3 => DEF
4 => GHI
5 => JKL
6 => MNO
7 => PQRS
8 => TUV
9 => WXYZ
Since multiple letters map to a single number, many key sequences represent multiple words. For example, the input 2665 represents "book" and "cool", among other possibilities.
2 6 6 5 2 6 6 5 a B c m n O m n O j K l OR a b C m n O m n O j k L | | | | | | | | ----------------------- ----------------------- | | book cool
To translate from number sequences to words, we will use a data structure known as a trie. A trie is a multiway branching structure (tree) that stores the prefixes of sequences. As we travel down a path in the trie, we reach word sequences spelled out by the numbers along that path. Classic trie data structures have edges labeled with letters to store prefixes of strings. But for this application, we use a compressed trie that has only 10 possible branches at each node instead of 26, since the digits 0-9 represent the 26 letters, space and symbols. Because of this, an extra layer of complexity is needed to figure out the string represented by a path.
(Actually, for our application, each node only needs 8 possible children numbered 2-9, since digits 0 and 1 don't encode letters. But writing the code might be easier if nodes have 10 children numbered 0-9, since then subtree number n corresponds to digit n. Feel free to use either representation for the trie depending on which seems simpler to implement.)
For more information on the trie data structure, here is a link to the Wikipedia article.
Implement in C a program t9
. The command
$ ./t9 FILE
should read in a dictionary file (FILE
) that contains a list
of words. Translate each word in the dictionary into its numeric key sequence,
then add the
key sequence to your trie, with the word at the end of the path corresponding
to the digits. If a word with the same numeric sequence already exists in
the trie, add the new word to the trie as a link to a new node with an edge
labeled '#' instead of one of the digits 2-9. (The words linked from a
node by the '#' edges essentially form a "linked list" of words that have the
same numeric code, but we use additional tree nodes to link the words together
instead of defining a separate kind of linked-list node just for this situation.)
For example, if the program reads the set of words "jello","rocks",
and "socks" from
the dictionary and adds it to an empty trie, the resulting trie should look
like this:
Once your program has read the dictionary and built the trie containing the words in it, it should start an interactive session. The user should be able to type numbers and the program should print out the word corresponding to the sequence of numbers entered. Your program should use the numbers typed by the user to traverse the trie that has already been created, retrieve the word, and print it to the screen. If the user then enters '#', the program should print the next word in the trie that has the same numeric value, and so forth. The user can also type a number followed by one or more '#' characters - this should print the same word that would be found by typing the number and individual '#' characters on separate input lines.
As an example, if we run the program using the above trie, an interactive session might look like this:
Enter "exit" to quit. Enter Key Sequence (or "#" for next word): > 76257 'rocks' Enter Key Sequence (or "#" for next word): > # 'socks' Enter Key Sequence (or "#" for next word): > 53556 'jello' Enter Key Sequence (or "#" for next word): > # There are no more T9onyms Enter Key Sequence (or "#" for next word): > 76257# 'socks' Enter Key Sequence (or "#" for next word): > 76257## There are no more T9onyms Enter Key Sequence (or "#" for next word): > 4423 Not found in current dictionary. Enter Key Sequence (or "#" for next word): > exit
The interactive session should terminate either when the user enters the word "exit" or when the end-of-file is reached on the interactive input (indicated by typing control-D on the keyboard on a separate line by itself).
Note: Make sure your program properly handles the case if the user types more "#"s than there are T9onyms for a particular number.
We provide you with two text files,
smallDictionary.txt
and
dictionary.txt
(right-click on the links to download the files). Each of these text
files contains a list of words to be used in constructing a trie - the
small one primarily for testing, and the large one for the final
program. Translate each word in the file into its associated T9 key
sequence, and add the word to the trie. In the case of multiple words
having the same key sequence k, let the first word encountered
in the text file be represented by the key sequence k, the next
encountered represented by k#, the next k##, etc. For
example, 2273 can represent acre, bard, bare, base, cape, card, care, or
case. To disambiguate, acre would be represented by 2273, bard by
2273#, bare by 2273##, and so forth. When a user inputs a key sequence,
print the appropriate word.
Your trie data structure should contain nodes to represent the tree, and strings (char arrays) containing copies of the words read from the dictionary file, linked to appropriate nodes in the trie.
Besides the general specification given above, your program should meet the following requirements to receive full credit.
Makefile
and use make
to compile your program. Your Makefile
should only
recompile the necessary part(s) of the program after changes are
made.malloc
to dynamically allocate the nodes, strings, and
any other data that make up your trie.malloc
and release the storage with free
when you are done with it. The amount allocated should be based on the
actual size needed, not some arbitrary size that is assumed to be "large
enough".malloc
will return NULL
if it is unable to
allocate storage. Although this is extremely unlikely to happen, a
robust program must check for the possibility and react appropriately if
it does.stderr
and terminate if there
is no further work to be done.free
everything acquired with
malloc
). This should be done explicitly without relying on
the operating system to clean up after the program finishes.gcc -Wall
on klaatu or the CSE Linux
VM
. Your program should build without errors when
make
is used to run your Makefile
. You are,
of course, free to use other systems for development, and you should be
fine as long as you have a relatively recent version of gcc. But we will
test the code on the CSE Linux machines.valgrind
.
(Warning: valgrind
slows down execution considerably. It
will take several minutes to load the full dictionary.txt file and then
free the resulting tree under valgrind
. We suggest you use
smaller input files during development to test for memory problems with
valgrind
.) If memory leaks are detected,
valgrind
's --leak-check=full
option will be
useful to generate more extensive messages with information about the
memory leaks. As with any program you write, your code should be readable and understandable to anyone who knows C. In particular, for full credit your code must observe the following requirements.
.h
) file for the trie (and any other header
files) should only declare items that are shared between client programs
that use the header and the file(s) that implement it. Don't include in
the header file implementation details that should be hidden. Be sure to
use the standard #ifndef
preprocessor trick so your header
files work properly if included more than once in a source file, either
directly or indirectly.void
functions that perform an action without returning a value. Variables
of local significance like loop counters or indices should be given
simple names like i,
n
, or p
and
do not require further comments. Avoid names like count
or
flag
that provide no useful information - use names that
suggests the values being counted or the condition that is
represented.malloc
or free
- these are expensive. Don't make unnecessary
copies of large data structures; use pointers. (Copies of ints,
pointers, and similar things are cheap; copies of large arrays and
structs are expensive.)As with the previous assignment, we strongly suggest that you use the
clint.py style checker (right-click to download,
and chmod +x
to make it executable) to review your code.
While this checker may flag a few things that you wish to leave as-is,
most of the things it catches, including whitespace errors in the code,
should be fixed. We will run this style checker on your code to check
for any issues that should have been fixed. Please use the class
discussion board if you have questions about any of clint's warnings and
whether they can be ignored.
struct
s) on paper or on a whiteboard. Be sure you
understand what you are trying to do before you start typing.gdb
and printf
are
your friends here to examine values while debugging.gdb
is your friend.if-elseif-else
statements, but another way to do it is to
have an array with one entry for each possible character. In the entry
for each character, store the corresponding digit code. Then you can
look up the code for a character without a complicated nest of
if
statements. (Recall that in C, as in Java and other
languages, a character can be used as a small integer. That is
particularly helpful for exactly this kind of application, where we
might want to use a character value as an index into a table of
data.)A small amount of extra credit will be awarded for adding the following extensions to an already complete, working assignment. No extra credit will be awarded if the basic program is not fully implemented and substantially bug-free. You should also turn in your working assignment before attempting any extra credit, and turn in a second archive containing the enhanced program later (see instructions below).
freq_Dictionary.txt
(right-click to download) contains a list of words with, in
addition, the frequency of each word in ordinary text. Use the
information in this file to construct your trie so the most likely
words are reached first. Warning: the data in this file has some
problems, like duplicate entries for some words. You will need to
figure out how best to use this data, or find different data that is
more helpful.If you include extensions in your program, you should also include a README
file
that describes what you added and how to demonstrate your addition(s) when
your program is executed.
The sequences below can be used to validate your trie against the supplied
dictionary.txt
file.
Create an uncompressed tar file named hw5.tar
containing
your source code and Makefile
, and submit that using the
normal class dropbox. You can do this with the command
$ tar cvf hw5.tar Makefile ...
where ...
is replaced by the
list of additional files you want to include in your submission.
If you do any extra credit extensions later, create a second tar file
named hw5-extra.tar
containing the extra-credit version of
your source code, Makefile
, and the README
file describing your extensions. Turn in that second archive in addition
to the first archive containing containing the solution to the required
part of the assignment.