In this assignment, you will develop a more complex C program using dynamic data structures. In doing so you will:
This portion of the page is divided into subsections for your convenience:
In this assignment, you will build programs to implement T9 predictive text,
a text input mode available on many cell phones and keypads. Each number
from 2-9 on the keypad represent three or four letters, the number 0 represents
a space, and 1 represents a set of symbols such as { , . ! ? } etc. The numbers
from 2-9 represent letters as follows:
2 ABC
3 DEF
4 GHI
5 JKL
6 MNO
7 PQRS
8 TUV
9 WXYZ
Since multiple letters map to a single number, many key sequences represent multiple words. For example, the input 2665 represents "book" and "cool", among other possibilities.
To translate from number sequences to words, we will use a data structure known as a trie. A trie is a multiway branching structure (tree) that stores the prefixes of sequences. As we travel down a path in the trie, we reach word sequences spelled out by the numbers along that path. Classic trie data structures have edges labeled with letters to store prefixes of strings. But for this application, we used a compressed trie that has only 10 possible branches at each node instead of 26, since the digits 0-9 represent the 26 letters, space and symbols. Because of this, an extra layer of complexity is needed to figure out the string represented by a path.
For more information on the trie data structure, here is a link to the Wikipedia article. Additionally, here is additional information on implementing the trie for this assignment.
Stanford CS Library, Pointers and Memory
Stanford CS Library, Linked Lists
Stanford CS Library, Binary Trees
Implement in C a program called t9
. The command
t9 FILE
should read in a dictionary file (FILE
) that contains a list
of words. Translate each word in the dictionary into its numeric key sequence,
then add the
key sequence to your trie, with the word at the end of the path corresponding
to the digits. If a word with the same numeric sequence already exists in
the trie, add the new word to the trie as a link to a new node with an edge
labeled '#' instead of one of the digits 2-9. (The words linked from a
node by the '#' edges essentially form a linked list of words that have the
same numeric code.)
For example, if the program reads the set of words "jello","rocks",
and "socks" from
the dictionary and adds it to an empty trie, the resulting trie should look
like this:
Once your program has read the dictionary and built the trie containing the words in it, it should start an interactive session. The user should be able to type numbers and the program should print out the word corresponding to the sequence of numbers entered. Your program should use the numbers typed by the user to traverse the trie that has already been created, retrieve the word, and print it to the screen. If the user then enters '#', the program should print the next word in the trie that has the same numeric value, and so forth. The user can also type a number followed by one or more '#' characters - this should print the same word that would be found by typing the number and individual '#' characters on separate input lines.
As an example, if we run the program using the above trie, an interactive session might look like this:
Enter "exit" to quit. Enter Key Sequence (or "#" for next word): > 76257 'rocks' Enter Key Sequence (or "#" for next word): > # 'socks' Enter Key Sequence (or "#" for next word): > 53556 'jello' Enter Key Sequence (or "#" for next word): > # There are no more T9onyms Enter Key Sequence (or "#" for next word): > 76257# 'socks' Enter Key Sequence (or "#" for next word): > 76257## There are no more T9onyms >4423 Not found in current dictionary. >exit
The interactive session should terminate either when the user enters the word "exit" or when the end-of-file is reached on the interactive input (indicated by typing control-D on the keyboard).
Note: Make sure your program properly handles the case if the user types more "#"s than there are T9onyms for a particular number.
We provide you with two text files, smallDictionary.txt
and dictionary.txt
(right-click on the links to download the files). Each of these text
files contains a list of words to be used in constructing a trie
- the small one primarily for testing, and the large one for the
final program. Translate each
word in the file into its associated T9 key
sequence, and add the word to the trie. In the case of multiple words having
the same key sequence k,
let the first word encountered in the text file be represented by the key sequence
k, the next encountered represented by k#, the next k##,
etc. For example, 2273 can represent acre, bard, bare, base, cape, card,
care, or case. To disambiguate, acre would be represented by 2273, bard by 2273#, bare by 2273##, and so forth. When a user inputs a key sequence,
print the appropriate word.
Your trie data structure should contain nodes to represent the tree, and strings (char arrays) containing copies of the words read from the dictionary file, linked to appropriate nodes in the trie.
Besides the general specification given above, your program should meet the following requirements to receive full credit.
Makefile
and use make
to compile your program. Your Makefile
should only recompile the necessary part(s) of the program after changes are made.malloc
to dynamically allocate the nodes, strings, and
any other data that make up your trie.malloc
and
return the storage with free
when you are done with it. The amount allocated should be based on the
actual size needed, not some arbitrary size that is assumed to be "large enough".stderr
and terminate if there
is no further work to be done.free
everything acquired with malloc
).
This should be done explicitly without relying on the operating system to clean
up after the program finishes.gcc -Wall
on klaatu
or the CSE Linux VM. Your program should
build without errors when
make
is used to run your Makefile
. You are, of course,
free to use other systems for development, and you should be fine as long as
you have a relatively recent version of gcc
, but we will test the code on the
CSE Linux machines.valgrind
. (Warning: valgrind
slows down execution considerably. It will take several minutes to load the full dictionary.txt
file and then free the resulting tree under valgrind
. We suggest you use smaller input files during development to test for memory problems with valgrind
.)As with any program you write, your code should be readable and understandable to anyone who knows C. In particular, for full credit your code must observe the following requirements.
.h
) file for the trie (and any other header files) should only
declare items that are shared between client programs that use the header and
the file(s)
that implement it. Don't include in the header file implementation details
that should be hidden. Be sure to use the standard #ifndef
preprocessor
trick so your header files work properly if included more than once in a source
file,
either directly or indirectly.void
functions that perform an action
without returning a value. Variables of local significance like loop counters
or indices should be given simple names like i,
n
,
or p
and do not require further comments. Avoid names like count
or flag
that
provide no useful information - use names that suggests the values being counted
or
the condition that is represented.malloc
or free
-
these are expensive. Don't make unnecessary copies of large data structures;
use pointers. (Copies of
ints, pointers, and similar things are cheap; copies of large arrays
and structs are expensive.)struct
s) on paper or on a whiteboard.
Be sure you understand what you are trying to do before you start typing.printf
and
gdb
are your friends here to examine values while debugging.gdb
is your friend.if-elseif-else
statements, but another way to do
it is to have an
array with one entry for each possible character. In the entry for each character,
store the corresponding digit code. Then you can look up the code for a
character without a complicated nest of if
statements. (Recall
that in C, as in Java and other languages, a character can be
used as a small integer. That is particularly helpful for exactly this kind
of application, where we might like to use a character value as an index
into a table of data.)The sequences below can be used to validate your trie against the given dictionary.
A small amount of extra credit will be awarded for adding the following extensions to an already complete, working assignment. No extra credit will be awarded if the basic program is not fully implemented and substantially bug-free (see the Extra Credit Policy in the Syllabus). You should also turn in your working assignment before attempting any extra credit, and turn in a second archive containing the enhanced program later (see instructions below).
freq_Dictionary.txt
(right-click
to download) contains a list of words with, in addition, the frequency
of each word in ordinary
text. Use the information
in this file to construct your trie so the most likely words are reached
first. Warning: the data in this file has some problems, like duplicate
entries for some words. You will need to figure out how best to use this
data, or find different data that is more helpful.
If you include extensions in your program, you should also include a README
file
that describes what you added and how to demonstrate your addition(s) when
your program is executed.
klaatu
and the CSE virtual machine) using gcc
with the -Wall
option.make
is used to run your Makefile
.
Identifying information including
Create an uncompressed tar
file named hw5.tar
containing your
source code and Makefile
,
and submit that using the normal class dropbox.
If you do any extra credit
extensions later, create a second tar
file named hw5-extra.tar
containing the
extra-credit version of your source code, Makefile
,
and the README
file
describing your extensions. Turn in that second archive in addition to the
first archive containing containing the solution to the main part of the assignment.