Due: Wednesday, November 22, 2023, at 11:59pm
You may not drop this project assignment.
In this assignment, you will develop a more complex program using dynamic data structures. In doing so you will:
This is a significantly larger task than the previous programming assignment! Start early.
In this assignment, you will build programs to implement T9 predictive text,
a text input mode developed originally for cell phones and
still used for numeric keypads. Each number
from 2-9 on the keypad represents three or four letters, the number 0 represents
a space, and 1 represents a set of symbols such as { , . ! ? } etc. The numbers
from 2-9 represent letters as follows:
2 => ABC
3 => DEF
4 => GHI
5 => JKL
6 => MNO
7 => PQRS
8 => TUV
9 => WXYZ
Since multiple letters map to a single number, many key sequences represent multiple words. For example, the input 2665 represents "book" and "cool", among other possibilities.
2 6 6 5 2 6 6 5 a B c m n O m n O j K l OR a b C m n O m n O j k L | | | | | | | | ----------------------- ----------------------- | | book cool
To translate from number sequences to words, we will use a data structure known as a trie. A trie is a multiway branching structure (tree) that stores the prefixes of sequences. As we travel down a path in the trie, we reach word sequences spelled out by the numbers along that path. Classic trie data structures have edges labeled with letters to store prefixes of strings. But for this application, we use a compressed trie that has only 10 possible branches at each node instead of 26, since the digits 0-9 represent the 26 letters, space and symbols. Because of this, an extra layer of complexity is needed to figure out the string represented by a path.
(Actually, for our application, each node only needs 8 possible children numbered 2-9, since digits 0 and 1 don't encode letters. But writing the code might be easier if nodes have 10 children numbered 0-9, since then subtree number n corresponds to digit n. Feel free to use either representation for the trie depending on which seems simpler to implement.)
For more information on the trie data structure, here is a link to the Wikipedia article.
Implement a program t9
. You must use C for this assignment.
The command
$ ./t9 FILE
should read in a dictionary file (FILE
) that contains a list
of words. Translate each word in the dictionary into its numeric key sequence,
then add the
key sequence to your trie, with the word at the end of the path corresponding
to the digits. If a word with the same numeric sequence already exists in
the trie, add the new word to the trie as a link to a new node with an edge
labeled '#' instead of one of the digits 2-9. (The words linked from a
node by the '#' edges essentially form a "linked list" of words that have the
same numeric code, but we use additional tree nodes to link the words together
instead of defining a separate kind of linked-list node just for this situation.)
For example, if the program reads the set of words "jello","rocks",
and "socks" from
the dictionary and adds it to an empty trie, the resulting trie should look
like this:
Once your program has read the dictionary and built the trie containing the words in it, it should start an interactive session. The user should be able to type numbers and the program should print out the word corresponding to the sequence of numbers entered. Your program should use the numbers typed by the user to traverse the trie that has already been created, retrieve the word, and print it to the screen. If the user then enters '#', the program should print the next word in the trie that has the same numeric value, and so forth. The user can also type a number followed by one or more '#' characters - this should print the same word that would be found by typing the number and individual '#' characters on separate input lines.
As an example, if we run the program using the above trie, an interactive session might look like this:
Enter "exit" to quit. Enter Key Sequence (or "#" for next word): > 76257 'rocks' Enter Key Sequence (or "#" for next word): > # 'socks' Enter Key Sequence (or "#" for next word): > 53556 'jello' Enter Key Sequence (or "#" for next word): > # There are no more T9onyms Enter Key Sequence (or "#" for next word): > 76257# 'socks' Enter Key Sequence (or "#" for next word): > 2273 'acre' Enter Key Sequence (or "#" for next word): > # 'bard' Enter Key Sequence (or "#" for next word): > 76257## There are no more T9onyms Enter Key Sequence (or "#" for next word): > 2273#### 'base' Enter Key Sequence (or "#" for next word): > 4423 Not found in current dictionary. Enter Key Sequence (or "#" for next word): > exit
The interactive session should terminate either when the user enters the word "exit" or when the end-of-file is reached on the interactive input (indicated by typing control-D on the keyboard on a separate line by itself).
Note: The user of your program may not behave the way you expect - make sure your program can handle the case where "#" is entered but there are no more T9onyms, as well as cases where the user enters characters that aren't [2-9,#], or patterns with no word associated with them.
We provide you with two text files,
smallDictionary.txt
and
dictionary.txt
.
Each of these text files contains a list of words to be used in constructing
a trie - the small one primarily for testing, and the large one for the final
program. You may wish to create a third test file with only one or two words.
Translate each word in the file into its associated T9 key sequence, and add the word to the trie. In the case of multiple words having the same key sequence k, let the first word encountered in the text file be represented by the key sequence k, the next encountered represented by k#, the next k##, etc. For example, 2273 can represent acre, bard, bare, base, cape, card, care, or case. To disambiguate, acre would be represented by 2273, bard by 2273#, bare by 2273##, and so forth.
For the interactive session, use the user-entered pattern to traverse the trie and return the correct word.
Your trie data structure should contain nodes to represent the tree, and strings (char arrays) containing copies of the words read from the dictionary file, linked to appropriate nodes in the trie.
Besides the general specification given above, your program should meet the following requirements to receive full credit.
Makefile
and use make
to compile your program. Your Makefile
should only
recompile the necessary part(s) of the program after changes are
made. Your Makefile
should include a target clean
to remove previously generated files. You may also want to include a target
test
for your own use during development.malloc
to dynamically allocate the nodes, strings, and
any other data that make up your trie. Make sure you free all the memory
at the end of the run. We will use valgrind
to evaluate how well
your code manages memory.malloc
and release the storage with free
when you are done with it. The amount allocated should be based on the
actual size needed, not some arbitrary size that is assumed to be "large
enough".strncpy
instead
of strcpy
malloc
will return NULL
if it is unable to
allocate storage. Although this is extremely unlikely to happen, a
robust program must check for the possibility and react appropriately if
it does.stderr
and terminate if there
is no further work to be done.free
everything acquired with
malloc
). This should be done explicitly without relying on
the operating system to clean up after the program finishes. Remember
to close any files you have opened.gcc -Wall -std=c11
on Seaside.
Your program should build without errors when
make
is used to run your Makefile
. You are,
of course, free to use other systems for development, and you should be
fine as long as you have a relatively recent version of gcc. But we will
test the code on the CSE Linux machines.valgrind
. You must
load a dictionary and have a short interactive session while testing with
valgrind
.
(Warning: valgrind
slows down execution considerably. It
will take several minutes to load the full dictionary.txt file and then
free the resulting tree under valgrind
. We suggest you use
smaller input files during development to test for memory problems with
valgrind
.) If memory leaks are detected, compiling with the
-g
option and using
valgrind
's --leak-check=full
option will
generate more extensive messages with information about the
memory leaks. As with any program you write, your code should be readable and understandable to anyone who knows C. In particular, for full credit your code must observe the following requirements.
trienode.h
, trie.c
,
and tnine.c
, which give some guidance to segmenting the code.
Within a module/file your code should be divided into
functions, each of which has a single well-defined purpose..h
) file for the trie (and any other header
files) should only declare items that are shared between client programs
that use the header and the file(s) that implement it. Don't include in
the header file implementation details that should be hidden. The header
provided for you may not need to be edited, and already uses
the standard #ifndef
preprocessor trick so your header
files work properly if included more than once in a source file, either
directly or indirectly. You will want to look closely at what is included
in the header file.malloc
or free
- these are expensive. Don't make unnecessary
copies of large data structures; use pointers. (Copies of ints,
pointers, and similar things are cheap; copies of large arrays and
structs are expensive.)As with the previous assignment, you should use the cpplint.py style checker to review your code. While this checker may flag a few things that you wish to leave as-is, most of the things it catches, including whitespace errors in the code, should be fixed. We will run this style checker on your code to check for any issues that should have been fixed. Please use the class discussion board if you have questions about any of clint's warnings and whether they can be ignored.
cse374-materials
repository from gitlab. If you have not yet cloned the repository you may use the
command git@gitlab.cs.washington.edu:mh75/cse374-materials.git
.
If you already completed that step go to the directory and use
git pull
to get the most up-to-date files.trienode.h
,
trie.c
, and tnine.c
. These files are designed to get you
started. You should edit them to complete your program, and submit your completed
files for the assignment. The header file may not need to be edited, but the two .c
files will definitely need additions. You may use the enclosed comments for some
additional guidance.struct
s) on paper or on a whiteboard. Be sure you
understand what you are trying to do before you start typing.gdb
and printf
are
your friends here to examine values while debugging.gdb
is your friend.if-elseif-else
statements, but another way to do it is to
have an array with one entry for each possible character. In the entry
for each character, store the corresponding digit code. Then you can
look up the code for a character without a complicated nest of
if
statements. (Recall that in C, as in Java and other
languages, a character can be used as a small integer. That is
particularly helpful for exactly this kind of application, where we
might want to use a character value as an index into a table of
data.)The sequences below can be used to validate your trie against the supplied
dictionary.txt
file.
You will need to submit your Makefile and all source code required to make this project. We will be expecting new copies of the three files provided as starter code in addition to your Makefile.
Submit your files to the Gradescope assignment, which will be linked through Canvas. For this assignment we ask that you submit through Gitlab. To do this you will follow these steps:
There is an autograder provided which will perform basic tests on your code. You should not use this autograder as your primary test mechanism, but it will provide some immediate feedback. There will be additional manual grading of your assignment that focuses on the code quality issues described above.
This assignment will be worth 50 points total. You will be evaluated based on the overall performance of your program, as well as your attention to the code quality points above.