CSE 374 16wi - Homework 4
Due: Thursday, Feb. 11, at 23.00
Assignment goals
In this assignment, you will develop a more complex program using dynamic
data structures. In doing so you will:
- Gain experience developing a larger system one part at a
time, testing as you go.
- Learn about the trie data structure, a version of
a search tree.
- Gain experience working with trees, structs, and dynamically allocated data.
- Gain more experience reading and processing text files in C.
- Practice writing simple Makefiles.
Please use the
discussion board to ask questions.
cplusplus.com is a great resource for both C and C++.
T9
Overview
In this assignment, you will build programs to implement T9
predictive text, a text input mode available on many cell
phones and keypads. Each number from 2-9 on the keypad
represent three or four letters, the number 0 represents a
space, and 1 represents a set of symbols such as { , . ! ? }
etc. The numbers from 2-9 represent letters as follows:
2 ABC
3 DEF
4 GHI
5 JKL
6 MNO
7 PQRS
8 TUV
9 WXYZ
Since multiple letters map to a single number, many key
sequences represent multiple words. For example, the input
2665 represents "book" and "cool", among other
possibilities.
To translate from number sequences to words, you will use a
data structure known as a trie. A trie is a multiway branching
structure (tree) that stores the prefixes of sequences. As you
travel down a path in the trie, you reach word sequences
spelled out by the numbers along that path. Classic trie data
structures have edges labeled with letters to store prefixes
of strings. But for this application, you will use a
compressed trie that has only 9 possible branches at each node
instead of 26, since the digits 2-9 represent the 26 letters
plus an additional branch for words with the same numeric
sequence. Because of this, an extra layer of complexity is
needed to figure out the string represented by a path.
For more information on the trie data
structure, here
is a link to the Wikipedia article.
Specification
Implement in C a program t9
. The command
t9 FILE
should read in a dictionary file (FILE
) that
contains a list of words. Translate each word in the
dictionary into its numeric key sequence, then add the key
sequence to your trie, with the word at the end of the path
corresponding to the digits. If a word with the same numeric
sequence already exists in the trie, add the new word to the
trie as a link to a new node with an edge labeled '#' instead
of one of the digits 2-9.
For example, if the program reads the set of words
"bat","cat", and "hawk" from the dictionary and adds it to an
empty trie, the resulting trie should look like this:
Once your program has read the dictionary and built the trie
containing the words in it, it should start an interactive
session. The user should be able to type numbers and the
program should print out the word corresponding to the
sequence of numbers entered. Your program should use the
numbers typed by the user to traverse the trie that has
already been created, retrieve the word, and print it to the
screen. If the user then enters '#', the program should print
the next word in the trie that has the same numeric value, and
so forth. The user can also type a number followed by one or
more '#' characters - this should print the same word that
would be found by typing the number and individual '#'
characters on separate input lines.
As an example, if we run the program using the above trie,
an interactive session might look like this:
Enter "exit" to quit.
Enter Key Sequence (or "#" for next word):
> 228
'bat'
Enter Key Sequence (or "#" for next word):
> #
'cat'
Enter Key Sequence (or "#" for next word):
> 4295
'hawk'
Enter Key Sequence (or "#" for next word):
> #
There are no more T9onyms
Enter Key Sequence (or "#" for next word):
> 228#
'cat'
Enter Key Sequence (or "#" for next word):
> 228##
There are no more T9onyms
> 4423
Not found in current dictionary.
> exit
The interactive session should terminate either when the user
enters the word "exit" or when the end-of-file is reached on
the interactive input (indicated by typing control-D on the
keyboard).
Note: Make sure your program properly handles the case if the
user types more "#"s than there are T9onyms for a particular
number.
We provide you with two text
files, smallDictionary.txt
and dictionary.txt
. Each
of these text files contains a list of words to be used in
constructing a trie - the small one primarily for testing, and
the large one for the final program. Translate each word in
the file into its associated T9 key sequence, and add the word
to the trie. In the case of multiple words having the same key
sequence k, let the first word encountered in the
text file be represented by the key sequence
k, the next encountered represented by k#,
the next k##, and so on. For example, 2273 can
represent acre, bard, bare, base, cape, card, care, or
case. To disambiguate, acre would be represented by 2273, bard
by 2273#, bare by 2273##, and so forth. When a user inputs a
key sequence, print the appropriate word.
Your trie data structure should contain nodes to represent the
tree, and strings (char arrays) containing copies of the words read
from the dictionary file, linked to appropriate nodes in the trie.
Technical Requirements
Besides the general specification given above, your program
should meet the following requirements to receive full
credit.
- You should create a
Makefile
and
use make
to compile your
program. Your Makefile
should only recompile
the necessary part(s) of the program after changes are
made.
- Use
malloc
to dynamically allocate the nodes,
strings, and any other data that make up your trie.
- If you need to create a copy of a string or other
variable-size data, you should dynamically allocate an
appropriate amount of storage using
malloc
and
return the storage with free
when you are done
with it. The amount allocated should be based on the actual
size needed, not some arbitrary size that is assumed to be
"large enough".
- Use standard C library functions where possible; do not
reimplement operations available in the standard
libraries.
- You must check the return status (result code) of every
library function you call to be sure that no errors
occurred. In particular,
malloc
will
return NULL
if it is unable to allocate
storage. Although this is extremely unlikely to happen, a
robust program must check for the possibility and react
appropriately if it does.
- If an error occurs when opening or reading a file, the
program should write an appropriate error message
to
stderr
and terminate if there is no further
work to be done.
- Before the program terminates, all dynamically allocated
data must be properly freed (i.e.,
free
everything acquired with malloc
). This should
be done explicitly without relying on the operating system
to clean up after the program finishes.
- Your code must compile and run without errors or warnings
when compiled with
gcc -Wall
on klaatu or the
CSE Linux VM. Your program should build without errors
when make
is used to run
your Makefile
. You are, of course, free to use
other systems for development, and you should be fine as
long as you have a relatively recent version of gcc. But we
will test the code on the CSE Linux machines.
- Your program should terminate cleanly with no memory leaks
or other memory errors reported when it is run
using
valgrind
. (Warning: valgrind
slows down execution considerably. It may take several
minutes to load the full dictionary.txt file and then free
the resulting tree under valgrind
. We suggest
you use smaller input files during development to test for
memory problems with valgrind
.) If memory
leaks are
detected, valgrind
's --leak-check=full
option will be useful to generate more extensive messages
with information about the memory leaks.
Implementation Advice
- There are a lot of things to get right here; the job may
seem overwhelming if you try to do it all at once. But if
you break it into small tasks, each one of which can be done
individually by itself, it should be quite manageable. For
instance, figure out how to add a single word to the trie
before you implement the logic to process all the words in
the dictionary. Figure out how to add a few words that have
different numeric codes before you handle words that have
the same codes. Implement the code to traverse the trie to
translate an input key sequence into the corresponding word
once you've built the trie, not before.
- Before you start typing code into the computer, spend some
time sketching out data structures and code (particularly
trie
struct
) on paper or on a whiteboard. Be
sure you understand what you are trying to do before you
start typing.
- Every time you add something new to your code (see hint
#1), test it. Right Now! It is much easier
to find and fix problems if you can isolate the potential
bug to a small section of code you just added or
changed.
gdb
and printf
are your
friends here to examine values while debugging.
- You will probably find it very useful to include code that
can print the contents of the trie in some understandable
format. This is not required, but how can you be sure your
code is correct if you can't look at the trie that is built
for a small set of input data?
- Start with a small data file and figure out in advance
what the resulting trie should look like. Then verify that
the program does, in fact, create that trie.
gdb
is your friend.
- To build the trie, you need some way to translate
characters (primarily letters) from words in the dictionary
file to the corresponding keypad digits. It is probably a
great idea to include in your code a function that takes a
character as an argument and returns the corresponding
digit. This can be implemented with a series
of
if-elseif-else
statements, but another way
to do it is to have an array with one entry for each
possible character. In the entry for each character, store
the corresponding digit code. Then you can look up the code
for a character without a complicated nest
of if
statements. (Recall that in C, as in Java
and other languages, a character can be used as a small
integer. That is particularly helpful for exactly this kind
of application, where we might want to use a character value
as an index into a table of data.)
- Be sure to check for errors like trying to open a
nonexistent file to see if your error handling is working
properly.
Test Sequences:
The sequences below can be used to validate your trie against the given dictionary.
- 22737: acres, bards, barer,bares,barfs,baser,bases,caper,capes,cards,carer,cares,cases
- 46637: goner,goods,goofs,homer,homes,honer,hones,hoods,hoofs,inner
- 2273: acre, bard,bare,barf,base,cape,card,care,case
- 729: paw,pax,pay,raw,rax,ray,saw,sax,say
- 76737: popes,pores,poser,poses,roper,ropes,roses,sords,sorer,sores
Code Quality Requirements
As with any program you write, your code should be readable
and understandable to anyone who knows C. In particular, for
full credit your code must observe the following
requirements.
- Divide your program into suitable source files (at least
two) and functions, each of which does a single well-
defined aspect of the assignment. For example, there should
almost certainly be a header and source file for the trie
data structure and the operations needed on it (create a new
empty trie, insert a word, search, delete the trie,
etc.). Your program most definitely may not consist of one
huge main function that does everything.
- The header (
.h
) file for the trie (and any
other header files) should only declare items that are
shared between client programs that use the header and the
file(s) that implement it. Don't include in the header file
implementation details that should be hidden. Be sure to use
#ifndef
so your header files work properly if
included more than once in a source file, either directly or
indirectly.
- Be sure to include appropriate function prototypes near
the beginning of each source file for functions whose
declarations are not included in a header file.
- Comment sensibly, but not excessively. You should not use
comments to repeat the obvious or explain how the C language
works - assume that the reader knows C at least as well as
you do. Your code should, however, include the following
minimum comments:
- Every function must include a heading comment that
explains what the function does (not how it does it),
including the significance of all parameters. It must
not be necessary to read the function code to determine
how to call it or what happens when it is called. (But
these comments do not need to be nearly as verbose as,
for example JavaDoc comments.)
- Every significant variable must include a comment that
is sufficient to understand the information in the
variable and how it is stored. It must not be necessary
to read code that initializes or uses a variable to
understand this.
- Every source file should begin with a comment
identifying the file, author, and purpose (i.e., the
assignment or project).
- Use appropriate names for variables and functions: nouns
or noun phrases suggesting the contents of variables and the
results of value-returning functions; verbs or verb phrases
for
void
functions that perform an action
without returning a value. Variables of local significance
like loop counters or indices should be given simple names
like i
, n
, or p
and
do not require further comments. Avoid names
like count
or flag
that provide no
useful information - use names that suggests the values
being counted or the condition that is represented.
- No global variables. Use parameters (particularly
pointers) appropriately. Exception: it is may be appropriate
to use global variables for constant data like translation
tables if the program is better structured this way.
- No unnecessary computation or excessive use
of
malloc
or free
- these are
expensive. Don't make unnecessary copies of large data
structures; use pointers. (Copies of ints, pointers, and
similar things are cheap; copies of large arrays and structs
are expensive.)
As with the previous assignment, we strongly suggest that you
use the clint.py style checker
(right-click to download, and chmod +x
to make it
executable) to review your code. While this checker may flag
a few things that you wish to leave as-is, most of the things
it catches, including whitespace errors in the code, should be
fixed. We will run this style checker on your code to check
for any issues that should have been fixed.
Extra Credit
Extra credit will be awarded for adding the following
extensions to an already complete, working assignment. You
should also turn in your working assignment before attempting
any extra credit, and turn in the enhanced program later (see
instructions below).
- Add functionality to allow users to give a prefix of a
word as input instead of requiring entry of complete
words. When a '#' is entered, if the numbers entered so far
are only the prefix of a word, the program should print a
word that begins with this prefix. (e.g., if '22' is the
input and the user enters '#', the output might be cab; if
the user types "#" again, the output might be cap; "##" :
car, etc., depending on the input dictionary)
- Store the words in the trie so that if a numeric sequence
matches several possible words, the most likely word is
presented first, based on how frequently different words
with the same numeric sequence actually appear in English
text. The data
file
freq_Dictionary.txt
contains a list of words with, in addition, the frequency of
each word in ordinary text. Use the information in this file
to construct your trie so the most likely words are reached
first. Warning: the data in this file has some problems,
like duplicate entries for some words. You will need to
figure out how best to use this data, or find different data
that is more helpful.
- Dynamically update the frequencies of words in the
trie. The idea here is that if someone uses a particular
word often, it should be presented first before other words
that have the same numeric code. To do this, you need to
rearrange the trie as the program runs so that frequently
used words move to locations higher up in the trie. Or you
might want to change the trie so that words with the same
numeric spelling are stored as a linked list anchored at a
single trie node that represents that sequence of numbers,
and move words to the front of their linked list when they
are used.
- Feel free to experiment with additional extensions.
If you include extensions in your program, you should also
include a extra_credit.txt
file that describes
what you added and how to demonstrate your addition(s) when
your program is executed.
What to turn in
Use
the turn-in
drop box to submit the source code to your program as well
as your Makefile
. If you do any extra credit
extensions later, create a tar file
named hw5-extra.tar
containing the extra-credit
version of your source code, Makefile
, and
the extra-credit.txt
file describing your
extensions. Turn in that archive in addition to the solution to
the main part of the assignment.