This homework implements a program that has the basic functionality of checking spelling of English text and outputting relevant statistics on the number of words, paragraphs, typos, etc.. This program is case insensitive, so it will recognize “Hello” in the input text as “hello” in the dictionary.
The flow of the program, in general, goes like this:
- Build a dictionary from a certain file that contains all the words we want in the dictionary.
- The input text is read and examined word by word, updating variables holding statistics and generating output about specific mistyped words along the way.
- Write the human-readable statistics to the output.
The provided code already has most of the functionality, but is full of nasty memory errors. And your job is to understand how memory is managed and used in the provided code, identify those errors, and come up with a solution to those errors.
Starter Code¶
File structure¶
Your hw4 folder contains the following files:
SpellChecker.h
, contains the declaration of functions in the Dictionary module.SpellChecker.c
, contains the implementation of functions in SpellChecker.h.count_typos.c
, the driver program.Utils.h
, contains the declaration of utility functions.Utils.c
, the implementation of utility functions.
The only files that you’ll need to modify and submit are:
SpellChecker.c
count_typos.c
Although there are only two files that contain the buggy code, you’ll still want to look into the header files to see what each function is supposed to do, and to decide how you may modify the code while satisfying the spec.
Tip
In the SpellChecker.h
, there is a line typedef char** Dictionary;
. We will talk more about typedef
on Wednesday/Friday (07/17 and 07/19), but all you need to know is that the Dictionary
is of type char**
.
We also provide the following files:
cpplint.py
: code style checker (see Style Errors for usage)data_files/
: this directory contains some test cases for you to runcount_typos
on, including test filespap.txt
,scramble.txt
, andsimple.txt
, and example correct outputpap.txt.typos
dict/dict0
anddict/dict1
: two dictionaries to testcount_typos
withsolution_binary
: a fixed, non-buggy version ofcount_typos
that you can use to check the program (see Homework Goals for more information)
Types in C¶
This homework involves usage of some types that we’d like to address before you look at the actual code:
On size_t
:¶
The type size_t
is conventionally used in C standard library functions to store a number big enough to fit the size of any object, as computed by calling sizeof
on a variable or type name. You can think of the name as “size type.” On the cancun system (64-bit) this will be a 64-bit unsigned integer.
On char*
, char**
, and binary search:¶
As we’ve seen in lectures 7-10, the type char*
is a pointer to a null-terminated array of char’s, which we also call a “string”. The naive implementation of our dictionary is simply a char**
: an array of char*
‘s. Each pointer in the array points to a string (char*
) containing a single word. And our check_spelling
operation is then a binary search on the array to see if a word exists. As such, all operations on our dictionary requires the user to provide both the dictionary itself and the size of the dictionary because we wouldn’t know how many char*
‘s the dictionary has or what the bounds of our binary search will be otherwise.
Extra Info on Types¶
If you’re interested in how the types work in the provided file Utils.c
, you can check out the Extras section, but this isn’t required.
Technical Requirements¶
Compile-time errors¶
To start this homework, simply cd
to the starter code folder and run this command:
make
And you’ll notice many compiler errors that come up.
To address this, you’ll need to add certain #include
statements to the appropriate places in Spellchecker.c
and count_typos.c
. You’ll want to include some mix of the following, which means you’ll need to figure out which ones to include. Here’s what we suggest doing for the various headers files you can include:
- C standard library header files: refer to man pages, online documentation, or the error message from
gcc
- Provided HW4 header files: read their doc comments and/or their code to see how each of the code files are connected.
It should be a relatively simple process to get the code to compile.
Note
The make
command is a build automation tool commonly used in software development. It reads a file called Makefile
that specifies a set of rules and dependencies, allowing developers to easily compile, test, and organize their code projects. We will learn more about this on Wednesday (07/17), but you shouldn’t need to know anything about it to complete HW4.
Runtime errors¶
Most of the work for this homework will happen here.
The provided code is designed to be infested with all kinds of memory errors that programmers can easily make due to human error. Your task will be to identify and get rid of them; the major, obvious places where you have to make an addition are marked with // TODO
but there are many more hidden ones that you’ll have to identify with gdb
or valgrind
.
To clarify, it is suggested that you remove the TODO comments once you’re done with them.
To approach this part of the homework, we have a few tips for getting started:
valgrind
is your absolute best friend for this assignment; however, the starter code as it is, makes many and aggressive memory errors that it can causevalgrind
to freeze/crash. This means that sometimes you’ll have toctrl-c
to terminate it and make use of whatever information that is printed on the screen.valgrind
options, such as--leak-check=full
,--show-leak-kinds=all
,--track-origins=yes
, are all very helpful but they do slow down the execution by a certain amount. Consider using them progressively, i.e., try runningvalgrind
without them, and only add each option incrementally ifvalgrind
doesn’t provide enough information to debug.- If you have a hard time locating the bug with the information given by
valgrind
, consider stepping through the program to the segment that causes the error usinggdb
, and pay extra attention whenever you see amalloc
operation or a write to heap memory.
Style errors¶
It is recommended that you look at the style errors after you’re done with all the other ones; because fixing style won’t involve any changes to the structure or the behaviors of the program. We didn’t deliberately include any style errors for you to fix, so this shouldn’t be a concern at all if you practice good coding habits.
Please use ./cpplint.py --clint *.c
to review your code. If this fails, you must call python3 explicitly: python3 ./cpplint.py --clint *.c
.
Homework Goals¶
In the end, the program needs to satisfy the following requirements:
- The code needs to compile without any warning by running
make
. - There should be no memory leaks.
- Your program’s behavior needs to match the behavior of the
solution_binary
given to you in the starter code.
Code Quality Requirements¶
This assignment will not be graded on code style, because 95% of the code is written by the staff and there are only a few lines of code that you need to change/add to make it work. However, we do still recommend putting short comments where you do make changes, to
- Help identify the part you changed, and
- Reinforce good coding practice.
Extras¶
All of the content in this section is not strictly required, but may be helpful.
More on Types in C¶
You only need to read the following if you want to understand Utils.c
.
On int
, char
(read if you’re interested in understanding Utils.c):¶
Recall that int
is a 32-bit/4-byte signed integer. char
may be considered a specialized type for character, but in reality it is the same as int
except that it’s shorter with only 8 bits. In fact, they’re completely interchangeable if the value doesn’t go past the bounds of an 8-bit space.
We can explore this through the following series of examples. Suppose that we have:
int i = 48;
char c = '0'; // the ascii value of '0' is decimal 48
The following statements will both print out “48”:
printf("%d\n", i); // %d is the placeholder for integer values
to be interpreted as decimal numbers
printf("%d\n", c); //
And the following will both print out “0”:
printf("%c\n", i); // %c is the placeholder for ascii characters
printf("%c\n", c);
Also, this statement, using int
s:
int i = '0';
is identical to this one:
int i = 48;
And so are these two, using char
s:
char c = '0';
char c = 48;
Ultimately, you can even do this:
char c = 48, d = 5;
c = c + d; // c is now decimal 53, which is the ascii value of the character '5'
With all of that said, it should be clear that int
and char
are identical in terms of binary representation and it’s up to the programmer how a specific instance of each is interpreted.
In Utils.c
, we chose to use int
to store the current character as we step down along the FILE*
stream; this is because we would like to be able to look at the current character and decide if it’s EOF
; it just so happens that EOF
is of type int.
(Alternatively, we can use feof()
to detect EOF
as well).