CSE 374, Lecture 14: Data Structures

Malloc and variables

I quickly summarized what we've learned so far about different kinds of variables:

Global variables are declared outside of functions and they are allocated in the "globals" section of the address space.
Local variables are declared within functions, only live within their scope, and are allocated in the "stack" section of the address space.
Memory blocks can be "reserved" using malloc, and these memory blocks live within the "heap" section of the address space.

I then posed a question: in the following example, what kind of variable is "x"?

    int* x = (int*) malloc(sizeof(int));

"x" is in fact a LOCAL variable! This is an important distinction when we use malloc. While malloc reserves a block of memory for us, the only way that we have of accessing it (ie getting its address) is from the LOCAL pointer "x". x stores the address of the memory block and this address lives on the STACK. That means that it can go out of scope when the function returns! If we don't save the address somewhere or return it, this is how we create "memory leaks". We can have memory leaks because the addresses of malloc'ed regions of the heap are still local variables that can be lost when the scope exits.

Using free

I proposed that we try an example of removing a node from the linked list example that we did last lecture. See the updated code on the course website. Note that WE MUST FREE ANY NODES THAT WE REMOVE! When we "delete" a node by removing it from the list, we must free it.

Binary trees

Another type of data structure that we learned about in the intro computer science courses is the binary tree. A binary tree looks something like this:

              ---
             | 5 |
              ---
               |
          -----------
         |           |
        ---         ---
       | 2 |       | 7 |
        ---         ---
         |           |
       -----       --
      |     |     |
     ---   ---   ---
    | 1 | | 3 | | 6 |
     ---   ---   ---

We refer to the "5" as the "root" of the tree. We used trees for a variety of things including sorting and compression.

How might we represent the node of a binary tree in C? It looks very similar to a linked list node, except that each node can point to TWO next nodes:

    struct BinaryTreeNode {
      int data;
      struct BinaryTreeNode* left;
      struct BinaryTreeNode* right;
    }

N-ary trees

Binary trees are just one type of tree! What could we do if we wanted a trinary tree (3 children per node) or a quad tree (4 children per node)? We could write our trinary tree node like this:

    struct TrinaryTreeNode {
      int data;
      struct TrinaryTreeNode* left;
      struct TrinaryTreeNode* middle;
      struct TrinaryTreeNode* right;
    }

But then how would we write a quad tree? Do we use names "left", "middle-left", "middle-right", and "right"? That is quickly getting unwieldy, and the names are going to be far too long for an 8-tree or a 10-tree! Instead, we can use an array of children to avoid having to name them!

    struct OctoTreeNode {
      int data;
      struct OctoTreeNode* children[8];
    }

We would then refer to the children using children[n] to get the nth child.

We call these types of higher-order trees "N-ary" trees.

Representing a dictionary

How might you use data structures from the intro computer science courses to represent a dictionary, as in a collection of valid words in a language? We brainstormed a few ways:

a list or array
a set
a map, where the key is the word and the value is the definition

For any of these data structures, we calculated that if there are 100,000 valid words and the average word length is 5 characters, we would need at least 500k bytes to store all the words. That's a lot of bytes - and we can do better using an n-ary tree.

We could store a dictionary as an n-ary tree with the root being the beginning of the word and each level of the tree being an additional character in the word. In this example, we store "ant", "cat", and "try".

                            root
                            ---
                           |   |
                            ---
                             |
            ---------------------------------------------------------------
           |             |        | | | | |              |         | | | | |
          ---           ---          ...                ---           ...
         | a |         | c |                           | t |
          ---           ---                             ---
           |             |                               |
     ------------      ----------                    ---------------------
    | |   |  | | |    |    | | | |                  | | | |     |   | | | |
    ...  ---  ...    ---     ...                      ...      ---    ...
        | n |       | a |                                     | r |
         ---         ---                                       ---
          |           |                                         |
     -------------   ---------------------      -----------------------------
    | | |   |   | | | | | | | |   |   | | |    | | | | | | | | | | | | | |   |
     ...   ---  ...    ...       ---   ...                 ...              ---
          | t |                 | t |                                      | y |
           ---                   ---                                        ---

We call this type of data structure a "trie" or a "prefix tree", because each path from the root down to some node in the tree represents a "prefix" or start of a word. It is a 26-ary tree, because there are potentially 26 characters that can follow at each position within a word.

One note: what is the actual data stored in each node of the trie? We'll need to store whether the word that ends in that node is a word or not, because words can be prefixes of each other! For instance "an" is a word, but "ant" is also a word. Therefore the "n" should store a boolean (an int in C) that represents whether or not the word "an" is a valid word.

Tries have many uses:

autocomplete (eg what Google does)
spell checking
data compression
searching for strings

Introducing HW5

In HW5, you will implement a different type of trie: a T9 trie.

T9 is that old way of converting a telephone keypad into numbers:

     --------------------
    |  1   |  2   |  3   |
    |      | ABC  | DEF  |
    |--------------------|
    |  4   |  5   |  6   |
    | GHI  | JKL  | MNO  |
    |--------------------|
    |  7   |  8   |  9   |
    | PQRS | TUV  | WXYZ |
    |--------------------|
    |  *   |  0   |  #   |
    |      |      |      |
     --------------------

The word "exam" would be represented as "3926".

The same number sequence can represent multiple words. "227" can represent "bar", "car", and "cap".

In HW5, you will represent words in T9 as a trie. This will be similar to the dictionary trie we discussed above, except instead of letters, the nodes will be numbers! Take a look at the HW5 spec for a more detailed diagram.

I also gave a demo of the T9 program that you will be building.