CSE 373, Spring 2004: Assignment 4

Title:Working With Web Words (V1.02)

Due Dates:

Part I -- Monday, May 10, in class; Part II -- Friday, May 14 at 5:00 PM

Work Modes: Part I -- Individual; Part II -- Choice of Individual or Partnership

Part I (Individual Work)

  • Show the result of inserting the key 49 into the AVL tree shown in Fig. 9.11b of the text (same in both 2nd and 3rd edition of Goodrich and Tamassia). Show any rotations involved.
  • Show the result of removing the key 62 from the (same) tree of Fig. 9.11b. Show any rotations involved.
  • Do exercise R-10.4 in the text (same in both editions). This is to prove that Mergesort is O(n log n) even if n is not a power of 2. Turn in your solutions in class on Monday, May 10.

    Part II (Choice of Individual or Partnership Work)

    (Partnerships are encouraged. You have no obligation to have the same partner as in Assignment 3.)

    Implement the following applet for the analysis of web pages using AVL trees. The applet class should be named AVLTreeApplet.

    A web page is normally written using HTML. As such, it is a mixture of normal text and HTML formatting commands.

    Feature 1: Using either your solution or the given solution to the Huffman Trees applet assignment as a starting point create an Applet that builds AVL trees, showing the tree after each insertion. Your applet should support commands RESET, FIND (String), and INSERT (String). Each node of the AVL tree will hold a key (in the form of a string) and a count (in the form of an integer). Here is a sample input sequence for Feature 1:

    RESET
    INSERT ant 5
    INSERT bear 2
    INSERT chinchilla 1
    INSERT dromedary 2
    FIND chinchilla ; returns chinchilla 1
    FIND egret ; returns NOT FOUND
    

    The AVL tree should be displayed on the screen using essentially the same display algorithm used for the Huffman trees in Assignment 3; Show, in each node, the word for that node and its number of occurrences.

    Feature 2a: Implement commands BEGIN-DOCUMENT and END-DOCUMENT so that your applet will accept a document as text pasted in the textarea of your applet. For example the following command sequence should build the same tree as does the sequence given above. Each word should be mapped to lower-case before it is inserted. Use a Java StringTokenizer object (as was used in Assignment 3) to find the words and eliminate the punctuation.

    RESET
    BEGIN-DOCUMENT
    Ant bear ant chinchilla dromedary,
    bear ant ant ant dromedary.
    END-DOCUMENT
    FIND chinchilla ; returns chinchilla 1
    

    Feature 2b: (optional for individuals but required for partnerships) strip off and discard any HTML tags in the text, leaving only the normal text. To eliminate a tag, first scan the string for a left angle bracket, and if found, scan for the next right angle bracket. Use the substring method of class String to extract the parts of the string you need and concatenate them into a string that leaves out the tag. Repeat appropriately to get rid of all the tags.

    Feature 3: Scan through the text (using a Java StringTokenizer object), and for each word encountered, process it using the AVL tree. Each word will be stored in the AVL tree together with a count of the number of its occurrences in the document.

    Here is how to process each word: Each time a word is encountered in the scan of the document, check to see if the word is already in the AVL tree by performing a FIND operation. If it is, increment the count of the number of occurrences of that word. If it is not in the tree, INSERT it into the tree with a count of 1.

    More Features for Partnerships

    Feature 4: (required for partnerships) Allow two such documents to be processed in your program and permit two trees to exist simultaneously. Do this by implementing two new commands BEGIN-DOCUMENT-A and BEGIN-DOCUMENT-B. They should work just like BEGIN-DOCUMENT but lead to two separate AVL trees. Display both of these trees on the screen.

    Feature 5: (required for partnerships) Provide a simple method that compares the two documents. The two documents A and B should be compared in the following way. Determines four sets of words:

    (i) the set A-B of words in document A but not in document B,
    (ii) the set B-A of words in document B but not in document A,
    (iii) the set A^B of words that  are common to A and B.
    (iv) the set AuB of words that are either in A or in B.
    

    First let's define card(S) to be the number of elements in the set S. Next, let us define the set distance between A and B as
    Ds(A,B) = [card(A-B) + card(B-A)] / (1 + card(A^B))
    
    Your applet should compute this information in response to the command SIMPLE-COMPARE. It should show, in the history window, each of the sets (i), (ii), (iii), and (iv), as well as the values of Ds(A,B), card(A-B), card(B-A), card(A^B), and card(AuB).

    Feature 6 (Optional for all): Compute a more sophisticated comparison of two documents A and B as follows. Suppose that the words of AuB are in alphabetical order as . Then we define the vector VA to be , where ai is the number of occurrences of wi in document A. Similarly VB is the vector giving numbers of occurrences of each word in B.

    The normalized word vector NA for A is obtained from VA by dividing each element ai by the length of VA. The length of VA is

                      2     2           2
    || VA || = sqrt(a1  + a2  + ... + an )
    
    The inner product of two vectors X = and Y = is given by
    X o Y = Sum for i=1 to n of xi * yi.
    Support the command COSINE-DISTANCE. To do this, compute the cosine distance between A and B as
    Db(A,B) = NA o NB.
    This should cause the cosine distance to be displayed in the status line and in the history window.

    What to turn in

    For Part I, turn in your answers as hardcopy.

    For Part II, turn in (a) the URL to your applet on the web, using the electronic turn-in form, and (b) a ZIP file named A4.zip containing your source code (.java files only).

    A demo version of an applet for this assignment is available for operation here.