Implement the following applet for the analysis of web pages using AVL trees. The applet class should be named AVLTreeApplet.
A web page is normally written using HTML. As such, it is a mixture of normal text and HTML formatting commands.
Feature 1: Using either your solution or the given solution to the Huffman Trees applet assignment as a starting point create an Applet that builds AVL trees, showing the tree after each insertion. Your applet should support commands RESET, FIND (String), and INSERT (String). Each node of the AVL tree will hold a key (in the form of a string) and a count (in the form of an integer). Here is a sample input sequence for Feature 1:
RESET INSERT ant 5 INSERT bear 2 INSERT chinchilla 1 INSERT dromedary 2 FIND chinchilla ; returns chinchilla 1 FIND egret ; returns NOT FOUND
The AVL tree should be displayed on the screen using essentially the same display algorithm used for the Huffman trees in Assignment 3; Show, in each node, the word for that node and its number of occurrences.
Feature 2a: Implement commands BEGIN-DOCUMENT and END-DOCUMENT so that your applet will accept a document as text pasted in the textarea of your applet. For example the following command sequence should build the same tree as does the sequence given above. Each word should be mapped to lower-case before it is inserted. Use a Java StringTokenizer object (as was used in Assignment 3) to find the words and eliminate the punctuation.
RESET BEGIN-DOCUMENT Ant bear ant chinchilla dromedary, bear ant ant ant dromedary. END-DOCUMENT FIND chinchilla ; returns chinchilla 1
Feature 2b: (optional for individuals but required for partnerships) strip off and discard any HTML tags in the text, leaving only the normal text. To eliminate a tag, first scan the string for a left angle bracket, and if found, scan for the next right angle bracket. Use the substring method of class String to extract the parts of the string you need and concatenate them into a string that leaves out the tag. Repeat appropriately to get rid of all the tags.
Feature 3: Scan through the text (using a Java StringTokenizer object), and for each word encountered, process it using the AVL tree. Each word will be stored in the AVL tree together with a count of the number of its occurrences in the document.
Here is how to process each word: Each time a word is encountered in the scan of the document, check to see if the word is already in the AVL tree by performing a FIND operation. If it is, increment the count of the number of occurrences of that word. If it is not in the tree, INSERT it into the tree with a count of 1.
Feature 5: (required for partnerships) Provide a simple method that compares the two documents. The two documents A and B should be compared in the following way. Determines four sets of words:
(i) the set A-B of words in document A but not in document B, (ii) the set B-A of words in document B but not in document A, (iii) the set A^B of words that are common to A and B. (iv) the set AuB of words that are either in A or in B.
Ds(A,B) = [card(A-B) + card(B-A)] / (1 + card(A^B))Your applet should compute this information in response to the command SIMPLE-COMPARE. It should show, in the history window, each of the sets (i), (ii), (iii), and (iv), as well as the values of Ds(A,B), card(A-B), card(B-A), card(A^B), and card(AuB).
Feature 6 (Optional for all):
Compute a more sophisticated comparison of two documents A and B
as follows.
Suppose that the words of AuB are in alphabetical order as
The normalized word vector NA for A is obtained from VA
by dividing each element ai by the length of VA.
The length of VA is
For Part II, turn in (a) the URL to your applet on the web,
using the electronic turn-in form, and (b) a ZIP file named A4.zip
containing your source code (.java files only).
A demo version of an applet for this assignment
is available for operation
here.
2 2 2
|| VA || = sqrt(a1 + a2 + ... + an )
The inner product of two vectors X =
X o Y = Sum for i=1 to n of xi * yi.
Support the command COSINE-DISTANCE. To do this,
compute the cosine distance between A and B as
Db(A,B) = NA o NB.
This should cause the cosine distance to be displayed in
the status line and in the history window.
What to turn in
For Part I, turn in your answers as hardcopy.