CSE326 - Project 3 clarifications

Project 3 Clarifications

You should implement your own hash code for the hash table. Do not use String.hashCode().
AVLTree and SplayTree must extend BinarySearchTree regardless of whether you choose to modify the existing code for BinarySearchTree.
Your hash table should be able to hold an arbitrary number of elements. In particular, if you are using separate chaining, you should still rehash the table when your "lambda" value gets above some constant. The issue is that separate chaining starts to have very bad performance if the table is saturated.
For the document correlator, just output the final difference metric you compute. e.g.:

bash$ java Correlator -b foo.txt bar.txt
.42

[3/3 5pm] This number should be the difference with removing words with normalized frequencies above 0.01 (1%) and below 0.0001 (0.01%). [Note: If you have already implemented this outputing two numbers - with and without the cutoffs, this will also be o.k..]

Posted Feb 21

You should use regular arrays for your hash tables, not ArrayLists.
Posted Mon, March 1, 4:30pm
Running count.pl and count.sh: If you get "Permission Denied" when trying to run either of these, you may need to change the file permissions to be user (thats you!) executable (so you can run it). You do this with the chmod command:

bash$ chmod u+x count.sh
bash$ chmod u+x count.pl
Then you should be able to run these as described in the assignment:
bash$ ./count.sh hamlet.txt
These are Perl (count.pl) and Unix shell (count.sh) scripts. You do not need to understand how they work at all for this assignment! However they are just text files that you can open in any editor.
Math libraries: You are welcome to use functions from the java math libraries, just not data structures from the java collections.
Posted Tues, March 2, 1:30pm
For the sorting algorithm in sortByDecreasingCount, you are told to replace this with a better sort. Be sure you pay attention to what the comments say that sortByDecreasingCount should do. It currently assumes that the array it is given has been sorted alphabetically. It then does a sort on counts. The final output is sorted decreasingly by counts, but for elements with the same count, they are sorted alphabetically (see the examples in the writeup). You need to modify sortByDecreasingCount so that you get this same output even if you pass it an array that has NOT been sorted alphabetically. Thus you may need to do alphabetical sorting inside of sortByDecreasingCount as well as sorting by count. Feel free to write extra routines to do this alpha sorting. However you do things, be sure your sortByDecreasingCounts routine will work when given an array that has NOT already been sorted alphabetically.
Correlation Results: I have: 0.0029051893816795836 without word removal and 5.79912995261018E-4 with word removal via cutoffs .0001 and .01 That's with hamlet & new atlantis exactly as we gave them to you (including all that header/copyright information). When grading with regard to these #'s we'll look at just a few significant digits, so, if the numbers above are accurate anything around 5.79E-4 would be fine as well.
Posted Wed, March 3, 5:30pm
In question 8 of the readme, "data structure" is referring to your DataCounter implementations. The cases you enumerate may or may not be ones that you actually encountered in your testing - you can describe what these cases would look like (if any).
WordCount output The assignment specifies what should happen if -frequency is used. It specifies what should happen if -num_unique is used. It does not clearly specify what should happen if (a) neither of those two options is given, (b) both options are given. So what you do for (a) or (b) is up to you. This could include - doing nothing if no flags are given, or only implementing the first flag it sees if more than one is given. The only thing the assignment DOES say to do is not to crash if given invalid input.