- You should implement your own hash code for the hash table. Do
not use String.hashCode().
- AVLTree and SplayTree must extend
BinarySearchTree regardless of
whether you choose to modify the existing code for
BinarySearchTree.
- Your hash table should be able to hold an arbitrary number of
elements. In particular, if you are using separate chaining, you
should still rehash the table when your "lambda" value gets above some
constant. The issue is that separate chaining starts to have very bad
performance if the table is saturated.
- For the document correlator, just output the final difference
metric you compute. e.g.:
bash$ java Correlator -b foo.txt
bar.txt
.42
[3/3 5pm] This number should be the difference with removing words
with normalized frequencies above 0.01 (1%) and below 0.0001 (0.01%).
[Note: If you have already implemented this outputing two numbers -
with and without the cutoffs, this will also be o.k..]
Posted Feb 21
- You should use regular arrays for your hash tables, not ArrayLists.
Posted Mon, March 1, 4:30pm
- Running count.pl and count.sh: If you get "Permission Denied"
when trying to run either of these, you may need to change the file
permissions to be user (thats you!) executable (so you can run it).
You do this with the chmod command:
bash$ chmod u+x count.sh
bash$ chmod u+x count.pl
Then you should be able to run these as described in the assignment:
bash$ ./count.sh hamlet.txt
These are Perl (count.pl) and Unix shell (count.sh) scripts.
You do not need to understand how they work at all for this
assignment! However they are just text files that you can open in any
editor.
- Math libraries: You are welcome to use functions from
the java math libraries, just not data structures from the java collections.
Posted Tues, March 2, 1:30pm
- For the sorting algorithm in sortByDecreasingCount, you are told
to replace this with a better sort. Be sure you pay attention to what
the comments say that sortByDecreasingCount should do. It currently
assumes that the array it is given has been sorted alphabetically. It
then does a sort on counts. The final output is sorted decreasingly
by counts, but for elements with the same count, they are sorted
alphabetically (see the examples in the writeup). You need to modify
sortByDecreasingCount so that you get this same output even if you
pass it an array that has NOT been sorted alphabetically. Thus you
may need to do alphabetical sorting inside of sortByDecreasingCount as
well as sorting by count. Feel free to write extra routines to do
this alpha sorting. However you do things, be sure your
sortByDecreasingCounts routine will work when given an array that has
NOT already been sorted alphabetically.
- Correlation Results:
I have: 0.0029051893816795836 without word removal and 5.79912995261018E-4
with word removal via cutoffs .0001 and .01
That's with hamlet & new atlantis exactly as we gave them to you
(including all that header/copyright information).
When grading with regard to these #'s we'll look at just a few
significant digits, so, if the numbers above are accurate anything
around 5.79E-4 would be fine as well.
Posted Wed, March 3, 5:30pm
- In question 8 of the readme, "data structure" is referring to
your DataCounter implementations. The cases you enumerate may or may
not be ones that you actually encountered in your testing - you can
describe what these cases would look like (if any).
- WordCount output The assignment specifies what should
happen if -frequency is used. It specifies what should happen if
-num_unique is used. It does not clearly specify what should happen if
(a) neither of those two options is given, (b) both options are given.
So what you do for (a) or (b) is up to you. This could include -
doing nothing if no flags are given, or only implementing the first
flag it sees if more than one is given. The only thing the assignment
DOES say to do is not to crash if given invalid input.