saveIndex()
/ loadIndex()
?
A: See this email message
A: See this email message
I try to run cse454: (CSE454_CLASSPATH points to the folder where I have StudentIndexer.class) cse454 index -indexclass StudentIndexer And it gives me: Setting student classpath .. .. .. Creating StudentIndexer .. .. .. Exception in thread "main" java.lang.NoClassDefFoundError: StudentIndexer (wrong name: edu/washington/cse454/StudentIndexer) at java.lang.ClassLoader.. .. .. .. .. .. .. What could be wrong?
A: You want to pass the entire class name:
cse454 index -indexclass edu.washington.cse454.StudentIndexer
A: Yes, you have to take stop words into account for term positions. So in the document "hair of the cat," position(hair) = position(cat) - 3.
A: You have to put in the full path to the program each time. So "/projects/instr/cse454-05au/assignment1/bin/cse454 index ..." instead of "cse454 index ..."
saveIndex()
/ loadIndex()
?
A: saveIndex()
is called by the cse454 command after indexPages()
finishes. The purpose is to return the path to your index on disk that you will submit, along with your write-up and your entire index. I will instantiate your indexer object and call loadIndex()
on the path returned by your saveIndex()
in order to compare your index with mine.
A: Apparently some page contents are missing in the crawl but are returned as documents anyway. I'm not sure why that's the case, but it's just a small percentage of them. You can treat them as zero-length pages.
A: Some classes, such as IIndexer.FileList.DocOccurrence
, can be a pain to instantiate (sorry). The easiest thing to do is to have a FileList object around so you can do something like
filelist.new DocOccurrence()
(assuming filelist
is a FileList object that you've instantiated earlier).
A: See this email message
To use the medium-sized crawl, add "-size med" to the cse454 command invocation. I'd recommend against using it while you're developing, as it will be much slower and take up significantly more disk space. Once your indexer is farther along, try indexing the medium sized crawl to stress test your program. You definitely want to do this to see if you run out of memory.
Find a Linux machine that doesn't have a lot of users on it, if possible. Try finding a machine in one of the labs and remember it's hostname so you can ssh to it in order to run your indexer.
Use the 'df' command to see how much free space is on each directory. Directories with names like /local1 or /tmp or /var/tmp are the best places to save the files your indexer generates. If you try saving to your own home directory, you'll quickly hit your quota.
As the assignment says, the lexicon should be kept in main memory. But the occurrence index gets too big to keep entirely in memory, so we have to write out parts of it to disk at times. We read those parts back in when we have to make changes to them. So how should you divide up your occurrence index? This is one of the design decisions you must make, and it will have a big impact on how long it takes your indexer to index a crawl. If you choose a bad way of keeping track of term occurrences, you'll end up reading and writing too often.
One possible intermediate structure you can use is a trie (see the indexing lecture). The trie has the advantage of being easy to sort terms alphabetically, which can be a huge benefit if you want to create a sorted index. Having a sorted index allows you to not have to worry about reading in old inverted file lists because you'll never need to add occurrences of terms that you've already passed alphabetically. The problem is, you can't fit the entire trie in memory because it represents your entire index, too. So you'll have to consider ways of splitting it into chunks.
Here's a naive approach that you could (but shouldn't) use: put every unique term's inverted file list in a separate file. Why is this bad? You'll end up with a huge number of files, which the operating system will not handle very well. You could put your files in a trie structure, so that all files placed in the same directory start with the same prefix, but this won't decrease the number of files and will increase the number of directories, so you'll be even more inefficient! Clearly, you have to find some middle ground between one file per term and one file for all terms...
One last idea is to expand the number of steps and intermediate files you use. You can't sort all the terms and their document occurrences all at once, but you can sort a small portion at a time...