Installation of Lucene crawler ****************************** NOTE: These are instructions exclusively for Windows users. If you are a non-Windows user, you will be on your own, so we encourage you to switch fast. :-) 1. Go to http://www.lucene.com/ and proceed to download the latest version (currently 1.0 -> lucene-r1_00.tar.gz, about 350KB). Unzip it. 2. Unzip the source files (which have just appeared out of the previous archive) to the current directory. This adds a new subdirectory (lucene-r1_00) to the current folder. 3. The README.txt file inside this new subdirectory explains how to proceed. Follow closely steps 1-3: - Download and install CYGWIN (if you already have it, you may need to get this latest version 1.1.7 of the DLL, about 160KB). - Obtain JDK (the latest version 1.3 is installed on the instructional machines, I see it on D:\app\jdk1.3\) and set the proper environment variable as explained. - Go to (http://www.metamata.com/download_top.html) and find JavaCC further down. You will need to register, but not pay for it. Remember to download both JavaCC (about 750KB archived), and its documentation (there will be a link for it, about 200KB). Extract the JavaCC class file (if you got the zip, otherwise skip.) - Set the JAVAHOME environment variable as explained. (The others would follow shortly.) Best would be to put all environment and path settings in your AUTOEXEC.bat (a hidden system file, located in your C:\ folder), so that you do not retype the same settings for every Command Prompt (aka DOS Prompt) session. - Run java JavaCC2_0 to install JavaCC. YOu will be prompted to pick one of three kinds of scripts. I picked JView Scripts, since I was intending to use MS J++. (I don't know if selecting Windows Standard Scripts would work as well.) As a result, a folder named javacc2.0 will be added. - Set the JAVACC environment variable as described in the README.txt file. - Complete the instructions on the PATH variable settings. - Here is a dump of my settings on one of the instructional machines (yours should look pretty close): ... JAVACC=c:\temp\Lucene\javacc2_0\bin\lib\JavaCC.zip JAVAHOME=d:\app\jdk1.3 ... Path=d:\app\jdk1.3\bin;c:\temp\cygwin\bin;... ... - You should have a lucene.jar file, which came out of one of the previous archives. Do a "jar xvf lucene.jar" to extract its files - they are critical to the spider! - This completes your installation! :-) It should have taken you little over 7MB of disk space. Experimenting with the Lucene crawler ************************************* The online documentation of Lucene (http://www.lucene.com/doc/API.html) contains examples of how to use the spider. Here is what I did (you may choose to experiment differently): - Go to one directory above the folder named "demo" and follow the examples, described at the bottom of the documentation (linked above). It is *very important* that you do those in the same order, i.e. trying to run SearchFiles before actually creating the indexes (by running IndexFiles) does not work! On another note, it may be worthwhile to collect a set of text files that you want to experiment searching on and to put those in one directory (e.g. foo). Then, you run: java demo.IndexFiles foo This creates an index of keywords, which you can then search for a term of your choice: java demo.SearchFiles In addition to searching for a term in the text of a file, you can search for a term in the path (file- or foldername). Furthermore, you have the functionality for logical operations (i.e. more complex searches) built-in. - Do more experiments and familiarize yourself with the way it works. Next, we will be proceeding to the java code and explain how you can build it on your own (and modify its functionality to include other desired features). - Skip the last experiment, since it involves a web server and we do not have that set up yet. Hints ***** - Here is an open-source general-purpose client side Web API in C - for your inspiration (http://www.w3.org/Library/) - If you find out (mostly while making extensions to the existing code) that there is a missing class in your library (like I was missing QueryParser, for instance), it may be that you have to generate it using JavaCC. Start bin\javaccw.exe from the JavaCC folder) and compile the corresponding *.jj file (which contains the description of a grammar) to the necessary set of java files.