Search Engine
Building a search engine with maps and sets.1
Web search engines are software systems that organize information on the internet so that users can discover web pages related to a search query. We can implement a basic web search engine by building an inverted index (also known as an index) that maps each term (word) to the set of web pages in which it appears. A search query, which is a sequence of one or more terms, can be answered by returning the web pages that contain all of the terms.
Search Engine
- Oct 12
- Sets and Maps
- BJP 11.2, 11.3
- Describe an example where a set would be preferred over a list, and vice versa.
- Loop over each item in a collection (or an array) with a for-each loop.
- Apply the if-missing-then-put pattern to provide default map values.
- Oct 13
- SectionSets and Maps
- Oct 14
- Nested Collections
- Trace the execution of programs with reference data types.
- Define methods that use collections containing other collection types.
- Describe the relationship between object reference equality vs. value equality.
- Oct 15
- SectionNested Collections
- Oct 16
- Linked Nodes
- BJP 16.1
- Evaluate and reassign variable references to deeply linked nodes.
- Trace the execution of programs with references to deeply linked nodes.
- Identify and apply additive changes before destructive linked node changes.
Specification
Implement a SearchEngine
class that represents an inverted index.
SearchEngine()
- Constructs an empty
SearchEngine
instance. void index(String document)
- Indexes the
document
by associating each term in thedocument
with thedocument
itself. This method will be called for each web page that the client wants to add to the search engine. Set<String> search(String query)
- Returns the set of documents where each document contains all of the terms in the given
query
ignoring any never-indexed terms. If none of the terms have been indexed or if thequery
is empty, then an empty set is returned. One way to implement this is to start with all of the documents associated with any one of the indexed terms in thequery
and then remove all of the documents not associated with each of the other indexed terms in thequery
.
Split strings using the provided split
method that returns a Set<String>
terms.
Web app
To launch the web app, Open Terminal
javac Server.java && java Server bbcnews/*.txt; rm *.class
Then, open the Network
When you’re done, return to the terminal and enter the key combination Ctrl
C
to close the server.
Grading
Mark the program to submit it for grading. In addition to the automated tests, the final submission will also undergo code review for internal correctness on comments, variable names, indentation, data fields, and code quality.
- Method errors
- Not decomposing complex methods into logical subtasks that each have their own method. Create
private
“helper” methods to capture repeated code. A good rule of thumb for this assignment is that no method should have more than 20 lines of code in its body. - Data structure errors
- Excessive inefficiency when using collections. For example, using a for-each loop to find a value of a map when a call to
get
would return the value.- Using a collection when not necessary, or using the wrong collection type. For example, using a list instead of a set when we only want to maintain unique elements.
Chris Piech and Mehran Sahami. 2020. Bajillion Search Engine. In CS 106A Spring 2020. https://web.stanford.edu/class/archive/cs/cs106a/cs106a.1206/assn/15-assignment7.pdf ↩