Set and Maps
Table of contents
Search Engine
Earlier, we saw that the worst-case order of growth of the runtime for countUnique
with an ArrayList
is in Θ(N2), where N is the length of the list. The ArrayList.contains
algorithm may need to scan over the entire list on each iteration of the while
loop.
public static int countUnique(Scanner input) {
List<String> words = new ArrayList<String>();
int count = 0;
while (input.hasNext()) {
String word = input.next();
if (!words.contains(word)) {
count++;
}
words.add(word);
}
return count;
}
On a real dataset containing all of Shakespeare’s words, it takes about 3.5 minutes to determine that Shakespeare used 33,505 unique words.1 Designing programs for data on the internet might require dealing with even more data every second!
Web search engines are software systems that organize information on the internet so that users can discover web pages related to a search query. We can implement a basic web search engine by building an index that associates each term (word) to all web pages in which it appears. A search query can be answered by returning the documents that contain all of the terms in a given query.
Search Engine
- Oct 12
- Sets and Maps
- BJP 11.2, 11.3
- Describe an example where a set would be preferred over a list, and vice versa.
- Loop over each item in a collection (or an array) with a for-each loop.
- Apply the if-missing-then-put pattern to provide default map values.
- Oct 13
- SectionSets and Maps
- Oct 14
- Nested Collections
- Trace the execution of programs with reference data types.
- Define methods that use collections containing other collection types.
- Describe the relationship between object reference equality vs. value equality.
- Oct 15
- SectionNested Collections
- Oct 16
- Linked Nodes
- BJP 16.1
- Evaluate and reassign variable references to deeply linked nodes.
- Trace the execution of programs with references to deeply linked nodes.
- Identify and apply additive changes before destructive linked node changes.
This is an overcount since
Scanner
splits tokens on whitespace, so words can include punctuation. ↩
Sets
One way to speed up the countUnique
method is to use a set abstract data type instead of a list. By restricting the public methods and changing the way clients interact with the collection, implementers can design more efficient data structures to speed up the running time to meet the demands of big data.
In contrast to lists, sets only contain unique elements and do not keep track of element positions (indices). The set abstract data type represents an unordered collection of unique elements.
add
an element to the set.contains
to return whether or not a value is in the set.remove
and return an element from the set.size
to return the number of elements in the set.
Without the need to maintain element positions, set implementations organize data in more creative ways than storing elements according to index. The Java Collections framework provides two Set
implementations called HashSet
and TreeSet
. We can rewrite the countUnique
method to use a HashSet
.
public static int countUnique(Scanner input) {
Set<String> words = new HashSet<String>();
while (input.hasNext()) {
String word = input.next();
words.add(word);
}
return words.size();
}
Since sets only contain unique elements, words.size()
represents the number of unique words added to the set.
HashSet
- Elements are stored in a consistent but unpredictable order. For our purposes, the worst case runtime of all operations are constant with respect to the size of the set.
TreeSet
- Elements are stored in sorted order. The worst-case runtime of all operations are in Θ(log2 N) with respect to N, the size of the set. Later, we’ll learn about binary search trees, the data structure idea behind
TreeSet
.
For-each loop
One key limitation of the set ADT is that there’s no get
or peek
methods for accessing elements. Since lists maintain indices, List
clients can print out each word with a for loop.
List<String> words = ...
for (int i = 0; i < words.size(); i++) {
String word = words.get(i);
System.out.println(word);
}
However, sets don’t maintain element indices! Instead, iterating over a set of words requires a new Java syntax called the for-each loop.
Set<String> words = ...
for (String word : words) {
System.out.println(word);
}
Elements are considered in an order determined by the set implementer. Since TreeSet
stores elements in sorted order, a for-each loop over a TreeSet
of words will consider each word in dictionary order. The ordering of elements in a HashSet
, in contrast, is not easy to interpret.
For-each loops can also be used with other collection types as well. For-each loops help simplify code by removing the need to maintain additional loop counter variables.
The main drawback of the for-each loop is that it does not allow modification to the collection. For example, we can’t add or remove elements from the collection within a for-each loop.
for (String word : words) {
System.out.println(word);
words.add("hello"); // will not work
}
Maps
The map abstract data type associates a set of unique keys with a collection of values, where each key is associated with a value.
containsKey
returns true if the givenkey
is in the map.get
returns the value associated with the givenkey
, or null if thekey
is not in the map.put
associates the givenkey
with the givenvalue
.remove
to remove the entry for the given key and return its associated value.
Maps relate two types of data: keys and values. Suppose we want to build a letter inventory for the string “Washington” using a HashMap
where the keys are characters and the values are integers.
Map<Character, Integer> inventory = new HashMap<Character, Integer>();
// Add character count mappings to the inventory
inventory.put('a', 1);
inventory.put('g', 1);
inventory.put('h', 1);
inventory.put('i', 1);
inventory.put('n', 2);
inventory.put('o', 1);
inventory.put('s', 1);
inventory.put('t', 1);
inventory.put('w', 1);
To later add the inventory for the string “State”, we can call put
again with the same key and the updated value. The new entry will overwrite the old entry for the given key.
// Add [aetts] to the [aghinnostw] inventory
inventory.put('a', inventory.get('a') + 1);
inventory.put('e', 1);
inventory.put('s', inventory.get('s') + 1);
inventory.put('t', inventory.get('t') + 2);
For-each loops are more complicated since we need to specify whether we want to loop over the keySet()
or the collection of values()
.
for (char letter : inventory.keySet()) {
int count = inventory.get(letter);
System.out.println(name + ": " + count);
}
Why isn't there a method to return the key associated with a given value?
Keys in a map are unique, so there is only one value for each key. But values are not necessarily unique. There are many letters in the inventory
with a count of 1, so it’s impossible to say which key the client wants in particular.
Like sets, there also exists a TreeMap
class that also implements the Map
interface and maintains its keys in sorted order.
Map<Character, Integer> inventory = new TreeMap<Character, Integer>();