Set and Maps

Search Engine
Sets
For-each loop
Maps

Search Engine

Earlier, we saw that the worst-case order of growth of the runtime for countUnique with an ArrayList is in Θ(N²), where N is the length of the list. The ArrayList.contains algorithm may need to scan over the entire list on each iteration of the while loop.

public static int countUnique(Scanner input) {
    List<String> words = new ArrayList<String>();
    int count = 0;
    while (input.hasNext()) {
        String word = input.next();
        if (!words.contains(word)) {
            count++;
        }
        words.add(word);
    }
    return count;
}

On a real dataset containing all of Shakespeare’s words, it takes about 3.5 minutes to determine that Shakespeare used 33,505 unique words.¹ Designing programs for data on the internet might require dealing with even more data every second!

Web search engines are software systems that organize information on the internet so that users can discover web pages related to a search query. We can implement a basic web search engine by building an index that associates each term (word) to all web pages in which it appears. A search query can be answered by returning the documents that contain all of the terms in a given query.

Search Engine

Oct 12

Sets and Maps: BJP 11.2, 11.3

Describe an example where a set would be preferred over a list, and vice versa.
Loop over each item in a collection (or an array) with a for-each loop.
Apply the if-missing-then-put pattern to provide default map values.

Oct 13

SectionSets and Maps

Oct 14

Nested Collections

Trace the execution of programs with reference data types.
Define methods that use collections containing other collection types.
Describe the relationship between object reference equality vs. value equality.

Oct 15

SectionNested Collections

Oct 16

Linked Nodes: BJP 16.1

Evaluate and reassign variable references to deeply linked nodes.
Trace the execution of programs with references to deeply linked nodes.
Identify and apply additive changes before destructive linked node changes.

This is an overcount since Scanner splits tokens on whitespace, so words can include punctuation. ↩

Sets

One way to speed up the countUnique method is to use a set abstract data type instead of a list. By restricting the public methods and changing the way clients interact with the collection, implementers can design more efficient data structures to speed up the running time to meet the demands of big data.

In contrast to lists, sets only contain unique elements and do not keep track of element positions (indices). The set abstract data type represents an unordered collection of unique elements.

add an element to the set.
contains to return whether or not a value is in the set.
remove and return an element from the set.
size to return the number of elements in the set.

Without the need to maintain element positions, set implementations organize data in more creative ways than storing elements according to index. The Java Collections framework provides two Set implementations called HashSet and TreeSet. We can rewrite the countUnique method to use a HashSet.

public static int countUnique(Scanner input) {
    Set<String> words = new HashSet<String>();
    while (input.hasNext()) {
        String word = input.next();
        words.add(word);
    }
    return words.size();
}

Since sets only contain unique elements, words.size() represents the number of unique words added to the set.

HashSet: Elements are stored in a consistent but unpredictable order. For our purposes, the worst case runtime of all operations are constant with respect to the size of the set.
TreeSet: Elements are stored in sorted order. The worst-case runtime of all operations are in Θ(log₂ N) with respect to N, the size of the set. Later, we’ll learn about binary search trees, the data structure idea behind TreeSet.

For-each loop

One key limitation of the set ADT is that there’s no get or peek methods for accessing elements. Since lists maintain indices, List clients can print out each word with a for loop.

List<String> words = ...
for (int i = 0; i < words.size(); i++) {
    String word = words.get(i);
    System.out.println(word);
}

However, sets don’t maintain element indices! Instead, iterating over a set of words requires a new Java syntax called the for-each loop.

Set<String> words = ...
for (String word : words) {
    System.out.println(word);
}

Elements are considered in an order determined by the set implementer. Since TreeSet stores elements in sorted order, a for-each loop over a TreeSet of words will consider each word in dictionary order. The ordering of elements in a HashSet, in contrast, is not easy to interpret.

For-each loops can also be used with other collection types as well. For-each loops help simplify code by removing the need to maintain additional loop counter variables.

The main drawback of the for-each loop is that it does not allow modification to the collection. For example, we can’t add or remove elements from the collection within a for-each loop.

for (String word : words) {
    System.out.println(word);
    words.add("hello"); // will not work
}

Maps

The map abstract data type associates a set of unique keys with a collection of values, where each key is associated with a value.

containsKey returns true if the given key is in the map.
get returns the value associated with the given key, or null if the key is not in the map.
put associates the given key with the given value.
remove to remove the entry for the given key and return its associated value.

Maps relate two types of data: keys and values. Suppose we want to build a letter inventory for the string “Washington” using a HashMap where the keys are characters and the values are integers.

Map<Character, Integer> inventory = new HashMap<Character, Integer>();
// Add character count mappings to the inventory
inventory.put('a', 1);
inventory.put('g', 1);
inventory.put('h', 1);
inventory.put('i', 1);
inventory.put('n', 2);
inventory.put('o', 1);
inventory.put('s', 1);
inventory.put('t', 1);
inventory.put('w', 1);

To later add the inventory for the string “State”, we can call put again with the same key and the updated value. The new entry will overwrite the old entry for the given key.

// Add [aetts] to the [aghinnostw] inventory
inventory.put('a', inventory.get('a') + 1);
inventory.put('e', 1);
inventory.put('s', inventory.get('s') + 1);
inventory.put('t', inventory.get('t') + 2);

For-each loops are more complicated since we need to specify whether we want to loop over the keySet() or the collection of values().

for (char letter : inventory.keySet()) {
    int count = inventory.get(letter);
    System.out.println(name + ": " + count);
}

Why isn't there a method to return the key associated with a given value?

Keys in a map are unique, so there is only one value for each key. But values are not necessarily unique. There are many letters in the inventory with a count of 1, so it’s impossible to say which key the client wants in particular.

Like sets, there also exists a TreeMap class that also implements the Map interface and maintains its keys in sorted order.

Map<Character, Integer> inventory = new TreeMap<Character, Integer>();

Set and Maps

Table of contents

Search Engine

Search Engine

Sets

For-each loop

Maps