Link Menu Search Expand Document

Lists; sets; for-each loop

ed Lesson

Table of contents


Lists

This week, we’ll switch from being the implementer of ArrayIntList to being the client of linear collections. The first linear collection that we’ll study is the list.

Our ArrayIntList is one particular implementation of the more general idea of a list. A list has the following methods (and possibly more).

  • add an element to the end of the list.
  • add an element to a particular place in the list.
  • get the value at a particular place in the list.
  • set the value at a particular place in the list.
  • size of the list.
  • remove the element at a particular place in the list.

These methods are the essential properties of a list. Computer scientists call this idea of a list and all of the above methods within it an abstract data type (ADT). ADTs specify what methods can be called, but not how they’re implemented. ArrayList is an implementation of the list ADT. Java has several other list implementations, such as LinkedList that we’ll study more in depth next week.

Printing a list

Consider the printList method that prints the elements of an ArrayList on separate lines.

Client

// post: prints the strings in the given list, one per line
public static void printList(ArrayList<String> list) {
    for (int i = 0; i < list.size(); i++) {
        System.out.println(list.get(i));
    }
}

This works fine when given an ArrayList.

Client

ArrayList<String> list = new ArrayList<String>();
list.add("According");
list.add("to");
list.add("all");
printList(list);

But printList won’t work with a LinkedList.

Client

LinkedList<String> list = new LinkedList<String>();
list.add("all");
list.add("known");
list.add("laws");
printList(list);
Why doesn't this code work with the printList method above?

printList takes an ArrayList<String>, but we’re giving it a LinkedList<String> instead. Java is very strict about checking the types, even if both ArrayList and LinkedList provide the size() and get(int index) methods necessary for printList to work.

One way to resolve this issue is to make a copy of the method with exactly the same body but accepting a LinkedList instead of an ArrayList.

Client

// post: prints the strings in the given list, one per line
public static void printList(LinkedList<String> list) {
    for (int i = 0; i < list.size(); i++) {
        System.out.println(list.get(i));
    }
}

However, there are some stylistic and engineering concerns about this solution. Now, whenever we want to change the behavior of printList, we need to change it in two different places. Here, the two methods are right next to each other so it’s not too hard to figure out that both of them need to be changed at the same time. But in larger programs, these two methods might be in two different classes. Someone making changes might not know to update the code in both places.

Interfaces

Instead of making two copies of the method, the better way to implement printList is to change the parameter to use the more general List interface, which is how Java models the idea of a list.

Client

// post: prints the strings in the given list, one per line
public static void printList(List<String> list) {
    for (int i = 0; i < list.size(); i++) {
        System.out.println(list.get(i));
    }
}

Notice that the method is declared to take a List<String>. This List<String> could be an ArrayList<String> or a LinkedList<String> because both of these implementations provide all of the methods required by the List interface. This one printList method can now work with any implementation of the List interface.

This idea of using the most general type as possible is recurring principle in programming.

Interface types for variables

Beyond using the List interface for the parameter to printList, we can also bring the benefits of generalizing code to our variables as well. Instead of declaring the type of a variable as ArrayList<String> or LinkedList<String>, we can use the interface type instead.

Client

List<String> list = new ArrayList<String>();

This line of code creates a new ArrayList<String> (right-hand side) and assigns it to a variable of type List<String> instead of a variable of type ArrayList<String>. By restricting the type to List<String>, this allows us to use the list variable in potentially more situations like we saw with the printList method.

Wrapper classes

In order to store a list of integers using Java’s ArrayList class, we might want to use the following line of code.

Client

List<int> list = new ArrayList<int>();

This doesn’t work due to a design decision in the Java language disallowing primitive types such as int and double from being used as the specialized type. Instead, specialized types need to be classes such as String.

In order to support numbers and other primitive types, Java has wrapper classes whose sole purpose is to represent primitive types as objects. For example, the wrapper class for int is called Integer. We can declare a list of integers as follows.

Client

List<Integer> list = new ArrayList<Integer>();

Java will conveniently convert between the primitive type and the wrapper class automatically in most places for us, so we can use the primitive type almost everywhere else.

Client

List<Integer> list = new ArrayList<Integer>();
list.add(1);
list.add(2);
list.add(3);

int first = list.get(0);

IntList interface

Writing an interface is very similar to writing a class, but we use the keyword interface instead of class.

For example, we are going to make an IntList interface that acts like Java’s List but to go along with our ArrayIntList. We’ll see next week another way to implement an IntList called a LinkedIntList.

Implementer

public interface IntList {
    // TODO
}

To require all classes that are IntLists to have an add method, we write the method header for the method we want without a body since interfaces don’t provide implementations, only requirements.

Implementer

public interface IntList {
    public void add(int value);  // semicolon ; instead of { } with method body
}

Now any class that claims to be an IntList must have a method that matches the signature add(int). There are many other methods we probably want to make sure anyone who is an IntList has, so we can add more method requirements.

Implementer

public interface IntList {
    public void add(int value);
    public void add(int index, int value);
    public boolean contains(int value);
    public int get(int index);
    public void remove(int index);
    public int size();
}

Now that we have the IntList interface, we can write methods that work with it! Remember that if we have a variable of type IntList, we are allowed to call any of the methods required by the IntList interface on that variable since Java will enforce that any class that implements the interface will have all the required methods.

Client

public static void printList(IntList list) {
    for (int i = 0; i < list.size(); i++) {
        System.out.println(list.get(i));
    }
}

Now, let’s try calling the printList method.

Client

public static void main(String[] args) {
    ArrayIntList list = new ArrayIntList();
    list.add(1);
    list.add(2);
    list.add(3);

    printList(list);
}

Unfortunately we will end up getting a compiler error because Java doesn’t know that ArrayIntList is actually an IntList. How does Java know that ArrayList is a List?

Relating ArrayIntList to IntList

Since we have all the methods that need to be implemented in ArrayIntList, we just need to tell Java that ArrayIntList is actually an IntList.

Implementer

public class ArrayIntList implements IntList {
    ...
}

This is telling Java that we promise to implement all the methods defined in IntList. If we were to forget a method (such as contains(int)), then our class would not compile because it did not implement all the methods IntList asked for!

After adding these two tokens, we are now able to call printList passing in an ArrayIntList. In fact, now that we have an interface for it, it would be better style to define the variable in the client code using the interface type.

Client

IntList list = new ArrayIntList();

Shakespeare

Next, we’ll look at a case study of why we might want to use different collections by analyzing all of Shakespeare’s published words. Our task will be to write a method countUnique that takes a Scanner as input and returns the number of unique words in the input.

One way to achieve this is by putting words into a list.

Client

public static int countUnique(Scanner input) {
    List<String> words = new ArrayList<String>();
    int count = 0;
    while (input.hasNext()) {
        String word = input.next();
        count++;
        words.add(word);
    }
    return count;
}
This program is buggy! What's missing?

Currently, the program counts and adds all of the words regardless of whether or not they’re actually unique.

We’ll need to add some condition to check whether or not we’ve seen a particular word before.

Client

public static int countUnique(Scanner input) {
    List<String> words = new ArrayList<String>();
    int count = 0;
    while (input.hasNext()) {
        String word = input.next();
        if (!words.contains(word)) {
            count++;
        }
        words.add(word);
    }
    return count;
}

Now, the program works correctly. It turns out that this dataset of Shakespeare’s works includes 33,505 unique words.

However, it takes about 3.5 minutes to compute this result. And this is not big data at all. Popular social networks might deal with similar amounts or even more data every second. Let’s start by asking ourselves why this program might be slow.

Sets

What’s really slow about countUnique is the contains call. The ArrayList is not optimized for this method: each call needs to iterate through the elements of the list to find the target. Later, we’ll introduce a language that computer scientists use to analyze how long it takes a program or method to execute.

For now, we’ll introduce the idea of a set, a collection of unique elements. In contrast to a list, a set does not store duplicates. It also does not keep track of the index of an element. The interface for the set ADT is often simpler than the list ADT.

  • add an element to the set.
  • contains to check if a value is in the set.
  • remove an element from the set.
  • size of the set.

At first glance, sets seem less useful than lists. To some extent, that’s true: we no longer have element indices and we can’t store duplicate elements. But by restricting the methods of the set ADT, this allows for the implementer to design faster set implementations. Because lists organize their elements by their indices, given any value, the list won’t know where to find it. In contrast, set implementations can organize their elements in more creative and efficient ways.

Performance can be significantly improved by using a more restricted interface.

Just like there are multiple List implementations (such as ArrayList and LinkedList), there are also multiple Set implementations. The two that we’ll use most often in this course are HashSet and TreeSet. We can reimplement the countUnique method with a HashSet and simplify the logic since sets only store unique elements.

Client

public static int countUnique(Scanner input) {
    Set<String> words = new HashSet<String>();
    while (input.hasNext()) {
        String word = input.next();
        words.add(word);
    }
    return words.size();
}

Note that the int count is no longer necessary. Since the set stores only unique elements, the size of a set is the number of unique elements in the set.

For now, we won’t worry too much about the difference between HashSet and TreeSet. HashSet is marginally faster than TreeSet, but we’ll see that there’s another trade-off that might make TreeSet the better choice than HashSet in certain situations.

For-each loop

Suppose we want to print out all of Shakespeare’s unique words. We’ve done this before in lists using for loops.

Client

for (int i = 0; i < words.size(); i++) {
    String word = words.get(i);
    System.out.println(word);
}

However, sets don’t keep track of element indices, so there’s no get(int index) method!

Instead, we’ll use a new bit of Java syntax called the for-each loop.

Client

for (String word : words) {
    System.out.println(word);
}

When iterating over a set, elements are considered in an order corresponding to how the data is organized in the set. For HashSet, there’s no easily-understandable interpretation to the order. But for TreeSet, the elements will be considered in sorted order. For example, each word in a TreeSet of words will be returned in dictionary order.

It turns out that sets are not the only data type that can use this syntax. For-each loops can be used with lists as well. In many cases, the for-each loop is a better way of iterating over the elements in a collection since we don’t need to introduce additional variables for the loop counter.

The main drawback with for-each loops is that they do not allow modifications to the underlying data structure. For example, we can’t add or remove elements from the collection within the for-each loop.

Client

for (String word : words) {
    System.out.println(word);
    // won't work
    words.add("hello");
}