Removing Duplicate Tuples
By sorting: use all (projected) attributes as the sort key
- Can spot duplicates in a final scan over the data
By hashing: hash on all (projected) attributes
- Can think of the file as being “partitioned”
- Duplicates will always collide into the same partition (bucket)
- Next, rehash each bucket with a different hash function
- Output unique tuples from the 2nd level buckets