Projection via Hashing
Partitioning phase: Read R using one input buffer. For each tuple, discard unwanted fields, apply hash function h1 to choose one of B-1 output buffers.
- Result is B-1 partitions (of tuples with no unwanted fields).
- 2 tuples from different partitions guaranteed to be distinct.
Duplicate elimination phase: For each partition, read it and build an in-memory hash table, using hash fn h2 (<> h1) on all fields, while discarding duplicates.
- If partition does not fit in memory, can apply hash-based projection algorithm recursively to this partition.
Cost: For partitioning, read R, write out each tuple, but with fewer fields. This is read in next phase.