================================================================
CSE 344 -- Spring 2011
Lecture 22:   Parallel Databases

================================================================

Two MAJOR trents that are pushing Computuer Science toward parallel
computation:

1. Moore's law (exponential growth in density of transistors per chip)
   is no longer reflected in increased clock speeds.  Increased
   hardware performance will be available only through parallelism.
   Think multicore: 4 cores today, perhaps 64 in a few years.

2. Cloud computing commoditizes access to large compute clusters.  Ten
   years ago, only google could afford 1000 servers; today you can
   rent this at Amazon Web Services AWS.

================================================================
Traditional parallel databases

Terminology:

    P = number of processors (or servers)

    Speedup: TPS = f(P).    (TPS = "transactions per second")
    (Note TPS is just one metric; others could be: speed for one query)

         -- Ideal: linear speedup, TPS = TPS0 * P
         -- In practice: *** show graph

   Scaleup: TPS = f(P,D)  (D = size of the database)

         -- double both P and D, how does TPS vary ?
         -- Ideal: constant scaleup
         -- In practice: *** show graph


   *** In class: what prevents us from achieving linear speedup or
       constant scaleup ?
         

Types of parallel architectures:

1. Shared Memory
    -- processors share both RAM and disk
    -- dozens to hundreds of processors
    -- premium cost -- last remaining cash cows in the hardware
       industry

  Characteristics:
    -- scalability
    -- usability/programability
    -- failure mode: one node's failure brings down entire cluster

2. Shared Disk
    -- all processors access the same disks
    -- found in the largest "single-box" (non-cluster) multiprocessors
    -- Oracle dominates

  Characteristics:
    -- scalability
    -- usability/programability
    -- failure mode: one node's failure brings down entire cluster

3. Shared nothing
    -- Cluster of single-processor machines, on high-speed network
    -- Called "clusters" or "blade servers"
    -- Data partitioning (discussed next);
    -- Each machine does locking and logging on local disks

  Characteristics:
    -- scalability
    -- usability/programability
    -- failure modes: (1) when one node fails, bring down all nodes,
       or (2) "data skip" simply ignore the failed node an skip the
       unavailable data

================================================================

Basic query processing on one node:

*** Discuss in class:
   -- selection
   -- group-by
   -- join

================================================================

Data partitioning on a shared nothing architecture.

Have a large table R(K,A,B,C), need to partition it on shared-nothing
architecture; what are the options ?

  -- Block-partition
  -- Range partition on an attribute A
  -- Hash partition on attribute A

1. discuss how to compute GroupBy(R, A, sum(C)) for each of the
   partitions.

2. discuss how to compute GroupBy(R, B, sum(B)) for each of the
   partitions.

================================================================

**** In class: Discuss Parallel Hash-Joins

R(A,B) Join S(B,C)