================================================================ CSE 344 -- Spring 2011 Lecture 22: Parallel Databases ================================================================ Two MAJOR trents that are pushing Computuer Science toward parallel computation: 1. Moore's law (exponential growth in density of transistors per chip) is no longer reflected in increased clock speeds. Increased hardware performance will be available only through parallelism. Think multicore: 4 cores today, perhaps 64 in a few years. 2. Cloud computing commoditizes access to large compute clusters. Ten years ago, only google could afford 1000 servers; today you can rent this at Amazon Web Services AWS. ================================================================ Traditional parallel databases Terminology: P = number of processors (or servers) Speedup: TPS = f(P). (TPS = "transactions per second") (Note TPS is just one metric; others could be: speed for one query) -- Ideal: linear speedup, TPS = TPS0 * P -- In practice: *** show graph Scaleup: TPS = f(P,D) (D = size of the database) -- double both P and D, how does TPS vary ? -- Ideal: constant scaleup -- In practice: *** show graph *** In class: what prevents us from achieving linear speedup or constant scaleup ? Types of parallel architectures: 1. Shared Memory -- processors share both RAM and disk -- dozens to hundreds of processors -- premium cost -- last remaining cash cows in the hardware industry Characteristics: -- scalability -- usability/programability -- failure mode: one node's failure brings down entire cluster 2. Shared Disk -- all processors access the same disks -- found in the largest "single-box" (non-cluster) multiprocessors -- Oracle dominates Characteristics: -- scalability -- usability/programability -- failure mode: one node's failure brings down entire cluster 3. Shared nothing -- Cluster of single-processor machines, on high-speed network -- Called "clusters" or "blade servers" -- Data partitioning (discussed next); -- Each machine does locking and logging on local disks Characteristics: -- scalability -- usability/programability -- failure modes: (1) when one node fails, bring down all nodes, or (2) "data skip" simply ignore the failed node an skip the unavailable data ================================================================ Basic query processing on one node: *** Discuss in class: -- selection -- group-by -- join ================================================================ Data partitioning on a shared nothing architecture. Have a large table R(K,A,B,C), need to partition it on shared-nothing architecture; what are the options ? -- Block-partition -- Range partition on an attribute A -- Hash partition on attribute A 1. discuss how to compute GroupBy(R, A, sum(C)) for each of the partitions. 2. discuss how to compute GroupBy(R, B, sum(B)) for each of the partitions. ================================================================ **** In class: Discuss Parallel Hash-Joins R(A,B) Join S(B,C)