Database Group Meeting - University of Washington, CSE

About the Database Group Meeting
This quarter the Database group meeting will be used both for presenting current research as well as for inviting speakers from outside of CSE UW. All external talks are part of the Yahoo! Database Talk Series and are labeled with

.
Meetings will be held in CSE 605 Database Lab unless specified otherwise. (** = will be held in Gates Commons.)

Schedule

Wed Oct 22

Date	Time	Presenter(s)	Title
Wed Sep 24	2.30 - 3.30pm	Jingren Zhou (MSR)	SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets
Wed Oct 1	2.30 - 3.30pm	Chris Re (UW)	Advances in Processing SQL Queries on Probabilistic Data
Wed Oct 8	2.30 - 3.30pm	Chris Re (UW)	An Overview of the Mystiq Probabilistic Database
Wed Oct 15	2.30 - 3.30pm	cancelled	cancelled
2.30 - 3.30pm	Justin Cappos (UW)	Seattle: Building a Million Node Testbed
Wed Oct 29	2.30 - 3.30pm	Jonathan Hsieh (UW)	An Extensible Web Crawler Infrastructure
Mon Nov 3	2.30 - 3.30pm	Chris Jermaine (U. of Florida)	MCDB: The Monte Carlo Database System
Fri Nov 7	2.30 - 3.30pm	Kristen LeFevre (U. of Michigan)	Privacy Protection in Data Publishing
Fri Nov 14	2.30 - 3.30pm	Laura Chiticariu (IBM Almaden)	Systems for Tracing the Provenance of Data
Wed Nov 19	1.30 - 2.30pm	David DeWitt (U. of Wisconsin)	Clustera: A Data-Centric Approach to Scalable Cluster Management **

Talk Details

Week 1: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets

Presenter: Jingren Zhou (Microsoft Research)

Abstract:
Companies providing cloud-scale services have an increasing need to store and analyze massive data sets such as search logs and click streams. For cost and performance reasons, processing is typically done on large clusters of shared-nothing commodity machines. It is imperative to develop a programming model that hides the complexity of the underlying system but provides flexibility by allowing users to extend functionality to meet a variety of requirements. In this talk, we present a new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis. The language is designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. SCOPE borrows several features from SQL. Data is modeled as sets of rows composed of typed columns. The select statement is retained with inner joins, outer joins, and aggregation allowed. Users can easily define their own functions and implement their own versions of operators: extractors (parsing and constructing rows from a file), processors (row-wise processing), reducers (group-wise processing), and combiners (combining rows from two inputs). SCOPE supports nesting of expressions but also allows a computation to be specified as a series of steps, in a manner often preferred by programmers. We also describe how scripts are compiled into efficient, parallel execution plans and executed on large clusters.

Bio:
Jingren is a researcher in the Database Group at Microsoft Research. Jingren's research is in the area of databases, in particular query processing, query optimization, large scale distributed computing, and architecture-conscious database systems. Before joining Microsoft, Jingren obtained a Ph.D. in Computer Science at Columbia University and a B.S. at the University of Science and Technology of China.

Week 2: Advances in Processing SQL Queries on Probabilistic Data

Presenter: Christopher Re (UW)

Abstract:
A SQL query over a relational database returns a set of answers (tuples), which in practice can be often be quite large: hundreds or thousands of tuples. When the relational data is probabilistic, then the SQL system needs to compute a probability for each answer tuple. This computation is expensive, because it requires running a Monte Carlo simulation, and needs be repeated for each tuple in the answer. This work presents two methods for speeding up the SQL processor.

Week 3: An Overview of the Mystiq Probabilistic Database

Presenter: Christopher Re (UW)

Abstract:
In this talk, we describe our ongoing work at the University of Washington on the Mystiq system, a probabilistic database (pDB) that is motivated by applications in deduplication, information extraction and RFID. In particular, we discuss three techniques that form the technical core of query evaluation in Mystiq: (1) safe plans, a technique that allows us to evaluate SQL queries on probabilistic databases as fast as native SQL, (2) multisimulation, a technique that allows us to evaluate any SELECT-FROM-WHERE query efficiently and (3) materialized views for pDBs that allows GB scale SQL processing.

Week 5: Seattle: Building a Million Node Testbed

Presenter: Justin Cappos (UW)
Abstract:
Testbeds are a popular way for researchers to evaluate new algorithms and ideas as they provide the researcher with a way to test their prototype on the live Internet. However, existing testbeds tend to be limited in their platform heterogeneity, network diversity, and scale. For example, there are no testbeds with the scale of the Azureus DHT (1 million nodes). We are striving to build a testbed (Seattle) that has the same heterogeneity characteristics as the Internet and does so at a scale of a million nodes. Our goal is to provide a testbed that supports diverse forms of research from cloud computing, P2P, mobile / ubiquitous devices, and networking research. The first half of the talk will focus on the characteristics and composition of the Seattle testbed. The second half will be on interesting database / cloud computing research enabled by the testbed. The talk will feature a live demonstration with audience participation so please bring your Internet connected device!

Bio:
Justin Cappos is a Post Doc. at the University of Washington whose research is focused on solving practical problems on networks of computer systems. His Ph. D. work at the University of Arizona focused on a package manager Stork that managed over 1/2 million VMs.

Week 6: An extensible web crawler infrastructure

Presenter: Jonathan Hsieh (UW)
Abstract:
Many services extend web crawlers to provide customized web crawler applications. Examples include alerts services, copyright violation detection applications, and web malware scanners. In this talk, I describe a flexible and scalable system and some challenges in the design of its system infrastructure. The system needs to be flexible enough to support these applications and also able to scale to billions of filters while maintaining high throughput. The system includes a simple language for specifying declarative filters that allows for ample opportunities for optimization.

Week 7: MCDB: The Monte Carlo Database System

Presenter: Chris Jermaine (Univ. of Florida)
Abstract:
Analysts working with large data sets often use statistical models to "guess" at unknown, inaccurate, or missing information associated with the data stored in a database. For example, an analyst for a manufacturer may wish to know, "What would my profits have been if I'd increased my margins by 5% last year?" The answer to this question naturally depends upon the extent to which the higher prices would have affected each customer's demand, which is undoubtedly guessed via the application of some statistical model.

In this talk, I'll describe MCDB, which is a prototype database system that is designed for just such a scenario. MCDB allows an analyst to
attach arbitrary stochastic models to the database data in order to "guess" the values for unknown or inaccurate data, such as each customer's unseen demand function. These stochastic models are used to produce multiple possible database instances in Monte Carlo fashion (a.k.a. "possible worlds"), and the underlying database query is run over each instance. In this way, fine-grained stochastic models become first-class citizens within the database. This is in contrast to the "classical" paradigm, where high-level summary data are first extracted from the database, then taken as input into a separate statistical model which is then used for subsequent analysis.

Week 7: Privacy Protection in Data Publishing

Presenter: Kristen LeFevre
Abstract:
Numerous organizations collect, distribute, and publish personal data for purposes that include demographic and public health research. Protection of individual privacy is an important problem in this setting, and a variety of anonymization techniques have been developed that typically aim to satisfy certain privacy constraints (e.g., k-anonymity and l-diversity) with minimal impact on the quality of the resulting data. This talk will describe several contributions to this field. In particular, I will describe a scalable workload-aware anonymization tool, which is able to incorporate a class of target workloads, consisting of data mining tasks and queries, when anonymizing an input dataset. I will also briefly describe some extended privacy definitions that allow for the more flexible incorporation of instance-level adversarial background knowledge. Finally, looking forward, I will describe several emerging data-intensive applications to which conventional definitions of privacy do not easily apply.

Bio:
Kristen LeFevre is an Assistant Professor in EECS at the University of Michigan, where she is a member of the database group and the software research lab. She received her Ph.D. from the University of Wisconsin - Madison in 2007.

Week 8: Systems for Tracing the Provenance of Data

Presenter: Laura Chiticariu (IBM Almaden)
Abstract:
Provenance of data describes the origins, as well as the journey of data, throughout its life cycle. The ability to trace data provenance is crucial in today's' information systems, where data is constantly created, copied, transformed and integrated. Provenance allows one to assess the quality and trustworthiness of the data, as well as understand and debug the transformations that data undergoes in such systems.

In this talk I will discuss two principled methods (and corresponding system implementations) for tracing the provenance of data in the context of two commonly used formalisms for specifying data transformations: SQL queries and respectively, schema mappings. Specifically, I will describe the DBNotes annotation management system which traces data provenance over SQL queries, and the SPIDER schema mappings debugging system which traces data provenance over schema mappings.

The type of provenance computed by DBNotes is known as where-provenance, whereas the type of provenance computed by SPIDER is an instance of how-provenance. Towards the end of the talk I will give a brief overview of three main notions of database provenance: why-, where- and how-provenance, which have been proposed and studied in recent years.

Bio:
Laura Chiticariu is a Research Staff Member in the Search and Analytics group at IBM Almaden Research Center. She received her PhD from UC Santa Cruz in September 2008.

Week 9: Clustera: A Data-Centric Approach to Scalable Cluster Management

Presenter: David J. DeWitt (Microsoft Jim Gray Systems Lab & U. of Wisconsin)
Abstract:
Twenty-five years ago, when we built our first cluster management system using a collection of twenty VAX 11/750 computers, the idea of a compute cluster was an exotic concept. Today, clusters of 1,000 nodes are common and some of the biggest have in excess of 10,000 nodes. Such clusters are simply awash in data about machines, users, jobs, and files. Many of the tasks that such systems are asked to perform are very similar to database transactions. For example, the system must accept jobs from users and send them off to be executed. The system should not "drop" jobs or lose files due to hardware or software failures. The software must also allow users to stop failed computations or "change their mind" and retract thousands of submitted but not yet completed jobs. Amazingly, no cluster management system that we are aware of uses a database system for managing its data.

In this talk I will describe Clustera, a new cluster management system we have been working for the last three years. As one would expect from some database types, Clustera uses a relational DBMS to store all its operational data including information about jobs, users, machines, and files (executable, input, and output). One unique aspect of the Clustera design is its use of an application server (JBoss currently) in front of the relational DBMS. Application servers have a number of appealing capabilities. First, they can handle 10s of 1000s of clients. Second, they provide fault tolerance and scalability by running on multiple server nodes. Third, they multiplex connections to the database system to a level that the database system can comfortably support. Compute nodes in a Clustera cluster appear as web clients to the application server and make SOAP calls to submit requests for jobs to execute and to update status information that is stored in the relational

Extensibility is a second key goal of the Clustera project. Traditional cluster management systems such as Condor were targeted toward long-running, computational intensive jobs. Newer systems such as Map-Reduce are targeted toward a specific type of data intensive parallel computation. Parallel SQL database systems represent a third type of cluster management system. The Clustera framework was designed to handle each of these clexecution and data framework.

Please sign up for the mailing list here. Send mail to that list at Uw-db at cs