Database Group Meeting

Organised by: Magda Balazinska

The Database Group meets every Friday at 3.30pm-4.20pm in CSE 405, Allen Center. We meet to present on-going work in the group and to attend talks by invited speakers.

Upcoming talks are announced on uw-db@cs. Please sign up for the mailing list.

Schedule

Date Presenter Talk Title
Sep 27 Gabriel
Oct 4 Paris
Oct 11 Sudip
Oct 18 Bailu
Oct 25 Tim Kraska MLbase
Nov 1 Andy Pavlo What's Really New with NewSQL?
Nov 8 Dan Olteanu Factorized Relational Databases
Nov 15 Mike Cafarella Semiautomatic Spreadsheet Extraction
Nov 22 Barzan Mozafari Query Petabytes of Data in a Blink Time!
Nov 29 Thanksgiving
Dec 6

Details

Tim Kraska

Title: Machine learning (ML) and statistical techniques are crucial to transforming Big Data into actionable knowledge. However, the complexity of existing ML algorithms is often overwhelming. End-users often do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support ML are typically not accessible to ML developers without a strong background in distributed systems and low-level primitives. In this talk, I will present MLbase, a new system designed to tackle both of these issues simultaneously. MLbase provides (1) a simple declarative way for end-users to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, and (3) a set of high-level operators to enable ML developers to scalably implement a wide range of ML methods without deep systems knowledge.

Bio: Tim Kraska is an Assistant Professor in the Computer Science department at Brown University. Currently, his research focuses on Big Data management in the cloud and hybrid human/machine database systems. Before joining Brown, Tim Kraska spent 2 years as a PostDoc in the AMPLab at UC Berkeley after receiving his PhD from ETH Zurich, where he worked on transaction management and stream processing. He was awarded a Swiss National Science Foundation Prospective Researcher Fellowship (2010), a DAAD Scholarship (2006), a University of Sydney Master of Information Technology Scholarship for outstanding achievement (2005), the University of Sydney Siemens Prize (2005), a VLDB best demo award (2011) and an ICDE best paper award (2013).

Andy Pavlo

Title: What's Really New with NewSQL?

Abstract: The previous decade was marked by the demands of Web-based applications clashing with the limitations of traditional database management systems (DBMSs). This brought about two scaling solutions to support high-velocity applications: custom sharding middleware and NoSQL systems. These approaches focused on providing high availability and scalability by forgoing strong transactional guarantees. Although such trade-offs are appropriate for certain situations, they are insufficient for many OLTP applications that deal with high-profile data. A contemporary class of relational DBMSs, known as NewSQL, has emerged to provide the same scalable performance of middleware and NoSQL systems for OLTP workloads while maintaining the ACID properties of traditional DBMSs. It is often not clear, however, how these systems actually achieve this goal and which of their features are actually novel. In this talk, I will present an overview of the state-of-the-art in NewSQL systems and discuss recent advancements in scalable transaction processing. I will then discuss the key research problems that need to be overcome in order to enable NewSQL DBMSs to support larger and more complex workloads in the future. I will conclude with a description of my own work in building the elusive high-performance, “one-size-almost-fits-all” distributed DBMS.

Bio: Andy Pavlo is an Assistant professor in the Computer Science department at Carnegie Mellon University. His research interests center on database management systems, specifically main memory systems, distributed transaction processing systems, and large-scale data analytics. He received his Ph.D. in 2013 from Brown University where he was the lead developer of the H-Store system (since commercialized as VoltDB).

Dan Olteanu

Title: Factorized Relational Databases

Abstract: In this talk I will present a representation system for relational data that uses factored forms to encode succinctly large relations.

I will then address two main questions:

1. How succinct are factorizations of query results?

2. Can such factorizations speed up query evaluation?

I will also comment on how factorizations are used in Google's recent distributed database management system F1, for scalable machine learning, for managing large sets of possibilities and choices in incomplete information and configuration problems, and for tractable query evaluation in probabilistic databases.

This is joint work with Jakub Zavodny.

Bio: Dan Olteanu is a tenured University Lecturer (equivalent of tenured Associate Professor in North America) in the Department of Computer Science at the University of Oxford and Fellow of St Cross College.

He is currently spending his sabbatical as Computer Scientist at LogicBlox and as Visiting Associate Professor at UC Berkeley.

Mike Cafarella

Title: Semiautomatic Spreadsheet Extraction

Abstract: Spreadsheets have evolved into a “Swiss Army Knife” for data management that allows non-experts to perform many database-style tasks. As a result, spreadsheet files are generally popular, easy for humans to understand, and contain interesting data on a wide range of topics. Spreadsheets’ data should make them a prime target for integration with other sources, but their lack of explicit schema information makes doing so a painful and error-prone task.

We propose a two-phase semiautomatic approach for extracting relational data from spreadsheets. Unlike past approaches, it is domain-independent and requires no up-front user guidance in the form of topic-specific schemas, extraction rules, or training examples. In the first phase, an automatic extractor uses hints from spreadsheets’ graphical style and recovered metadata to extract data as accurately as possible. In the second phase, the system asks a human to manually repair any extraction errors; by identifying regions of the dataset that are very similar, the system can amortize human effort over many possible extraction errors. The result is a system that can obtain correct extractions with substantially less human effort than with a standard technique. In addition to the extraction system, we will present a large-scale portrait of how spreadsheets are used for data management by examining 400,000 spreadsheets crawled from the Web.

Bio: Michael Cafarella is an assistant professor in the division of Computer Science and Engineering at the University of Michigan. His research interests include databases, information extraction, data integration, and data mining. He has published extensively in venues such as SIGMOD, VLDB, and elsewhere. Mike received his PhD from the University of Washington, Seattle, in 2009 with advisors Oren Etzioni and Dan Suciu. He received the NSF CAREER award in 2011. In addition to his academic work, he costarted (with Doug Cutting) the Hadoop open-source project, which is widely used at Facebook, Yahoo!, and elsewhere.

Barzan Mozafari

Title: Query Petabytes of Data in a Blink Time!

Abstract: For the past few decades, databases have been a successful abstraction for accessing and managing data. However, the rapid growth of data and the demand for more complex analytics have significantly hindered the scalability and applicability of these systems beyond traditional business data processing scenarios. In this talk, I will present the theme of my research in addressing these challenges, which consists of adapting tools from applied statistics to build robust and scalable data-intensive systems. In particular, I will focus on our parallel query engine, called BlinkDB, that enables interactive, ad-hoc queries over massive volumes of data. First, I will briefly overview BlinkDB’s architecture, as well as our design choices driven by real-world workloads from several companies. Then, I demonstrate how BlinkDB employs sophisticated optimization and sampling strategies to achieve sub-second latency on tens of terabytes to petabytes of data. Finally, I will turn to the problem of quality assessment of approximate answers in BlinkDB, where I present our new algorithm that is several orders of magnitude faster than the state-of-the-art variants of bootstrap.

Relevant URL: http:blinkdb.org

Bio: Barzan Mozafari is an Assistant Professor of Computer Science and Engineering at the University of Michigan (Ann Arbor), where he is a member of the Michigan Database Group and the Software Systems Lab. Prior to that, he was a Postdoctoral Associate at Massachusetts Institute of Technology. He earned his Ph.D. in Computer Science from the University of California at Los Angeles. He is passionate about building large-scale data-intensive systems, with a particular interest in database-as-a-service clouds, distributed systems, and crowdsourcing. In his research, he draws on advanced mathematical models to deliver practical database solutions. He has won several awards and fellowships, including SIGMOD 2012 and EuroSys 2013′s best paper awards.

Feel free to send comments to Prasang.