Titles and Abstracts of Guest Lectures

April 14: Rebecca Taft (Cockroach Lab)

Title: CockroachDB's Query Optimizer.

Abstract: We live in an increasingly interconnected world, with many organizations operating across countries or even continents. To serve their global user base, organizations are replacing their legacy DBMSs with cloud-based systems capable of scaling OLTP workloads to millions of users. CockroachDB is a scalable SQL DBMS that was built from the ground up to support these global OLTP workloads while maintaining high availability and strong consistency. Just like its namesake, CockroachDB is resilient to disasters through replication and automatic recovery mechanisms.

In this talk, I'll give a brief introduction to the architecture of CockroachDB followed by a deep dive into the design and implementation of CockroachDB's query optimizer. CockroachDB has a Cascades-style query optimizer that uses over 200 transformation rules to explore the space of possible query execution plans. In this talk, I'll describe the domain-specific language, Optgen, that we use to define these transformation rules, and demonstrate how the rules work in action. I'll explain how we use statistics to choose the best plan from the search space, and how we automatically collect stats without disrupting production workloads or requiring coordination between nodes. I'll also describe some of the unique challenges we face when optimizing queries for a geo-distributed environment, and how CockroachDB handles them.

Bio: Becca is the Engineering Manager of the SQL Queries team at Cockroach Labs. Prior to joining Cockroach Labs, she was a graduate student at MIT, where she worked with Professor Michael Stonebraker researching distributed database elasticity and multi-tenancy. Becca holds a B.S. in Physics from Yale University and an M.S. and Ph.D. in Computer Science from MIT. In her free time, she enjoys rowing on the Chicago River and enjoying the great outdoors.

Slides here

April 21: Nico Bruno and César A. Galindo-Legaria (Microsoft)

Title: The Cascades framework for query optimization at Microsoft.

Abstract: The Cascades framework was an academic project introduced 25 years ago as a foundation for modern query optimizers. It provides extensibility, memoization-based dynamic programming, an algebraic representation of logical and physical operator trees, and manipulation of such trees using transformation rules to enable cost-based query optimization. Cascades provides a clean framework/skeleton for optimizer development, but it needs to be instantiated with domain-knowledge and augmented in several directions to cope with real-world workloads in an industrial setting. We will describe some design choices and extensions to Cascades that power multiple Microsoft products, including MS SQL Server and Azure Synapse Analytics.

Slides here

April 28: Ippokratis Pandis (Amazon)

Title:Amazon Redshift Re-invented

Abstract: In 2013, eight years ago, Amazon Web Services revolutionized the data warehousing industry by launching Amazon Redshift, the first fully managed, petabyte-scale cloud data warehouse solution. Amazon Redshift made it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. This launch was a significant leap from the traditional on-premise data warehousing solutions which were expensive, rigid (not elastic), and needed a lot of tribal knowledge to perform. Unsurprisingly, customers embraced Amazon Redshift and it went on to become the fastest growing service in AWS. Today, tens of thousands of customers use Amazon Redshift in AWS's global infrastructure of 25 launched Regions and 81 Availability Zones (AZs) to process Exabytes of data daily.

The success of Amazon Redshift inspired a lot of innovation in the analytics industry which in turn has benefited consumers. In the last few years, the use cases for Amazon Redshift have evolved and in response, Amazon Redshift has delivered a series of innovations that continue to delight customers. In this talk, we take a peek under the hood of Amazon Redshift, and give an overview of its architecture. We focus on the core of the system and explain how Amazon Redshift maintains its differentiating industry-leading performance and scalability. We discuss how Amazon Redshift extends beyond traditional data warehousing workloads, by integrating with the broad AWS ecosystem making Amazon Redshift a one-stop solution for analytics. We then talk about Amazon Redshift’s autonomics and Amazon Redshift Serverless. In particular, we present how Redshift continuously monitors the system and uses machine learning to improve its performance and operational health without the need of dedicated administration resources, in an easy to use offering.

Bio: Ippokratis Pandis is a senior principal engineer at Amazon Web Services, currently working in Amazon Redshift. Redshift is Amazon's fully managed, petabyte-scale data warehouse service. Previously, Ippokratis has held positions as software engineer at Cloudera where he worked on the Impala SQL-on-Hadoop query engine, and as member of the research staff at the IBM Almaden Research Center, where he worked on IBM DB2 BLU. Ippokratis received his PhD from the Electrical and Computer Engineering department at Carnegie Mellon University. He is the recipient of Best Demonstration awards at ICDE 2006 and SIGMOD 2011, and Test-of-Time award at EDBT 2019. He has served or serving as PC chair of DaMoN 2014, DaMoN 2015, CloudDM 2016, HPTS 2019 and ICDE Industrial 2022, as well as General Chair of SIGMOD 2023.

Paper: here

Slides: here (CSE NetID required)

May 5: Justin Levandoski (Google)

Title: An Overview of Google BigQuery

Abstract: Google BigQuery is a serverless, scalable, and cost effective cloud data warehouse. Having evolved from internal Google infrastructure (Dremel), BigQuery is unique in a number of dimensions. In this talk, we provide a look at some of the key architectural aspects of BigQuery and how it provides a true serverless and multi-tenant warehousing solution to customers. We then provide an overview of recent features such as BQML and the embedded BI engine that build on these architectural foundations that allow BigQuery to provide a unique data warehousing solution to customers.

Bio: Justin Levandoski works on BigQuery, Google Cloud's native data warehouse. Prior to Google, he was a principal engineer at Amazon Web Services (AWS), where he worked on Amazon Aurora, a cloud-native operational database system. Before that, he was a member of the database group at Microsoft Research, where he worked on main-memory databases, database support for new hardware platforms, transaction processing, and cloud computing. His research was commercialized in a number of Microsoft products, including the SQL Server Hekaton main-memory database engine, Azure CosmosDB, and bing.

Slides here

May 12: Doug Brown (Teradata)

Agenda: Key Components/Concepts (see also here)

MPP Shared-nothing Software-based Architecture
- Hash-based with partitioning (row and column)
Data Management
- Relational, semi-structured, unstructured
Cost-based Optimizer
Query Execution
- SQL Engine, Machine Learning Engines
Extensibility
- UDTs, UDFs, procedures, foreign servers
High-concurrency mixed workload management
Reliability and availability

Slides here

May 19: Jiaqi Yan (Snowflake)

Title: The Snowflake Data Cloud

Abstract: The Snowflake Data Cloud is a new type of cloud dedicated to data analytics, built on top of the infrastructure provided by the three major cloud providers, namely Amazon, Microsoft, and Google. In this talk, I will give an overview of the Snowflake cloud data platform, highlighting how we leverage the cloud to make Snowflake infinitely scalable, elastic, global, collaborative, and extremely simple to use. In particular, we will show that this is made possible by our multi-cluster shared-data architecture. This novel architecture, specifically designed for the cloud, allows unlimited and independent scaling of both compute and storage, with instant elasticity along both dimensions. The Snowflake cloud data platform is globally distributed, making it one single multi-tenant platform for the world. This enables collaboration between tenants, for example allowing one tenant, a provider, to share a secure view of a database with one or more tenants, referred to as consumers. Our multi-cluster shared-data architecture enables live data sharing without the need to share any compute resources, resulting in full isolation between providers and consumers.

Bio: Jiaqi Yan is a Principal Software Engineer at Snowflake Computing. He primarily leads development efforts for the Compiler and Optimizer of the Snowflake Database, along with Workload Optimization features. Prior to joining Snowflake, Jiaqi worked as a Senior Member of Technical Staff for Oracle Database's Query Engine.

Slides: here

May 26: Martin Bravenboer (RelationalAI)

Title: Design and Implementation of the RelationalAI Knowledge Graph Management System

Abstract RelationalAI is the next-generation database system for new intelligent data applications based on relational knowledge graphs. RelationalAI complements the modern data stack by allowing data applications to be implemented relationally and declaratively, leveraging knowledge/semantics for reasoning, graph analytics, relational machine learning, and mathematical optimization workloads. RelationalAI as a relational and cloud native system fits naturally in the modern data stack, providing virtually infinite compute and storage capacity, versioning, and a fully managed system. RelationalAI supports the workload of data applications with an expressive relational language (called Rel), novel join algorithms and JIT compilation suitable for complex computational workloads, semantic optimization that leverages knowledge to optimize application logic, and incrementality of the entire system for both data (IVM) and code (live programming). The system utilizes immutable data structures, versioning, parallelism, distribution, out-of-core memory management to support state-of-the-art workload isolation and scalability for simple as well as complex business logic. In our experience, RelationalAI’s expressive, relational, and declarative language leads to a 10-100x reduction in code for complex business domains. Applications are developed faster, with superior quality by bringing non-technical domain experts into the process and by automating away complex programming tasks. We discuss the core innovations that underpin the RelationalAI system: an expressive relational language, worst-case optimal join algorithms, semantic optimization, just-in-time compilation, schema discovery and evolution, incrementality and immutability.

Slides: here