Titles and Abstracts of Guest Lectures
April 14: Rebecca Taft (Cockroach Lab)
Title: CockroachDB's Query Optimizer.
Abstract: We live in an increasingly interconnected world, with many
organizations operating across countries or even continents. To serve
their global user base, organizations are replacing their legacy DBMSs
with cloud-based systems capable of scaling OLTP workloads to millions
of users. CockroachDB is a scalable SQL DBMS that was built from the
ground up to support these global OLTP workloads while maintaining
high availability and strong consistency. Just like its namesake,
CockroachDB is resilient to disasters through replication and
automatic recovery mechanisms.
In this talk, I'll give a brief introduction to the architecture of
CockroachDB followed by a deep dive into the design and implementation
of CockroachDB's query optimizer. CockroachDB has a Cascades-style
query optimizer that uses over 200 transformation rules to explore the
space of possible query execution plans. In this talk, I'll describe
the domain-specific language, Optgen, that we use to define these
transformation rules, and demonstrate how the rules work in
action. I'll explain how we use statistics to choose the best plan
from the search space, and how we automatically collect stats without
disrupting production workloads or requiring coordination between
nodes. I'll also describe some of the unique challenges we face when
optimizing queries for a geo-distributed environment, and how
CockroachDB handles them.
Bio: Becca is the Engineering Manager of the SQL Queries team at
Cockroach Labs. Prior to joining Cockroach Labs, she was a graduate
student at MIT, where she worked with Professor Michael Stonebraker
researching distributed database elasticity and multi-tenancy. Becca
holds a B.S. in Physics from Yale University and an M.S. and Ph.D. in
Computer Science from MIT. In her free time, she enjoys rowing on the
Chicago River and enjoying the great outdoors.
Slides here
April 21: Nico Bruno and César A. Galindo-Legaria (Microsoft)
Title: The Cascades framework for query optimization at
Microsoft.
Abstract:
The Cascades framework was an academic project introduced 25 years ago
as a foundation for modern query optimizers. It provides
extensibility, memoization-based dynamic programming, an algebraic
representation of logical and physical operator trees, and
manipulation of such trees using transformation rules to enable
cost-based query optimization. Cascades provides a clean
framework/skeleton for optimizer development, but it needs to be
instantiated with domain-knowledge and augmented in several directions
to cope with real-world workloads in an industrial setting. We will
describe some design choices and extensions to Cascades that power
multiple Microsoft products, including MS SQL Server and Azure Synapse
Analytics.
Slides here
April 28: Ippokratis Pandis (Amazon)
Title:Amazon Redshift Re-invented
Abstract:
In 2013, eight years ago, Amazon Web Services revolutionized the data
warehousing industry by launching Amazon Redshift, the first fully
managed, petabyte-scale cloud data warehouse solution. Amazon Redshift
made it simple and cost-effective to efficiently analyze large volumes
of data using existing business intelligence tools. This launch was a
significant leap from the traditional on-premise data warehousing
solutions which were expensive, rigid (not elastic), and needed a lot
of tribal knowledge to perform. Unsurprisingly, customers embraced
Amazon Redshift and it went on to become the fastest growing service
in AWS. Today, tens of thousands of customers use Amazon Redshift in
AWS's global infrastructure of 25 launched Regions and 81 Availability
Zones (AZs) to process Exabytes of data daily.
The success of Amazon Redshift inspired a lot of innovation in the
analytics industry which in turn has benefited consumers. In the last
few years, the use cases for Amazon Redshift have evolved and in
response, Amazon Redshift has delivered a series of innovations that
continue to delight customers. In this talk, we take a peek under the
hood of Amazon Redshift, and give an overview of its architecture. We
focus on the core of the system and explain how Amazon Redshift
maintains its differentiating industry-leading performance and
scalability. We discuss how Amazon Redshift extends beyond traditional
data warehousing workloads, by integrating with the broad AWS
ecosystem making Amazon Redshift a one-stop solution for analytics. We
then talk about Amazon Redshift’s autonomics and Amazon Redshift
Serverless. In particular, we present how Redshift continuously
monitors the system and uses machine learning to improve its
performance and operational health without the need of dedicated
administration resources, in an easy to use offering.
Bio:
Ippokratis Pandis is a senior principal engineer at Amazon Web
Services, currently working in Amazon Redshift. Redshift is Amazon's
fully managed, petabyte-scale data warehouse service. Previously,
Ippokratis has held positions as software engineer at Cloudera where
he worked on the Impala SQL-on-Hadoop query engine, and as member of
the research staff at the IBM Almaden Research Center, where he worked
on IBM DB2 BLU. Ippokratis received his PhD from the Electrical and
Computer Engineering department at Carnegie Mellon University. He is
the recipient of Best Demonstration awards at ICDE 2006 and SIGMOD
2011, and Test-of-Time award at EDBT 2019. He has served or serving as
PC chair of DaMoN 2014, DaMoN 2015, CloudDM 2016, HPTS 2019 and ICDE
Industrial 2022, as well as General Chair of SIGMOD 2023.
Paper: here
Slides: here
(CSE NetID required)
May 5: Justin Levandoski (Google)
Title: An Overview of Google BigQuery
Abstract:
Google BigQuery is a serverless, scalable, and cost effective cloud
data warehouse. Having evolved from internal Google infrastructure
(Dremel), BigQuery is unique in a number of dimensions. In this talk,
we provide a look at some of the key architectural aspects of BigQuery
and how it provides a true serverless and multi-tenant warehousing
solution to customers. We then provide an overview of recent features
such as BQML and the embedded BI engine that build on these
architectural foundations that allow BigQuery to provide a unique data
warehousing solution to customers.
Bio:
Justin Levandoski works on BigQuery, Google Cloud's native data
warehouse. Prior to Google, he was a principal engineer at Amazon Web
Services (AWS), where he worked on Amazon Aurora, a cloud-native
operational database system. Before that, he was a member of the
database group at Microsoft Research, where he worked on main-memory
databases, database support for new hardware platforms, transaction
processing, and cloud computing. His research was commercialized in a
number of Microsoft products, including the SQL Server Hekaton
main-memory database engine, Azure CosmosDB, and bing.
Slides here
May 12: Doug Brown (Teradata)
Agenda: Key Components/Concepts (see also
here)
- MPP Shared-nothing Software-based Architecture
- Hash-based with partitioning (row and column)
- Data Management
- Relational, semi-structured, unstructured
- Cost-based Optimizer
- Query Execution
- SQL Engine, Machine Learning Engines
- Extensibility
- UDTs, UDFs, procedures, foreign servers
- High-concurrency mixed workload management
- Reliability and availability
Slides here
May 19: Jiaqi Yan (Snowflake)
Title: The Snowflake Data Cloud
Abstract:
The Snowflake Data Cloud is a new type of cloud dedicated to data
analytics, built on top of the infrastructure provided by the three
major cloud providers, namely Amazon, Microsoft, and Google. In this
talk, I will give an overview of the Snowflake cloud data platform,
highlighting how we leverage the cloud to make Snowflake infinitely
scalable, elastic, global, collaborative, and extremely simple to
use. In particular, we will show that this is made possible by our
multi-cluster shared-data architecture. This novel architecture,
specifically designed for the cloud, allows unlimited and independent
scaling of both compute and storage, with instant elasticity along
both dimensions. The Snowflake cloud data platform is globally
distributed, making it one single multi-tenant platform for the
world. This enables collaboration between tenants, for example
allowing one tenant, a provider, to share a secure view of a database
with one or more tenants, referred to as consumers. Our multi-cluster
shared-data architecture enables live data sharing without the need to
share any compute resources, resulting in full isolation between
providers and consumers.
Bio: Jiaqi Yan is a Principal Software Engineer at Snowflake
Computing. He primarily leads development efforts for the Compiler and
Optimizer of the Snowflake Database, along with Workload Optimization
features. Prior to joining Snowflake, Jiaqi worked as a Senior Member
of Technical Staff for Oracle Database's Query Engine.
Slides: here
May 26: Martin Bravenboer (RelationalAI)
Title: Design and Implementation of the RelationalAI Knowledge Graph
Management System
Abstract RelationalAI is the next-generation database system for new
intelligent data applications based on relational knowledge
graphs. RelationalAI complements the modern data stack by allowing
data applications to be implemented relationally and declaratively,
leveraging knowledge/semantics for reasoning, graph analytics,
relational machine learning, and mathematical optimization
workloads. RelationalAI as a relational and cloud native system fits
naturally in the modern data stack, providing virtually infinite
compute and storage capacity, versioning, and a fully managed
system. RelationalAI supports the workload of data applications with
an expressive relational language (called Rel), novel join algorithms
and JIT compilation suitable for complex computational workloads,
semantic optimization that leverages knowledge to optimize application
logic, and incrementality of the entire system for both data (IVM) and
code (live programming). The system utilizes immutable data
structures, versioning, parallelism, distribution, out-of-core memory
management to support state-of-the-art workload isolation and
scalability for simple as well as complex business logic. In our
experience, RelationalAI’s expressive, relational, and declarative
language leads to a 10-100x reduction in code for complex business
domains. Applications are developed faster, with superior quality by
bringing non-technical domain experts into the process and by
automating away complex programming tasks. We discuss the core
innovations that underpin the RelationalAI system: an expressive
relational language, worst-case optimal join algorithms, semantic
optimization, just-in-time compilation, schema discovery and
evolution, incrementality and immutability.
Slides: here