CSE 599k - LLM Serving Systems

Overview

This class is about optimizing serving for large language models (LLMs). It delves into the challenges of deploying these models at scale and examines advanced strategies to enhance serving efficiency. Topics include an in-depth analysis of transformer architecture and performance analysis methodologies such as the roofline model . We will study both the attention and feed forward layers in detail alongside key optimizations that have been proposed. Topics include memory management, sparsity-related optimizations, hardware-aware optimizations, and speculative decoding. In addition, we will explore parallelization strategies used in serving systems and learn about collective communications. Finally, we will study scheduling systems that execute serving requests in an optimized way. The class will involve a number of hands-on programming assignments meant to solidify the introduced concepts.

Schedule

Instructor: Baris Kasikci, baris (at) cs

TA: Kan Zhu, kanzhu (at) cs

Lectures: CSE2 G04, Zoom

Mon 10:00am - 11:20am
Wed 10:00am - 11:20am

Office hours: By appointment

CSE 599k: LLM Serving Systems

Overview

Schedule