Overview

This class is about optimizing serving for large language models (LLMs). It delves into the challenges of deploying these models at scale and examines advanced strategies to enhance serving efficiency. Topics include an in-depth analysis of transformer architecture and performance analysis methodologies such as the roofline model . We will study both the attention and feed forward layers in detail alongside key optimizations that have been proposed. Topics include memory management, sparsity-related optimizations, hardware-aware optimizations, and speculative decoding. In addition, we will explore parallelization strategies used in serving systems and learn about collective communications. Finally, we will study scheduling systems that execute serving requests in an optimized way. The class will involve a number of hands-on programming assignments meant to solidify the introduced concepts.

Schedule