CSE 599 — ML for ML Systems

Welcome to ML for ML Systems, taught by Prof. Luis Ceze A portrait of Luis with Zihao Ye as TA.

ML models are quickly become an integral component of how applications are built. Yet they are a different thing than most software — performance hungry, bandwidth hungry, and very fast-evolving. This lead to the need to build systems to support them — abstractions and frameworks to tame complexity and quickly adapt, compilers, programming languages and runtime systems to make efficient use of hardware resources, better communication approaches for distributed systems, etc. One important twist to this fast systems development is that optimization spaces for ML systems themselves (codegen fo ML models, systems parameter tuning, resource allocation, etc) are very large, so these systems use machine learning itself to provide effective solutions — so you read the name of the class right, it is “ML for ML systems” ;).

In this special topics class we will explore the state-of-the-art and research on ML systems, including: ML model compilers, ML training systems, ML serving systems, support for large language models serving, ML systems that span cloud and edge, resource management for ML, among others. The format is a participatory focused on paper reading, presenting and discussion, and a class project scoped and chosen by the participants.

This website will be updated throughout the quarter, so check back for the latest.

Course Overview

Lectures: Monday and Wednesday 3:00pm-4:20pm (Location: CSE2 271)
Luis’ Office Hours: By appointment.
TA Office Hours: Friday 9:30am - 10:30am (Gates 374).
Course canvas: Link
Course materials: Google Drive Link

Assignments

Read all papers (optional readings are not necessary) and submit one idea of extension or use of core papers contributions.
Present and lead discussion of 2 papers in pairs.
A research project on ML systems, possible ideas include.
- Cost predictor for training and serving over model lifetime.
- Optimizing a new workload (e.g. AI for science).
- Resource provisioning for serving.
- On-device training/inference.
- Deploy models in new backend (e.g. browser).

Schedule

Date	Topic & Readings	HW/Notes/Slides
March 27	No class (ASPLOS 2023)
March 29	Introduction	TODO: Sign up for discussion leads! TODO: Submit your project proposal!
April 3	Model Compilation/Optimization - ML Compilers Required Readings: Apache TVM Unity: a vision for the ML software & hardware ecosystem in 2022 TensorIR: An Abstraction for Automatic Tensorized Program Optimization Presenter: Vishal Canumalla Optional Readings: Tensor Program Optimization with Probabilistic Programs Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization	Due: Topic 1 Machine Learning Compilation
April 5	Model Compilation/Optimization - Neural Architecture Search Required Readings: Once-for-All: Train One Network and Specialize it for Efficient Deployment Hyperscale Hardware Optimized Neural Architecture Search Presenter: Chloe Yang, Yifang Chen Optional Readings: ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism Neural Architecture Search using Property Guided Synthesis	Due: Topic 2 Slides(OFA) Roofline Models
April 10	Model Compilation/Optimization - LLM Quantization Required Readings: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale Guest: Tim Dettmers on 4-bit fine-tuning Presenter: Sam Kaufman, Rosario Scalise Optional Readings: Quantization Algorithms - Distiller Documentation SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models The case for 4-bit precision: k-bit Inference Scaling Laws	Due: Topic 3 Slides Using FP8 with Transformer Engine
April 12	Model Compilation/Optimization - Transformers & Beyond Required Readings: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Simple Hardware-Efficient Long Convolutions for Sequence Modeling Presenter: Huong Ngo, Nicholas Boren, Jaehong Min Optional Readings: Online normalizer calculation for softmax Self-attention Does Not Need \(O(n^2)\) Memory Efficiently Modeling Long Sequences with Structured State Spaces Hungry Hungry Hippos: Towards Language Modeling with State Space Models From Deep to Long Learning? ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs	Due: Topic 4 TODO: Proposal Presentation Sign Up! Slides (FlashAttention) Slides (FlashButterfly) From Online Softmax to FlashAttention
April 17	Model Compilation/Optimization - Sparsification Required Readings: Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks MegaBlocks: Efficient Sparse Training with Mixture-of-Experts Presenter: Alan Fan, Rohith Leeladharan Optional Readings: The State of Sparsity in Deep Neural Networks Sparse Networks from Scratch: Faster Training without Losing Performance Accelerating Sparsity in the NVIDIA Ampere Architecture SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning	Due: Topic 5 Slides (Sparsity in Deep Learning)
April 19	Project Proposal Presentation
April 24	Training Optimization - Parallelism (1) Required Readings: Horovod: fast and easy distributed deep learning in TensorFlow Beyond Data and Model Parallelism for Deep Neural Networks PipeDream: generalized pipeline parallelism for DNN training Presenter: Bohan Liu, Mike Merrill, Aditya K Kamath Optional Readings: Bringing HPC Techniques to Deep Learning Supporting Very Large Models using Automatic Dataflow Graph Partitioning GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Paradigms of Parallelism - Colossal-AI Documentation	Due: Topic 6 Slides (Parallelism) SAMPL Talk (Pipedream) PyTorch: Distributed Data Parallel PyTorch: Single Machine Model Parallel PyTorch: Pipeline Parallelism PyTorch: Fully Sharded Data Parallel
April 26	Training Optimization - On Device Training Required Readings: POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging On-Device Training Under 256KB Memory Presenter: Jason Zhang, Anoop Mysore Optional Readings: MCUNet: Tiny Deep Learning on IoT Devices MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning	Due: Topic 7 Slides (MCUNetV3)
May 1	Training Optimization - Memory Optimizations Required Readings: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Dynamic Tensor Rematerialization Presenter: Sam Kaufman, Bohan Liu Optional Readings: Training Deep Nets with Sublinear Memory Cost The Reversible Residual Network: Backpropagation Without Storing Activations Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training ZeRO-Offload: Democratizing Billion-Scale Model Training ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning DeepUM: Tensor Migration and Prefetching in Unified Memory	Due: Topic 8 Slides (ZeRO)
May 3	Training Optimization - Parallelism (2) Required Readings: Ray: A Distributed Framework for Emerging AI Applications Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning Presenter: Alan Fan, Nicholas Boren Optional Readings: GSPMD: General and Scalable Parallelization for ML Computation Graphs Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM	Due: Topic 9 Slides (Ray) Ray V2 Architecture SAMPL Talk (Alpa)
May 8	Model Inference & Serving - Model Serving Required Readings: Serving DNNs like Clockwork: Performance Predictability from the Bottom Up Nexus: a GPU Cluster Engine for Accelerating DNN-based Video Analysis Guest: Lequn Chen on Symphony: a new model serving system Presenter: Tapan Chugh, Vaibhav Mehrotra Optional Readings: Clipper: A Low-Latency Online Prediction Serving System	Due: Topic 10 Slides (Clockwork) Slides (Symphony)
May 10	Model Inference & Serving - Large Scale Inference/Serving Required Readings: DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving Presenter: Khurshid Alam, Rashmika Reddy Optional Readings: DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale	Due: Topic 11 Slides (DeepSpeed)
May 15	Model Inference & Serving - LLM Inference/Serving Required Readings: High-throughput Generative Inference of Large Language Models with a Single GPU Orca: A Distributed Serving System for Transformer-Based Generative Models Guest: Lequn Chen on batching effects in GPT models Presenter: Daksh Sinha, Huong Ngo Optional Readings: EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models	Due: Topic 12 TODO: Final Project Presentation Sign Up! Slides (Orca) Dissecting Batching Effects in GPT Inference Transformer Inference Arithmetics
May 17	AI Hardware - TPU Required Readings: The Hardware Lottery Ten Lessons From Three Generations Shaped Google’s TPUv4i Presenter: Tapan Chugh, Jaehong Min Optional Readings: TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings Machine Learning Systems are Stuck in a Rut In-Datacenter Performance Analysis of a Tensor Processing Unit A Learned Performance Model for Tensor Processing Units A New Golden Age for Computer Architecture: History, Challenges and Opportunities The Golden Age of Compiler Design in an Era of HW/SW Co-design	Due: Topic 13 Slides (Hardware Lottery) Slides (TPU)
May 22	AI Hardware - GPU & Reconfigurable Architectures Required Readings: A Hardware-Software Blueprint for Flexible Deep Learning Specialization NVIDIA H100 Tensor Core GPU Architecture Guest: Ying Sheng on FlexGen: an LLM inference system Presenter: Fengqing Jiang, Yun-Chang Teng, Aditya K Kamath Optional Readings: NVIDIA Ampere GA102 GPU Architecture HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing Plasticine: A Reconfigurable Architecture For Parallel Patterns	Due: Topic 14 Slides (VTA) Slides (Hopper) Recording of FlexGen Talk by Ying Sheng
May 24	Project Presentation
May 29	No class (Memorial Day)
May 31	Project Presentation

ML for ML Systems Spring 2023 CSE 599M

Course Overview

Assignments

Schedule

ML for ML Systems Spring 2023
CSE 599M