阅读清单

第1-2周：快速启动 + 冷启动基础

论文

Denoising Diffusion Probabilistic Models (DDPM) ✅ 2025-12-20
- 作者: Ho et al., NeurIPS 2020
- 链接: https://arxiv.org/abs/2006.11239
- 重点: 理解扩散模型的数学原理、前向扩散和反向去噪过程
High-Resolution Image Synthesis with Latent Diffusion Models ✅ 2025-12-20
- 作者: Rombach et al., CVPR 2022
- 链接: https://arxiv.org/abs/2112.10752
- 重点: Stable Diffusion架构，VAE+UNet+CLIP的组合，latent space操作

博客与文档

Hugging Face Diffusers库文档
- 链接: https://huggingface.co/docs/diffusers/
- 重点: 快速入门、StableDiffusionPipeline使用、模型加载方式
PyTorch推理模式官方文档
- 链接: https://pytorch.org/docs/stable/generated/torch.inference_mode.html
- 重点: inference_mode vs no_grad的区别

课程

MIT 6.5940: TinyML and Efficient Deep Learning
- 链接: https://hanlab.mit.edu/courses/2024-fall-65940
- Lecture 1-3: 模型压缩基础、推理优化入门
- 重点: 理解推理与训练的区别、模型效率的基本概念

第3-4周：冷启动优化 + 推理引擎深入

论文

Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer (Blog)
- 链接: https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/
- 重点: 流式加载模型、并发读取和GPU传输、避免中间存储
25x Faster Cold Starts for LLMs on Kubernetes
- 链接: https://www.bentoml.com/blog/25x-faster-cold-starts-for-llms-on-kubernetes
- 重点: 真正的按需加载、容器化环境下的优化

博客与文档

GPU Memory Snapshots: Supercharging Sub-second Startup
- 链接: https://modal.com/blog/gpu-mem-snapshots
- 重点: 快照技术实现10x启动加速、vLLM应用案例（45s→5s）
TensorRT官方文档 - Getting Started
- 链接: https://docs.nvidia.com/deeplearning/tensorrt/
- 重点: TensorRT工作流程、模型优化、引擎构建
ONNX Runtime性能调优
- 链接: https://onnxruntime.ai/docs/performance/
- 重点: Execution Provider选择、图优化、内存优化
safetensors格式详解
- 链接: https://huggingface.co/docs/safetensors/
- 重点: 安全高效的模型序列化、内存映射支持

课程

CMU 15-418/Stanford CS149: Parallel Computing
- 链接: https://cs149.stanford.edu/
- Lecture 1-4: 并行计算基础、GPU架构、CUDA编程模型
- 重点: 理解GPU的计算模型和内存层次

第5-6周：热切换方案设计 + 容器化基础

论文

Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
- 链接: https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/
- 重点: NVIDIA GPU Memory Swap技术、模型热切换、动态offload/onload
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
- 链接: https://arxiv.org/abs/2411.01142
- 重点: CPU offloading策略、内存管理、性能权衡
Clipper: A Low-Latency Online Prediction Serving System
- 作者: Crankshaw et al., NSDI 2017
- 链接: https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/crankshaw
- 重点: 模型容器化、批处理调度、自适应批大小

博客与文档

Docker官方文档 - 容器基础
- 链接: https://docs.docker.com/get-started/
- 重点: namespace、cgroup、镜像分层
NVIDIA Container Toolkit文档
- 链接: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/
- 重点: GPU容器化、设备分配、运行时配置
PyTorch显存管理机制详解
- 链接: https://pytorch.org/docs/stable/notes/cuda.html#memory-management
- 重点: caching allocator工作原理、torch.cuda.empty_cache()、显存碎片

第7-8周：热切换优化 + 调度系统基础

论文

Clipper: A Low-Latency Online Prediction Serving System (深入阅读)
- 重点: 批处理调度细节、请求队列管理、SLO保证
Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis
- 作者: Shen et al., SOSP 2019
- 链接: https://dl.acm.org/doi/10.1145/3341301.3359658
- 重点: GPU集群调度、多模型管理、资源分配策略

博客与文档

vLLM V1架构详解
- 链接: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
- 重点: Scheduler设计、KV-Cache Manager、混合prefill/decode调度
Inside vLLM: Anatomy of a High-Throughput LLM Inference System
- 链接: https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html
- 重点: 调度器实现、PagedAttention、批处理策略（虽然是LLM，但调度思想通用）
Kubernetes调度器文档
- 链接: https://kubernetes.io/docs/concepts/scheduling-eviction/
- 重点: 调度流程、资源约束、调度策略

课程

MIT 6.824: Distributed Systems
- 链接: https://pdos.csail.mit.edu/6.824/
- Lecture 1-2: 分布式系统基础、RPC、容错
- 重点: 理解分布式系统的基本概念

第9-10周：虚拟化与混合部署

论文

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
- 作者: Yu et al., MLSys 2020
- 链接: https://proceedings.mlsys.org/paper/2020/hash/3def184ad8f4755ff269862ea77393dd
- 重点: GPU时间片分配、快速上下文切换、多租户隔离
Gandiva: Introspective Cluster Scheduling for Deep Learning
- 作者: Xiao et al., OSDI 2018
- 链接: https://www.usenix.org/conference/osdi18/presentation/xiao
- 重点: GPU集群调度、时间片共享、抢占式调度
Tiresias: A GPU Cluster Manager for Distributed Deep Learning
- 作者: Gu et al., NSDI 2019
- 链接: https://www.usenix.org/conference/nsdi19/presentation/gu
- 重点: 集群调度策略、作业优先级、资源碎片问题

博客与文档

NVIDIA MIG用户指南
- 链接: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
- 重点: MIG分区、GPU实例、计算实例、隔离保证
vCUDA开源项目
- 链接: https://github.com/tkestack/vcuda-controller
- 重点: GPU虚拟化实现、API拦截、资源限制

课程

CMU 15-779: Advanced Topics in Machine Learning Systems
- 链接: https://www.csd.cmu.edu/course/15779/f25
- 重点: GPU共享、调度、资源管理相关内容

第11-12周：调度策略实现 + 系统集成

论文

Orca: A Distributed Serving System for Transformer-Based Generative Models
- 作者: Yu et al., OSDI 2022
- 链接: https://www.usenix.org/conference/osdi22/presentation/yu
- 重点: 迭代级调度、批处理优化、选择性批处理
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
- 作者: Li et al., OSDI 2023
- 链接: https://www.usenix.org/conference/osdi23/presentation/li-zhouhan
- 重点: 模型并行的服务系统、统计复用、延迟优化

博客与文档

SGLang Diffusion: Accelerating Video and Image Generation
- 链接: https://lmsys.org/blog/2025-11-07-sglang-diffusion/
- 核心必读: 直接针对多模态生成模型！
- 重点:
  - 支持Wan、Hunyuan、Qwen-Image、Flux等视频/图像生成模型
  - 1.2x-5.9x加速
  - Unified Sequence Parallelism (USP)、CFG-parallelism
  - OpenAI兼容API、CLI、Python接口
  - batching、精确内存预测、量化、LoRA支持
SGLang GitHub Roadmap
- 链接: https://github.com/sgl-project/sglang/issues/12799
- 重点: 2025 Q4 Diffusion roadmap，跟进最新开发进展
Prometheus监控最佳实践
- 链接: https://prometheus.io/docs/practices/naming/
- 重点: 指标命名、标签设计、查询优化

第13-14周：分布式推理基础

论文

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- 作者: Shoeybi et al., 2019
- 链接: https://arxiv.org/abs/1909.08053
- 重点: Tensor Parallelism原理、通信模式、适用场景
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- 作者: Zheng et al., OSDI 2022
- 链接: https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin
- 重点: 自动并行策略搜索、跨算子和算子内并行
Stable Video Diffusion (SVD)
- 作者: Blattmann et al., 2023
- 链接: https://arxiv.org/abs/2311.15127
- 重点: 视频生成模型架构、时空注意力、计算特点
Align Your Latents (Gen-2类技术)
- 链接: 搜索视频生成相关论文
- 重点: 文本到视频生成、latent space对齐

博客与文档

NCCL官方文档
- 链接: https://docs.nvidia.com/deeplearning/nccl/
- 重点: 集合通信原语、拓扑感知、跨机通信
DeepSpeed Inference文档
- 链接: https://www.deepspeed.ai/inference/
- 重点: Inference优化技术、Tensor Parallelism支持

课程

CMU 15-418: Parallel Computing
- Lecture 5-8: 分布式内存编程、MPI、通信优化
- 重点: 理解分布式通信的开销和优化方法

第15-16周：细粒度调度深化

论文

Cocktail: A Multidimensional Optimization for Model Serving
- 作者: Gunasekaran et al., NSDI 2022
- 链接: https://www.usenix.org/conference/nsdi22/presentation/gunasekaran
- 重点: 多维度优化、资源分配、延迟-吞吐量权衡
Shepherd: Serving DNNs in the Wild
- 作者: Zhang et al., NSDI 2023
- 链接: https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
- 重点: 动态批大小、自适应调度、请求特征感知
Bark: Text-to-Audio Generation
- 链接: https://github.com/suno-ai/bark
- 重点: 音频生成模型架构、推理流程、计算特点
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
- 作者: Liu et al., 2023
- 链接: https://arxiv.org/abs/2301.12503
- 重点: 音频潜在扩散模型、特征提取、生成流程

博客与文档

排队论基础教程
- 推荐: Queueing Theory and Network Models (相关章节)
- 重点: M/M/1、M/M/c队列、Little’s Law在系统设计中的应用
SGLang调度器源码分析
- 链接: https://github.com/sgl-project/sglang/tree/main/python/sglang
- 重点: 实际调度器实现、批处理逻辑、请求管理

第17-18周：系统优化与稳定性

论文

MLSys 2024/2025/2026会议论文选读
- 链接: https://mlsys.org/Conferences/2024/Schedule
- 重点: 选择推理系统、调度、优化相关论文
OSDI 2024/2025/2026论文选读
- 链接: https://www.usenix.org/conference/osdi24
- 重点: 系统优化、可靠性、性能分析

书籍

Site Reliability Engineering (SRE Book)
- 链接: https://sre.google/sre-book/table-of-contents/
- 章节:
  - Chapter 7: The Evolution of Automation at Google
  - Chapter 22: Addressing Cascading Failures
  - Chapter 26: Data Integrity
- 重点: 系统可靠性设计、故障处理、降级策略

博客与文档

PyTorch Profiler深入使用
- 链接: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
- 重点: 性能分析、火焰图、瓶颈定位
NVIDIA Nsight Systems教程
- 链接: https://docs.nvidia.com/nsight-systems/
- 重点: GPU性能分析、trace分析、优化建议

课程

CMU 15-779: Advanced Topics in Machine Learning Systems
- 重点: 最新研究进展、前沿技术

第19-20周：端到端优化与压测

论文

生产系统案例研究
- 搜索关键词: “production ML serving”, “inference system in production”
- OSDI/SOSP/NSDI会议中的工业界论文
- 重点: 真实系统的挑战、解决方案、经验教训

博客与文档

大规模推理系统案例
- OpenAI、Anthropic、Stability AI等公司的技术博客
- 重点: 生产环境的实际挑战和解决方案
压力测试工具
- Locust、wrk、vegeta等工具文档
- 重点: 压测设计、指标收集、瓶颈分析

第21-22周：云原生集成与扩缩容

论文

Kubernetes调度器优化相关论文
- 搜索关键词: “Kubernetes GPU scheduling”, “cluster autoscaling”
- 重点: GPU调度、自动扩缩容、资源碎片

博客与文档

Kubernetes Autoscaling最佳实践
- 链接: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
- 重点: HPA配置、自定义指标、扩缩容策略
Kubernetes GPU Operator
- 链接: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/
- 重点: GPU资源管理、自动配置、监控
KEDA (Kubernetes Event Driven Autoscaling)
- 链接: https://keda.sh/
- 重点: 基于事件的扩缩容、自定义scaler

课程

CNCF云原生课程
- 链接: https://www.cncf.io/certification/training/
- 重点: 云原生架构、Kubernetes最佳实践

重点前沿技术追踪（持续更新）

SGLang生态

SGLang Diffusion系列博客
- 主页: https://lmsys.org/blog/
- 定期检查更新，关注新功能和优化技术
SGLang GitHub Issues和Roadmap
- 链接: https://github.com/sgl-project/sglang/issues
- 关注Diffusion相关的issue和feature request

vLLM架构（借鉴调度思想）

vLLM V1深入分析
- 链接: https://blog.vllm.ai/
- 重点: 调度器设计、内存管理、批处理策略（通用思想可借鉴）

工业界实践

RunAI技术博客
- 链接: https://www.run.ai/blog/
- 重点: GPU资源管理、模型部署、优化技术
Modal技术博客
- 链接: https://modal.com/blog/
- 重点: Serverless推理、冷启动优化
BentoML技术博客
- 链接: https://www.bentoml.com/blog/
- 重点: 模型服务、部署优化

重要会议与资源

会议论文

MLSys Conference
- 链接: https://mlsys.org/
- 时间: 每年5-6月
- 重点: 机器学习系统前沿研究
OSDI/SOSP
- 链接: https://www.usenix.org/conferences
- 重点: 系统领域顶会，常有ML系统相关论文
NSDI
- 链接: https://www.usenix.org/conferences
- 重点: 网络和分布式系统

开源项目

SGLang
- 链接: https://github.com/sgl-project/sglang
- 核心项目：直接相关，需要深入研究代码
vLLM
- 链接: https://github.com/vllm-project/vllm
- 重点: 调度器实现、PagedAttention（思想可借鉴）
TensorRT
- 链接: https://github.com/NVIDIA/TensorRT
- 重点: 推理引擎优化
Diffusers (Hugging Face)
- 链接: https://github.com/huggingface/diffusers
- 重点: Diffusion模型实现、pipeline设计

阅读优先级说明

必读（P0）

SGLang Diffusion相关所有材料（直接工作相关）
冷启动优化相关（NVIDIA Model Streamer、BentoML、Modal等）
GPU Memory Swap/热切换相关（NVIDIA、RunAI等）
vLLM V1架构（调度思想）
Diffusion模型原理论文（DDPM、Stable Diffusion）

重要（P1）

调度系统论文（Nexus、Clipper、Orca等）
GPU虚拟化论文（Salus、Gandiva等）
分布式推理论文（Megatron、Alpa等）
视频/音频生成模型论文

选读（P2）

操作系统和分布式系统基础课程（如有基础可跳过）
MLOps和生产系统最佳实践
云原生相关技术

阅读方法建议

论文阅读：
- 先读Abstract和Conclusion了解核心思想
- 重点关注系统设计和优化技术
- 跳过复杂数学推导（除非是模型原理论文）
- 重点看实验部分的性能指标和优化效果
博客和文档：
- 边读边实验，动手验证
- 记录关键配置和参数
- 总结最佳实践
代码阅读：
- 重点看架构设计和接口定义
- 理解关键数据结构和算法
- 不必逐行阅读实现细节
课程学习：
- 重点看lecture slides和视频
- 选择性完成作业（时间有限的情况下）
- 理解核心概念即可

文档状态：第一版

LazyBearLee's Blog

探索

阅读清单

第1-2周：快速启动 + 冷启动基础

论文

博客与文档

课程

第3-4周：冷启动优化 + 推理引擎深入

论文

博客与文档

课程

第5-6周：热切换方案设计 + 容器化基础

论文

博客与文档

第7-8周：热切换优化 + 调度系统基础

论文

博客与文档

课程

第9-10周：虚拟化与混合部署

论文

博客与文档

课程

第11-12周：调度策略实现 + 系统集成

论文

博客与文档

第13-14周：分布式推理基础

论文

博客与文档

课程

第15-16周：细粒度调度深化

论文

博客与文档

第17-18周：系统优化与稳定性

论文

书籍

博客与文档

课程

第19-20周：端到端优化与压测

论文

博客与文档

第21-22周：云原生集成与扩缩容

论文

博客与文档

课程

重点前沿技术追踪（持续更新）

SGLang生态

vLLM架构（借鉴调度思想）

工业界实践

重要会议与资源

会议论文

开源项目

阅读优先级说明

必读（P0）

重要（P1）

选读（P2）

阅读方法建议

关系图谱

目录