Shenggan / awesome-distributed-ml Public

Notifications You must be signed in to change notification settings
Fork 25
Star 197

A curated list of awesome projects and papers for distributed training or inference

197 stars 25 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
README.md		README.md

Repository files navigation

Awesome Distributed Machine Learning System

A curated list of awesome projects and papers for distributed training or inference especially for large model.

Contents

Awesome Distributed Machine Learning System

Open Source Projects

Papers

Survey

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis by Tal Ben-Nun et al., ACM Computing Surveys 2020
A Survey on Auto-Parallelism of Neural Networks Training by Peng Liang., techrxiv 2022

Pipeline Parallelism

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism by Yanping Huang et al., NeurIPS 2019
PipeDream: generalized pipeline parallelism for DNN training by Deepak Narayanan et al., SOSP 2019
Memory-Efficient Pipeline-Parallel DNN Training by Deepak Narayanan et al., ICML 2021
DAPPLE: a pipelined data parallel approach for training large models by Shiqing Fan et al., PPoPP 2021
Chimera: efficiently training large-scale neural networks with bidirectional pipelines by Shigang Li et al., SC 2021
Elastic Averaging for Efficient Pipelined DNN Training by Zihao Chen et al., PPoPP 2023
Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers by Yangyang Feng et al., ASPLOS 2023
Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency by Ziming Liu et al., SC 2023
Zero Bubble Pipeline Parallelism by Penghui Qi et al., ICLR 2024

Sequence Parallelism

Long Sequence Training from System Perspective by Shenggui Li et al., ACL 2023
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models by Sam Ade Jacobs et al., arxiv 2023
Ring Attention with Blockwise Transformers for Near-Infinite Context by Hao Liu et al., NeurIPS 2023 Workshop
USP: A Unified Sequence Parallelism Approach for Long Context Generative AI by Jiarui Fang et al., arxiv 2024
LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism by Diandian Gu et al., arxiv 2024

Mixture-of-Experts System

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding by Dmitry Lepikhin et al., ICLR 2021
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models by Jiaao He et al., PPoPP 2022
BaGuaLu: targeting brain scale pretrained models with over 37 million cores by Zixuan Ma et al., PPoPP 2022
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale by Samyam Rajbhandari et al., ICML 2022
Tutel: Adaptive Mixture-of-Experts at Scale by Changho Hwang et al., MLSys 2023
Accelerating Distributed MoE Training and Inference with Lina by Jiamin Li et al., ATC 2023
SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Static and Dynamic Parallelization by Mingshu Zhai et al., ATC 2023
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts by Trevor Gale et al., MLSys 2023

Graph Neural Networks System

PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs Chunyang Wang et al., PPoPP 2023
DSP: Efficient GNN Training with Multiple GPUs Zhenkun Cai et al., PPoPP 2023
Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms Yuke Wang et al., OSDI 2023

Hybrid Parallelism & Framework

Efficient large-scale language model training on GPU clusters using megatron-LM by Deepak Narayanan et al., SC 2021
GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training by Arpan Jain et al., SC 2020
Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training by Can Karakus et al., arxiv 2021
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch by Jinhui Yuan et al., arxiv 2021
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training by Zhengda Bian., arxiv 2021

Memory Efficient Training

Training deep nets with sublinear memory cost by Tianqi Chen et al., arxiv 2016
ZeRO: memory optimizations toward training trillion parameter models by Samyam Rajbhandari et al., SC 2020
Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization by Paras Jain et al., MLSys 2020
Dynamic Tensor Rematerialization by Marisa Kirisame et al., ICLR 2021
ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training by Jianfei Chen et al., ICML 2021
GACT: Activation Compressed Training for Generic Network Architectures by Xiaoxuan Liu et al., ICML 2022

Tensor Movement

Superneurons: dynamic GPU memory management for training deep neural networks by Linnan Wang et al., PPoPP 2018
Capuchin: Tensor-based GPU Memory Management for Deep Learning by Xuan Peng et al., ASPLOS 2020
SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping by Chien-Chin Huang et al., ASPLOS 2020
ZeRO-Offload: Democratizing Billion-Scale Model Training by Jie Ren et al., ATC 2021
ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning by Samyam Rajbhandari et al., SC 2021
PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management by Jiarui Fang et al., TPDS 2023
MegTaiChi: dynamic tensor-based memory management optimization for DNN training by Zhongzhe Hu et al., ICS 2022
Tensor Movement Orchestration In Multi-GPU Training Systems Shao-Fu Lin et al., HPCA 2023

Auto Parallelization

Mesh-tensorflow: Deep learning for supercomputers by Noam Shazeer et al., NeurIPS 2018
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks by Zhihao Jia et al., ICML 2018
Beyond Data and Model Parallelism for Deep Neural Networks by Zhihao Jia et al., MLSys 2019
Supporting Very Large Models using Automatic Dataflow Graph Partitioning by Minjie Wang et al., EuroSys 2019
GSPMD: General and Scalable Parallelization for ML Computation Graphs by Yuanzhong Xu et al., arxiv 2021
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning by Lianmin Zheng et al., OSDI 2022
Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization by Colin Unger, Zhihao Jia, et al., OSDI 2022
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism by Xupeng Miao, et al., VLDB 2023
Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform by Shiwei Zhang, Lansong Diao, et al., arxiv 2023
nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training by Zhiqi Lin, et al., OSDI 2024

Communication Optimization

Blink: Fast and Generic Collectives for Distributed ML by Guanhua Wang et al., MLSys 2020
Synthesizing optimal collective algorithms by Zixian Cai et al., PPoPP 2021
Breaking the computation and communication abstraction barrier in distributed machine learning workloads by Abhinav Jangda et al., ASPLOS 2022
MSCCLang: Microsoft Collective Communication Language by Meghan Cowan et al., ASPLOS 2023
Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models by Shibo Wang et al., ASPLOS 2023
Logical/Physical Topology-Aware Collective Communication in Deep Learning Training by Jo Sanghoon et al., HPCA 2023
Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning by Chang Chen et al., ASPLOS 2024

Fault-tolerant Training

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates by Insu Jang et al., SOSP 2023
Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs by John Thorpe et al., NSDI 2023
Varuna: scalable, low-cost training of massive deep learning models by Sanjith Athlur et al., EuroSys 2022

Inference and Serving

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale by Reza Yazdani Aminabadi et al., SC 2022
EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models by Jiangsu Du et al., arxiv 2022
Efficiently Scaling Transformer Inference by Reiner Pope et al., MLSys 2022
Beta: Statistical Multiplexing with Model Parallelism for Deep Learning Serving by Zhuohan Li et al., OSDI 2023
Fast inference from transformers via speculative decoding by Yaniv Leviathan et al., ICML 2023
FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU by Ying Sheng et al., ICML 2023
Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference by Jiangsu Du et al., PPoPP 2024
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving by Ruoyu Qin et al., arxiv 2024
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve by Amey Agrawal et al., OSDI 2024

Applications

NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism by Shixiong Zhao et al., ASPLOS 2022
AthenaRL: Distributed Reinforcement Learning with Dataflow Fragments by Huanzhou Zhu et al., ATC 2023
Hydro: Surrogate-Based Hyperparameter Tuning Service in the Datacenter by Qinghao Hu et al., OSDI 2023
FastFold: Optimizing AlphaFold Training and Inference on GPU Clusters by Shenggan Cheng et al., PPoPP 2024
HybridFlow: A Flexible and Efficient RLHF Framework by Guangming Sheng et al., EuroSys 2025

Contribute

All contributions to this repository are welcome. Open an issue or send a pull request.

About

A curated list of awesome projects and papers for distributed training or inference

distributed-systems machine-learning deep-learning high-performance-computing model-parallelism pipeline-parallelism

Report repository

Contributors 4