A curated list of awesome projects and papers for distributed training or inference especially for large model.
- Megatron-LM: Ongoing Research Training Transformer Models at Scale
- DeepSpeed: A Deep Learning Optimization Library that Makes Distributed Training and Inference Easy, Efficient, and Effective.
- ColossalAI: A Unified Deep Learning System for Large-Scale Parallel Training
- OneFlow: A Performance-Centered and Open-Source Deep Learning Framework
- Mesh TensorFlow: Model Parallelism Made Easier
- FlexFlow: A Distributed Deep Learning Framework that Supports Flexible Parallelization Strategies.
- Alpa: Auto Parallelization for Large-Scale Neural Networks
- Easy Parallel Library: A General and Efficient Deep Learning Framework for Distributed Model Training
- FairScale: PyTorch Extensions for High Performance and Large Scale Training
- TePDist: an HLO-level automatic distributed system for DL models
- EasyDist: Automated Parallelization System and Infrastructure
- exo: Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
- Nerlnet: A framework for research and deployment of distributed machine learning algorithms on IoT devices
- veScale: A PyTorch Native LLM Training Framework
- Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis by Tal Ben-Nun et al., ACM Computing Surveys 2020
- A Survey on Auto-Parallelism of Neural Networks Training by Peng Liang., techrxiv 2022
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism by Yanping Huang et al., NeurIPS 2019
- PipeDream: generalized pipeline parallelism for DNN training by Deepak Narayanan et al., SOSP 2019
- Memory-Efficient Pipeline-Parallel DNN Training by Deepak Narayanan et al., ICML 2021
- DAPPLE: a pipelined data parallel approach for training large models by Shiqing Fan et al., PPoPP 2021
- Chimera: efficiently training large-scale neural networks with bidirectional pipelines by Shigang Li et al., SC 2021
- Elastic Averaging for Efficient Pipelined DNN Training by Zihao Chen et al., PPoPP 2023
- Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers by Yangyang Feng et al., ASPLOS 2023
- Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency by Ziming Liu et al., SC 2023
- Zero Bubble Pipeline Parallelism by Penghui Qi et al., ICLR 2024
- Long Sequence Training from System Perspective by Shenggui Li et al., ACL 2023
- DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models by Sam Ade Jacobs et al., arxiv 2023
- Ring Attention with Blockwise Transformers for Near-Infinite Context by Hao Liu et al., NeurIPS 2023 Workshop
- USP: A Unified Sequence Parallelism Approach for Long Context Generative AI by Jiarui Fang et al., arxiv 2024
- LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism by Diandian Gu et al., arxiv 2024
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding by Dmitry Lepikhin et al., ICLR 2021
- FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models by Jiaao He et al., PPoPP 2022
- BaGuaLu: targeting brain scale pretrained models with over 37 million cores by Zixuan Ma et al., PPoPP 2022
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale by Samyam Rajbhandari et al., ICML 2022
- Tutel: Adaptive Mixture-of-Experts at Scale by Changho Hwang et al., MLSys 2023
- Accelerating Distributed MoE Training and Inference with Lina by Jiamin Li et al., ATC 2023
- SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Static and Dynamic Parallelization by Mingshu Zhai et al., ATC 2023
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts by Trevor Gale et al., MLSys 2023
- PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs Chunyang Wang et al., PPoPP 2023
- DSP: Efficient GNN Training with Multiple GPUs Zhenkun Cai et al., PPoPP 2023
- Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms Yuke Wang et al., OSDI 2023
- Efficient large-scale language model training on GPU clusters using megatron-LM by Deepak Narayanan et al., SC 2021
- GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training by Arpan Jain et al., SC 2020
- Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training by Can Karakus et al., arxiv 2021
- OneFlow: Redesign the Distributed Deep Learning Framework from Scratch by Jinhui Yuan et al., arxiv 2021
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training by Zhengda Bian., arxiv 2021
- Training deep nets with sublinear memory cost by Tianqi Chen et al., arxiv 2016
- ZeRO: memory optimizations toward training trillion parameter models by Samyam Rajbhandari et al., SC 2020
- Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization by Paras Jain et al., MLSys 2020
- Dynamic Tensor Rematerialization by Marisa Kirisame et al., ICLR 2021
- ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training by Jianfei Chen et al., ICML 2021
- GACT: Activation Compressed Training for Generic Network Architectures by Xiaoxuan Liu et al., ICML 2022
- Superneurons: dynamic GPU memory management for training deep neural networks by Linnan Wang et al., PPoPP 2018
- Capuchin: Tensor-based GPU Memory Management for Deep Learning by Xuan Peng et al., ASPLOS 2020
- SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping by Chien-Chin Huang et al., ASPLOS 2020
- ZeRO-Offload: Democratizing Billion-Scale Model Training by Jie Ren et al., ATC 2021
- ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning by Samyam Rajbhandari et al., SC 2021
- PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management by Jiarui Fang et al., TPDS 2023
- MegTaiChi: dynamic tensor-based memory management optimization for DNN training by Zhongzhe Hu et al., ICS 2022
- Tensor Movement Orchestration In Multi-GPU Training Systems Shao-Fu Lin et al., HPCA 2023
- Mesh-tensorflow: Deep learning for supercomputers by Noam Shazeer et al., NeurIPS 2018
- Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks by Zhihao Jia et al., ICML 2018
- Beyond Data and Model Parallelism for Deep Neural Networks by Zhihao Jia et al., MLSys 2019
- Supporting Very Large Models using Automatic Dataflow Graph Partitioning by Minjie Wang et al., EuroSys 2019
- GSPMD: General and Scalable Parallelization for ML Computation Graphs by Yuanzhong Xu et al., arxiv 2021
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning by Lianmin Zheng et al., OSDI 2022
- Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization by Colin Unger, Zhihao Jia, et al., OSDI 2022
- Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism by Xupeng Miao, et al., VLDB 2023
- Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform by Shiwei Zhang, Lansong Diao, et al., arxiv 2023
- nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training by Zhiqi Lin, et al., OSDI 2024
- Blink: Fast and Generic Collectives for Distributed ML by Guanhua Wang et al., MLSys 2020
- Synthesizing optimal collective algorithms by Zixian Cai et al., PPoPP 2021
- Breaking the computation and communication abstraction barrier in distributed machine learning workloads by Abhinav Jangda et al., ASPLOS 2022
- MSCCLang: Microsoft Collective Communication Language by Meghan Cowan et al., ASPLOS 2023
- Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models by Shibo Wang et al., ASPLOS 2023
- Logical/Physical Topology-Aware Collective Communication in Deep Learning Training by Jo Sanghoon et al., HPCA 2023
- Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning by Chang Chen et al., ASPLOS 2024
- Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates by Insu Jang et al., SOSP 2023
- Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs by John Thorpe et al., NSDI 2023
- Varuna: scalable, low-cost training of massive deep learning models by Sanjith Athlur et al., EuroSys 2022
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale by Reza Yazdani Aminabadi et al., SC 2022
- EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models by Jiangsu Du et al., arxiv 2022
- Efficiently Scaling Transformer Inference by Reiner Pope et al., MLSys 2022
- Beta: Statistical Multiplexing with Model Parallelism for Deep Learning Serving by Zhuohan Li et al., OSDI 2023
- Fast inference from transformers via speculative decoding by Yaniv Leviathan et al., ICML 2023
- FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU by Ying Sheng et al., ICML 2023
- Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference by Jiangsu Du et al., PPoPP 2024
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving by Ruoyu Qin et al., arxiv 2024
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve by Amey Agrawal et al., OSDI 2024
- NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism by Shixiong Zhao et al., ASPLOS 2022
- AthenaRL: Distributed Reinforcement Learning with Dataflow Fragments by Huanzhou Zhu et al., ATC 2023
- Hydro: Surrogate-Based Hyperparameter Tuning Service in the Datacenter by Qinghao Hu et al., OSDI 2023
- FastFold: Optimizing AlphaFold Training and Inference on GPU Clusters by Shenggan Cheng et al., PPoPP 2024
- HybridFlow: A Flexible and Efficient RLHF Framework by Guangming Sheng et al., EuroSys 2025
All contributions to this repository are welcome. Open an issue or send a pull request.