Releases: DefTruth/Awesome-LLM-Inference
Releases · DefTruth/Awesome-LLM-Inference
v2.6.4
v2.6.3
v2.6.2
What's Changed
- early exit of LLM inference by @boyi-liu in #85
- Add paper AdaKV by @FFY0 in #86
- Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance by @aharshms in #87
- 🔥[FastAttention] FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs for Efficient Inference by @DefTruth in #88
New Contributors
- @boyi-liu made their first contribution in #85
- @FFY0 made their first contribution in #86
- @aharshms made their first contribution in #87
Full Changelog: v2.6.1...v2.6.2
v2.6.1
What's Changed
- [From Author] Link CacheGen and CacheBlend to LMCache by @KuntaiDu in #80
- 🔥[LORC] Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy by @DefTruth in #81
- Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation by @DefTruth in #82
- [LLM Inference] LARGE LANGUAGE MODEL INFERENCE ACCELERATION: A COMPREHENSIVE HARDWARE PERSPECTIVE by @DefTruth in #83
- 🔥[PARALLELSPEC] PARALLELSPEC: PARALLEL DRAFTER FOR EFFICIENT SPECULATIVE DECODING by @DefTruth in #84
New Contributors
Full Changelog: v2.6...v2.6.1
v2.6
What's Changed
- 🔥[VPTQ] VPTQ: EXTREME LOW-BIT VECTOR POST-TRAINING QUANTIZATION FOR LARGE LANGUAGE MODELS by @DefTruth in #70
- fix typo by @DefTruth in #71
- 🔥🔥[INT-FLASHATTENTION] INT-FLASHATTENTION: ENABLING FLASH ATTENTION FOR INT8 QUANTIZATION by @DefTruth in #72
- [Low-bit] A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms by @DefTruth in #73
- 🔥🔥[HiFloat8] Ascend HiFloat8 Format for Deep Learning by @DefTruth in #74
- 🔥[AlignedKV] AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization by @DefTruth in #75
- 🔥🔥[Tensor Cores] Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores by @DefTruth in #76
- 🔥[KV-COMPRESS] PAGED KV-CACHE COMPRESSION WITH VARIABLE COMPRESSION RATES PER ATTENTION HEAD by @DefTruth in #77
- 🔥[LayerKV] Optimizing Large Language Model Serving with Layer-wise KV Cache Management by @DefTruth in #78
- Bump up to v2.6 by @DefTruth in #79
Full Changelog: v2.5...v2.6
v2.5
What's Changed
- 🔥[InstInfer] InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference by @DefTruth in #65
- Update codebase of paper "parallel speculative decoding with adaptive draft length" by @smart-lty in #66
- move RetrievalAttention -> long context by @DefTruth in #67
- 🔥🔥[CRITIPREFILL] CRITIPREFILL: A SEGMENT-WISE CRITICALITYBASED APPROACH FOR PREFILLING ACCELERATION IN LLMS by @DefTruth in #68
- Bump up to v2.5 by @DefTruth in #69
New Contributors
- @smart-lty made their first contribution in #66
Full Changelog: v2.4...v2.5
v2.4
What's Changed
- 🔥[RetrievalAttention] Accelerating Long-Context LLM Inference via Vector Retrieval by @DefTruth in #62
- 🔥[Inf-MLLM] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU by @DefTruth in #63
- Bump up to v2.4 by @DefTruth in #64
Full Changelog: v2.3...v2.4
v2.3
v2.2
What's Changed
- Add NanoFlow code link by @DefTruth in #51
- 🔥[ACTIVATION SPARSITY] TRAINING-FREE ACTIVATION SPARSITY IN LARGE LANGUAGE MODELS by @DefTruth in #52
- 🔥[Decentralized LLM] Decentralized LLM Inference over Edge Networks with Energy Harvesting by @DefTruth in #53
- 🔥[SJF Scheduling] Efficient LLM Scheduling by Learning to Rank by @DefTruth in #54
- 🔥[Speculative Decoding] Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation by @DefTruth in #55
- 🔥🔥[Prompt Compression] Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference by @DefTruth in #56
- 🔥🔥[Context Distillation] Efficient LLM Context Distillation by @DefTruth in #57
- Bump up to v2.2 by @DefTruth in #58
Full Changelog: v2.1...v2.2
v2.1
What's Changed
- Update README.md by @DefTruth in #40
- 🔥[Speculative Decoding] Parallel Speculative Decoding with Adaptive Draft Length by @DefTruth in #41
- 🔥[FocusLLM] FocusLLM: Scaling LLM’s Context by Parallel Decoding by @DefTruth in #42
- 🔥[NanoFlow] NanoFlow: Towards Optimal Large Language Model Serving Throughput by @DefTruth in #43
- 🔥[MagicDec] MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding by @DefTruth in #44
- Add ABQ-LLM code link by @DefTruth in #46
- 🔥🔥[MARLIN] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models by @DefTruth in #47
- 🔥[1-bit LLMs] Matmul or No Matmal in the Era of 1-bit LLMs by @DefTruth in #48
- 🔥🔥[FLA] FLA: A Triton-Based Library for Hardware-Efficient Implementa… by @DefTruth in #49
- Bump up to v2.1 by @DefTruth in #50
Full Changelog: v2.0...v2.1