Change figure for continuous batching

hao-ai-lab · Mar 18, 2024 · 0a9a410 · 0a9a410
1 parent 509bb58
commit 0a9a410
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 1 deletion.
diff --git a/content/blogs/distserve/img/continuous_batching.png b/content/blogs/distserve/img/continuous_batching.png
diff --git a/content/blogs/distserve/index.md b/content/blogs/distserve/index.md
@@ -87,7 +87,10 @@ We explain them next.
 
 **Figure 3** shows a simplified view of the interference between prefill and decode. On the very left, we route the 2 incoming requests into two GPUs so that each request runs on their own. In the middle, we batch these 2 requests together in 1 GPU. We can see that continuous batching significantly elongates the latency for R1 (decode), and at the same time slightly increases the latency for R2 (prefill). On the right, we have a steady stream of incoming requests. Now the requests in the decode phase get “bugged” every single time a prefill requests come into the system, causing an unexpectedly long delay on decode. 
 
-{{< image src="img/lvHuoscAJhmWUmO2hN9ENRxYpW83WJRNLpeDfX52JqjATOpwdCD72PwbcH6LvA_bCMrnqxHdhi7snoUEt8DvvrJKEUuaHdCayqNLPfied_43of9cedDSvAqrpLqRQz2m3v6BZUkwdlDadMlelK-PVfU.png" alt="continuous_batching_interference" width="100%" title="Figure 3. Continuous batching causes interference.">}}
+<!-- {{< image src="img/lvHuoscAJhmWUmO2hN9ENRxYpW83WJRNLpeDfX52JqjATOpwdCD72PwbcH6LvA_bCMrnqxHdhi7snoUEt8DvvrJKEUuaHdCayqNLPfied_43of9cedDSvAqrpLqRQz2m3v6BZUkwdlDadMlelK-PVfU.png" alt="continuous_batching_interference" width="100%" title="Figure 3. Continuous batching causes interference.">}} -->
+
+
+{{< image src="img/continuous_batching.png" alt="continuous_batching_interference" width="100%" title="Figure 3. Continuous batching causes interference.">}}
 
 
 As a result of this interference, as shown in Figure 4, when services must satisfy both TTFT and TPOT SLOs, systems have to over-provision resources to meet the latency goal, especially when either SLO is strict.