Skip to content

Commit

Permalink
Change figure for continuous batching
Browse files Browse the repository at this point in the history
  • Loading branch information
GindaChen committed Mar 18, 2024
1 parent 509bb58 commit 0a9a410
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 1 deletion.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion content/blogs/distserve/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,10 @@ We explain them next.

**Figure 3** shows a simplified view of the interference between prefill and decode. On the very left, we route the 2 incoming requests into two GPUs so that each request runs on their own. In the middle, we batch these 2 requests together in 1 GPU. We can see that continuous batching significantly elongates the latency for R1 (decode), and at the same time slightly increases the latency for R2 (prefill). On the right, we have a steady stream of incoming requests. Now the requests in the decode phase get “bugged” every single time a prefill requests come into the system, causing an unexpectedly long delay on decode.

{{< image src="img/lvHuoscAJhmWUmO2hN9ENRxYpW83WJRNLpeDfX52JqjATOpwdCD72PwbcH6LvA_bCMrnqxHdhi7snoUEt8DvvrJKEUuaHdCayqNLPfied_43of9cedDSvAqrpLqRQz2m3v6BZUkwdlDadMlelK-PVfU.png" alt="continuous_batching_interference" width="100%" title="Figure 3. Continuous batching causes interference.">}}
<!-- {{< image src="img/lvHuoscAJhmWUmO2hN9ENRxYpW83WJRNLpeDfX52JqjATOpwdCD72PwbcH6LvA_bCMrnqxHdhi7snoUEt8DvvrJKEUuaHdCayqNLPfied_43of9cedDSvAqrpLqRQz2m3v6BZUkwdlDadMlelK-PVfU.png" alt="continuous_batching_interference" width="100%" title="Figure 3. Continuous batching causes interference.">}} -->


{{< image src="img/continuous_batching.png" alt="continuous_batching_interference" width="100%" title="Figure 3. Continuous batching causes interference.">}}


As a result of this interference, as shown in Figure 4, when services must satisfy both TTFT and TPOT SLOs, systems have to over-provision resources to meet the latency goal, especially when either SLO is strict.
Expand Down

0 comments on commit 0a9a410

Please sign in to comment.