Skip to content

Commit

Permalink
Update index.md
Browse files Browse the repository at this point in the history
  • Loading branch information
GindaChen committed Mar 21, 2024
1 parent b5f5364 commit e2e73c5
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion content/blogs/distserve/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ Figure 5 illustrates how a request is processed in such a disaggregated system.
Let’s go through a simple experiment to see why disaggregation is beneficial. We serve a 13B LLM on a single A100-80GB GPU with a synthetic workload of inputs of length 512 and output length 64 following [Poisson arrival](https://en.wikipedia.org/wiki/Poisson_point_process). We gradually increase the request rates (x-axis) and measure how the two latencies (P90 TTFT and P90 TPOT, y-axis) change in Figure 6.

Suppose we set the SLO of P90 TTFT as 0.4 seconds and P90 TPOT as 0.04 seconds (the horizontal line in **Figure 6**). We observe the existing systems can support roughly 3 rps that stay within the TTFT latency constraint using 1 GPU, whereas for TPOT, it sustains 1.6 rps (**Figure 6a)**. Since we need to satisfy both constraints, the goodput of existing colocated system becomes:
Goodput (colocate) = min(2.3, 1.6) = 1.6 rps (per GPU).
Goodput (colocate) = min(3, 1.6) = 1.6 rps (per GPU).

The performance is significantly boosted after disaggregation. Prefill worker and decode worker can both achieve better rps than the previous if only handling one phase – as shown in **Figure 6**, one prefill worker achieves roughly 5.6 rps and one decode worker achieves roughly 10 rps. More importantly, now we can flexibly allocate 2 prefill workers to pair with 1 decode worker (notate as 2P1D), 3 GPUs in total. The goodput becomes:

Expand Down

0 comments on commit e2e73c5

Please sign in to comment.