Update index.md

hao-ai-lab · Mar 21, 2024 · e2e73c5 · e2e73c5
1 parent b5f5364
commit e2e73c5
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/content/blogs/distserve/index.md b/content/blogs/distserve/index.md
@@ -114,7 +114,7 @@ Figure 5 illustrates how a request is processed in such a disaggregated system.
 Let’s go through a simple experiment to see why disaggregation is beneficial. We serve a 13B LLM on a single A100-80GB GPU with a synthetic workload of inputs of length 512 and output length 64 following [Poisson arrival](https://en.wikipedia.org/wiki/Poisson_point_process). We gradually increase the request rates (x-axis) and measure how the two latencies (P90 TTFT and P90 TPOT, y-axis) change in Figure 6.
 
 Suppose we set the SLO of P90 TTFT as 0.4 seconds and P90 TPOT as 0.04 seconds (the horizontal line in **Figure 6**). We observe the existing systems can support roughly 3 rps that stay within the TTFT latency constraint using 1 GPU, whereas for TPOT, it sustains 1.6 rps (**Figure 6a)**. Since we need to satisfy both constraints, the goodput of existing colocated system becomes:
-Goodput (colocate) = min(2.3, 1.6) = 1.6 rps (per GPU).
+Goodput (colocate) = min(3, 1.6) = 1.6 rps (per GPU).
 
 The performance is significantly boosted after disaggregation. Prefill worker and decode worker can both achieve better rps than the previous if only handling one phase – as shown in **Figure 6**, one prefill worker achieves roughly 5.6 rps and one decode worker achieves roughly 10 rps. More importantly, now we can flexibly allocate 2 prefill workers to pair with 1 decode worker (notate as 2P1D), 3 GPUs in total. The goodput becomes: