[Bug] How to improve the first frame response speed TTFT #2728

zhouyuustc · 2024-11-08T05:48:56Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Is there still room to improve the first frame response speed TTFT
I use 5 threads, each with 80000 tokens. The first frame response time after hitting the cache is 2-5 seconds. Do you have any optimization strategies
Currently, there is only one A800 graphics card available

Reproduction

Model: Qwen2__v5-14B-Structure-AWQ
Command: CUDA_VISIBLE_DEVICES=5 lmdeploy serve api_server /mnt/qwen2.5/qwen14bInt/Qwen/Qwen2___5-14B-Instruct-AWQ --backend turbomind --server-port 35551 --model-name qwenInt4 --model-format awq --session-len 100000 --cache-block-seq-len 128 --enable-prefix-caching --cache-max-entry-count 0.8 --log-level INFO --quant-policy=4 --rope-scaling-factor 4.0 >> /mnt/qwen2.5/qwenInt4/qwen14btmp2.txt 2>&1

Environment

sys.platform: linux
Python: 3.10.14 (main, May 29 2024, 23:47:02) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA Graphics Device
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.3.1+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.18.1+cu121
LMDeploy: 0.5.3+
transformers: 4.44.0
gradio: 4.42.0
fastapi: 0.111.0
pydantic: 2.7.2
triton: 2.3.1
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  mlx5_6  mlx5_7       CPU Affinity    NUMA Affinity
GPU0     X      NV8     NV8     NV8     NV8     NV8     NV8     NV8     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS 0-31,64-95       0
GPU1    NV8      X      NV8     NV8     NV8     NV8     NV8     NV8     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS 0-31,64-95       0
GPU2    NV8     NV8      X      NV8     NV8     NV8     NV8     NV8     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS 0-31,64-95       0
GPU3    NV8     NV8     NV8      X      NV8     NV8     NV8     NV8     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS 0-31,64-95       0
GPU4    NV8     NV8     NV8     NV8      X      NV8     NV8     NV8     SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE32-63,96-127     1
GPU5    NV8     NV8     NV8     NV8     NV8      X      NV8     NV8     SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE32-63,96-127     1
GPU6    NV8     NV8     NV8     NV8     NV8     NV8      X      NV8     SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB 32-63,96-127     1
GPU7    NV8     NV8     NV8     NV8     NV8     NV8     NV8      X      SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB 32-63,96-127     1
mlx5_0  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      PIX     NODE    NODE    SYS     SYS     SYS     SYS
mlx5_1  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     PIX      X      NODE    NODE    SYS     SYS     SYS     SYS
mlx5_2  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     SYS     SYS     SYS     SYS
mlx5_3  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      SYS     SYS     SYS     SYS
mlx5_4  SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      PIX     NODE    NODE
mlx5_5  SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     PIX      X      NODE    NODE
mlx5_6  SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE     X      PIX
mlx5_7  SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE    PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

No response

The text was updated successfully, but these errors were encountered:

lvhan028 · 2024-11-08T11:36:11Z

prefilling is compute bound operation.
As far as I know, it takes one second around to process 15,000 tokens. That's the upper bound right now

github-actions · 2024-11-16T02:42:42Z

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

lvhan028 added the awaiting response label Nov 8, 2024

github-actions bot added the Stale label Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] How to improve the first frame response speed TTFT #2728

[Bug] How to improve the first frame response speed TTFT #2728

zhouyuustc commented Nov 8, 2024

lvhan028 commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 16, 2024

[Bug] How to improve the first frame response speed TTFT #2728

[Bug] How to improve the first frame response speed TTFT #2728

Comments

zhouyuustc commented Nov 8, 2024

Checklist

Describe the bug

Reproduction

Environment

Error traceback

lvhan028 commented Nov 8, 2024 • edited Loading

github-actions bot commented Nov 16, 2024

lvhan028 commented Nov 8, 2024 •

edited

Loading