You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
Is there still room to improve the first frame response speed TTFT
I use 5 threads, each with 80000 tokens. The first frame response time after hitting the cache is 2-5 seconds. Do you have any optimization strategies
Currently, there is only one A800 graphics card available
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.
Checklist
Describe the bug
Is there still room to improve the first frame response speed TTFT
I use 5 threads, each with 80000 tokens. The first frame response time after hitting the cache is 2-5 seconds. Do you have any optimization strategies
Currently, there is only one A800 graphics card available
Reproduction
Model: Qwen2__v5-14B-Structure-AWQ
Command:
CUDA_VISIBLE_DEVICES=5 lmdeploy serve api_server /mnt/qwen2.5/qwen14bInt/Qwen/Qwen2___5-14B-Instruct-AWQ --backend turbomind --server-port 35551 --model-name qwenInt4 --model-format awq --session-len 100000 --cache-block-seq-len 128 --enable-prefix-caching --cache-max-entry-count 0.8 --log-level INFO --quant-policy=4 --rope-scaling-factor 4.0 >> /mnt/qwen2.5/qwenInt4/qwen14btmp2.txt 2>&1
Environment
Error traceback
No response
The text was updated successfully, but these errors were encountered: