Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Benchmark] benchmarks on different cuda architecture with models of various size #815

Open
lvhan028 opened this issue Dec 11, 2023 · 9 comments

Comments

@lvhan028
Copy link
Collaborator

lvhan028 commented Dec 11, 2023

背景

我们发现绝大部分LLM推理引擎在报告推理性能的时候,都是关掉sampling功能的。但是在实际应用中,sampling几乎是必选项。为了给出尽可能贴近实际应用的benchmark,我们开了这个issue,报告 LMDeploy 在采样开启时候的性能。

测试模型

  1. llama2-7b
  2. llama2-13b
  3. internlm-20b
  4. llama2-70b

测试设备

  1. A100
    模型计算精度:BF16(FP16)、W4A16、KV8
  2. V100
    模型计算精度:FP16
  3. 4090
    模型计算精度:W4A16
  4. 3090
    模型计算精度:W4A16
  5. 2080
    模型计算精度:W4A16

测量指标

  1. 静态推理性能(out token/s):在固定batch、输入输出 token 数的前提下,每秒产生的token数量
  2. 每秒处理请求数量(request/s):SharedGPT对话数据,不定长的 prompt 和 response。我们会测试 2 种接口:一种是 api_server 的 RESTful API,一种是 localhost 上的 Python API
@lvhan028 lvhan028 pinned this issue Dec 11, 2023
@frankxyy
Copy link

采样(num_beam=1)感觉是不是对性能影响不大啊?

@lvhan028
Copy link
Collaborator Author

采样(num_beam=1)感觉是不是对性能影响不大啊?

我理解是 temperature, top_p, top_k 这样的setting

@zhulinJulia24
Copy link
Collaborator

采样(num_beam=1)感觉是不是对性能影响不大啊?

我理解是 temperature, top_p, top_k 这样的setting

我使用了不同的top_p, top_k和temperature在llama-2-chat-7b模型tp1下使用profile_throughtput.py测试了性能,tokens/s几乎没有差异

@lvhan028
Copy link
Collaborator Author

lvhan028 commented Dec 21, 2023

A100 (w4a16)

Request Throughput (RPM)

model batch tp num_promts RPS RPM FTL(ave)(s) FTL(min)(s) FTL(max)(s) 50%(s) 75%(s) 95%(s) 99%(s) throughput(out tok/s) throughput(total tok/s)
llama-7b 64 1 3000 12.083 725.005 0.199 0.027 2.393 0.008 0.022 0.052 0.339 2811.948 5795.166
128 1 3000 13.375 802.511 0.341 0.052 4.029 0.022 0.046 0.098 0.380 3112.555 6414.690
llama2-13b 64 1 3000 7.980 478.805 0.130 0.036 2.077 0.026 0.031 0.086 0.138 1857.054 3827.217
128 1 3000 8.370 502.200 0.385 0.069 4.405 0.051 0.071 0.146 0.212 1947.793 4014.223
internlm-20b 64 1 3000 6.333 379.977 0.241 0.055 10.015 0.038 0.046 0.128 0.188 1263.609 2674.010
128 1 3000 6.310 378.589 2.236 0.083 9.626 0.067 0.094 0.204 0.289 1258.992 2664.239
llama2-70b 64 4 3000 5.355 321.290 0.245 0.063 3.595 0.036 0.041 0.129 0.213 1246.131 2568.162
128 4 3000 6.484 389.064 0.455 0.078 6.471 0.058 0.075 0.196 0.280 1508.993 3109.897

Static Inference Performance

llama2-7b

batch tp prompt_tokens completion_tokens throughput(out tok/s) mem(GB) FTL(ave)(s) FTL(min)(s) FTL(max)(s) 50%(s) 75%(s) 95%(s) 99%(s)
1 1 1 128 260.80 67.77 0.004 0.004 0.005 0.004 0.004 0.004 0.004
1 1 128 128 245.91 67.77 0.013 0.012 0.014 0.004 0.004 0.004 0.005
1 1 128 2048 226.59 67.77 0.013 0.013 0.013 0.005 0.005 0.005 0.005
1 1 2048 128 159.96 67.99 0.196 0.13 0.516 0.005 0.005 0.005 0.005
1 1 2048 2048 197.86 67.99 0.131 0.13 0.132 0.005 0.005 0.005 0.005
16 1 1 128 3326.22 67.80 0.01 0.007 0.014 0.005 0.005 0.006 0.006
16 1 128 128 2491.98 67.99 0.108 0.012 0.145 0.005 0.006 0.006 0.008
16 1 128 2048 1583.80 67.99 0.1 0.015 0.144 0.01 0.013 0.015 0.016
16 1 2048 128 518.54 69.46 1.43 0.133 2.032 0.015 0.015 0.016 0.017
16 1 2048 2048 784.66 69.36 1.437 0.134 2.044 0.019 0.022 0.024 0.025
32 1 1 128 4841.70 67.83 0.014 0.008 0.025 0.006 0.007 0.008 0.011
32 1 128 128 3288.00 68.18 0.193 0.018 0.263 0.008 0.008 0.01 0.011
32 1 128 2048 1867.68 68.15 0.194 0.019 0.277 0.017 0.022 0.026 0.028
32 1 2048 128 548.20 69.49 1.878 0.134 4.079 0.027 0.028 0.029 0.912
32 1 2048 2048 837.42 69.49 1.807 0.132 4.083 0.036 0.041 0.045 0.047
64 1 1 128 6576.58 67.90 0.031 0.009 0.056 0.01 0.016 0.024 0.03
64 1 128 128 4098.99 68.52 0.377 0.015 0.531 0.013 0.018 0.027 0.037
64 1 128 2048 2093.60 69.11 0.417 0.02 0.737 0.029 0.038 0.046 0.049
64 1 2048 128 568.93 69.49 2.811 0.133 13.776 0.044 0.046 0.177 1.046
64 1 2048 2048 828.56 69.49 34.994 0.133 104.059 0.044 0.045 0.047 0.051

llama2-13b

batch tp prompt_tokens completion_tokens throughput(out tok/s) mem(GB) FTL(ave)(s) FTL(min)(s) FTL(max)(s) 50%(s) 75%(s) 95%(s) 99%(s)
1 1 1 128 157.79 57.66 0.007 0.007 0.008 0.006 0.006 0.006 0.007
1 1 128 128 151.50 61.63 0.021 0.021 0.023 0.006 0.006 0.007 0.007
1 1 128 2048 140.05 59.16 0.022 0.021 0.022 0.007 0.007 0.008 0.008
1 1 2048 128 105.74 57.91 0.238 0.237 0.24 0.008 0.008 0.008 0.008
1 1 2048 2048 122.68 57.91 0.238 0.237 0.239 0.008 0.008 0.008 0.008
16 1 1 128 2051.60 57.66 0.015 0.01 0.025 0.008 0.008 0.009 0.009
16 1 128 128 1493.19 57.91 0.224 0.022 0.264 0.009 0.009 0.01 0.011
16 1 128 2048 999.76 57.91 0.198 0.022 0.281 0.016 0.02 0.023 0.024
16 1 2048 128 301.19 59.72 2.704 0.239 3.829 0.023 0.023 0.024 0.025
16 1 2048 2048 489.79 59.72 2.478 0.241 3.849 0.03 0.034 0.036 0.037
32 1 1 128 2993.08 57.69 0.02 0.013 0.031 0.01 0.011 0.013 0.014
32 1 128 128 1996.37 58.16 0.42 0.022 0.505 0.012 0.013 0.015 0.017
32 1 128 2048 1165.21 58.56 0.729 0.022 1.176 0.026 0.033 0.038 0.04
32 1 2048 128 310.99 59.78 3.512 0.24 12.731 0.038 0.039 0.041 1.004
32 1 2048 2048 478.93 60.82 32.547 0.235 90.296 0.037 0.038 0.04 0.041
64 1 1 128 4229.19 57.78 0.038 0.01 0.065 0.015 0.018 0.026 0.032
64 1 128 128 2500.53 58.53 0.684 0.029 0.967 0.018 0.02 0.024 0.038
64 1 128 2048 1182.01 59.59 6.725 0.028 52.618 0.038 0.041 0.044 0.054
64 1 2048 128 312.75 59.72 15.559 0.241 25.265 0.038 0.039 0.041 1.701
64 1 2048 2048 471.09 97.87 158.007 0.239 255.386 0.038 0.038 0.04 0.042

internlm-20b

batch tp prompt_tokens completion_tokens throughput(out tok/s) mem(GB) FTL(ave)(s) FTL(min)(s) FTL(max)(s) 50%(s) 75%(s) 95%(s) 99%(s)
1 1 1 128 102.44 70.05 0.011 0.01 0.011 0.01 0.01 0.01 0.011
1 1 128 128 98.88 92.22 0.032 0.032 0.033 0.01 0.01 0.01 0.011
1 1 128 2048 91.28 342.14 0.032 0.032 0.033 0.011 0.011 0.012 0.012
1 1 2048 128 69.28 69.81 0.361 0.36 0.361 0.012 0.012 0.012 0.012
1 1 2048 2048 80.07 69.81 0.362 0.361 0.363 0.012 0.013 0.013 0.013
16 1 1 128 1330.03 69.63 0.021 0.011 0.03 0.012 0.012 0.013 0.014
16 1 128 128 979.30 69.84 0.33 0.032 0.399 0.013 0.014 0.015 0.016
16 1 128 2048 659.21 69.97 0.344 0.032 0.409 0.024 0.03 0.034 0.036
16 1 2048 128 199.12 73.31 4.307 0.364 5.812 0.035 0.035 0.036 0.037
16 1 2048 2048 308.87 73.47 5.686 0.363 42.356 0.042 0.044 0.045 0.046
32 1 1 128 1974.15 69.69 0.028 0.016 0.041 0.016 0.017 0.019 0.021
32 1 128 128 1309.96 70.13 0.559 0.035 0.771 0.018 0.02 0.022 0.026
32 1 128 2048 738.76 368.22 2.114 0.033 26.537 0.037 0.045 0.048 0.049
32 1 2048 128 200.29 73.59 10.016 0.363 17.883 0.046 0.047 0.049 0.429
32 1 2048 2048 306.08 73.56 88.279 0.362 173.383 0.044 0.045 0.047 0.05
64 1 1 128 2808.92 69.84 0.041 0.014 0.06 0.022 0.024 0.028 0.03
64 1 128 128 1651.45 70.38 1.082 0.04 1.479 0.027 0.029 0.033 0.037
64 1 128 2048 736.56 205.43 22.127 0.035 83.859 0.048 0.05 0.053 0.273
64 1 2048 128 199.68 73.88 29.365 0.359 36.276 0.047 0.047 0.049 0.427
64 1 2048 2048 305.56 73.81 283.211 0.362 391.207 0.044 0.045 0.047 0.048

llama2-70b

batch tp prompt_tokens completion_tokens throughput(out tok/s) mem(GB) FTL(ave)(s) FTL(min)(s) FTL(max)(s) 50%(s) 75%(s) 95%(s) 99%(s)
1 4 1 128 72.79 74.98 0.016 0.014 0.017 0.014 0.014 0.014 0.015
1 4 128 128 70.26 74.98 0.047 0.047 0.048 0.014 0.014 0.014 0.014
1 4 128 2048 63.91 74.98 0.05 0.048 0.051 0.016 0.016 0.016 0.016
1 4 2048 128 52.13 75.07 0.367 0.366 0.368 0.016 0.016 0.016 0.017
1 4 2048 2048 60.90 75.07 0.369 0.368 0.372 0.016 0.016 0.016 0.016
16 4 1 128 959.05 75.01 0.034 0.021 0.048 0.016 0.017 0.018 0.018
16 4 128 128 796.94 75.07 0.312 0.05 0.435 0.017 0.017 0.018 0.019
16 4 128 2048 832.31 75.07 0.245 0.051 0.441 0.019 0.02 0.022 0.023
16 4 2048 128 240.39 75.70 3.965 0.372 5.618 0.022 0.023 0.023 0.025
16 4 2048 2048 617.35 75.71 3.428 0.372 5.703 0.023 0.024 0.025 0.026
32 4 1 128 1502.71 75.04 0.042 0.028 0.065 0.021 0.022 0.023 0.025
32 4 128 128 1162.02 75.20 0.493 0.065 0.775 0.021 0.022 0.024 0.052
32 4 128 2048 1249.91 75.20 0.486 0.062 0.771 0.025 0.027 0.03 0.031
32 4 2048 128 270.66 75.78 5.204 0.373 11.228 0.029 0.03 0.032 2.545
32 4 2048 2048 831.20 75.78 5.216 0.374 11.302 0.033 0.035 0.037 0.039
64 4 1 128 2063.85 75.10 0.072 0.032 0.238 0.03 0.032 0.035 0.038
64 4 128 128 1489.83 75.39 0.692 0.084 1.47 0.031 0.033 0.038 0.217
64 4 128 2048 1678.58 75.39 0.835 0.115 1.362 0.037 0.041 0.046 0.049
64 4 2048 128 287.97 75.79 6.458 0.444 22.085 0.044 0.047 0.405 2.864
64 4 2048 2048 1047.97 75.80 6.475 0.438 22.369 0.05 0.054 0.058 0.062

@Ajay-Wong
Copy link

问下,这个静态 batch 怎么测试的?现在不是支持 continue batch 了,这个不是根据显存大小去看推理的 batch size 的吗?

@lvhan028
Copy link
Collaborator Author

这里静态batch是个相对概念。在推理过程中,还是 continuous batching,只是在推理的绝大部分时间中,推理batch和输入的batch一样(--concurrency参数)

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 13, 2024

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 13, 2024

latest benchmark results https://buildkite.com/vllm/performance-benchmark/builds/3924

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 13, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants