onnxruntime is slower than torch for long sequence length #18151

wangml999 · 2023-10-30T05:31:50Z

wangml999
Oct 30, 2023

I am testing inference performance between roberta-base original model vs onnx on CPU (Apple MacBook M1/16G RAM). From the test, it shows that while the sequence length is increasing, ONNX model inference is getting slower comparing to the original PyTorch model. The example code I can find from googling were mostly working with short sequences which demos ONNX is faster than original PyTorch model but no tests I have seen comparing different sequence lengths all the way to the maximum.
Is this expected or something I need to do to get ONNX running faster? Or if I did anything wrong, please also let me know. Thanks for any comments. I have attached my code and package versions in the below.

test seq_len  original     onnx         %
0        10    70.356   16.832  0.239240
1        30    47.708   63.276  1.326318
2        50    48.427   36.691  0.757656
3        70    59.141   46.358  0.783856
4        90    60.057   51.678  0.860483
5       110    59.866   60.616  1.012528
6       130    61.139   64.824  1.060272
7       150    68.208   76.269  1.118183
8       170    70.345   85.269  1.212154
9       190    69.581   88.386  1.270261
10      210    74.503   97.976  1.315061
11      230    85.364  107.744  1.262171
12      250    94.989  133.765  1.408216
13      270   100.882  126.765  1.256567
14      290   104.278  163.494  1.567867
15      310   115.099  142.012  1.233825
16      330   115.009  151.286  1.315427
17      350   129.682  164.289  1.266860
18      370   135.282  181.807  1.343911
19      390   169.683  201.754  1.189005
20      410   159.626  212.748  1.332790
21      430   175.055  219.012  1.251104
22      450   197.737  246.649  1.247359
23      470   179.702  252.783  1.406679
24      490   207.664  243.209  1.171166
25      510   196.950  269.867  1.370231

python = "^3.11"
transformers = "^4.32.1"
onnxruntime = "^1.15.1"

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import os
import onnxruntime
import time
import pandas as pd

def performance_test():
    model_name = "roberta-base"
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # convert to onnx
    inputs = {
        "input_ids": torch.rand(1, 10).long(),
        "attention_mask": torch.rand(1, 10).long(),
    }
    onnx_model_path = f"./onnx/{model_name}/"
    if not os.path.exists(onnx_model_path):
        os.makedirs(onnx_model_path)

    torch.onnx.export(
        model,
        inputs,
        f=f"{onnx_model_path}/model.onnx",
        input_names=["input_ids", "attention_mask"],
        output_names=["logits"],
        dynamic_axes={
            "input_ids": {
                0: "batch_size",
                1: "sequence_length",
            },
            "attention_mask": {
                0: "batch_size",
                1: "sequence_length",
            },
            "logits": {
                0: "batch_size",
            },
        },
    )

    # warm up
    model(inputs["input_ids"], attention_mask=inputs["attention_mask"])
    ort_session = onnxruntime.InferenceSession(
        f"{onnx_model_path}/model.onnx", providers=["CPUExecutionProvider"]
    )
    ort_inputs = {
        "input_ids": inputs["input_ids"].numpy(),
        "attention_mask": inputs["attention_mask"].numpy(),
    }
    ort_session.run(["logits"], ort_inputs)

    # interfence
    torch_inference_time = []
    onnx_inference_time = []
    seq_lens = []
    with torch.no_grad():
        for seq_len in range(10, 512, 20):
            seq_lens.append(seq_len)

            inputs = {
                "input_ids": torch.rand(1, seq_len).long(),
                "attention_mask": torch.rand(1, seq_len).long(),
            }
            ort_inputs = {
                "input_ids": inputs["input_ids"].numpy(),
                "attention_mask": inputs["attention_mask"].numpy(),
            }
            start = time.time_ns()
            outputs = model(inputs["input_ids"], attention_mask=inputs["attention_mask"])
            end = time.time_ns()
            torch_inference_time.append((end-start)/1_000_000)

            start = time.time_ns()
            ort_session.run(["logits"], ort_inputs)
            end = time.time_ns()
            onnx_inference_time.append((end-start)/1_000_000)

    df = pd.DataFrame({"seq_len": seq_lens, "original": torch_inference_time, "onnx": onnx_inference_time})
    df["%"] = df["onnx"] / df["original"]
    print(df)



if __name__ == "__main__":
    performance_test()

gilljon · 2024-09-18T17:18:19Z

gilljon
Sep 18, 2024

Experiencing a similar issue when running Pytorch vs. ONNX on Mac M2... curious what's going on here.

1 reply

tianleiwu Oct 2, 2024
Collaborator

Please try I/O Binding when inputs/output tensors are large. It could avoid tensor copying.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onnxruntime is slower than torch for long sequence length #18151

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

onnxruntime is slower than torch for long sequence length #18151

wangml999 Oct 30, 2023

Replies: 1 comment · 1 reply

gilljon Sep 18, 2024

tianleiwu Oct 2, 2024 Collaborator

wangml999
Oct 30, 2023

Replies: 1 comment 1 reply

gilljon
Sep 18, 2024

tianleiwu Oct 2, 2024
Collaborator