Fine tuning embeddings with onnxruntime training: error creating artifacts #22427

riccardopinosio · 2024-10-14T13:21:40Z

riccardopinosio
Oct 14, 2024

Hello,

I would like to fine tune an embedding model like all-MiniLM-L6-v2 or distilbert for semantic search using onnxruntime on device training (not the large model training as there will be no python in the runtime) and, say, a cosine similarity metric. In other words, I want to implement this using onnxruntime on device training (see picture at the bottom). The examples will be tuples like (sentenceA, sentenceB, similarity_label).

I came up with the following simple pytorch model:

class SentencePairModel(nn.Module):
    """This class implements a pytorch model that wraps a transformer embedding model.
    This model can be used to generate onnxruntime assets to finetune the wrapped model
    on a sentence pairs dataset, where the inputs are of the form
    (sentence_A, sentence_B) and the label are similarity scores.
    """
    training: True
    def __init__(self, base_model='sentence-transformers/all-MiniLM-L6-v2'):
        super(SentencePairModel, self).__init__()
        # Load the MiniLM transformer model and tokenizer
        self.encoder = AutoModel.from_pretrained(base_model)
        self.cosine_similarity = nn.CosineSimilarity(dim=1)
        
    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0] #First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    def forward(self,
                input_ids_1=None,
                input_ids_2=None,
                token_type_ids_1=None,
                token_type_ids_2=None,
                attention_mask_1=None,
                attention_mask_2=None):
        a_embeddings = self.mean_pooling(self.encoder(input_ids=input_ids_1,
                                                      token_type_ids=token_type_ids_1,
                                                      attention_mask=attention_mask_1), 
                                         attention_mask_1)
        b_embeddings = self.mean_pooling(self.encoder(input_ids=input_ids_2,
                                                      token_type_ids=token_type_ids_2,
                                                      attention_mask=attention_mask_2), 
                                         attention_mask_2)
        return self.cosine_similarity(a_embeddings, b_embeddings).view(input_ids_1.shape[0], 1)

The idea here is that the _1 and _2 input names correspond to the tokenizations of the A and B sentences respectively. My idea would be to fine tune this model with the on device training api, and then the learned parameters can be added to the original embedding model for inference, which should be possible since the parameters of the above model ought to be the same as those of the original all-MiniLM model.

I export the above model using torchscript:

    example_input = (torch.randint(10, (1, 10)),
                     torch.randint(10, (1, 10)),
                     torch.zeros(10, dtype=int).view(1,10),
                     torch.zeros(10, dtype=int).view(1,10),
                     torch.ones(10, dtype=int).view(1,10),
                     torch.ones(10, dtype=int).view(1,10))
    
    torch.onnx.export(
        model,
        example_input,
        "../models/embedding_training.onnx",
        export_params=True,
        do_constant_folding=False,
        training=torch.onnx.TrainingMode.TRAINING,
        input_names=["input_ids_1",
                     "input_ids_2", 
                     "token_type_ids_1",
                     "token_type_ids_2",
                     "attention_mask_1",
                     "attention_mask_2"],
        output_names=['output'],
        dynamic_axes={
            'input_ids_1': {0: 'batch_size', 1: "sequence_length"},
            'input_ids_2': {0: 'batch_size', 1: "sequence_length"},
            'token_type_ids_1': {0: 'batch_size', 1: "sequence_length"},
            'token_type_ids_2': {0: 'batch_size', 1: "sequence_length"},
            'attention_mask_1': {0: 'batch_size', 1: "sequence_length"},
            'attention_mask_2': {0: 'batch_size', 1: "sequence_length"},
            'output': {0: 'batch_size'},  # Dynamic batch size for output
        }
    )

The generated onnx model seems happy, although the setting of torch.onnx.TrainingMode.TRAINING seems to return odd results e.g. in the following chunck the similarity is not 1.0:

    onnx_model = onnx.load(Path(model_path, "embedding_training.onnx"))
    onnx.checker.check_model(onnx_model) # passes

    a_sentences = ["This is a good movie"]
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
    a_token = tokenizer(a_sentences, padding=True, truncation=True, return_tensors='pt')
    session = ort.InferenceSession(Path(model_path, "embedding_training.onnx"))
    session.run(output_names=["output"], input_feed={
        "input_ids_1": a_token["input_ids"].numpy(),
        "input_ids_2": a_token["input_ids"].numpy(), # use the same sentence here
        "token_type_ids_1": a_token["token_type_ids"].numpy(),
        "token_type_ids_2": a_token["token_type_ids"].numpy(),
        "attention_mask_1": a_token["attention_mask"].numpy(),
        "attention_mask_2": a_token["attention_mask"].numpy()
    })

    # output: [array([[0.8402861]], dtype=float32)]

Perhaps someone knows why the result above is not 1.0 when in training mode.

Anyway, when I try to export the model for training, it complains:

    artifacts.generate_artifacts(onnx_model,
    frozen_params=[],
    requires_grad=[initializer.name for initializer in onnx_model.graph.initializer],
    loss=artifacts.LossType.MSELoss,
    optimizer=artifacts.OptimType.AdamW,
    artifact_directory=Path(Path(__file__).parent.parent.resolve(), "models", "embedding_training"))

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File /home/rpinosio/repositories/knights/hugot/python/generate_embedding_training_model.py:[1](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/python/generate_embedding_training_model.py:1)
----> 1 artifacts.generate_artifacts(onnx_model,
      [2](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/python/generate_embedding_training_model.py:2) frozen_params=[],
      [3](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/python/generate_embedding_training_model.py:3) requires_grad=[initializer.name for initializer in onnx_model.graph.initializer],
      [4](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/python/generate_embedding_training_model.py:4) loss=artifacts.LossType.MSELoss,
      [5](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/python/generate_embedding_training_model.py:5) optimizer=artifacts.OptimType.AdamW,
      [6](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/python/generate_embedding_training_model.py:6) artifact_directory=Path(Path(__file__).parent.parent.resolve(), "models", "embedding_training"))

File ~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/artifacts.py:193, in generate_artifacts(model, requires_grad, frozen_params, loss, optimizer, artifact_directory, prefix, ort_format, custom_op_library, additional_output_names, nominal_checkpoint, loss_input_names)
    [186](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/artifacts.py:186)     custom_op_library_path = pathlib.Path(custom_op_library)
    [188](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/artifacts.py:188) with onnxblock.base(loaded_model, model_path), (
    [189](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/artifacts.py:189)     onnxblock.custom_op_library(custom_op_library_path)
    [190](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/artifacts.py:190)     if custom_op_library is not None
    [191](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/artifacts.py:191)     else contextlib.nullcontext()
    [192](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/artifacts.py:192) ):
--> [193](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/artifacts.py:193)     _ = training_block(*[output.name for output in loaded_model.graph.output])
    [194](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/artifacts.py:194)     training_model, eval_model = training_block.to_model_proto()
    [195](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/artifacts.py:195)     model_params = training_block.parameters()

File ~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:204, in TrainingBlock.__call__(self, *args, **kwargs)
    [196](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:196) self._parameters = _training_graph_utils.get_model_parameters(model, self._requires_grad, self._frozen_params)
    [198](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:198) # Build the gradient graph. The gradient graph building is composed of the following steps:
    [199](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:199) #   - Move all model parameters to model inputs.
    [200](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:200) #   - Run orttraining graph transformers on the model.
    [201](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:201) #   - Add the gradient graph to the optimized model.
    [202](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:202) # The order of model inputs after gradient graph building is: user inputs, model parameters as inputs
    [203](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:203) # The order of the model outputs is: user outputs, model parameter gradients (in the order of parameter inputs)
--> [204](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:204) self._training_model, self._eval_model = _training_graph_utils.build_gradient_graph(
    [205](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:205)     model, self._requires_grad, self._frozen_params, output, accessor._GLOBAL_CUSTOM_OP_LIBRARY
    [206](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:206) )
    [208](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:208) logging.debug("Adding gradient accumulation nodes for training block %s", self.__class__.__name__)
    [210](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/onnxblock.py:210) _training_graph_utils.build_gradient_accumulation_graph(self._training_model, self._requires_grad)

File ~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:130, in build_gradient_graph(model, requires_grad, frozen_params, output_names, custom_op_library)
    [127](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:127) optimized_model = onnx.load_from_string(get_optimized_model(model.SerializeToString(), requires_grad, options))
    [129](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:129) # Assumption is that the first graph output is the loss output
--> [130](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:130) gradient_model = _gradient_model_for(optimized_model, requires_grad, output_names[0], options)
    [132](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:132) _reorder_outputs(gradient_model, output_names, requires_grad)
    [134](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:134) return gradient_model, eval_model

File ~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:84, in _gradient_model_for(model, requires_grad, loss_name, options)
     [79](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:79) logging.debug(
     [80](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:80)     "The loss output is %s. The gradient graph will be built starting from %s_grad.", loss_name, loss_name
     [81](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:81) )
     [83](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:83) builder = GradientGraphBuilder(model.SerializeToString(), {loss_name}, requires_grad, loss_name, options)
---> [84](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:84) builder.build()
     [85](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/rpinosio/repositories/knights/hugot/~/miniconda3/envs/hugoTrainer/lib/python3.12/site-packages/onnxruntime/training/onnxblock/_training_graph_utils.py:85) return onnx.load_from_string(builder.get_model())

RuntimeError: /onnxruntime_src/orttraining/orttraining/core/graph/gradient_builder_base.h:123 onnxruntime::training::ArgDef onnxruntime::training::GradientBuilderBase::O(size_t, bool) const i < node_->OutputDefs().size() was false.

It seems to be unhappy about the output node but it's unclear to me what the issue is. I checked the graph in netron and the output node has the dimensions I would expect:

With some more experimentation it seems the unhappiness derives from relying on the pretrained transformer model, which works for inference but raises this error at artifact generation time for training, as it seems to have troubles constructing the gradient graph.

Swapping the miniLM with distilbert (https://huggingface.co/docs/transformers/en/model_doc/distilbert) gives the same error.

martinkorelic · 2024-10-26T11:46:08Z

martinkorelic
Oct 26, 2024

Consider not using the given onnx loss function, but instead use some loss function implemented in pytorch or the default one from the transformers library. I have had problems in the past when passing a loss parameter to the artifact generation and in the end I solved my problem by using the default loss which is generated from the transformers model itself. In your model, what is being returned is the loss result, not the actual result as from my understanding (correct me if I am wrong), the result is being passed to loss MSE function to compute the loss.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning embeddings with onnxruntime training: error creating artifacts #22427

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Fine tuning embeddings with onnxruntime training: error creating artifacts #22427

riccardopinosio Oct 14, 2024

Replies: 1 comment

martinkorelic Oct 26, 2024

riccardopinosio
Oct 14, 2024

martinkorelic
Oct 26, 2024