-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLava ONNX export #1790
base: main
Are you sure you want to change the base?
Add LLava ONNX export #1790
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
if config._behavior == "encoder": | ||
inputs_embeds = model.get_input_embeddings()(input_ids) | ||
|
||
image_outputs = model.vision_tower(pixel_values, output_hidden_states=True) | ||
selected_image_feature = image_outputs.hidden_states[vision_feature_layer] | ||
|
||
if vision_feature_select_strategy == "default": | ||
selected_image_feature = selected_image_feature[:, 1:] | ||
elif vision_feature_select_strategy == "full": | ||
selected_image_feature = selected_image_feature | ||
else: | ||
raise ValueError(f"Unexpected select feature strategy: {vision_feature_select_strategy}") | ||
|
||
image_features = model.multi_modal_projector(selected_image_feature) | ||
inputs_embeds, attention_mask, labels, position_ids = model._merge_input_ids_with_image_features( | ||
image_features, inputs_embeds, input_ids, attention_mask, None | ||
) | ||
|
||
result = { | ||
"inputs_embeds": inputs_embeds, | ||
"decoder_attention_mask": attention_mask, | ||
"position_ids": position_ids, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might not be understanding this 100%, but won't this be problematic for generation? We would need to re-pass the image features on every forward pass, which will merge the ids every time. This also means that we cannot embed a single text token (e.g., the one just generated).
Here's an example of a hand-crafted version of a tiny random LlavaForConditionalGeneration: https://huggingface.co/Xenova/tiny-random-LlavaForConditionalGeneration. There are 3 models exported:
- embed_tokens.onnx - just the token embedding layer
- decoder_model_merged.onnx - the causal LM
- vision_encoder.onnx - the vision encoder
I've got this working with Transformers.js (v3), where the concatenation of the token/vision patch embeddings are done in JavaScript.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @xenova, it should not be a problem for generation.
I generate the following three models:
encoder_model.onnx
- token embed + vision tower + projection + mergingdecoder_model.onnx
- Language model only (The export is same as current decoder export in optimum)decoder_input_processor.onnx
- token embed + decoder input generation whenpast_key_values
is available. (The elif part in the modeling code)
The naming of models could possibly be updated.
This is how I use the models for inference: https://gist.github.com/mht-sharma/290f7bf9052e92023b4136c6fefd6717
ONNX Model: https://huggingface.co/mohitsha/llava-1.5-7b-hf/tree/main
In this version:
- I do all calculations as part of ONNX.
- The embedding model is duplicated but is comparatively small. If we want we could have additional 2 options for this part:
a. Create a separateembed_model.onnx
and rest same. Now we have 4 ONNX models.
b. Create a separateembed_model.onnx
and do thepast_key_value
stage attention_mask and position_ids processing as part of python code and removedecoder_input_processor.onnx
Let me know WDYT and if you have any suggestions.
It might also be a good idea to generalize for other |
Could you please give me the code for converting llava into onnx |
Because I'm going to make an error, RuntimeError: The size of tensor a (4112) must match the size of tensor b (32) at non-singleton dimension 3
Because I'm going to make an error, RuntimeError: The size of tensor a (4112) must match the size of tensor b (32) at non-singleton dimension 3 |
Traceback (most recent call last): |
I'm running this |
Hi @Pengjie-W I will have a look later today or Monday! |
@Pengjie-W onnxruntime-1.17.3 seems to work. Could you please give it a try? EDIT: The latest commit fixes the export for ORT 1.18 too. |
how can I export a onnx model by llava-1.5-7b-hf? environment: |
This PR has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs. |
What does this PR do?
As per title!
Issue: (#1751)
Before submitting