Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Support of more HuggingFace embedders for multimodality #28090

Open
eostis opened this issue Aug 20, 2023 · 12 comments
Open

[FR] Support of more HuggingFace embedders for multimodality #28090

eostis opened this issue Aug 20, 2023 · 12 comments
Assignees
Milestone

Comments

@eostis
Copy link

eostis commented Aug 20, 2023

My goal is to build a unique multimodal WooCommerce search experience with Vespa multivectors and an hybrid ranking on text-BM25, text-vectors, and image-vectors.

For instance, E-commerce can use:

  • text-to-image (CLIP): search images
  • text-to-text (sentence transformers): search texts
  • image-to-image (resnet): similar images.

Of course, sounds and videos are also a possibility.

Currently, I implemented a text-to-text demo: https://demo-woocommerce-cloudways-2k-vespa-transformers.wpsolr.com/shop/

But image HF embedders are not available yet, as far as I can read in the documentation and blog.

Blog examples require an external Python code to produce the image vectors.

@jobergum
Copy link

Makes sense. CLIP has two parts, image encoding and text encoding, and are handled by two different neural networks.

We could fit the text transformer model into the existing embed framework as already done in multiple vespa sample applications, but image encoding would not fit into the existing embed functionality which takes a string or array of string as inputs.

@jobergum
Copy link

So if you are fine with just having the text-to-image space model in Vespa, we can create that type of example using HF-embedder functionality.

@eostis
Copy link
Author

eostis commented Aug 21, 2023

With the same process ?

  • Export the HF CLIP .onnx
  • Set the containers's HF component in services.xml
  • Define the embedded field with the input fields and input images participating in .sd
  • Add a closeness rank profile
  • Define the YQL query with nearestNeighbor() and ranking

@jobergum
Copy link

To handle image data, we would have to create a new type of embedder functionality.

@eostis
Copy link
Author

eostis commented Aug 21, 2023

Exactly! It will also prepare Vespa for further types: audio, video ...

@eostis
Copy link
Author

eostis commented Aug 22, 2023

I was a bit ahead of time apparently. 7-modality is here.

@frodelu frodelu added this to the later milestone Aug 23, 2023
@jobergum
Copy link

ImageBind is interesting, but I do recommend looking at the licensing :)

@eostis
Copy link
Author

eostis commented Aug 23, 2023

@AriMKatz
Copy link

Does vespa support multimodality currently?

@jobergum
Copy link

Hey @AriMKatz,

We currently do not expose any provided embedders that is for multimodal. The provided embedder models are text only.

This doesn't mean that you cannot use multimodal representations with Vespa, for example here is a recent example of a multimodal model PDF Retrieval with Vision Language Models (ColPali).

@alpha-javed
Copy link

Hey @jobergum,
does vespa support multimodality currently?

@jobergum
Copy link

See my comment above and #32389, the native built in embedders are currently only text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants