-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : output embeddings for all tokens when pooling = none #10861
Conversation
examples/server/server.cpp
Outdated
|
||
virtual int get_index() override { | ||
return index; | ||
} | ||
|
||
virtual json to_json() override { | ||
if (embedding.size() == 1){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (embedding.size() == 1){ | |
if (embedding.size() == 1) { |
Not sure if I understand correctly, but the case where the embd vector has one element is for reranking, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
embedding
is now a vector of vectors. Here we check if there is only one vector in the results, not that the embedding size is 1 element. So far, all paths returned a single embedding vector. With this change, if --pooling none
and there are more than 1 tokens in the input batch, the result will be a vector of embedding vectors.
For reranking, we output a single float which is the score. It is enabled only for --pooling rank
. It should not be affected by this change.
@@ -45,6 +45,18 @@ def test_embedding_multiple(): | |||
assert len(d['embedding']) > 1 | |||
|
|||
|
|||
def test_embedding_pooling_none(): | |||
server = ServerPreset.bert_bge_small(pooling = 'none') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also do:
global server
server.pooling = 'none'
server.start()
Just pointing the obvious here, but using the server to obtain the embeddings of every token for something like TTS is extremely inefficient and should not be done if there is a better way. |
Yes, the overhead from the JSON protocol is quite substantial for this use case. Adding support for |
Btw if you plan to implement this functionality for TTS, I would suggest another method: we can make an API endpoint that accepts input text and output a stream of wav of ogg audio.
The reason why it's not there because OAI itself (the official closed-source API) don't have this capability, mostly due to too much overhead for network bandwidth. Most TTS API out there just take text as input and output audio stream anyway, so I think it will be pretty much the norm now. |
We should also consider that this will not make the API OpenAI-compatible, it will just allow applications that are (unwittingly) using pooling type
|
Will change the
We can do that as well. As long as the endpoint can send raw data that is not JSON. Otherwise there is not a significant benefit compared to streaming embeddings since the raw spectrograms (i.e. the embeddings) are just 4 times larger than the raw audio (x2 from F32 -> I16 and x2 from the IRFFT). A potential problem with this approach might be that the logic for converting the spectrogram to audio would have to be on the server-side and I don't know yet how many implementations of that will be in the future. I'm currently playing with OuteTTS, but other models do other post-processing, so these would have to be implemented, maybe in |
Hmm yeah I think we can make a simple POC API that returns the whole audio file (either wav or mp3 or ogg). Given the fact that in the other PR, you already be able to export the audio file, I think this should be simple to bring it into server (?) On sending the response from server, we just need to set mimetype to
Yes I think it make sense to put it into |
The |
server.start() | ||
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}") | ||
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}/v1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add the /v1
suffix for these calls, otherwise the client will hit the /embeddings
endpoint which is not OAI compatible. Is this the correct fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's correct, I haven't notice that before. The original OAI code has /v1
in its default value: https://github.com/openai/openai-python/blob/e94d98e9bf97a5d2d02d79d58f2abdbab26ff2bd/src/openai/_client.py#L117
We should change all other recurrence too (in another PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw for embeddings endpoint we also have the switch between non-OAI and OAI using the presence of content
and input
fields in the response. Is it still relevant now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've modified this to be able to use both endpoints with both inputs. And I'm thinking to also make the change that content
is alias to input
- maybe we can do this directly in #10866
2230786
to
2a5510e
Compare
To make the llama.cpp/examples/server/server.cpp Lines 3701 to 3711 in 2a5510e
For the @ngxson agree? |
This should be ready to merge. We can do the strong OAI-compat of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We should tag breaking change too, in case some users using /embeddings
in OAI-compat clients.
…nov#10861) * server : add "tokens" output ggml-ci * server : output embeddings for all tokens when pooling = none ggml-ci * server : update readme [no ci] * server : fix spacing [no ci] Co-authored-by: Xuan Son Nguyen <[email protected]> * server : be explicit about the pooling type in the tests ggml-ci * server : update /embeddings and /v1/embeddings endpoints ggml-ci * server : do not normalize embeddings when there is no pooling ggml-ci * server : update readme ggml-ci * server : fixes * tests : update server tests ggml-ci * server : update readme [no ci] * server : remove rebase artifact --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
When using an embedding model with pooling type
none
, the/embeddings
endpoint will now output a vector of unnormalized embedding vectors - one for each input token. The/v1/embeddings
OAI-compatible endpoint does not support this and will return an error when trying to use a model with pooling typenone
.API Changes
/embedding
endpoint now returns a vector of embedding vectors. See the server readme for more information.TODO: implement
"encoding_format": "base64"
to reduce the output size of this endpoint.