improve configurability of embedding and LLM model sources #169

ChuckHend · 2024-10-25T10:53:53Z

Issue is WIP and will be further refined.

LLM and embedding model sources are currently defined in GUC, e.g. vectorize.openai_service_url = https://api.openai.com/v1 contains the base url for OpenAI. This implementation introduces at least two limitatations:

two vectorize projects cannot use two different urls for the same source. for example, project_a wants vectorize.openai_service_url = https://api.openai.com/v1 and project_b wants vectorize.openai_service_url = https://myapi.mydomain.com/v1. The pg vectorize background worker reads the model source from the job's values in the vectorize.job table in order to know which guc to use.
adding a new embedding source is cumbersome. adding GUCs requires a code change, and a new project relase.

Proposal: move GUCs into a table such as vectorize.model_sources to contain information such as the base url, schema, etc.

Considerations:

How to handle API keys for each record in model source. Can this be a superuser only table, or does API key need to be a GUC?
API request/response schemas. These are currently defined in code, but we could also define these in json mapping using a framework like jolt . There is an experiment of this implementation here. Using this framework would allow for adding arbitrary embedding and LLM sources via inserts to this model sources table.

The text was updated successfully, but these errors were encountered:

tavallaie · 2024-10-25T19:23:59Z

what is different between a project and a table?
how about using multiple api if they have same embedding model?
can we have different embedding columns? do you think it can be useful or not? like embedding for different language (like Arabic ,Persian, English etc ) or having image embedding along side of texts?

ChuckHend · 2024-10-25T20:45:52Z

what is different between a project and a table?

I was using "project" and "job_name" (the parameter in vectorize.table()) interchangeably. Maybe we should rename it to "project". A table can have multiple of the "jobs". Tables have column(s) that get transformed into embeddings using the transformer specified in the vectorize.table() call. So if you change any of the parameters, it should be a new job.

how about using multiple api if they have same embedding model?

this is currently possible and we'd want to preserve it going forward. Currently the model provider is determined by the name of the model passed into the function call, and it works the same for embeddings and LLM. For example, ollama/wizardlm2:7b will route the request to the server at vectorize.ollama_service_url and openai/text-embedding-ada-002 will route to vectorize.openai_service_url. So theoretically if you could have ollama/mymodel and openai/mymodal and they should both work assuming the models actually exist at the destination servers.

can we have different embedding columns? do you think it can be useful or not? like embedding for different language (like Arabic ,Persian, English etc )

I think we could do something like the below for multiple language models.

vectorize.table(
  table => 'mytable',
  job_name => 'project_persian',
  transformer => 'ollama/persian-model'
);
vectorize.table(
  table => 'mytable',
  job_name => 'project_arabic',
  transformer => 'ollama/arabic-model'
);

having image embedding along side of texts

Images are not yet supported but there are plans to implement it soon.

tavallaie · 2024-10-25T22:08:04Z

I suggest building a simple secret manager using Postgres with tools like pgcrypto or pgsodium
. The goal is to securely store sensitive information like API keys.

Here’s the plan:

Use key-value storage for each secret, with an expiration option to set time limits if needed.
Optionally, we could add features like rate limits in the future, but we can leave this out for now to keep things simple.

The table structure would store user-level secrets, so it doesn’t have to be a super-user table. This way, each user can securely store and manage their own API keys.

Also, since transformer and chat_model are similar (they’re the same model but respond to different requests), we could set up a single table called model_resource and use a type field to separate them.

What do you think, @ChuckHend?

ChuckHend · 2024-10-26T08:05:24Z

I like this. What would a row look like for OpenAI since base url and api key would be for both an LLM type and embeddings type? Whereas some others might be just embedding, or just LLM model_resource?

tavallaie · 2024-10-27T12:55:09Z

we have some difficulty here, some LLMs limits their responsible embeddings with what they support, so I need to think about it.

tavallaie · 2024-10-30T10:33:14Z

I think there are two ways to do this:

First Way: Using compatible_models as a Column

We can add a compatible_models column to the model_resource table:

Column Name	Data Type	Description
`model_id`	UUID	Primary key
`name`	VARCHAR	Model name (e.g., 'openai/text-embedding-ada-002')
`type`	VARCHAR[]	Types of the model (e.g., ['embedding', 'LLM'])
`base_url`	VARCHAR	Base URL for API requests like `'https://api.openai.com/v1/embeddings'`
`capabilities`	JSONB	Details on languages, modalities, etc.
`request_template`	JSONB	JSON template for constructing API requests
`response_mapping`	JSONB	JSON mapping for parsing API responses
`compatible_models`	UUID[]	Array of `model_ids` that are compatible

the second way is making another table for `model_compatibility` instead of `compatible_models` column:

Column Name	Data Type	Description
`model_id`	UUID	Foreign key to `model_resource` (model)
`compatible_model_id`	UUID	Foreign key to `model_resource` (compatible)

ChuckHend · 2024-11-01T17:13:47Z

@tavallaie, what are some example of values that would go in the compatible_models column?

tavallaie · 2024-11-01T17:32:26Z

In my design, there is no difference between embedding models and LLMs or even images and audio.
So if there is any dependent model like embedding for LLMs, we put their UUID or any other unique identifier in that column.

ChuckHend · 2024-11-01T17:48:23Z

Ok I think I might see where you are going with that. Can you provide an example of what that table might look like in your use case?

tavallaie · 2024-11-02T10:13:58Z

I am thinking of something like this:

model_id	name	type	base_url	capabilities	request_template	response_mapping	compatible_models
'1a2b3c4d-5e6f-7a8b-9c0d-1e2f3a4b5c6d'	'openai/text-embedding-ada-002'	['embedding']	'https://api.openai.com/v1/embeddings'	{ "languages": ["en"], "modalities": ["text"] }	{ "model": "text-embedding-ada-002", "input": "{{input_text}}" }	{ "embedding": "$.data[0].embedding" }	['3d4c5b6a-7e8f-9a0b-1c2d-3e4f5a6b7c8d']
'3d4c5b6a-7e8f-9a0b-1c2d-3e4f5a6b7c8d'	'custom/advanced-LLM'	['LLM']	'https://api.custom.com/v1/generate'	{ "languages": ["en", "es"], "modalities": ["text"] }	{ "model": "advanced-LLM", "prompt": "{{prompt_text}}" }	{ "generated_text": "$.choices[0].text" }	['1a2b3c4d-5e6f-7a8b-9c0d-1e2f3a4b5c6d']

ChuckHend · 2024-11-04T12:30:23Z

Thank you. Do the APIs need to change how they reference a model then, or how does this impact vectorize.table(), vectorize.init_rag(), and others?

tavallaie · 2024-11-04T14:01:09Z

I don't think we need to change them, because names are unique we can get them by name.

ChuckHend · 2024-11-04T15:11:12Z

Ok cool, I like this. It'll be a fairly large code change I think. For the "OpenAI compatible" providers, it will probably be more performant to keep the code that does the request_template and response_mapping, so maybe those columns could have like a 'native' value there?

I think I'm on board with this overall design btw. I think some of it will end up being a fairly large code change, do you think we can break it up into a few smaller PRs?

tavallaie · 2024-11-04T15:17:13Z

lets start with compatible provider like adding provider column to our model, in this way we have open AI and Ollama provider that mostly used by people and we can put our effort to work with vllm, self-hosted Ollama and LM-studio and etc.
Later if it was not sufficient, we can add a fully dynamic system by adding a new provider.

tavallaie · 2024-11-04T15:19:50Z

like this:

model_id	name	type	base_url	capabilities	request_template	response_mapping	compatible_models	compatible_provider
'1a2b3c4d-5e6f-7a8b-9c0d-1e2f3a4b5c6d'	'openai/text-embedding-ada-002'	['embedding']	'https://api.openai.com/v1/embeddings'	{ "languages": ["en"], "modalities": ["text"] }	{ "model": "text-embedding-ada-002", "input": "{{input_text}}" }	{ "embedding": "$.data[0].embedding" }	['3d4c5b6a-7e8f-9a0b-1c2d-3e4f5a6b7c8d']	OpenAI
'3d4c5b6a-7e8f-9a0b-1c2d-3e4f5a6b7c8d'	'custom/advanced-LLM'	['LLM']	'https://api.custom.com/v1/generate'	{ "languages": ["en", "es"], "modalities": ["text"] }	{ "model": "advanced-LLM", "prompt": "{{prompt_text}}" }	{ "generated_text": "$.choices[0].text" }	['1a2b3c4d-5e6f-7a8b-9c0d-1e2f3a4b5c6d']	ollama

when creating a job automatically decide which provider should be used.

ChuckHend · 2024-11-04T15:30:48Z

So maybe providers and models are separate tables then? In the above, won't we end up with almost identical records since there is also 'openai/text-embedding-3-small'?

tavallaie · 2024-11-04T21:37:26Z

we have providers in rust like mentioned in #152, so maybe we can change those to be compatible with this model instead of hardcoding.

ChuckHend · 2024-11-05T15:52:43Z

Do you have any sense for performance difference between using request/response mapping vs having hard coding?

tavallaie · 2024-11-06T00:17:05Z

not really, we should run few tests.

ChuckHend added enhancement New feature or request help wanted Extra attention is needed labels Oct 25, 2024

ChuckHend mentioned this issue Oct 25, 2024

No Way to Change Ollama and OpenAI Embedding Models #167

Closed

tavallaie mentioned this issue Oct 27, 2024

increase embedding provider support #152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve configurability of embedding and LLM model sources #169

improve configurability of embedding and LLM model sources #169

ChuckHend commented Oct 25, 2024

tavallaie commented Oct 25, 2024

ChuckHend commented Oct 25, 2024

tavallaie commented Oct 25, 2024 •

edited

Loading

ChuckHend commented Oct 26, 2024

tavallaie commented Oct 27, 2024

tavallaie commented Oct 30, 2024

ChuckHend commented Nov 1, 2024

tavallaie commented Nov 1, 2024

ChuckHend commented Nov 1, 2024

tavallaie commented Nov 2, 2024

ChuckHend commented Nov 4, 2024

tavallaie commented Nov 4, 2024

ChuckHend commented Nov 4, 2024

tavallaie commented Nov 4, 2024

tavallaie commented Nov 4, 2024 •

edited

Loading

ChuckHend commented Nov 4, 2024

tavallaie commented Nov 4, 2024

ChuckHend commented Nov 5, 2024

tavallaie commented Nov 6, 2024

improve configurability of embedding and LLM model sources #169

improve configurability of embedding and LLM model sources #169

Comments

ChuckHend commented Oct 25, 2024

tavallaie commented Oct 25, 2024

ChuckHend commented Oct 25, 2024

tavallaie commented Oct 25, 2024 • edited Loading

ChuckHend commented Oct 26, 2024

tavallaie commented Oct 27, 2024

tavallaie commented Oct 30, 2024

First Way: Using compatible_models as a Column

the second way is making another table for model_compatibility instead of compatible_models column:

ChuckHend commented Nov 1, 2024

tavallaie commented Nov 1, 2024

ChuckHend commented Nov 1, 2024

tavallaie commented Nov 2, 2024

ChuckHend commented Nov 4, 2024

tavallaie commented Nov 4, 2024

ChuckHend commented Nov 4, 2024

tavallaie commented Nov 4, 2024

tavallaie commented Nov 4, 2024 • edited Loading

ChuckHend commented Nov 4, 2024

tavallaie commented Nov 4, 2024

ChuckHend commented Nov 5, 2024

tavallaie commented Nov 6, 2024

tavallaie commented Oct 25, 2024 •

edited

Loading

the second way is making another table for `model_compatibility` instead of `compatible_models` column:

tavallaie commented Nov 4, 2024 •

edited

Loading