-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bring your own embeddings #149
Comments
💎 $150 bounty • TemboSteps to solve:
Thank you for contributing to tembo-io/pg_vectorize! Add a bounty • Share on socials
|
/attempt #149
|
Can I get assigned? @ChuckHend |
@onyedikachi-david, we've generally been working with whichever PR is opened first. Once you have your first contribution merged I'd be willing to start assigning to you if it helps. |
So does this mean that when the model is the same, vectorize.table() shouldn't generate new embeddings, but instead use the ones that we already generated in an earlier project? |
@ChuckHend Also of course while preventing the embeddings generation jobs. |
@Neptune650 , that's right -- use the embeddings that were already generated (but not generated by pg_vectorize), and since embeddings are already generated we do not create the embedding generation jobs. But we should create the triggers (insert trigger and update trigger) so that when we insert new records or update records, new embeddings ARE generated using the same model. |
@ChuckHend In that case, do you think adding an "embeddings" parameter to vectorize.table() would be appropriate? |
That could work. Are you thinking the embeddings parameter would accept a column name where the embeddings already exist? |
Correct, that would be the way I'd implement it |
That sounds good to me, but it could get complicated since wed want to support embeddings in a column on the source table or embedding on another table with a foreign key. If you can figure it out I think it would be a good solution. Alternatively, that could be a flag like |
@ChuckHend Well if we're implementing the latter, would we only not initialize embeddings or nothing at all? Also would copying/moving embeddings just be a SQL "INSERT INTO"? |
If it's the I'm not certain how we'd handle the cron job, if realtime parameter is set to a cron syntax. Embedding would need to be inserted before the first scheduled job is executed. |
But since it's up to the user to do that, how could we? |
Yes it is up to the user, but if its almost impossible to get the embeddings inserted before the cron job runs then that isn't a very good experience. |
provide a feature or tooling to allow a user to take embeddings from one table and make it such that pg_vectorize can manage those embeddings. for example, assume a user has a table already with a
content
column and anembeddings
column generated from thesentence-transformers/all-MiniLM-L6-v2
model. Rather than recomputing embeddings for all of thecontent
column, we should be able to just insert those into the new embeddings table or column. I think it would be safe and fairly straight forward to manually insert embeddings intovectorize.<project_name>_embeddings
after the project is created. If the project is usingschedule => 'realtime'
, then creating a new project on a table will immediately create jobs to generate embeddings for all the text, so we might wamt to delete those jobs if we dont want to execute the jobs. In summary, I think the steps to do this could be:vectorize.table()
vectorize.<project_name>_embeddings
delete from pgmq where message ->> 'name' = '<project_name>'
The text was updated successfully, but these errors were encountered: