Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bring your own embeddings #149

Open
ChuckHend opened this issue Oct 11, 2024 · 15 comments
Open

bring your own embeddings #149

ChuckHend opened this issue Oct 11, 2024 · 15 comments
Labels
💎 Bounty documentation Improvements or additions to documentation hacktoberfest

Comments

@ChuckHend
Copy link
Member

provide a feature or tooling to allow a user to take embeddings from one table and make it such that pg_vectorize can manage those embeddings. for example, assume a user has a table already with a content column and an embeddings column generated from the sentence-transformers/all-MiniLM-L6-v2 model. Rather than recomputing embeddings for all of the content column, we should be able to just insert those into the new embeddings table or column. I think it would be safe and fairly straight forward to manually insert embeddings into vectorize.<project_name>_embeddings after the project is created. If the project is using schedule => 'realtime', then creating a new project on a table will immediately create jobs to generate embeddings for all the text, so we might wamt to delete those jobs if we dont want to execute the jobs. In summary, I think the steps to do this could be:

  1. create vectorize by calling vectorize.table()
  2. insert embeddings into the embedding column on vectorize.<project_name>_embeddings
  3. optionally delete from pgmq where message ->> 'name' = '<project_name>'
@ChuckHend ChuckHend added the documentation Improvements or additions to documentation label Oct 11, 2024
Copy link

algora-pbc bot commented Oct 17, 2024

💎 $150 bounty • Tembo

Steps to solve:

  1. Start working: Comment /attempt #149 with your implementation plan
  2. Submit work: Create a pull request including /claim #149 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to tembo-io/pg_vectorize!

Add a bountyShare on socials

Attempt Started (GMT+0) Solution
🟢 @onyedikachi-david Oct 17, 2024, 2:12:37 PM WIP

@onyedikachi-david
Copy link

onyedikachi-david commented Oct 17, 2024

/attempt #149

Algora profile Completed bounties Tech Active attempts Options
@onyedikachi-david 10 bounties from 5 projects
TypeScript, Python,
JavaScript & more
Cancel attempt

@onyedikachi-david
Copy link

Can I get assigned? @ChuckHend

@ChuckHend
Copy link
Member Author

@onyedikachi-david, we've generally been working with whichever PR is opened first. Once you have your first contribution merged I'd be willing to start assigning to you if it helps.

@Neptune650
Copy link

provide a feature or tooling to allow a user to take embeddings from one table and make it such that pg_vectorize can manage those embeddings. for example, assume a user has a table already with a content column and an embeddings column generated from the sentence-transformers/all-MiniLM-L6-v2 model. Rather than recomputing embeddings for all of the content column, we should be able to just insert those into the new embeddings table or column. I think it would be safe and fairly straight forward to manually insert embeddings into vectorize.<project_name>_embeddings after the project is created. If the project is using schedule => 'realtime', then creating a new project on a table will immediately create jobs to generate embeddings for all the text, so we might wamt to delete those jobs if we dont want to execute the jobs. In summary, I think the steps to do this could be:

1. create vectorize by calling `vectorize.table()`

2. insert embeddings into the embedding column on `vectorize.<project_name>_embeddings`

3. optionally `delete from pgmq where message ->> 'name' = '<project_name>'`

So does this mean that when the model is the same, vectorize.table() shouldn't generate new embeddings, but instead use the ones that we already generated in an earlier project?

@Neptune650
Copy link

@ChuckHend Also of course while preventing the embeddings generation jobs.

@ChuckHend
Copy link
Member Author

@Neptune650 , that's right -- use the embeddings that were already generated (but not generated by pg_vectorize), and since embeddings are already generated we do not create the embedding generation jobs. But we should create the triggers (insert trigger and update trigger) so that when we insert new records or update records, new embeddings ARE generated using the same model.

@Neptune650
Copy link

@Neptune650 , that's right -- use the embeddings that were already generated (but not generated by pg_vectorize), and since embeddings are already generated we do not create the embedding generation jobs. But we should create the triggers (insert trigger and update trigger) so that when we insert new records or update records, new embeddings ARE generated using the same model.

@ChuckHend In that case, do you think adding an "embeddings" parameter to vectorize.table() would be appropriate?

@ChuckHend
Copy link
Member Author

That could work. Are you thinking the embeddings parameter would accept a column name where the embeddings already exist?

@Neptune650
Copy link

That could work. Are you thinking the embeddings parameter would accept a column name where the embeddings already exist?

Correct, that would be the way I'd implement it

@ChuckHend
Copy link
Member Author

That sounds good to me, but it could get complicated since wed want to support embeddings in a column on the source table or embedding on another table with a foreign key. If you can figure it out I think it would be a good solution.

Alternatively, that could be a flag like init=false, then documentation for how to copy or move embeddings.

@Neptune650
Copy link

That sounds good to me, but it could get complicated since wed want to support embeddings in a column on the source table or embedding on another table with a foreign key. If you can figure it out I think it would be a good solution.

Alternatively, that could be a flag like init=false, then documentation for how to copy or move embeddings.

@ChuckHend Well if we're implementing the latter, would we only not initialize embeddings or nothing at all? Also would copying/moving embeddings just be a SQL "INSERT INTO"?

@ChuckHend
Copy link
Member Author

If it's the init=false there are still other operations we'd want to happen like triggers, creating tables or columns (depending on the appropriate parameter values). Then i think yes it would be just an insert.

I'm not certain how we'd handle the cron job, if realtime parameter is set to a cron syntax. Embedding would need to be inserted before the first scheduled job is executed.

@Neptune650
Copy link

I'm not certain how we'd handle the cron job, if realtime parameter is set to a cron syntax. Embedding would need to be inserted before the first scheduled job is executed.

But since it's up to the user to do that, how could we?

@ChuckHend
Copy link
Member Author

But since it's up to the user to do that, how could we?

Yes it is up to the user, but if its almost impossible to get the embeddings inserted before the cron job runs then that isn't a very good experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💎 Bounty documentation Improvements or additions to documentation hacktoberfest
Projects
None yet
Development

No branches or pull requests

4 participants