bring your own embeddings #149

ChuckHend · 2024-10-11T05:05:07Z

provide a feature or tooling to allow a user to take embeddings from one table and make it such that pg_vectorize can manage those embeddings. for example, assume a user has a table already with a content column and an embeddings column generated from the sentence-transformers/all-MiniLM-L6-v2 model. Rather than recomputing embeddings for all of the content column, we should be able to just insert those into the new embeddings table or column. I think it would be safe and fairly straight forward to manually insert embeddings into vectorize.<project_name>_embeddings after the project is created. If the project is using schedule => 'realtime', then creating a new project on a table will immediately create jobs to generate embeddings for all the text, so we might wamt to delete those jobs if we dont want to execute the jobs. In summary, I think the steps to do this could be:

create vectorize by calling vectorize.table()
insert embeddings into the embedding column on vectorize.<project_name>_embeddings
optionally delete from pgmq where message ->> 'name' = '<project_name>'

The text was updated successfully, but these errors were encountered:

algora-pbc · 2024-10-17T12:15:04Z

💎 $150 bounty • Tembo

Steps to solve:

Start working: Comment /attempt #149 with your implementation plan
Submit work: Create a pull request including /claim #149 in the PR body to claim the bounty
Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to tembo-io/pg_vectorize!

Add a bounty • Share on socials

Attempt	Started (GMT+0)	Solution
🟢 @onyedikachi-david	Oct 17, 2024, 2:12:37 PM	WIP

onyedikachi-david · 2024-10-17T14:12:34Z

/attempt #149

Algora profile	Completed bounties	Tech	Active attempts	Options
@onyedikachi-david	10 bounties from 5 projects	TypeScript, Python, JavaScript & more		Cancel attempt

onyedikachi-david · 2024-10-17T14:13:07Z

Can I get assigned? @ChuckHend

ChuckHend · 2024-10-21T17:42:45Z

@onyedikachi-david, we've generally been working with whichever PR is opened first. Once you have your first contribution merged I'd be willing to start assigning to you if it helps.

Neptune650 · 2024-10-25T01:45:34Z

provide a feature or tooling to allow a user to take embeddings from one table and make it such that pg_vectorize can manage those embeddings. for example, assume a user has a table already with a content column and an embeddings column generated from the sentence-transformers/all-MiniLM-L6-v2 model. Rather than recomputing embeddings for all of the content column, we should be able to just insert those into the new embeddings table or column. I think it would be safe and fairly straight forward to manually insert embeddings into vectorize.<project_name>_embeddings after the project is created. If the project is using schedule => 'realtime', then creating a new project on a table will immediately create jobs to generate embeddings for all the text, so we might wamt to delete those jobs if we dont want to execute the jobs. In summary, I think the steps to do this could be:
1. create vectorize by calling `vectorize.table()`

2. insert embeddings into the embedding column on `vectorize.<project_name>_embeddings`

3. optionally `delete from pgmq where message ->> 'name' = '<project_name>'`

So does this mean that when the model is the same, vectorize.table() shouldn't generate new embeddings, but instead use the ones that we already generated in an earlier project?

Neptune650 · 2024-10-25T01:45:59Z

@ChuckHend Also of course while preventing the embeddings generation jobs.

ChuckHend · 2024-10-25T09:14:01Z

@Neptune650 , that's right -- use the embeddings that were already generated (but not generated by pg_vectorize), and since embeddings are already generated we do not create the embedding generation jobs. But we should create the triggers (insert trigger and update trigger) so that when we insert new records or update records, new embeddings ARE generated using the same model.

Neptune650 · 2024-10-25T15:47:21Z

@Neptune650 , that's right -- use the embeddings that were already generated (but not generated by pg_vectorize), and since embeddings are already generated we do not create the embedding generation jobs. But we should create the triggers (insert trigger and update trigger) so that when we insert new records or update records, new embeddings ARE generated using the same model.

@ChuckHend In that case, do you think adding an "embeddings" parameter to vectorize.table() would be appropriate?

ChuckHend · 2024-10-25T17:13:04Z

That could work. Are you thinking the embeddings parameter would accept a column name where the embeddings already exist?

Neptune650 · 2024-10-25T17:40:51Z

That could work. Are you thinking the embeddings parameter would accept a column name where the embeddings already exist?

Correct, that would be the way I'd implement it

ChuckHend · 2024-10-25T18:06:03Z

That sounds good to me, but it could get complicated since wed want to support embeddings in a column on the source table or embedding on another table with a foreign key. If you can figure it out I think it would be a good solution.

Alternatively, that could be a flag like init=false, then documentation for how to copy or move embeddings.

Neptune650 · 2024-10-25T18:20:51Z

That sounds good to me, but it could get complicated since wed want to support embeddings in a column on the source table or embedding on another table with a foreign key. If you can figure it out I think it would be a good solution.

Alternatively, that could be a flag like init=false, then documentation for how to copy or move embeddings.

@ChuckHend Well if we're implementing the latter, would we only not initialize embeddings or nothing at all? Also would copying/moving embeddings just be a SQL "INSERT INTO"?

ChuckHend · 2024-10-25T18:42:42Z

If it's the init=false there are still other operations we'd want to happen like triggers, creating tables or columns (depending on the appropriate parameter values). Then i think yes it would be just an insert.

I'm not certain how we'd handle the cron job, if realtime parameter is set to a cron syntax. Embedding would need to be inserted before the first scheduled job is executed.

Neptune650 · 2024-10-25T19:22:35Z

I'm not certain how we'd handle the cron job, if realtime parameter is set to a cron syntax. Embedding would need to be inserted before the first scheduled job is executed.

But since it's up to the user to do that, how could we?

ChuckHend · 2024-11-01T16:55:21Z

But since it's up to the user to do that, how could we?

Yes it is up to the user, but if its almost impossible to get the embeddings inserted before the cron job runs then that isn't a very good experience.

ChuckHend added the documentation Improvements or additions to documentation label Oct 11, 2024

FloorD added the hacktoberfest label Oct 14, 2024

algora-pbc bot added the 💎 Bounty label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bring your own embeddings #149

bring your own embeddings #149

ChuckHend commented Oct 11, 2024

algora-pbc bot commented Oct 17, 2024 •

edited

Loading

onyedikachi-david commented Oct 17, 2024 •

edited by algora-pbc bot

Loading

onyedikachi-david commented Oct 17, 2024

ChuckHend commented Oct 21, 2024

Neptune650 commented Oct 25, 2024

Neptune650 commented Oct 25, 2024

ChuckHend commented Oct 25, 2024

Neptune650 commented Oct 25, 2024

ChuckHend commented Oct 25, 2024

Neptune650 commented Oct 25, 2024

ChuckHend commented Oct 25, 2024

Neptune650 commented Oct 25, 2024

ChuckHend commented Oct 25, 2024

Neptune650 commented Oct 25, 2024

ChuckHend commented Nov 1, 2024

bring your own embeddings #149

bring your own embeddings #149

Comments

ChuckHend commented Oct 11, 2024

algora-pbc bot commented Oct 17, 2024 • edited Loading

💎 $150 bounty • Tembo

Steps to solve:

onyedikachi-david commented Oct 17, 2024 • edited by algora-pbc bot Loading

onyedikachi-david commented Oct 17, 2024

ChuckHend commented Oct 21, 2024

Neptune650 commented Oct 25, 2024

Neptune650 commented Oct 25, 2024

ChuckHend commented Oct 25, 2024

Neptune650 commented Oct 25, 2024

ChuckHend commented Oct 25, 2024

Neptune650 commented Oct 25, 2024

ChuckHend commented Oct 25, 2024

Neptune650 commented Oct 25, 2024

ChuckHend commented Oct 25, 2024

Neptune650 commented Oct 25, 2024

ChuckHend commented Nov 1, 2024

algora-pbc bot commented Oct 17, 2024 •

edited

Loading

onyedikachi-david commented Oct 17, 2024 •

edited by algora-pbc bot

Loading