Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Welcome Embeddings 🚀 #4322

Merged
merged 11 commits into from
Mar 12, 2024

Conversation

vkWeb
Copy link
Member

@vkWeb vkWeb commented Nov 1, 2023

Summary

Description of the change(s) you made

Added an Embeddings model that allows caching of generated embeddings and enables calculation of closest embeddings for recommending contentnodes.

Manual verification steps performed

  1. First run make run-services in one terminal.
  2. Then on second terminal, run pytest -s -k EmbeddingsTestCase.
  3. Passing tests indicate that the embedding related tables were created successfully and pgvector is installed correctly.

Reviewer guidance

How can a reviewer test these changes?

Performing manual verification steps should be enough.

References

Closes #4290.


Contributor's Checklist

PR process:

  • If this is an important user-facing change, PR or related issue the CHANGELOG label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later time
  • If this includes an internal dependency change, a link to the diff is provided
  • The docs label has been added if this introduces a change that needs to be updated in the user docs?
  • If any Python requirements have changed, the updated requirements.txt files also included in this PR
  • Opportunities for using Google Analytics here are noted
  • Migrations are safe for a large db

Studio-specifc:

  • All user-facing strings are translated properly
  • The notranslate class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)
  • All UI components are LTR and RTL compliant
  • Views are organized into pages, components, and layouts directories as described in the docs
  • Users' storage used is recalculated properly on any changes to main tree files
  • If there new ways this uses user data that needs to be factored into our Privacy Policy, it has been noted.

Testing:

  • Code is clean and well-commented
  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Any new interactions have been added to the QA Sheet
  • Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

This section is for reviewers to fill out.

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@vkWeb vkWeb requested review from bjester and akolson November 1, 2023 14:07
@vkWeb vkWeb requested review from jamalex and bjester November 8, 2023 12:28
@vkWeb
Copy link
Member Author

vkWeb commented Nov 8, 2023

@bjester github actions is failing because it uses the original postgres:12 image, we'll need to refactor our actions to support our docker compose. Meanwhile, I've verified locally, the related tests to this PR are passing.

@bjester
Copy link
Member

bjester commented Nov 14, 2023

Just noting that this is blocked until we can get the GH Actions sorted and the extension enabled on the unstable server database

@akolson
Copy link
Member

akolson commented Nov 15, 2023

Just noting that this is blocked until we can get the GH Actions sorted and the extension enabled on the unstable server database

#4341 is tracking the GH Actions update

@bjester
Copy link
Member

bjester commented Dec 18, 2023

@vkWeb the GH Action is now using pg-vector. Although, do the tests pass on your machine? I ran them on mine and it complained about a missing storage bucket, but its directly connected to your tests:

E           botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when calling the PutObject operation: The specified bucket does not exist

../pyenv/versions/3.9.13/envs/studio-3.9.13/lib/python3.9/site-packages/botocore/client.py:705: NoSuchBucket

During handling of the above exception, another exception occurred:

cls = <class 'contentcuration.tests.test_models.EmbeddingsTestCase'>

    @classmethod
    def setUpClass(cls):
        super(EmbeddingsTestCase, cls).setUpClass()
>       node_1 = testdata.node({
            "kind_id": "video",
            "title": "first"
        })

@vkWeb
Copy link
Member Author

vkWeb commented Dec 21, 2023

@vkWeb the GH Action is now using pg-vector. Although, do the tests pass on your machine? I ran them on mine and it complained about a missing storage bucket, but its directly connected to your tests

@bjester the tests are passing on my system pretty well. Let's wait for the ci tests results. If CI fails then I will look deeper.

@vkWeb
Copy link
Member Author

vkWeb commented Dec 23, 2023

@bjester yes you were right, tests are failing on CI and are failing on my system too. Running just tests related to embedding works but running all the tests together is failing. I'm looking into this.

@vkWeb
Copy link
Member Author

vkWeb commented Dec 23, 2023

@bjester so this was an issue with the bucket getting deleted (fixed in the most recent commit 🎉), let me explain in detail --

When only EmbeddingsTestCase was run, the bucket gets auto created on test start because AWS_AUTO_CREATE_BUCKET is True, so setUpClass works out right.

But when all the tests are run together, any test that's run before EmbeddingsTestCase calls tear down thereby deleting the bucket. Now setUpClass doesn't creates a bucket so setUpClass was failing to save video file to s3.


This brings to me that we can improve on our developer experience with how our test classes are laid out. It was really a non-intuitive experience.

@vkWeb
Copy link
Member Author

vkWeb commented Dec 23, 2023

@bjester this is ready to merge 🎉

@bjester
Copy link
Member

bjester commented Jan 3, 2024

Code looks good. Last blocker is to get the extension enabled in the unstable server environment. I'll follow up with infra

@bjester bjester changed the base branch from unstable to search-recommendations March 12, 2024 16:29
@bjester bjester merged commit 3861ca1 into learningequality:search-recommendations Mar 12, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a model that caches generated embeddings
4 participants