Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGVector Duplicates Entries #739

Open
MichaelMMeskhi opened this issue Dec 18, 2024 · 3 comments
Open

PGVector Duplicates Entries #739

MichaelMMeskhi opened this issue Dec 18, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@MichaelMMeskhi
Copy link

Describe the bug
When training the RAG layer for PGVector, it duplicates the entires. For instance in ChromaDB, duplicate entries are skipped over.

To Reproduce
Steps to reproduce the behavior:

  1. Run script to embed 10 documents into PGVector.
  2. Check Vanna app to confirm training data has 10 entries,
  3. Rerun training script
  4. Training data now has 20 entries (10 duplicates).

Expected behavior
Should skip duplicate embeddings.

Error logs/Screenshots
If applicable, add logs/screenshots to give more information about the issue.

Desktop (please complete the following information where):

  • OS: Ubuntu
  • Version: 24.04
  • Python: 3.11
  • Vanna: 0.7.5
@MichaelMMeskhi MichaelMMeskhi added the bug Something isn't working label Dec 18, 2024
@MichaelMMeskhi
Copy link
Author

For instance, when using ChromaDB, it warns the user that an existing embedding already exists and it skips it. While PGVector has no such warnings and somehow manages to slip the duplicates in.

@Oscaner
Copy link

Oscaner commented Dec 24, 2024

Workaround, using uuidv5 instead of uuidv4.

import uuid

import json5 as json
from langchain_core.documents import Document
from vanna.pgvector import PG_VectorStore as VannaBase


class PG_VectorStore(VannaBase):

    def add_question_sql(self, question: str, sql: str, **kwargs) -> str:
        question_sql_json = json.dumps(
            {
                "question": question,
                "sql": sql,
            },
            ensure_ascii=False,
        )

        id = str(uuid.uuid5(uuid.NAMESPACE_DNS, question_sql_json)) + "-sql"

        docs = self.sql_collection.get_by_ids([id])

        if len(docs) > 0:
            return id

        createdat = kwargs.get("createdat")

        doc = Document(
            page_content=question_sql_json,
            metadata={"id": id, "createdat": createdat},
        )

        self.sql_collection.add_documents([doc], ids=[doc.metadata["id"]])

        return id

    def add_ddl(self, ddl: str, **kwargs) -> str:
        _id = str(uuid.uuid5(uuid.NAMESPACE_DNS, ddl)) + "-ddl"

        docs = self.ddl_collection.get_by_ids([_id])

        if len(docs) > 0:
            return _id

        doc = Document(
            page_content=ddl,
            metadata={"id": _id},
        )

        self.ddl_collection.add_documents([doc], ids=[doc.metadata["id"]])

        return _id

    def add_documentation(self, documentation: str, **kwargs) -> str:
        _id = str(uuid.uuid5(uuid.NAMESPACE_DNS, documentation)) + "-doc"

        docs = self.documentation_collection.get_by_ids([_id])

        if len(docs) > 0:
            return _id

        doc = Document(
            page_content=documentation,
            metadata={"id": _id},
        )

        self.documentation_collection.add_documents([doc], ids=[doc.metadata["id"]])

        return _id

@MichaelMMeskhi
Copy link
Author

@zainhoda shouldn't pgvector be using deterministic_uuid from utils just like chromadb is using it? Wondering if this is the reason for having duplicates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants