PGVector Duplicates Entries #739

MichaelMMeskhi · 2024-12-18T16:56:14Z

Describe the bug
When training the RAG layer for PGVector, it duplicates the entires. For instance in ChromaDB, duplicate entries are skipped over.

To Reproduce
Steps to reproduce the behavior:

Run script to embed 10 documents into PGVector.
Check Vanna app to confirm training data has 10 entries,
Rerun training script
Training data now has 20 entries (10 duplicates).

Expected behavior
Should skip duplicate embeddings.

Error logs/Screenshots
If applicable, add logs/screenshots to give more information about the issue.

Desktop (please complete the following information where):

OS: Ubuntu
Version: 24.04
Python: 3.11
Vanna: 0.7.5

MichaelMMeskhi · 2024-12-18T18:20:38Z

For instance, when using ChromaDB, it warns the user that an existing embedding already exists and it skips it. While PGVector has no such warnings and somehow manages to slip the duplicates in.

Oscaner · 2024-12-24T06:48:35Z

Workaround, using uuidv5 instead of uuidv4.

import uuid

import json5 as json
from langchain_core.documents import Document
from vanna.pgvector import PG_VectorStore as VannaBase


class PG_VectorStore(VannaBase):

    def add_question_sql(self, question: str, sql: str, **kwargs) -> str:
        question_sql_json = json.dumps(
            {
                "question": question,
                "sql": sql,
            },
            ensure_ascii=False,
        )

        id = str(uuid.uuid5(uuid.NAMESPACE_DNS, question_sql_json)) + "-sql"

        docs = self.sql_collection.get_by_ids([id])

        if len(docs) > 0:
            return id

        createdat = kwargs.get("createdat")

        doc = Document(
            page_content=question_sql_json,
            metadata={"id": id, "createdat": createdat},
        )

        self.sql_collection.add_documents([doc], ids=[doc.metadata["id"]])

        return id

    def add_ddl(self, ddl: str, **kwargs) -> str:
        _id = str(uuid.uuid5(uuid.NAMESPACE_DNS, ddl)) + "-ddl"

        docs = self.ddl_collection.get_by_ids([_id])

        if len(docs) > 0:
            return _id

        doc = Document(
            page_content=ddl,
            metadata={"id": _id},
        )

        self.ddl_collection.add_documents([doc], ids=[doc.metadata["id"]])

        return _id

    def add_documentation(self, documentation: str, **kwargs) -> str:
        _id = str(uuid.uuid5(uuid.NAMESPACE_DNS, documentation)) + "-doc"

        docs = self.documentation_collection.get_by_ids([_id])

        if len(docs) > 0:
            return _id

        doc = Document(
            page_content=documentation,
            metadata={"id": _id},
        )

        self.documentation_collection.add_documents([doc], ids=[doc.metadata["id"]])

        return _id

MichaelMMeskhi · 2025-01-02T14:41:33Z

@zainhoda shouldn't pgvector be using deterministic_uuid from utils just like chromadb is using it? Wondering if this is the reason for having duplicates?

MichaelMMeskhi added the bug Something isn't working label Dec 18, 2024

wwulfric mentioned this issue Jan 9, 2025

feat: support pgvecto.rs #748

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PGVector Duplicates Entries #739

PGVector Duplicates Entries #739

MichaelMMeskhi commented Dec 18, 2024

MichaelMMeskhi commented Dec 18, 2024

Oscaner commented Dec 24, 2024 •

edited

Loading

MichaelMMeskhi commented Jan 2, 2025

PGVector Duplicates Entries #739

PGVector Duplicates Entries #739

Comments

MichaelMMeskhi commented Dec 18, 2024

MichaelMMeskhi commented Dec 18, 2024

Oscaner commented Dec 24, 2024 • edited Loading

MichaelMMeskhi commented Jan 2, 2025

Oscaner commented Dec 24, 2024 •

edited

Loading