Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an example application about how to properly deal with stale documents on the vector database #612

Open
eolivelli opened this issue Oct 18, 2023 · 1 comment

Comments

@eolivelli
Copy link
Member

All the example applications that we currently have don't show how to deal with these two common issues:

Shorter pages

When you re-index a website then new version of the page may be shorter, so with less chunks.
You can override the chunks with lower ids, but you keep the old chunks with higher ids.
We need to show how to remove stale chunks

Pages that disappeared

This is trickier. When you know that you are re-indexing the whole corpus of documents (for instance a whole website) you should drop the documents that are no more available, the risks are to have outdated documents or to have duplicate content (in case of a page that has been renamed)

@eolivelli eolivelli moved this to In Progress in LangStream Oct 18, 2023
@eolivelli eolivelli moved this from In Progress to Done in LangStream Oct 20, 2023
@eolivelli eolivelli added this to the 0.3.0 milestone Oct 20, 2023
@eolivelli eolivelli reopened this Oct 20, 2023
@eolivelli eolivelli moved this from Done to In Progress in LangStream Oct 20, 2023
@eolivelli eolivelli removed this from the 0.3.0 milestone Oct 20, 2023
@eolivelli
Copy link
Member Author

The first part has been delivered in the 0.3.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

1 participant