Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase Size of Test Dataset During SEARCH_CONCURRENT Stage #385

Open
wahajali opened this issue Oct 25, 2024 · 2 comments
Open

Increase Size of Test Dataset During SEARCH_CONCURRENT Stage #385

wahajali opened this issue Oct 25, 2024 · 2 comments
Assignees

Comments

@wahajali
Copy link
Contributor

wahajali commented Oct 25, 2024

I would like to propose that instead of using the 1,000 test query set that VectorDBBench currently uses during the SEARCH_SERIAL stage to calculate recall, we should use a larger pool of queries during the SEARCH_CONCURRENT phase where QPS is calculated. Since we don't need to compute recall during this stage, we also don't require the ground truth (GT) for these queries, so this can be an entirely different set of queries.

To provide context for this proposal, I’d like to share some results from tests I ran comparing the original 1,000 test queries with 10,000 randomly selected queries from the original training dataset on pgvector with HNSW. The tests were run where the index size exceeds the memory cache (shared_buffers in PostgreSQL) available, meaning that the index could not fit in memory. For HNSW, the performance impact is expected to be significant. In the first case, I used the original 1,000 queries, and in the second case, I used 10,000 randomly selected queries from the original dataset. For reference, I’ve also included results for the OpenAI 500K dataset, which fits into memory.

Dataset Size QPS Test Data Size
5M OpenAI 1030.69815 1000 (original)
5M OpenAI 4.4392 10000 (generated)
500K OpenAI 1276.48165 1000 (original)
500K OpenAI 1143.8795 10000 (generated)

As you can see, the decrease in QPS is dramatic. The low QPS is expected because the index size is significantly larger than the available buffers, which requires disk IO. In the case of 1,000 test queries, this observation isn’t apparent at all perhaps because the limited number of index queries don't force the entire index to be loaded into memory. While a recent improvement in randomly selecting the query index has made the QPS more realistic, I believe this change will make the numbers even more reflective of actual performance.

In my opinion, there are two options:

  1. Generate a larger test dataset using the same methodology as previously used.
  2. Randomly select vectors from the training dataset and use them as the test queries.
@alwayslove2013
Copy link
Collaborator

@wahajali Thank you for your incredibly detailed observation! You’re absolutely right—during the conc_tests, recall isn’t calculated, so there’s no need for a ground truth file.

Since we don't need to compute recall during this stage, we also don't require the ground truth (GT) for these queries, so this can be an entirely different set of queries.

I love the idea of extending the functionality of conc_tests to allow for an increased number of test queries. We could sample from train_vectors or even generate them randomly, which would provide a more comprehensive evaluation of different vector DB memory strategies. Your insights are invaluable!

  1. Randomly select vectors from the training dataset and use them as the test queries.

@frg01
Copy link

frg01 commented Dec 5, 2024

@wahajali Thank you for your incredibly detailed observation! You’re absolutely right—during the conc_tests, recall isn’t calculated, so there’s no need for a ground truth file.

Since we don't need to compute recall during this stage, we also don't require the ground truth (GT) for these queries, so this can be an entirely different set of queries.

I love the idea of extending the functionality of conc_tests to allow for an increased number of test queries. We could sample from train_vectors or even generate them randomly, which would provide a more comprehensive evaluation of different vector DB memory strategies. Your insights are invaluable!

  1. Randomly select vectors from the training dataset and use them as the test queries.

@alwayslove2013
Hi, your ideas are great. I have some suggestions, that is, how to conduct a comprehensive index performance evaluation. When I want to use a random test dataset, how to ensure that the results are stable and comprehensive, and eliminate the impact of randomness on the accuracy of the results.
Let me tell you my own opinion, that is, based on the number of nearest neighbors, the test dataset query can completely cover the train dataset. That is, a larger train dataset requires a larger test dataset. Ensure that all data can be retrieved when calculating recall rate.
ann-benchmark do this. The datasets must be standard, otherwise the recall rate cannot be truly evaluated, and qps cannot be truly evaluated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants