Increase Size of Test Dataset During SEARCH_CONCURRENT Stage #385

wahajali · 2024-10-25T20:43:59Z

I would like to propose that instead of using the 1,000 test query set that VectorDBBench currently uses during the SEARCH_SERIAL stage to calculate recall, we should use a larger pool of queries during the SEARCH_CONCURRENT phase where QPS is calculated. Since we don't need to compute recall during this stage, we also don't require the ground truth (GT) for these queries, so this can be an entirely different set of queries.

To provide context for this proposal, I’d like to share some results from tests I ran comparing the original 1,000 test queries with 10,000 randomly selected queries from the original training dataset on pgvector with HNSW. The tests were run where the index size exceeds the memory cache (shared_buffers in PostgreSQL) available, meaning that the index could not fit in memory. For HNSW, the performance impact is expected to be significant. In the first case, I used the original 1,000 queries, and in the second case, I used 10,000 randomly selected queries from the original dataset. For reference, I’ve also included results for the OpenAI 500K dataset, which fits into memory.

Dataset Size	QPS	Test Data Size
5M OpenAI	1030.69815	1000 (original)
5M OpenAI	4.4392	10000 (generated)
500K OpenAI	1276.48165	1000 (original)
500K OpenAI	1143.8795	10000 (generated)

As you can see, the decrease in QPS is dramatic. The low QPS is expected because the index size is significantly larger than the available buffers, which requires disk IO. In the case of 1,000 test queries, this observation isn’t apparent at all perhaps because the limited number of index queries don't force the entire index to be loaded into memory. While a recent improvement in randomly selecting the query index has made the QPS more realistic, I believe this change will make the numbers even more reflective of actual performance.

In my opinion, there are two options:

Generate a larger test dataset using the same methodology as previously used.
Randomly select vectors from the training dataset and use them as the test queries.

alwayslove2013 · 2024-10-28T02:52:04Z

@wahajali Thank you for your incredibly detailed observation! You’re absolutely right—during the conc_tests, recall isn’t calculated, so there’s no need for a ground truth file.

Since we don't need to compute recall during this stage, we also don't require the ground truth (GT) for these queries, so this can be an entirely different set of queries.

I love the idea of extending the functionality of conc_tests to allow for an increased number of test queries. We could sample from train_vectors or even generate them randomly, which would provide a more comprehensive evaluation of different vector DB memory strategies. Your insights are invaluable!

Randomly select vectors from the training dataset and use them as the test queries.

frg01 · 2024-12-05T04:14:36Z

@wahajali Thank you for your incredibly detailed observation! You’re absolutely right—during the conc_tests, recall isn’t calculated, so there’s no need for a ground truth file.

Since we don't need to compute recall during this stage, we also don't require the ground truth (GT) for these queries, so this can be an entirely different set of queries.

I love the idea of extending the functionality of conc_tests to allow for an increased number of test queries. We could sample from train_vectors or even generate them randomly, which would provide a more comprehensive evaluation of different vector DB memory strategies. Your insights are invaluable!

Randomly select vectors from the training dataset and use them as the test queries.

@alwayslove2013
Hi, your ideas are great. I have some suggestions, that is, how to conduct a comprehensive index performance evaluation. When I want to use a random test dataset, how to ensure that the results are stable and comprehensive, and eliminate the impact of randomness on the accuracy of the results.
Let me tell you my own opinion, that is, based on the number of nearest neighbors, the test dataset query can completely cover the train dataset. That is, a larger train dataset requires a larger test dataset. Ensure that all data can be retrieved when calculating recall rate.
ann-benchmark do this. The datasets must be standard, otherwise the recall rate cannot be truly evaluated, and qps cannot be truly evaluated.

alwayslove2013 self-assigned this Oct 28, 2024

wahajali mentioned this issue Oct 31, 2024

HNSW QPS Degradation as Index Size Grows Beyond Memory pgvector/pgvector#700

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase Size of Test Dataset During SEARCH_CONCURRENT Stage #385

Increase Size of Test Dataset During SEARCH_CONCURRENT Stage #385

wahajali commented Oct 25, 2024 •

edited

Loading

alwayslove2013 commented Oct 28, 2024

frg01 commented Dec 5, 2024 •

edited

Loading

Increase Size of Test Dataset During SEARCH_CONCURRENT Stage #385

Increase Size of Test Dataset During SEARCH_CONCURRENT Stage #385

Comments

wahajali commented Oct 25, 2024 • edited Loading

alwayslove2013 commented Oct 28, 2024

frg01 commented Dec 5, 2024 • edited Loading

wahajali commented Oct 25, 2024 •

edited

Loading

frg01 commented Dec 5, 2024 •

edited

Loading