Randomly pick start idx of test dataset in concurrency search. #377

Sheharyar570 · 2024-10-14T18:55:21Z

In the current implementation of concurrent search, all threads begin querying the test dataset starting at index 0 and proceed sequentially through the dataset (looping multiple times). As a consequence, all threads are making the same query to the database almost at the same time. To make it more realistic the PR starts each query from a different starting index, resulting it different threads running different queries.
Databases like Postgres, rely on storing the data/index on disk and loading it on demand in memory. In case of constraint memory, where index doesn’t fit in memory, the first query typically hits the table, but subsequent queries are served from the index cached in shared_buffers. This skews performance metrics by making it difficult to observe the true cost of querying uncached data.
Note that there is no ramp-up or pre-warming step in VectorDBBench. However, there is a serial search task during which recall is calculated, where the set of 1000 queries is executed (this is the same set that is queried during the conc search). Therefore, the relevant index should already be present in memory, if enough memory is there. However, in case memory is not large enough to hold the index, it makes sense that randomizing the idx will result it more realistic numbers.

In summary, if index fits in memory, then this and previous approach would give same QPS. However in case index > memory, QPS numbers with these changes will decrease as different threads are executing different queries at the same time and to accommodate the index in memory certain data needs to be evicted.

Tested on:
Database: Postgres
Algo: pgvectorhnsw

Sheharyar570 · 2024-10-14T19:15:50Z

/assign @XuanYang-cn

sre-ci-robot · 2024-10-15T01:49:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alwayslove2013, Sheharyar570
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alwayslove2013 · 2024-10-15T01:57:09Z

@Sheharyar570 Well done! Randomizing the start_idx can reduce cache hits, which will make the VectorDBBench tests more representative of real-world scenarios. You did an amazing job!

Randomly pick start idx of test dataset in concurrency search.

e395c90

alwayslove2013 approved these changes Oct 15, 2024

View reviewed changes

alwayslove2013 merged commit 5e7e438 into zilliztech:main Oct 15, 2024
4 checks passed

wahajali deleted the randomly-pick-start-idx-test-dataset branch October 23, 2024 14:41

wahajali mentioned this pull request Oct 25, 2024

Increase Size of Test Dataset During SEARCH_CONCURRENT Stage #385

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Randomly pick start idx of test dataset in concurrency search. #377

Randomly pick start idx of test dataset in concurrency search. #377

Sheharyar570 commented Oct 14, 2024

Sheharyar570 commented Oct 14, 2024

sre-ci-robot commented Oct 15, 2024

alwayslove2013 commented Oct 15, 2024

Randomly pick start idx of test dataset in concurrency search. #377

Randomly pick start idx of test dataset in concurrency search. #377

Conversation

Sheharyar570 commented Oct 14, 2024

Sheharyar570 commented Oct 14, 2024

sre-ci-robot commented Oct 15, 2024

alwayslove2013 commented Oct 15, 2024