Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomly pick start idx of test dataset in concurrency search. #377

Merged

Conversation

Sheharyar570
Copy link
Contributor

In the current implementation of concurrent search, all threads begin querying the test dataset starting at index 0 and proceed sequentially through the dataset (looping multiple times). As a consequence, all threads are making the same query to the database almost at the same time. To make it more realistic the PR starts each query from a different starting index, resulting it different threads running different queries.
Databases like Postgres, rely on storing the data/index on disk and loading it on demand in memory. In case of constraint memory, where index doesn’t fit in memory, the first query typically hits the table, but subsequent queries are served from the index cached in shared_buffers. This skews performance metrics by making it difficult to observe the true cost of querying uncached data.
Note that there is no ramp-up or pre-warming step in VectorDBBench. However, there is a serial search task during which recall is calculated, where the set of 1000 queries is executed (this is the same set that is queried during the conc search). Therefore, the relevant index should already be present in memory, if enough memory is there. However, in case memory is not large enough to hold the index, it makes sense that randomizing the idx will result it more realistic numbers.

In summary, if index fits in memory, then this and previous approach would give same QPS. However in case index > memory, QPS numbers with these changes will decrease as different threads are executing different queries at the same time and to accommodate the index in memory certain data needs to be evicted.

Tested on:
Database: Postgres
Algo: pgvectorhnsw

@Sheharyar570
Copy link
Contributor Author

/assign @XuanYang-cn

@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alwayslove2013, Sheharyar570
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alwayslove2013
Copy link
Collaborator

@Sheharyar570 Well done! Randomizing the start_idx can reduce cache hits, which will make the VectorDBBench tests more representative of real-world scenarios. You did an amazing job!

@alwayslove2013 alwayslove2013 merged commit 5e7e438 into zilliztech:main Oct 15, 2024
4 checks passed
@wahajali wahajali deleted the randomly-pick-start-idx-test-dataset branch October 23, 2024 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants