Randomly pick start idx of test dataset in concurrency search. #377
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the current implementation of concurrent search, all threads begin querying the test dataset starting at index 0 and proceed sequentially through the dataset (looping multiple times). As a consequence, all threads are making the same query to the database almost at the same time. To make it more realistic the PR starts each query from a different starting index, resulting it different threads running different queries.
Databases like Postgres, rely on storing the data/index on disk and loading it on demand in memory. In case of constraint memory, where index doesn’t fit in memory, the first query typically hits the table, but subsequent queries are served from the index cached in shared_buffers. This skews performance metrics by making it difficult to observe the true cost of querying uncached data.
Note that there is no ramp-up or pre-warming step in VectorDBBench. However, there is a serial search task during which recall is calculated, where the set of 1000 queries is executed (this is the same set that is queried during the conc search). Therefore, the relevant index should already be present in memory, if enough memory is there. However, in case memory is not large enough to hold the index, it makes sense that randomizing the idx will result it more realistic numbers.
In summary, if index fits in memory, then this and previous approach would give same QPS. However in case index > memory, QPS numbers with these changes will decrease as different threads are executing different queries at the same time and to accommodate the index in memory certain data needs to be evicted.
Tested on:
Database: Postgres
Algo: pgvectorhnsw