-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Clickhouse Bench #356
base: main
Are you sure you want to change the base?
add Clickhouse Bench #356
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gb198871 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
could you also provide some numbers so we can verify it? |
@xiaofan-luan Do you need this datas? |
@@ -22,3 +22,4 @@ environs | |||
pydantic<v2 | |||
scikit-learn | |||
pymilvus | |||
clickhouse_connect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add it to pyproject.toml
all = [
...,
"clickhouse_connect"
]
clickhouse = [ "clickhouse_connect" ]
so that users could use "pip install vectordb-bench[all]" or "pip install vectordb-bench[clickhouse]" to install dependencies from PYPI
.
if filters: | ||
gt = filters.get("id") | ||
filterSql = f'SELECT id,cosineDistance(embedding,{query}) AS score FROM {self.db_config["dbname"]}.{self.table_name} \ | ||
WHERE id > {gt} ORDER BY score LIMIT {k};' | ||
result = self.conn.query(filterSql).result_rows | ||
return [int(row[0]) for row in result] | ||
else: | ||
selectSql = f'SELECT id,cosineDistance(embedding,{query}) AS score FROM {self.db_config["dbname"]}.{self.table_name} \ | ||
ORDER BY score LIMIT {k};' | ||
result = self.conn.query(selectSql).result_rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not recommended to fix the metric to cosine
here. although all the datasets used by vectordbbench
are cosine
at the moment, we may support more datasets in the future, possibly using L2
or IP
.
You can get the metric used for the current test case from self.case_config
.
from typing import TypedDict | ||
from pydantic import BaseModel, SecretStr | ||
from ..api import DBConfig, DBCaseConfig, MetricType, IndexType | ||
|
||
class ClickhouseConfig(DBConfig): | ||
user_name: SecretStr = "default" | ||
password: SecretStr | ||
host: str = "127.0.0.1" | ||
port: int = 30193 | ||
db_name: str = "default" | ||
|
||
def to_dict(self) -> dict: | ||
user_str = self.user_name.get_secret_value() | ||
pwd_str = self.password.get_secret_value() | ||
return { | ||
"host": self.host, | ||
"port": self.port, | ||
"dbname": self.db_name, | ||
"user": user_str, | ||
"password": pwd_str | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find any code related to ANN Index
in config.py
. Since your test results show that both recall
and ndcg
are equal to 1.0, I'm curious if clickhouse
only supports brute-force
for vector search.
@gb198871 Thank you so much for your first PR contribution! I really appreciate you taking the time to work on this. I've left some comments on the PR with a few suggestions. We are looking forward to collaborating with you and continue improving the |
No description provided.