-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cluster framework #448
Conversation
42be3b1
to
ee1562f
Compare
include/knowhere/kmeans.h
Outdated
namespace knowhere::kmeans { | ||
namespace { | ||
|
||
static constexpr int64_t MAX_TRAIN_SIZE = 10000000L * 700 * 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this magic number come from?
also, please use ULL
, not L
for uint64_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
include/knowhere/kmeans.h
Outdated
uint64_t npts, dim; | ||
uint32_t npts_32, dim_32; | ||
reader.read((char*)&npts_32, sizeof(uint32_t)); | ||
reader.read((char*)&dim_32, sizeof(uint32_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are the number of points and dim limited to 32 bits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, at least the number of points could be very large, change them to uint64_t
include/knowhere/kmeans.h
Outdated
|
||
template <typename T> | ||
inline bool | ||
load_bin_file(const std::string& fname, std::unique_ptr<T[]>& data, uint64_t& offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do I get it correct that this function loads the whole file into a given preallocated buffer into a given offset? If so, then could you please add a comment on what this function does exactly, otherwise it is somewhat confusing. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/common/kmeans.cc
Outdated
|
||
template <typename VecT> | ||
void | ||
KMeans<VecT>::elkan_L2(const VecT* x, const VecT* y, size_t d, size_t nx, size_t ny, uint32_t* ids, float* val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my concert with elkan that it was MUCH slower that a regular implementatoin in my past experiments, up to 10x slower if I am not mistaken. Faiss uses the following BLAS call: https://github.com/facebookresearch/faiss/blob/dafdff110489db7587b169a0afee8470f220d295/faiss/utils/distances.cpp#L263
Would you consider providing a plain BLAS-based implementation as well?
src/common/kmeans.cc
Outdated
void | ||
KMeans<VecT>::initRandom(const VecT* train_data, size_t n_train, uint32_t random_state) { | ||
std::unordered_set<uint32_t> picked; | ||
std::mt19937 rng(random_state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random state for std random number generators is uint64_t
src/common/kmeans.cc
Outdated
size_t start_id = block_id * block_size; | ||
size_t end_id = (std::min)((block_id + 1) * block_size, n_train); | ||
for (size_t id = start_id; id < end_id; id++) { | ||
dist[id] = faiss::fvec_L2sqr(train_data + id * dim_, train_data + init_id * dim_, dim_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use faiss::fvec_L2sqr_ny()
src/common/kmeans.cc
Outdated
size_t start_id = block_id * block_size; | ||
size_t end_id = (std::min)((block_id + 1) * block_size, n_train); | ||
for (size_t id = start_id; id < end_id; id++) { | ||
dist[id] = faiss::fvec_L2sqr(train_data + id * dim_, train_data + init_id * dim_, dim_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use faiss::fvec_L2sqr_ny()
src/common/kmeans.cc
Outdated
|
||
namespace knowhere::kmeans { | ||
|
||
template <typename VecT> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
faiss has faiss::Clustering
class, just in case :)
src/common/kmeans.cc
Outdated
|
||
offset = 0; | ||
for (int i = sample_files; i < file_paths.size(); i++) { | ||
uint64_t dumb = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what a variable name :))) I'd use dummy
in this case
ee1562f
to
c45448a
Compare
cmake/libs/libfaiss.cmake
Outdated
@@ -110,7 +110,7 @@ if(__X86_64) | |||
-Wno-unused-function | |||
-Wno-strict-aliasing>) | |||
target_link_libraries( | |||
faiss PUBLIC OpenMP::OpenMP_CXX ${BLAS_LIBRARIES} ${LAPACK_LIBRARIES} | |||
faiss PUBLIC OpenMP::OpenMP_CXX openblas ${BLAS_LIBRARIES} ${LAPACK_LIBRARIES} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we need openblas
here, because there is already ${BLAS_LIBRARIES}
O_o
src/common/kmeans.cc
Outdated
void | ||
KMeans<VecT>::exhaustive_L2sqr_blas(const VecT* x, const VecT* y, size_t d, size_t nx, size_t ny, uint32_t* ids, | ||
float* val) { | ||
static_assert(std::is_same_v<VecT, float>, "sgemm only support float now"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of not just calling faiss::exhaustive_L2sqr_blas()
here?
c45448a
to
105d4fe
Compare
include/knowhere/kmeans.h
Outdated
fit(const VecT* vecs, size_t n, size_t max_iter = 10, uint32_t random_state = 0, std::string_view init = "random", | ||
std::string_view algorithm = "lloyd"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we turn all string_view into enum class?
include/knowhere/dataset.h
Outdated
inline DataSetPtr | ||
GenResultDataSet(const int64_t dim, const void* tensor, const int64_t rows, const void* centroid_id_mapping) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to stick on returning a DataSetPtr
for this Kmeans API?
src/common/kmeans.cc
Outdated
for (size_t iter = 1; iter <= max_iter; ++iter) { | ||
if (algorithm == "lloyd") { | ||
auto loss = lloyds_iter(vecs, closest_docs, centroid_id_mapping_.get(), closest_centroid_distance.get(), n, | ||
random_state, verbose_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the API the last param of this lloyds_iter
is compute_residual
, but why we pass verbose_
?
src/common/kmeans.cc
Outdated
if (compute_residual) { | ||
for (size_t i = 0; i < n_train; ++i) { | ||
losses += closest_centroid_distance[i]; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we do loss computation when compute_residual
src/common/kmeans.cc
Outdated
template <typename VecT> | ||
float | ||
KMeans<VecT>::lloyds_iter(const VecT* train_data, std::vector<std::vector<uint32_t>>& closest_docs, | ||
uint32_t* closest_centroid, float* closest_centroid_distance, size_t n_train, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to pass closest_centroid_distance
in instead of creating it in this function
src/common/kmeans.cc
Outdated
} | ||
old_loss = loss; | ||
} else { | ||
throw std::runtime_error(std::string("Algorithm: ") + std::string(algorithm) + " not supported yet."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't throw exceptions. Use errorcode instead
087146f
to
477f58a
Compare
Make clustering (currently only one kmeans implmentation)the same level as index, so refactor some index-related .h and .cc to the index folder, the same as clustering. |
477f58a
to
0b4e799
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #448 +/- ##
=========================================
+ Coverage 0 71.21% +71.21%
=========================================
Files 0 67 +67
Lines 0 4387 +4387
=========================================
+ Hits 0 3124 +3124
- Misses 0 1263 +1263 |
|
||
namespace knowhere { | ||
|
||
class KmeansConfig : public BaseConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many KNN related param in the BaseConfig
include/knowhere/dataset.h
Outdated
inline DataSetPtr | ||
GenResultDataSet(const int64_t rows, const void* centroid_id_mapping) { | ||
auto ret_ds = std::make_shared<DataSet>(); | ||
ret_ds->SetRows(rows); | ||
ret_ds->SetCentroidIdMapping(centroid_id_mapping); | ||
ret_ds->SetIsOwner(true); | ||
return ret_ds; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put this Genxxx function to the user side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could reuse current dataset actually
include/knowhere/dataset.h
Outdated
@@ -162,6 +168,17 @@ class DataSet : public std::enable_shared_from_this<const DataSet> { | |||
return nullptr; | |||
} | |||
|
|||
const void* | |||
GetCentroidIdMapping() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And these Getter. We can have a index_dataset_util & another cluster_dataset_util to gather them.
#define CLUSTERING_H | ||
|
||
#include "knowhere/binaryset.h" | ||
#include "knowhere/clustering/clustering_node.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little bit weird to make clustering
= index
. A better name is needed, some thing like cluster
and cluster_operator
. These just an example for reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets go with cluster
Assign(const DataSet& dataset, const Config& cfg) = 0; | ||
|
||
// return centroids, must be called after trained | ||
virtual expected<DataSetPtr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use a general object like DataSet
as input/output. I will suggest to add more comments to declare what we need in side this Object. Or we can directly avoid using this.
src/clustering/kmeans/kmeans.cc
Outdated
|
||
template <typename DataType> | ||
expected<DataSetPtr> | ||
KmeansClusteringNode<DataType>::Assign(const DataSet& dataset, const Config& cfg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cfg
is not used. Let's either mark it as /* unused */
in the signature or get rid of this param
src/clustering/kmeans/kmeans.cc
Outdated
} | ||
auto rows = dataset.GetRows(); | ||
auto vecs = dataset.GetTensor(); | ||
knowhere::TimeRecorder build_time("Kmeans assign cost", 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do a dim check here
src/clustering/kmeans/kmeans.cc
Outdated
elkan_L2(vecs + start * dim_, centroids, dim_, end - start, num_clusters_, closest_centroid + start, | ||
closest_centroid_distance + start); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is for float. If we don't support other data types, let's simply return error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove elkan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster factory has limited data type to be float, already have a static assert
src/clustering/kmeans/kmeans.cc
Outdated
for (auto& future : futures) { | ||
future.wait(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Async call need to handle the error
src/clustering/kmeans/kmeans.cc
Outdated
for (auto& future : futures) { | ||
future.wait(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, let's handle the error
0b4e799
to
b5af672
Compare
b122b8b
to
47b89c9
Compare
47b89c9
to
5e54053
Compare
5e54053
to
44fdd4e
Compare
44fdd4e
to
e312f51
Compare
/unhold |
include/knowhere/comp/index_param.h
Outdated
@@ -61,6 +65,7 @@ constexpr const char* INDEX_ENGINE_VERSION = "index_engine_version"; | |||
constexpr const char* RETRIEVE_FRIENDLY = "retrieve_friendly"; | |||
constexpr const char* DIM = "dim"; | |||
constexpr const char* TENSOR = "tensor"; | |||
constexpr const char* CENTROID_ID_MAPPING = "centroid_id_mapping"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove redundant
include/knowhere/comp/index_param.h
Outdated
@@ -82,6 +87,7 @@ constexpr const char* TRACE_FLAGS = "trace_flags"; | |||
constexpr const char* MATERIALIZED_VIEW_SEARCH_INFO = "materialized_view_search_info"; | |||
constexpr const char* MATERIALIZED_VIEW_OPT_FIELDS_PATH = "opt_fields_path"; | |||
constexpr const char* MAX_EMPTY_RESULT_BUCKETS = "max_empty_result_buckets"; | |||
constexpr const char* NUM_CLUSTERS = "num_clusters"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only used by UT
Signed-off-by: chasingegg <[email protected]>
Signed-off-by: chasingegg <[email protected]>
e312f51
to
c39d17a
Compare
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chasingegg, liliu-z The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
issue #444
/kind feature
/hold