Support cluster framework #448

chasingegg · 2024-03-07T12:24:48Z

issue #444
/kind feature
/hold

alexanderguzhva · 2024-03-07T15:06:01Z

include/knowhere/kmeans.h

+namespace knowhere::kmeans {
+namespace {
+
+static constexpr int64_t MAX_TRAIN_SIZE = 10000000L * 700 * 4;


where does this magic number come from?
also, please use ULL, not L for uint64_t

alexanderguzhva · 2024-03-07T15:10:25Z

include/knowhere/kmeans.h

+    uint64_t npts, dim;
+    uint32_t npts_32, dim_32;
+    reader.read((char*)&npts_32, sizeof(uint32_t));
+    reader.read((char*)&dim_32, sizeof(uint32_t));


why are the number of points and dim limited to 32 bits?

you are right, at least the number of points could be very large, change them to uint64_t

alexanderguzhva · 2024-03-07T15:13:08Z

include/knowhere/kmeans.h

+
+template <typename T>
+inline bool
+load_bin_file(const std::string& fname, std::unique_ptr<T[]>& data, uint64_t& offset) {


do I get it correct that this function loads the whole file into a given preallocated buffer into a given offset? If so, then could you please add a comment on what this function does exactly, otherwise it is somewhat confusing. Thanks.

alexanderguzhva · 2024-03-07T15:18:35Z

src/common/kmeans.cc

+
+template <typename VecT>
+void
+KMeans<VecT>::elkan_L2(const VecT* x, const VecT* y, size_t d, size_t nx, size_t ny, uint32_t* ids, float* val) {


my concert with elkan that it was MUCH slower that a regular implementatoin in my past experiments, up to 10x slower if I am not mistaken. Faiss uses the following BLAS call: https://github.com/facebookresearch/faiss/blob/dafdff110489db7587b169a0afee8470f220d295/faiss/utils/distances.cpp#L263

Would you consider providing a plain BLAS-based implementation as well?

alexanderguzhva · 2024-03-07T15:19:48Z

src/common/kmeans.cc

+void
+KMeans<VecT>::initRandom(const VecT* train_data, size_t n_train, uint32_t random_state) {
+    std::unordered_set<uint32_t> picked;
+    std::mt19937 rng(random_state);


random state for std random number generators is uint64_t

alexanderguzhva · 2024-03-07T15:22:04Z

src/common/kmeans.cc

+            size_t start_id = block_id * block_size;
+            size_t end_id = (std::min)((block_id + 1) * block_size, n_train);
+            for (size_t id = start_id; id < end_id; id++) {
+                dist[id] = faiss::fvec_L2sqr(train_data + id * dim_, train_data + init_id * dim_, dim_);


use faiss::fvec_L2sqr_ny()

alexanderguzhva · 2024-03-07T15:22:30Z

src/common/kmeans.cc

+                size_t start_id = block_id * block_size;
+                size_t end_id = (std::min)((block_id + 1) * block_size, n_train);
+                for (size_t id = start_id; id < end_id; id++) {
+                    dist[id] = faiss::fvec_L2sqr(train_data + id * dim_, train_data + init_id * dim_, dim_);


use faiss::fvec_L2sqr_ny()

alexanderguzhva · 2024-03-07T15:23:40Z

src/common/kmeans.cc

+
+namespace knowhere::kmeans {
+
+template <typename VecT>


faiss has faiss::Clustering class, just in case :)

alexanderguzhva · 2024-03-07T15:24:52Z

src/common/kmeans.cc

+
+        offset = 0;
+        for (int i = sample_files; i < file_paths.size(); i++) {
+            uint64_t dumb = 0;


what a variable name :))) I'd use dummy in this case

alexanderguzhva · 2024-03-12T00:39:39Z

cmake/libs/libfaiss.cmake

@@ -110,7 +110,7 @@ if(__X86_64)
            -Wno-unused-function
            -Wno-strict-aliasing>)
  target_link_libraries(
-    faiss PUBLIC OpenMP::OpenMP_CXX ${BLAS_LIBRARIES} ${LAPACK_LIBRARIES}
+    faiss PUBLIC OpenMP::OpenMP_CXX openblas ${BLAS_LIBRARIES} ${LAPACK_LIBRARIES}


I'm not sure why we need openblas here, because there is already ${BLAS_LIBRARIES} O_o

alexanderguzhva · 2024-03-12T00:44:12Z

src/common/kmeans.cc

+void
+KMeans<VecT>::exhaustive_L2sqr_blas(const VecT* x, const VecT* y, size_t d, size_t nx, size_t ny, uint32_t* ids,
+                                    float* val) {
+    static_assert(std::is_same_v<VecT, float>, "sgemm only support float now");


What is the point of not just calling faiss::exhaustive_L2sqr_blas() here?

liliu-z · 2024-04-07T02:38:38Z

include/knowhere/kmeans.h

+    fit(const VecT* vecs, size_t n, size_t max_iter = 10, uint32_t random_state = 0, std::string_view init = "random",
+        std::string_view algorithm = "lloyd");


Can we turn all string_view into enum class?

liliu-z · 2024-04-07T03:04:50Z

include/knowhere/dataset.h

+inline DataSetPtr
+GenResultDataSet(const int64_t dim, const void* tensor, const int64_t rows, const void* centroid_id_mapping) {


Why we need to stick on returning a DataSetPtr for this Kmeans API?

liliu-z · 2024-04-07T04:19:02Z

src/common/kmeans.cc

+    for (size_t iter = 1; iter <= max_iter; ++iter) {
+        if (algorithm == "lloyd") {
+            auto loss = lloyds_iter(vecs, closest_docs, centroid_id_mapping_.get(), closest_centroid_distance.get(), n,
+                                    random_state, verbose_);


From the API the last param of this lloyds_iter is compute_residual, but why we pass verbose_?

liliu-z · 2024-04-07T04:19:58Z

src/common/kmeans.cc

+    if (compute_residual) {
+        for (size_t i = 0; i < n_train; ++i) {
+            losses += closest_centroid_distance[i];
+        }
+    }


Why we do loss computation when compute_residual

liliu-z · 2024-04-07T04:25:36Z

src/common/kmeans.cc

+template <typename VecT>
+float
+KMeans<VecT>::lloyds_iter(const VecT* train_data, std::vector<std::vector<uint32_t>>& closest_docs,
+                          uint32_t* closest_centroid, float* closest_centroid_distance, size_t n_train,


Why we need to pass closest_centroid_distance in instead of creating it in this function

liliu-z · 2024-04-07T08:05:18Z

src/common/kmeans.cc

+            }
+            old_loss = loss;
+        } else {
+            throw std::runtime_error(std::string("Algorithm: ") + std::string(algorithm) + " not supported yet.");


We don't throw exceptions. Use errorcode instead

chasingegg · 2024-04-17T08:02:41Z

Make clustering (currently only one kmeans implmentation)the same level as index, so refactor some index-related .h and .cc to the index folder, the same as clustering.

codecov · 2024-04-17T08:42:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.21%. Comparing base (3c46f4c) to head (c39d17a).
Report is 12 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##           main     #448       +/-   ##
=========================================
+ Coverage      0   71.21%   +71.21%     
=========================================
  Files         0       67       +67     
  Lines         0     4387     +4387     
=========================================
+ Hits          0     3124     +3124     
- Misses        0     1263     +1263

see 67 files with indirect coverage changes

liliu-z · 2024-04-17T09:12:00Z

src/clustering/kmeans/kmeans_config.h

+
+namespace knowhere {
+
+class KmeansConfig : public BaseConfig {


Too many KNN related param in the BaseConfig

liliu-z · 2024-04-17T09:20:46Z

include/knowhere/dataset.h

+inline DataSetPtr
+GenResultDataSet(const int64_t rows, const void* centroid_id_mapping) {
+    auto ret_ds = std::make_shared<DataSet>();
+    ret_ds->SetRows(rows);
+    ret_ds->SetCentroidIdMapping(centroid_id_mapping);
+    ret_ds->SetIsOwner(true);
+    return ret_ds;
+}
+


Let's put this Genxxx function to the user side.

we could reuse current dataset actually

liliu-z · 2024-04-17T09:22:14Z

include/knowhere/dataset.h

@@ -162,6 +168,17 @@ class DataSet : public std::enable_shared_from_this<const DataSet> {
        return nullptr;
    }

+    const void*
+    GetCentroidIdMapping() const {


And these Getter. We can have a index_dataset_util & another cluster_dataset_util to gather them.

liliu-z · 2024-04-17T09:27:51Z

include/knowhere/clustering/clustering.h

+#define CLUSTERING_H
+
+#include "knowhere/binaryset.h"
+#include "knowhere/clustering/clustering_node.h"


It is a little bit weird to make clustering = index. A better name is needed, some thing like cluster and cluster_operator. These just an example for reference.

Lets go with cluster

liliu-z · 2024-04-17T09:30:01Z

include/knowhere/clustering/clustering_node.h

+    Assign(const DataSet& dataset, const Config& cfg) = 0;
+
+    // return centroids, must be called after trained
+    virtual expected<DataSetPtr>


If we use a general object like DataSet as input/output. I will suggest to add more comments to declare what we need in side this Object. Or we can directly avoid using this.

liliu-z · 2024-04-17T09:34:13Z