Skip to content

Commit

Permalink
Merge pull request #34 from Jaybro/kd_forest
Browse files Browse the repository at this point in the history
kd_forest rework
  • Loading branch information
Jaybro authored Sep 7, 2023
2 parents 9ac48f3 + c05db63 commit 46699e6
Show file tree
Hide file tree
Showing 30 changed files with 731 additions and 414 deletions.
39 changes: 19 additions & 20 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/utils.cmake)

project(pico_tree
LANGUAGES CXX
VERSION 0.8.1
VERSION 0.8.2
DESCRIPTION "PicoTree is a C++ header only library for fast nearest neighbor searches and range searches using a KdTree."
HOMEPAGE_URL "https://github.com/Jaybro/pico_tree")

Expand All @@ -23,16 +23,19 @@ add_subdirectory(src)
# Ignored when running cmake from setup.py using scikit-build.
if(NOT SKBUILD)
option(BUILD_EXAMPLES "Enable the creation of PicoTree examples." ON)
message(STATUS "BUILD_EXAMPLES: ${BUILD_EXAMPLES}")

if(BUILD_EXAMPLES)
add_subdirectory(examples)
endif()

include(CTest)
find_package(GTest QUIET)

if(BUILD_TESTING)
if(GTEST_FOUND)
if(GTEST_FOUND)
include(CTest)
message(STATUS "BUILD_TESTING: ${BUILD_TESTING}")

if(BUILD_TESTING)
# Tests are dependent on some common code.
# For now, the understory is considered important enough to be tested.
if(NOT TARGET pico_toolshed)
Expand All @@ -42,28 +45,24 @@ if(NOT SKBUILD)

enable_testing()
add_subdirectory(test)
message(STATUS "GTest found. Building unit tests.")
else()
message(STATUS "GTest not found. Unit tests will not be build.")
endif()
else()
message(STATUS "GTest not found. Unit tests cannot be build.")
endif()

find_package(Doxygen QUIET)
option(BUILD_DOCS "Build documentation with Doxygen." ON)

if(BUILD_DOCS)
if(DOXYGEN_FOUND)
set(DOC_TARGET_NAME ${PROJECT_NAME}_doc)
if(DOXYGEN_FOUND)
set(DOC_TARGET_NAME ${PROJECT_NAME}_doc)

# Hide the internal namespace from the documentation.
# set(DOXYGEN_EXCLUDE_SYMBOLS "internal")
doxygen_add_docs(
${DOC_TARGET_NAME}
src/pico_tree)
# Hide the internal namespace from the documentation.
# set(DOXYGEN_EXCLUDE_SYMBOLS "internal")
doxygen_add_docs(
${DOC_TARGET_NAME}
src/pico_tree)

message(STATUS "Doxygen found. To build the documentation: cmake --build . --target ${DOC_TARGET_NAME}")
else()
message(STATUS "Doxygen not found. Documentation cannot be build.")
endif()
message(STATUS "Doxygen found. To build the documentation: cmake --build . --target ${DOC_TARGET_NAME}")
else()
message(STATUS "Doxygen not found. Documentation cannot be build.")
endif()
endif()
18 changes: 10 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ PicoTree is a C++ header only library with [Python bindings](https://github.com/
| [Scikit-learn KDTree][skkd] 1.2.2 | ... | 6.2s | ... | 42.2s |
| [pykdtree][pykd] 1.3.7 | ... | 1.0s | ... | 6.6s |
| [OpenCV FLANN][cvfn] 4.6.0 | 1.9s | ... | 4.7s | ... |
| PicoTree KdTree v0.8.1 | 0.9s | 1.0s | 2.8s | 3.1s |
| PicoTree KdTree v0.8.2 | 0.9s | 1.0s | 2.8s | 3.1s |

Two [LiDAR](./docs/benchmark.md) based point clouds of sizes 7733372 and 7200863 were used to generate these numbers. The first point cloud was the input to the build algorithm and the second to the query algorithm. All benchmarks were run on a single thread with the following parameters: `max_leaf_size=10` and `knn=1`. A more detailed [C++ comparison](./docs/benchmark.md) of PicoTree is available with respect to [nanoflann][nano].

Expand Down Expand Up @@ -61,7 +61,7 @@ PicoTree can interface with different types of points and point sets through tra
* Creating a [custom search visitor](./examples/kd_tree/kd_tree_custom_search_visitor.cpp).
* [Saving and loading](./examples/kd_tree/kd_tree_save_and_load.cpp) a KdTree to and from a file.
* Support for [Eigen](./examples/eigen/eigen.cpp) and [OpenCV](./examples/opencv/opencv.cpp) data types.
* Running the KdTree on the [MNIST](./examples/mnist/mnist.cpp) [database](http://yann.lecun.com/exdb/mnist/).
* [Running the KdTree and KdForest](./examples/kd_forest/kd_forest.cpp) on the [MNIST](http://yann.lecun.com/exdb/mnist/) and [SIFT](http://corpus-texmex.irisa.fr/) datasets.
* How to use the [KdTree with Python](./examples/python/kd_tree.py).

# Requirements
Expand Down Expand Up @@ -113,9 +113,11 @@ $ pip install ./pico_tree

# References

* [Computational Geometry - Algorithms and Applications.](https://www.springer.com/gp/book/9783540779735) Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars, Springer-Verlag, third edition, 2008.
* S. Maneewongvatana and D. M. Mount. [It's okay to be skinny, if your friends are fat.](http://www.cs.umd.edu/~mount/Papers/cgc99-smpack.pdf) 4th Annual CGC Workshop on Computational Geometry, 1999.
* S. Arya and H. Y. Fu. [Expected-case complexity of approximate nearest neighbor searching.](https://www.cse.ust.hk/faculty/arya/pub/exp.pdf) InProceedings of the 11th ACM-SIAM Symposium on Discrete Algorithms, 2000.
* S. Arya and D. M. Mount. [Algorithms for fast vector quantization.](https://www.cs.umd.edu/~mount/Papers/DCC.pdf) In IEEE Data Compression Conference, pages 381–390, March 1993.
* N. Sample, M. Haines, M. Arnold and T. Purcell. [Optimizing Search Strategies in k-d Trees.](http://infolab.stanford.edu/~nsample/pubs/samplehaines.pdf) In: 5th WSES/IEEE World Multiconference on Circuits, Systems, Communications & Computers (CSCC 2001), July 2001.
* A. Yershova and S. M. LaValle, [Improving Motion-Planning Algorithms by Efficient Nearest-Neighbor Searching.](http://msl.cs.uiuc.edu/~lavalle/papers/YerLav06.pdf) In IEEE Transactions on Robotics, vol. 23, no. 1, pp. 151-157, Feb. 2007.
* J. L. Bentley, [Multidimensional binary search trees used for associative searching](https://dl.acm.org/doi/pdf/10.1145/361002.361007), Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.
* S. Arya and D. M. Mount, [Algorithms for fast vector quantization](https://www.cs.umd.edu/~mount/Papers/DCC.pdf), In IEEE Data Compression Conference, pp. 381–390, March 1993.
* S. Maneewongvatana and D. M. Mount, [It's okay to be skinny, if your friends are fat](http://www.cs.umd.edu/~mount/Papers/cgc99-smpack.pdf), 4th Annual CGC Workshop on Computational Geometry, 1999.
* S. Arya and H. Y. Fu, [Expected-case complexity of approximate nearest neighbor searching](https://www.cse.ust.hk/faculty/arya/pub/exp.pdf), InProceedings of the 11th ACM-SIAM Symposium on Discrete Algorithms, 2000.
* N. Sample, M. Haines, M. Arnold and T. Purcell, [Optimizing Search Strategies in k-d Trees](http://infolab.stanford.edu/~nsample/pubs/samplehaines.pdf), In: 5th WSES/IEEE World Multiconference on Circuits, Systems, Communications & Computers (CSCC 2001), July 2001.
* A. Yershova and S. M. LaValle, [Improving Motion-Planning Algorithms by Efficient Nearest-Neighbor Searching](http://msl.cs.uiuc.edu/~lavalle/papers/YerLav06.pdf), In IEEE Transactions on Robotics, vol. 23, no. 1, pp. 151-157, Feb. 2007.
* M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars, [Computational Geometry - Algorithms and Applications](https://www.springer.com/gp/book/9783540779735), Springer-Verlag, third edition, 2008.
* C. Silpa-Anan and R. Hartley, [Optimised KD-trees for fast image descriptor matching](http://vigir.missouri.edu/~gdesouza/Research/Conference_CDs/IEEE_CVPR_2008/data/papers/298.pdf), In CVPR, 2008.
2 changes: 1 addition & 1 deletion docs/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ One of the PicoTree examples contains [benchmarks](../examples/benchmark/) of di

The results described in this document were generated on 29-08-2021 using MinGW GCC 10.3, PicoTree v0.7.4 and Nanoflann v1.3.2.

Note: The performance of PicoTree v0.8.0 released on 30-6-2023 is identical to that of v0.7.4. However, the build algorithm of nanoflann v1.5.0 regressed and has become 90% slower.
Note: The performance of PicoTree v0.8.2 released on 07-09-2023 is identical to that of v0.7.4. However, the build algorithm of nanoflann v1.5.0 regressed and has become 90% slower.

# Data sets

Expand Down
6 changes: 2 additions & 4 deletions examples/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ add_subdirectory(pico_understory)

add_subdirectory(kd_tree)

add_subdirectory(kd_forest)

find_package(Eigen3 QUIET)

if(Eigen3_FOUND)
Expand Down Expand Up @@ -35,10 +37,6 @@ else()
message(STATUS "benchmark not found. PicoTree benchmarks skipped.")
endif()

if(Eigen3_FOUND)
add_subdirectory(mnist)
endif()

# The Python examples only get copied when the bindings module will be build.
if(TARGET _pyco_tree)
add_subdirectory(python)
Expand Down
3 changes: 3 additions & 0 deletions examples/kd_forest/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
add_executable(kd_forest kd_forest.cpp)
set_default_target_properties(kd_forest)
target_link_libraries(kd_forest PUBLIC pico_toolshed pico_understory)
92 changes: 92 additions & 0 deletions examples/kd_forest/kd_forest.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
#include <iostream>
#include <pico_toolshed/format/format_bin.hpp>
#include <pico_toolshed/scoped_timer.hpp>
#include <pico_tree/array_traits.hpp>
#include <pico_tree/kd_tree.hpp>
#include <pico_tree/vector_traits.hpp>
#include <pico_understory/kd_forest.hpp>

#include "mnist.hpp"
#include "sift.hpp"

// A KdForest takes roughly forest_size times longer to build compared to
// building a KdTree. However, the KdForest is usually a lot faster with queries
// in high dimensions with the added trade-off that the exact nearest neighbor
// may not be found.
template <typename Dataset>
void RunDataset(
std::size_t tree_max_leaf_size,
std::size_t forest_size,
std::size_t forest_max_leaf_size,
std::size_t forest_max_leaves_visited) {
using Point = typename Dataset::PointType;
using Space = std::reference_wrapper<std::vector<Point>>;
using Scalar = typename Point::value_type;

auto train = Dataset::ReadTrain();
auto test = Dataset::ReadTest();
std::size_t count = test.size();
std::vector<pico_tree::Neighbor<int, Scalar>> nns(count);
std::string fn_nns_gt = Dataset::kDatasetName + "_nns_gt.bin";

if (!std::filesystem::exists(fn_nns_gt)) {
std::cout << "Creating " << fn_nns_gt
<< " using the KdTree. Be *very* patient." << std::endl;

auto kd_tree = [&train, &tree_max_leaf_size]() {
ScopedTimer t0("kd_tree build");
return pico_tree::KdTree<Space>(train, tree_max_leaf_size);
}();

{
ScopedTimer t1("kd_tree query");
for (std::size_t i = 0; i < nns.size(); ++i) {
kd_tree.SearchNn(test[i], nns[i]);
}
}

pico_tree::WriteBin(fn_nns_gt, nns);
} else {
pico_tree::ReadBin(fn_nns_gt, nns);
std::cout << "KdTree not created. Read " << fn_nns_gt << " instead."
<< std::endl;
}

std::size_t equal = 0;
{
auto rkd_tree = [&train, &forest_max_leaf_size, &forest_size]() {
ScopedTimer t0("kd_forest build");
return pico_tree::KdForest<Space>(
train, forest_max_leaf_size, forest_size);
}();

ScopedTimer t1("kd_forest query");
pico_tree::Neighbor<int, Scalar> nn;
for (std::size_t i = 0; i < nns.size(); ++i) {
rkd_tree.SearchNn(test[i], forest_max_leaves_visited, nn);

if (nns[i].index == nn.index) {
++equal;
}
}
}

std::cout << "Precision: "
<< (static_cast<float>(equal) / static_cast<float>(count))
<< std::endl;
}

int main() {
// forest_max_leaf_size = 16
// forest_max_leaves_visited = 16
// forest_size 8: a precision of around 0.915.
// forest_size 16: a precision of around 0.976.
RunDataset<Mnist>(16, 8, 16, 16);
// forest_max_leaf_size = 32
// forest_max_leaves_visited = 64
// forest_size 8: a precision of around 0.884.
// forest_size 16: a precision of around 0.940.
// forest_size 128: out of memory :'(
RunDataset<Sift>(16, 8, 32, 64);
return 0;
}
59 changes: 59 additions & 0 deletions examples/kd_forest/mnist.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#pragma once

#include <algorithm>
#include <filesystem>
#include <pico_toolshed/format/format_mnist.hpp>

template <typename U, typename T, std::size_t N>
std::array<U, N> Cast(std::array<T, N> const& i) {
std::array<U, N> c;
std::transform(i.begin(), i.end(), c.begin(), [](T a) -> U {
return static_cast<U>(a);
});
return c;
}

template <typename U, typename T, std::size_t N>
std::vector<std::array<U, N>> Cast(std::vector<std::array<T, N>> const& i) {
std::vector<std::array<U, N>> c;
std::transform(
i.begin(),
i.end(),
std::back_inserter(c),
[](std::array<T, N> const& a) -> std::array<U, N> { return Cast<U>(a); });
return c;
}

class Mnist {
private:
using Scalar = float;
using ImageByte = std::array<std::byte, 28 * 28>;
using ImageFloat = std::array<Scalar, 28 * 28>;

static std::vector<ImageFloat> ReadImages(std::string const& filename) {
if (!std::filesystem::exists(filename)) {
throw std::runtime_error(filename + " doesn't exist.");
}

std::vector<ImageByte> images_u8;
pico_tree::ReadMnistImages(filename, images_u8);
return Cast<Scalar>(images_u8);
}

public:
using PointType = ImageFloat;

static std::string const kDatasetName;

static std::vector<PointType> ReadTrain() {
std::string fn_images_train = "train-images.idx3-ubyte";
return ReadImages(fn_images_train);
}

static std::vector<PointType> ReadTest() {
std::string fn_images_test = "t10k-images.idx3-ubyte";
return ReadImages(fn_images_test);
}
};

std::string const Mnist::kDatasetName = "mnist";
36 changes: 36 additions & 0 deletions examples/kd_forest/sift.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#pragma once

#include <filesystem>
#include <pico_toolshed/format/format_xvecs.hpp>

class Sift {
private:
using VectorFloat = std::array<float, 128>;

static std::vector<VectorFloat> ReadVectors(std::string const& filename) {
if (!std::filesystem::exists(filename)) {
throw std::runtime_error(filename + " doesn't exist.");
}

std::vector<VectorFloat> vectors;
pico_tree::ReadXvecs(filename, vectors);
return vectors;
}

public:
using PointType = VectorFloat;

static std::string const kDatasetName;

static std::vector<PointType> ReadTrain() {
std::string fn_images_train = "sift_base.fvecs";
return ReadVectors(fn_images_train);
}

static std::vector<PointType> ReadTest() {
std::string fn_images_test = "sift_query.fvecs";
return ReadVectors(fn_images_test);
}
};

std::string const Sift::kDatasetName = "sift";
3 changes: 0 additions & 3 deletions examples/mnist/CMakeLists.txt

This file was deleted.

Loading

0 comments on commit 46699e6

Please sign in to comment.