Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement several methods to efficiently compute various descriptive statistics given PixelSelector object #129

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 34 additions & 2 deletions docs/api/generic.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,15 +45,47 @@ Generic API

.. automethod:: coord1
.. automethod:: coord2
.. automethod:: nnz
.. automethod:: sum
.. automethod:: to_arrow
.. automethod:: to_coo
.. automethod:: to_csr
.. automethod:: to_df
.. automethod:: to_numpy
.. automethod:: to_pandas

**Statistics**

:py:class:`hictkpy.PixelSelector` exposes several methods to compute or estimate several statistics efficiently.

The main features of these methods are:

* All statistics are computed by traversing the data only once and without caching interactions.
* All methods can be tweaked to include or exclude non-finite values.
* All functions implemented using short-circuiting to detect scenarios where the required statistics can be computed without traversing all pixels.

The following statistics are guaranteed to be exact:

* nnz
* sum
* min
* max
* mean

The rest of the supported statistics (currently variance, skewness, and kurtosis) are estimated and are thus not guaranteed to be exact.
However, in practice, the estimation is usually very accurate (based on my tests, the rel. error is always < 1.0e-4, and typically < 1.0e-6).
The only case where the estimates are likely to be inaccurate is when the sample size (i.e. the number of non-zero pixels overlapping a query) is very small.

.. automethod:: describe
.. automethod:: kurtosis
.. automethod:: max
.. automethod:: mean
.. automethod:: min
.. automethod:: nnz
.. automethod:: skewness
.. automethod:: sum
.. automethod:: variance

**Iteration**

.. automethod:: __iter__

.. code-block:: ipythonconsole
Expand Down
4 changes: 4 additions & 0 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,10 @@ find_package(
)

find_package(Arrow REQUIRED)
find_package(Boost REQUIRED)
find_package(FMT REQUIRED)
find_package(nanobind REQUIRED)
find_package(phmap REQUIRED)
find_package(spdlog REQUIRED)
find_package(Filesystem REQUIRED)

Expand Down Expand Up @@ -55,7 +57,9 @@ target_link_libraries(
hictk::hic
hictk::transformers
Arrow::arrow_$<IF:$<BOOL:${BUILD_SHARED_LIBS}>,shared,static>
Boost::headers
fmt::fmt-header-only
phmap
spdlog::spdlog_header_only
std::filesystem
)
Expand Down
5 changes: 2 additions & 3 deletions src/cooler_file_writer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
#include <hictk/cooler/cooler.hpp>
#include <hictk/reference.hpp>
#include <hictk/tmpdir.hpp>
#include <hictk/type_traits.hpp>
#include <stdexcept>
#include <string>
#include <string_view>
Expand Down Expand Up @@ -89,7 +88,7 @@ void CoolerFileWriter::add_pixels(const nb::object &df) {

std::visit(
[&](const auto &n) {
using N = hictk::remove_cvref_t<decltype(n)>;
using N = remove_cvref_t<decltype(n)>;
const auto pixels = coo_format ? coo_df_to_thin_pixels<N>(df, true)
: bg2_df_to_thin_pixels<N>(_w->bins(), df, true);

Expand Down Expand Up @@ -126,7 +125,7 @@ void CoolerFileWriter::finalize([[maybe_unused]] std::string_view log_lvl_str,
try {
std::visit(
[&](const auto &num) {
using N = hictk::remove_cvref_t<decltype(num)>;
using N = remove_cvref_t<decltype(num)>;
_w->aggregate<N>(_path.string(), false, _compression_lvl, chunk_size, update_freq);
},
_w->open("0").pixel_variant());
Expand Down
10 changes: 10 additions & 0 deletions src/include/hictkpy/common.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,25 @@

#include <hictk/common.hpp>
#include <hictk/numeric_variant.hpp>
#include <hictk/type_traits.hpp>
#include <stdexcept>
#include <string>
#include <string_view>
#include <type_traits>

namespace hictkpy {

#define HICTKPY_LIKELY HICTK_LIKELY
#define HICTKPY_UNLIKELY HICTK_UNLIKELY

[[noreturn]] inline void unreachable_code() { hictk::unreachable_code(); }

template <typename T>
using remove_cvref = hictk::remove_cvref<T>;

template <typename T>
using remove_cvref_t = typename remove_cvref<T>::type;

template <typename T>
[[nodiscard]] constexpr std::string_view map_type_to_dtype() {
if constexpr (std::is_unsigned_v<T>) {
Expand Down
131 changes: 131 additions & 0 deletions src/include/hictkpy/pixel_aggregator.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
// Copyright (C) 2024 Roberto Rossini <[email protected]>
//
// SPDX-License-Identifier: MIT

#pragma once

#include <parallel_hashmap/phmap.h>

#include <array>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics.hpp>
#include <cstdint>
#include <optional>
#include <string>
#include <string_view>
#include <variant>

namespace hictkpy {

struct Stats {
std::optional<std::int64_t> nnz{};
std::optional<std::variant<std::int64_t, double>> sum{};
std::optional<std::variant<std::int64_t, double>> min{};
std::optional<std::variant<std::int64_t, double>> max{};
std::optional<double> mean{};
std::optional<double> variance{};
std::optional<double> skewness{};
std::optional<double> kurtosis{};
};

class PixelAggregator {
template <typename N>
using Accumulator = boost::accumulators::accumulator_set<
N, boost::accumulators::stats<
boost::accumulators::droppable<boost::accumulators::tag::count>,
boost::accumulators::droppable<boost::accumulators::tag::sum>,
boost::accumulators::droppable<boost::accumulators::tag::min>,
boost::accumulators::droppable<boost::accumulators::tag::max>,
boost::accumulators::droppable<boost::accumulators::tag::mean>,
boost::accumulators::droppable<boost::accumulators::tag::variance>,
boost::accumulators::droppable<boost::accumulators::tag::skewness>,
boost::accumulators::droppable<boost::accumulators::tag::kurtosis>>>;

// When using ints there is a high chance that computing the 3rd or 4th moments causes an int
// overflow, leading to completely incorrect results.
// At the same time we cannot use doubles everywhere, because for large numbers the value of
// e.g. sum, mean etc. will no longer be exact
using KurtosisAccumulator = boost::accumulators::accumulator_set<
double, boost::accumulators::stats<boost::accumulators::tag::kurtosis>>;

std::variant<Accumulator<std::int64_t>, Accumulator<double>> _accumulator{
Accumulator<std::int64_t>{}};
std::optional<KurtosisAccumulator> _kurtosis_accumulator{};

bool _compute_count{true};
bool _compute_sum{true};
bool _compute_min{true};
bool _compute_max{true};
bool _compute_mean{true};
bool _compute_variance{true};
bool _compute_skewness{true};
bool _compute_kurtosis{true};

bool _nan_found{false};
bool _neg_inf_found{false};
bool _pos_inf_found{false};

public:
static constexpr std::array<std::string_view, 8> valid_metrics{
"nnz", "sum", "min", "max", "mean", "variance", "skewness", "kurtosis"};

PixelAggregator() = default;

template <typename N, bool keep_nans, bool keep_infs, typename PixelSelector>
[[nodiscard]] Stats compute(const PixelSelector& sel,
const phmap::flat_hash_set<std::string>& metrics);

private:
static void validate_metrics(const phmap::flat_hash_set<std::string>& metrics);

template <bool keep_nans, bool keep_infs, typename N>
void process_non_finite(N n) noexcept;
template <bool keep_nans, bool keep_infs, bool skip_kurtosis, typename N, typename It,
typename StopCondition>
[[nodiscard]] It process_pixels_until_true(Accumulator<N>& accumulator, It first, It last,
StopCondition break_loop);

template <typename N>
[[nodiscard]] bool sum_can_be_skipped(Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] bool min_can_be_skipped(Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] bool max_can_be_skipped(Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] bool mean_can_be_skipped(Accumulator<N>& accumulator) const;
template <typename N>
void disable_redundant_accumulators(Accumulator<N>& accumulator);

template <bool keep_nans, bool keep_infs, typename N, typename It>
void process_all_remaining_pixels(Accumulator<N>& accumulator, It&& first, It&& last);

template <bool keep_nans, bool keep_infs, typename N>
static bool drop_value(N n) noexcept;

template <typename N>
void reset(const phmap::flat_hash_set<std::string>& metrics);

template <typename N>
[[nodiscard]] Stats extract(const phmap::flat_hash_set<std::string>& metrics);

template <typename N>
[[nodiscard]] static std::int64_t extract_nnz(const Accumulator<N>& accumulator);
template <typename N>
[[nodiscard]] N extract_sum(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] N extract_min(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] N extract_max(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] double extract_mean(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] double extract_variance(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] double extract_skewness(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] double extract_kurtosis(const Accumulator<N>& accumulator) const;
};

} // namespace hictkpy

#include "../../pixel_aggregator_impl.hpp"
26 changes: 18 additions & 8 deletions src/include/hictkpy/pixel_selector.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
#include <string_view>
#include <tuple>
#include <variant>
#include <vector>

#include "hictkpy/nanobind.hpp"

Expand Down Expand Up @@ -55,14 +56,23 @@ struct PixelSelector {
[[nodiscard]] auto get_coord2() const -> PixelCoordTuple;

[[nodiscard]] nanobind::iterator make_iterable() const;
[[nodiscard]] nanobind::object to_arrow(std::string_view span = "upper_triangle") const;
[[nodiscard]] nanobind::object to_df(std::string_view span = "upper_triangle") const;
[[nodiscard]] nanobind::object to_pandas(std::string_view span = "upper_triangle") const;
[[nodiscard]] nanobind::object to_coo(std::string_view span = "upper_triangle") const;
[[nodiscard]] nanobind::object to_csr(std::string_view span = "upper_triangle") const;
[[nodiscard]] nanobind::object to_numpy(std::string_view span = "full") const;
[[nodiscard]] nanobind::object sum() const;
[[nodiscard]] std::int64_t nnz() const;
[[nodiscard]] nanobind::object to_arrow(std::string_view span) const;
[[nodiscard]] nanobind::object to_df(std::string_view span) const;
[[nodiscard]] nanobind::object to_pandas(std::string_view span) const;
[[nodiscard]] nanobind::object to_coo(std::string_view span) const;
[[nodiscard]] nanobind::object to_csr(std::string_view span) const;
[[nodiscard]] nanobind::object to_numpy(std::string_view span) const;

[[nodiscard]] nanobind::dict describe(const std::vector<std::string>& metrics, bool keep_nans,
bool keep_infs) const;
[[nodiscard]] std::int64_t nnz(bool keep_nans, bool keep_infs) const;
[[nodiscard]] nanobind::object sum(bool keep_nans, bool keep_infs) const;
[[nodiscard]] nanobind::object min(bool keep_nans = false, bool keep_infs = false) const;
[[nodiscard]] nanobind::object max(bool keep_nans = false, bool keep_infs = false) const;
[[nodiscard]] double mean(bool keep_nans = false, bool keep_infs = false) const;
[[nodiscard]] double variance(bool keep_nans = false, bool keep_infs = false) const;
[[nodiscard]] double skewness(bool keep_nans = false, bool keep_infs = false) const;
[[nodiscard]] double kurtosis(bool keep_nans = false, bool keep_infs = false) const;

[[nodiscard]] static auto parse_span(std::string_view span) -> QuerySpan;
[[nodiscard]] static auto parse_count_type(std::string_view type) -> PixelVar;
Expand Down
Loading
Loading