Implement several methods to efficiently compute various descriptive statistics given PixelSelector object #129
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is supposed to be similar to the describe method from scipy.stats.
The following statistics are supported:
All statistics are computed using non-zero pixels.
If you need statistics including zeros, you are better off fetching
interactions as a sparse matrix with
sel.to_csr()
and computing therequired statistics using the sparse matrix.
The main advantage of using hictkpy's describe() instead of other methods,
is that all statistics are computed (or estimated) by traversing the data
only once (and without caching pixels).
All statistics except variance, skewness, and kurtosis are guaranteed to
be exact.
Variance, skewness, and kurtosis are not exact because they are estimated
using the accumulator library from Boost.
However, in practice, the estimation is usually very accurate (based on my
tests, the rel. error is always < 1.0e-4, and typically < 1.0e-6).
The estimates can be inaccurate when the sample size is very small.
For the time being, working around this issue is the useri's responsibility.
Example:
Another important feature of hictkpy's describe(), is the ability of
recognizing scenarios where the required statistics can be computed
without traversing all pixels overlapping with the given query.
For example, as soon as the first pixel with a NaN count is encountered,
we can stop updating the estimates for all statistics except nnz.
This means that if we do not have to compute nnz, then describe() can return
as soon as the first NaN pixel is found.
If we have to compute nnz, then we must traverse all pixels.
However, we can still reduce the amount of work performed by describe()
by taking advantage of the fact that we only need to count pixels.
We recommend using describe() when computing multiple statistics at the
same time.
When computing a single statistic we recommend using one of nnz(), sum(),
min(), max(), mean(), variance(), skewness(), or kurtosis().
All methods computing stats from a PixelSelector accept a keep_nans and
keep_infs params which can be used to customize how non-finite values
are handled.