-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updates sparse scale #2942
updates sparse scale #2942
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2942 +/- ##
==========================================
+ Coverage 75.49% 75.51% +0.02%
==========================================
Files 116 117 +1
Lines 12911 12955 +44
==========================================
+ Hits 9747 9783 +36
- Misses 3164 3172 +8
|
It seems like I run into segfaults with the csc implementation. I could not reproduce these with on my PC. I run the tests a lot of times. Since the csr-kernel seems to be working and csr is anyway better for row-based subsets I would propose that we transform the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you looked at the implementation of inplace_column_scale
? WDYT about just scaling by one when the value is masked out
That could also work. But this would require a bit of a rewrite. I think the current solution is simpler and also really fast. |
My thinking on this right now is that:
But the performance benefit is quite good, and for sure the operation I also think we could get even faster, plus a bit cleaner if we instead modified scale array to use something like what I suggest here to accept a from scipy import sparse
import numpy as np
from operator import mul, truediv
def broadcast_csr_by_vec(X, vec, op, axis):
if axis == 0:
new_data = op(X.data, np.repeat(vec, np.diff(X.indptr)))
elif axis == 1:
new_data = op(X.data, vec.take(X.indices, mode="clip"))
return X._with_data(new_data) Which I think would be something like: def broadcast_csr_by_vec(X, vec, op, axis, row_mask: None | np.ndarray):
if row_mask is not None:
vec = np.where(row_mask, vec, 1)
if axis == 0:
new_data = op(X.data, np.repeat(vec, np.diff(X.indptr)))
elif axis == 1:
new_data = op(X.data, vec.take(X.indices, mode="clip"))
return X._with_data(new_data) Or, since we're doing numba already we could do just write out the operation with a check to see if we're on a masked row (which should be even faster since we're not allocating anything extra). I think either of these solutions would be simpler since we do the masking all in one place, and don't have to have a second update step. |
I really like your idea. But I feed like this will complecate the clipping function aswell since we need to subset there aswell than. |
I think it's not so bad. I think you can use similar logic. Numpy version is something like: if axis == 0:
data_mask = np.repeat(row_mask, np.diff(X.indptr))
elif axis == 1:
data_mask = obs_mask.take(X.indices, mode="clip")
X.data[(X.data > max_value) & data_mask] = max_value right? For numba, I'd just include the clipping in the inner loop so it's still single pass. |
I would be open to doing this is you move the whole sparse part out of scale array and than keeping it in scale sparse |
Absolutely. The sparse definition being in scale_array is a big part of what makes the current code a mess |
@ivirshup I don't know where the crash in the build is coming from I change nothing in those parts. However I rewrote the sparse logic and to me it's now a lot better. |
I think it was a bugged |
Ok looks good now |
scanpy/preprocessing/_simple.py
Outdated
elif isspmatrix_csc(X) and mask_obs is None: | ||
return scale_array( | ||
X, | ||
zero_center=zero_center, | ||
copy=copy, | ||
max_value=max_value, | ||
return_mean_std=return_mean_std, | ||
mask_obs=mask_obs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by what this branch ends up doing. Does this not hit the numba kernel?
Btw, I think you can have:
if mask is None:
blocks in numba code where the entire block is just excluded from the compiled code since it's a compile time constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that doesn't hit the numba kernel. For CSC matrix I doesnt make sense to make sense to hit the numba kernel. If we use a map obs we need to transform into csr for the mask-obs subset for the mean-var-calc
. I think what you suggest is fair that we also use the base implementation for none mask csr.
blocks in numba code where the entire block is just excluded from the compiled code since it's a compile time constant.
I'll look into this and adjust the code based on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I updated the kernel to use 2 compile time constant
.
scanpy/preprocessing/_simple.py
Outdated
@numba.njit() | ||
def _scale_sparse_numba(indptr, indices, data, *, std, mask_obs, has_mask, clip): | ||
def _loop_scale(cell_ix): | ||
for j in numba.prange(indptr[cell_ix], indptr[cell_ix + 1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this multithreaded if you don't pass parallel=True
to njit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its slower, because of the way the memory access happens. I tested it. So no its not multi-threaded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok after running some more tests. Its not the memory access but the compile time. The speedup only happens for very large matrices in the 2nd run so I dont think its worth it.
The compiled versions should get cached, so it's a one time cost per install. No? |
@Intron7, could you open an issue with the follow up things you wanted to investigate here? I think this is good to merge as it gets ~ a 100x speed up, and we can do comparisons on top of this in follow ups. |
Co-authored-by: Severin Dicks <[email protected]>
This would fix #2941
I created some numba.njit() kernels that perform in-place substitutions based on the assumption that we only change existing values and don't add new ones (where all the scipy overhead comes from).
Benchmarks for 90k cells and 25k genes:
CSR:
old 23 s | new 1 s | 23x
CSC:
old 61 s | 1.6 s | 36x