Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updates sparse scale #2942

Merged
merged 30 commits into from
Apr 8, 2024
Merged

updates sparse scale #2942

merged 30 commits into from
Apr 8, 2024

Conversation

Intron7
Copy link
Member

@Intron7 Intron7 commented Mar 22, 2024

This would fix #2941

I created some numba.njit() kernels that perform in-place substitutions based on the assumption that we only change existing values and don't add new ones (where all the scipy overhead comes from).

Benchmarks for 90k cells and 25k genes:
CSR:
old 23 s | new 1 s | 23x
CSC:
old 61 s | 1.6 s | 36x

@Intron7 Intron7 added this to the 1.10.1 milestone Mar 22, 2024
Copy link

codecov bot commented Mar 22, 2024

Codecov Report

Attention: Patch coverage is 87.70492% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 75.51%. Comparing base (c68557c) to head (35dd438).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2942      +/-   ##
==========================================
+ Coverage   75.49%   75.51%   +0.02%     
==========================================
  Files         116      117       +1     
  Lines       12911    12955      +44     
==========================================
+ Hits         9747     9783      +36     
- Misses       3164     3172       +8     
Files Coverage Δ
scanpy/preprocessing/__init__.py 100.00% <100.00%> (ø)
scanpy/preprocessing/_simple.py 83.97% <100.00%> (-1.19%) ⬇️
scanpy/tools/_rank_genes_groups.py 94.33% <100.00%> (ø)
scanpy/preprocessing/_scale.py 87.39% <87.39%> (ø)

@Intron7
Copy link
Member Author

Intron7 commented Mar 22, 2024

It seems like I run into segfaults with the csc implementation. I could not reproduce these with on my PC. I run the tests a lot of times. Since the csr-kernel seems to be working and csr is anyway better for row-based subsets I would propose that we transform the csc into a csr and than return that for mask_obs.

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at the implementation of inplace_column_scale? WDYT about just scaling by one when the value is masked out

docs/release-notes/1.10.1.md Outdated Show resolved Hide resolved
@Intron7
Copy link
Member Author

Intron7 commented Mar 27, 2024

That could also work. But this would require a bit of a rewrite. I think the current solution is simpler and also really fast.

@Intron7 Intron7 requested a review from ivirshup March 28, 2024 09:05
@ivirshup
Copy link
Member

ivirshup commented Mar 28, 2024

My thinking on this right now is that:

  • The code for masking logic (pre this PR) is kind of a mess
  • This PR doesn't make the code nicer

But the performance benefit is quite good, and for sure the operation X[mask_obs, :] = scale_rv is something we don't want to do with sparse matrices.

I also think we could get even faster, plus a bit cleaner if we instead modified scale array to use something like what I suggest here to accept a row_mask argument:

from scipy import sparse
import numpy as np
from operator import mul, truediv

def broadcast_csr_by_vec(X, vec, op, axis):
    if axis == 0:
        new_data = op(X.data, np.repeat(vec, np.diff(X.indptr)))
    elif axis == 1:
        new_data = op(X.data, vec.take(X.indices, mode="clip"))
    return X._with_data(new_data)

Which I think would be something like:

def broadcast_csr_by_vec(X, vec, op, axis, row_mask: None | np.ndarray):
    if row_mask is not None:
        vec = np.where(row_mask, vec, 1)
    if axis == 0:
        new_data = op(X.data, np.repeat(vec, np.diff(X.indptr)))
    elif axis == 1:
        new_data = op(X.data, vec.take(X.indices, mode="clip"))
    return X._with_data(new_data)

Or, since we're doing numba already we could do just write out the operation with a check to see if we're on a masked row (which should be even faster since we're not allocating anything extra).

I think either of these solutions would be simpler since we do the masking all in one place, and don't have to have a second update step.

@Intron7
Copy link
Member Author

Intron7 commented Mar 28, 2024

I really like your idea. But I feed like this will complecate the clipping function aswell since we need to subset there aswell than.

@ivirshup
Copy link
Member

ivirshup commented Mar 28, 2024

clipping function

I think it's not so bad. I think you can use similar logic. Numpy version is something like:

    if axis == 0:
        data_mask = np.repeat(row_mask, np.diff(X.indptr))
    elif axis == 1:
        data_mask = obs_mask.take(X.indices, mode="clip")
    X.data[(X.data > max_value) & data_mask] = max_value

right?

For numba, I'd just include the clipping in the inner loop so it's still single pass.

@Intron7
Copy link
Member Author

Intron7 commented Mar 28, 2024

I would be open to doing this is you move the whole sparse part out of scale array and than keeping it in scale sparse

@ivirshup
Copy link
Member

I would be open to doing this is you move the whole sparse part out of scale array and than keeping it in scale sparse

Absolutely. The sparse definition being in scale_array is a big part of what makes the current code a mess

@Intron7
Copy link
Member Author

Intron7 commented Mar 28, 2024

@ivirshup I don't know where the crash in the build is coming from I change nothing in those parts. However I rewrote the sparse logic and to me it's now a lot better.

@ivirshup
Copy link
Member

I think it was a bugged uv release, but seems to be fixed now.

@Intron7
Copy link
Member Author

Intron7 commented Mar 28, 2024

Ok looks good now

Comment on lines 942 to 950
elif isspmatrix_csc(X) and mask_obs is None:
return scale_array(
X,
zero_center=zero_center,
copy=copy,
max_value=max_value,
return_mean_std=return_mean_std,
mask_obs=mask_obs,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by what this branch ends up doing. Does this not hit the numba kernel?

Btw, I think you can have:

if mask is None:

blocks in numba code where the entire block is just excluded from the compiled code since it's a compile time constant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that doesn't hit the numba kernel. For CSC matrix I doesnt make sense to make sense to hit the numba kernel. If we use a map obs we need to transform into csr for the mask-obs subset for the mean-var-calc. I think what you suggest is fair that we also use the base implementation for none mask csr.

blocks in numba code where the entire block is just excluded from the compiled code since it's a compile time constant.

I'll look into this and adjust the code based on this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I updated the kernel to use 2 compile time constant.

@Intron7 Intron7 requested a review from ivirshup April 2, 2024 17:35
@numba.njit()
def _scale_sparse_numba(indptr, indices, data, *, std, mask_obs, has_mask, clip):
def _loop_scale(cell_ix):
for j in numba.prange(indptr[cell_ix], indptr[cell_ix + 1]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this multithreaded if you don't pass parallel=True to njit?

Copy link
Member Author

@Intron7 Intron7 Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its slower, because of the way the memory access happens. I tested it. So no its not multi-threaded

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok after running some more tests. Its not the memory access but the compile time. The speedup only happens for very large matrices in the 2nd run so I dont think its worth it.

@Intron7 Intron7 requested a review from ivirshup April 3, 2024 13:02
@Intron7
Copy link
Member Author

Intron7 commented Apr 4, 2024

Screenshot from 2024-04-04 12-00-02

I have some concerns about the performance of the no numba version for larger datasets. So it might be better to either switch to the numba kernel for larger datasets or take the compile hit for small datasets

@ivirshup
Copy link
Member

ivirshup commented Apr 8, 2024

So it might be better to either switch to the numba kernel for larger datasets or take the compile hit for small datasets

The compiled versions should get cached, so it's a one time cost per install. No?

@ivirshup
Copy link
Member

ivirshup commented Apr 8, 2024

@Intron7, could you open an issue with the follow up things you wanted to investigate here?

I think this is good to merge as it gets ~ a 100x speed up, and we can do comparisons on top of this in follow ups.

@ivirshup ivirshup merged commit 3255fda into main Apr 8, 2024
13 checks passed
@ivirshup ivirshup deleted the inplace_sparse_scale branch April 8, 2024 11:54
meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request Apr 8, 2024
ivirshup pushed a commit that referenced this pull request Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scale for sparse matrixes and mask
2 participants