-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch AnnData.__sizeof__() for backed datasets #1230
Conversation
Neah-Ko
commented
Nov 7, 2023
•
edited by flying-sheep
Loading
edited by flying-sheep
- Closes scipy.sparse.issparse check is always false in AnnData.__sizeof__() method + csr_matrix() realizes data #1222
- Tests added
- Release note added (or unnecessary)
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1230 +/- ##
==========================================
- Coverage 84.97% 83.12% -1.86%
==========================================
Files 34 34
Lines 5399 5405 +6
==========================================
- Hits 4588 4493 -95
- Misses 811 912 +101
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This helps a lot, but I think there are still a few assumptions that could break.
We can of course help out with testing or so, just tell us if you need support!
PS: Please also add a release note here:
anndata/docs/release-notes/0.10.4.md
Lines 3 to 5 in af7a5b7
```{rubric} Bugfix | |
``` | |
* Only try to use `Categorical.map(na_action=…)` in actually supported Pandas ≥2.1 {pr}`1226` {user}`flying-sheep` |
Hello @flying-sheep Maybe I need some help for testing / refinement of the specs we are aiming for here. I designed this naive test to append under def test_backed_sizeof(ondisk_equivalent_adata):
csr_mem, csr_disk, csc_disk, dense_disk = ondisk_equivalent_adata
assert_equal(dense_disk.__sizeof__(), csr_mem.__sizeof__())
assert_equal(dense_disk.__sizeof__(), csr_disk.__sizeof__())
assert_equal(dense_disk.__sizeof__(), csc_disk.__sizeof__()) it does two passes, testing both However it highlighted that the current E.g if you place a debug breakpoint on the first assert and execute some commands: nelem_x_size = lambda X: np.array(X.shape).prod() * X.dtype.itemsize
cstb = lambda X: X.data.nbytes + X.indptr.nbytes + X.indices.nbytes
# h5py pass
cstb(csr_mem.X)
3204
cstb(csc_disk.X._to_backed())
3204
cstb(csr_disk.X._to_backed())
3204
nelem_x_size(dense_disk.X)
20000
# zarr pass
cstb(csr_mem.X)
3204
cstb(csc_disk.X)
3204
cstb(csr_disk.X)
3204
nelem_x_size(dense_disk.X)
20000 LeadThen I decided to try re-implementing the get_size function like this: def get_size(X):
if isinstance(X, (h5py.Dataset,
sparse.csr_matrix,
sparse.csc_matrix,
BaseCompressedSparseDataset)):
return np.array(X.shape).prod() * X.dtype.itemsize
else:
return X.__sizeof__() Effect on the test: # h5py pass
get_size(csr_mem.X)
20000
get_size(csr_disk.X)
20000
get_size(csc_disk.X)
20000
get_size(dense_disk.X)
20000
# zarr pass
get_size(csr_mem.X)
20000
get_size(csr_disk.X)
20000
get_size(csc_disk.X)
20000
get_size(dense_disk.X)
20128 The test fails because of the size of ReflexionsI am starting to question implementing this directly in Maybe this deserves another function that has a more explicit name ? Or that function could simply compute the size of the data making it less precise but good enough in terms of order of magnitude. I see that #981 and #947 are about adding lazy support for other coordinates than X, I think this is something that we need to think about while designing that feature as well. Let me know what you think. Best, |
What do you mean? I do you mean things like DataFrames, which have several parts, or do you mean that there can be complex dtypes that aren’t easy to calculate size for? I’d say: Only test simple cases. Unless I misunderstood and even simple arrays can have varying sizes. In that case maybe just |
Hello @flying-sheep, I meant that it would be hard to return a consistent size value for the various classes that can be returned by accessing Since you don't have a problem with an imprecise test, I've updated my solution with the lower/upper bounds asserts. |
Hm, I think I wasn’t clear enough. What I meant is
With a focus on entirely. I thought you were referring to a few bytes of housekeeping data that some class has. Also I think you’re now making it so sparse matrix size isn’t reported correctly anymore. |
@flying-sheep |
OK, great! Now the only remaining point is
Actually it should only return the size of the I think it makes sense to customize it. I think for now, we could change def __sizeof__(self, *, with_disk: bool = False) -> int:
... Then If we want to follow the specs, we would have change it to def __sizeof__(self, *, with_fields: bool = False, with_disk: bool = False) -> int:
... which means we’d have to change behavior, so maybe let’s not do this right now. What do you think? |
@flying-sheep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much!
I think the behavior should be:
with_disk=True
→ everythingwith_disk=False
→ all in-memory structures, sparse or dense.
for more information, see https://pre-commit.ci
Thank you for the PR and for being patient with my many requests 😄 |
…cked datasets) (#1234) Co-authored-by: Etienne JODRY <[email protected]>
Sure, with pleasure. It was fun to dig into it :) Happy that it passed. Best, |