Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr Tile Sink #1446

Merged
merged 42 commits into from
Mar 21, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
0f537bb
Initial method setup
annehaley Dec 20, 2023
5c6990b
Populate addTile method; tests successful up to 3-D
annehaley Jan 4, 2024
29d8b19
Use test source to copy 5D images to Zarr sink
annehaley Jan 11, 2024
599503f
Remove old TODOs
annehaley Jan 29, 2024
10a1f7e
Refactor __init__ to simplify (split into multiple methods)
annehaley Jan 29, 2024
12946a0
Update Exception class in __del__ method
annehaley Jan 29, 2024
f8fb42e
Merge branch 'master' into zarr-sink
annehaley Jan 31, 2024
d697920
Enable cropping during `write`; does not modify internal data
annehaley Feb 1, 2024
b50ecd9
Merge branch 'master' into zarr-sink
annehaley Feb 9, 2024
f36f1a8
Protect `__del__` method and ensure both `close` operations are called
annehaley Mar 5, 2024
6b6a6a4
Set `_axes`, `_dtype`, `_bandCount`, `sizeX`, `sizeY`, `levels`, and …
annehaley Mar 5, 2024
d6dd5ad
Attempt to improve performance of `addTile`: remove `np.pad` and unne…
annehaley Mar 6, 2024
b7ed234
Merge branch 'master' into zarr-sink
annehaley Mar 6, 2024
df8819f
Fix lint failure: remove trailing whitespace
annehaley Mar 6, 2024
0a832fd
Set `self._levels` to None in `addTile`
annehaley Mar 7, 2024
cd0c715
Avoid creating full mask
annehaley Mar 7, 2024
41ea6b5
Use `np.where` instead of `np.place`
annehaley Mar 8, 2024
eaaf29e
Merge branch 'master' into zarr-sink
annehaley Mar 8, 2024
0414f71
Rename `mask_placement_slices` -> `placement_slices` and avoid mask c…
annehaley Mar 8, 2024
3da02f6
Include statement to add samples axis to mask if necessary
annehaley Mar 11, 2024
d0d6129
Refactor checking of `axes` arg; ensure "s" axis exists
annehaley Mar 13, 2024
07e744f
Use samples axis length in chunking (instead of 32) and rechunk if le…
annehaley Mar 14, 2024
5118fbc
Switch to using DirectoryStore internally (instead of SQLiteStore)
annehaley Mar 18, 2024
e1e9831
Refactor addTile: ensure placement from kwargs is used (including fir…
annehaley Mar 18, 2024
6706f69
Refactor addTile: move attribute changes within `addLock` context
annehaley Mar 18, 2024
b58fcda
style: whitespace changes to fix lint failure
annehaley Mar 18, 2024
756502b
Refactor addTile: expand mask dimensions with tile dimensions
annehaley Mar 18, 2024
58897b3
Merge branch 'master' into zarr-sink
annehaley Mar 18, 2024
a0b3fbc
fix: add NoneType protection to mask axis padding
annehaley Mar 18, 2024
f0e7482
Merge branch 'master' into zarr-sink
annehaley Mar 19, 2024
5317daa
fix: Revert to using SQLiteStore when opening existing Zarr file
annehaley Mar 19, 2024
32c9c58
fix: move start of `self._addLock` context
annehaley Mar 19, 2024
728faee
fix: set dtype of `np.empty` when initializing root
annehaley Mar 19, 2024
14bee28
fix: call `validateZarr` in beginning of `getMetadata` if necessary
annehaley Mar 19, 2024
ce1b1b7
fix: Set `_bandCount` from length of `s` axis if exists
annehaley Mar 19, 2024
5fac3a8
test: enable new algorithm progression tests
annehaley Mar 19, 2024
23b70f7
test: write basic use case tests for zarr sink in `test_sink.py`
annehaley Mar 19, 2024
414ba67
style: fix line too long; use string append
annehaley Mar 19, 2024
f35d06e
style: avoid explicit string concatenation
annehaley Mar 20, 2024
157c731
test: add `testNew` and initialize `_levels`
annehaley Mar 20, 2024
62e6572
fix: unindent mask reshape clause
annehaley Mar 20, 2024
71d1068
test: add `testAddTileWithMask` and fix caching to pass
annehaley Mar 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
262 changes: 259 additions & 3 deletions sources/zarr/large_image_source_zarr/__init__.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,23 @@
import math
import os
import shutil
import tempfile
import threading
import uuid
from importlib.metadata import PackageNotFoundError
from importlib.metadata import version as _importlib_version
from pathlib import Path

import numpy as np
import packaging.version
import zarr

import large_image
from large_image.cache_util import LruCacheMetaclass, methodcache
from large_image.constants import TILE_FORMAT_NUMPY, SourcePriority
from large_image.constants import NEW_IMAGE_PATH_FLAG, TILE_FORMAT_NUMPY, SourcePriority
from large_image.exceptions import TileSourceError, TileSourceFileNotFoundError
from large_image.tilesource import FileTileSource
from large_image.tilesource.utilities import nearPowerOfTwo
from large_image.tilesource.utilities import _imageToNumpy, nearPowerOfTwo

try:
__version__ = _importlib_version(__name__)
Expand Down Expand Up @@ -52,6 +56,13 @@
"""
super().__init__(path, **kwargs)

if str(path).startswith(NEW_IMAGE_PATH_FLAG):
self._initNew(path, **kwargs)
else:
self._initOpen(**kwargs)
self._tileLock = threading.RLock()

def _initOpen(self, **kwargs):
self._largeImagePath = str(self._getLargeImagePath())
self._zarr = None
if not os.path.isfile(self._largeImagePath) and '//:' not in self._largeImagePath:
Expand Down Expand Up @@ -79,7 +90,43 @@
except Exception:
msg = 'File cannot be opened -- not an OME NGFF file or understandable zarr file.'
raise TileSourceError(msg)
self._tileLock = threading.RLock()

def _initNew(self, path, **kwargs):
"""
Initialize the tile class for creating a new image.
"""
self._tempfile = tempfile.NamedTemporaryFile(suffix=path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name passed in path is guaranteed to be unique. I can see refactoring this (see other comments) to enable multi-process support, at which point we'd want to be able to consistently find the tempfile from the path. In that instance, if TMPDIR / path exists, we could open an existing zarr store and add to it rather than creating a new one. Obviously, doing so would require some logic changes (such as knowing the set of axes and maybe the final dimensions ahead of time and not recreating the dataset).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to the "future work" section of this PR's description so we can come back to this idea when the core functionality is complete.

self._zarr_store = zarr.SQLiteStore(self._tempfile.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that the SQLiteStore is vastly less performant than I had hoped. Specifically, there are some sort of locking semantics on this store that prevent efficient thread sharing. The DirectoryStore is the most efficient (but, of course, means that we have a temp directory to clean up rather than a file; I think this is okay). The ZipStore is less efficient than the DirectoryStore, but still much more efficient than the SQLiteStore. In my sweep algortihm test, the ZipStore throws some warnings; I don't know if they are a concern or not. We should probably switch to the DirectoryStore for new files.

As a side effect of switching to the DirectoryStore, if we knew how big things were going to grow, we might by inter-process pickleabilty and ability (assuming each process writes different parts of the array). This would be in a future PR to ensure we have proved it works correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to using DirectoryStore in 5118fbc. This will need further testing and refinement, though.

self._zarr = zarr.open(self._zarr_store, mode='w')
# Make unpickleable
self._unpickleable = True
self._largeImagePath = None
self.sizeX = self.sizeY = self.levels = 0
self.tileWidth = self.tileHeight = self._tileSize
self._frames = [0]
self._cacheValue = str(uuid.uuid4())
self._output = None
self._editable = True
self._bandRanges = None
self._addLock = threading.RLock()
self._framecount = 0
self._mm_x = 0
self._mm_y = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add self._levels = [] here and a test that just does

sink = large_image_source_zarr.new()
assert sink.metadata['levels'] == 0
assert sink.getRegion(format=TILE_FORMAT_NUMPY)[0].shape[:2] == (0, 0)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added in 157c731


def __del__(self):
try:
annehaley marked this conversation as resolved.
Show resolved Hide resolved
self._zarr.close()
self._tempfile.close()

Check warning on line 119 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L119

Added line #L119 was not covered by tests
except Exception:
pass

def _checkEditable(self):
"""
Raise an exception if this is not an editable image.
"""
if not self._editable:
msg = 'Not an editable image'
raise TileSourceError(msg)

Check warning on line 129 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L128-L129

Added lines #L128 - L129 were not covered by tests

def _getGeneralAxes(self, arr):
"""
Expand Down Expand Up @@ -396,6 +443,9 @@

@methodcache()
def getTile(self, x, y, z, pilImageAllowed=False, numpyAllowed=False, **kwargs):
if self._levels is None:
annehaley marked this conversation as resolved.
Show resolved Hide resolved
self._validateZarr()

Check warning on line 447 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L447

Added line #L447 was not covered by tests

frame = self._getFrame(**kwargs)
self._xyzInRange(x, y, z, frame, self._framecount)
x0, y0, x1, y1, step = self._xyzToCorners(x, y, z)
Expand Down Expand Up @@ -438,6 +488,204 @@
return self._outputTile(tile, TILE_FORMAT_NUMPY, x, y, z,
pilImageAllowed, numpyAllowed, **kwargs)

def addTile(self, tile, x=0, y=0, mask=None, axes=None, **kwargs):
"""
Add a numpy or image tile to the image, expanding the image as needed
to accommodate it. Note that x and y can be negative. If so, the
output image (and internal memory access of the image) will act as if
the 0, 0 point is the most negative position. Cropping is applied
after this offset.

:param tile: a numpy array, PIL Image, or a binary string
with an image. The numpy array can have 2 or 3 dimensions.
:param x: location in destination for upper-left corner.
:param y: location in destination for upper-left corner.
:param mask: a 2-d numpy array (or 3-d if the last dimension is 1).
If specified, areas where the mask is false will not be altered.
:param axes: a string or list of strings specifying the names of axes
in the same order as the tile dimensions
:param kwargs: start locations for any additional axes
"""
# TODO: improve band bookkeeping

placement = {
'x': x,
'y': y,
**kwargs,
}
if axes is None:
if len(tile.shape) == 2:
axes = 'yx'

Check warning on line 518 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L518

Added line #L518 was not covered by tests
elif len(tile.shape) == 3:
axes = 'yxs'

Check warning on line 520 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L520

Added line #L520 was not covered by tests
else:
axes = ''

Check warning on line 522 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L522

Added line #L522 was not covered by tests
if isinstance(axes, str):
axes = axes.lower()
elif isinstance(axes, list):
axes = [x.lower() for x in axes]
else:
err = 'Invalid type for axes. Must be str or list[str].'
raise ValueError(err)

Check warning on line 529 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L528-L529

Added lines #L528 - L529 were not covered by tests

self._checkEditable()
tile, mode = _imageToNumpy(tile)
self._levels = None # reset zarr validation

while len(tile.shape) < len(axes):
tile = np.expand_dims(tile, axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurred to me this could be done without an explicit while loop. I think the equivalent would be:

if len(tile.shape) < len(axes):
    tile = np.expand_dims(tile, axis=tuple(range(len(axes) - len(tile.shape))))
if mask is not None and len(mask.shape) < len(axes):
    mask = np.expand_dims(mask, axis=tuple(range(len(axes) - len(mask.shape))))

I'm not sure that is clearer, though.


current_arrays = dict(self._zarr.arrays())
if 'root' in current_arrays:
root = current_arrays['root']
else:
root = self._zarr.create_dataset('root', data=tile)

new_dims = {
a: max(
root.shape[i],
placement.get(
a,
0) +
tile.shape[i]) for i,
a in enumerate(axes)}
root_data = np.pad(
root,
[(0, d - root.shape[i]) for i, d in enumerate(new_dims.values())],
mode='empty',
)

tile_data = np.pad(
tile,
[(placement.get(a, 0), d - placement.get(a, 0) - tile.shape[i])
for i, (a, d) in enumerate(new_dims.items())],
mode='empty',
)

if mask is None:
mask = np.ones(tile.shape[:-1])
while len(mask.shape) < len(tile_data.shape):
mask = np.expand_dims(mask, axis=-1)
mask = np.repeat(mask, tile_data.shape[len(mask.shape) - 1], axis=-1)
mask_data = np.pad(
mask,
[(placement.get(a, 0), d - placement.get(a, 0) - mask.shape[i])
for i, (a, d) in enumerate(new_dims.items())],
mode='constant',
constant_values=0,
)
mask_data = mask_data.astype(bool)
np.copyto(root_data, tile_data, where=mask_data)

with self._addLock:
annehaley marked this conversation as resolved.
Show resolved Hide resolved
# This will rechunk data when necessary, according to new shape
root = self._zarr.create_dataset('root', data=root_data, overwrite=True)

# Edit OME metadata
self._zarr.attrs.update({
'multiscales': [{
'version': '0.5-dev',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked ome zarr docs for version information. Are we more conformant to a dev version than the latest official version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following what I found here: https://ngff.openmicroscopy.org/latest/#multiscale-md, but I'm not entirely confident that this is all correct.

'axes': [{
'name': a,
'type': 'space' if a in ['x', 'y'] else 'other',
} for a in axes],
'datasets': [{'path': 0}],
}],
'omero': {'version': '0.5-dev'},
})

@property
def crop(self):
"""
Crop only applies to the output file, not the internal data access.

It consists of x, y, w, h in pixels.
"""
return getattr(self, '_crop', None)

@crop.setter
def crop(self, value):
self._checkEditable()

Check warning on line 608 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L608

Added line #L608 was not covered by tests
if value is None:
self._crop = None
return
x, y, w, h = value
x = int(x)
y = int(y)
w = int(w)
h = int(h)

Check warning on line 616 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L610-L616

Added lines #L610 - L616 were not covered by tests
if x < 0 or y < 0 or w <= 0 or h <= 0:
msg = 'Crop must have non-negative x, y and positive w, h'
raise TileSourceError(msg)
self._crop = (x, y, w, h)

Check warning on line 620 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L618-L620

Added lines #L618 - L620 were not covered by tests

def write(
self,
path,
lossy=True,
alpha=True,
overwriteAllowed=True,
):
"""
Output the current image to a file.

:param path: output path.
:param lossy: if false, emit a lossless file.
:param alpha: True if an alpha channel is allowed.
:param overwriteAllowed: if False, raise an exception if the output
path exists.
"""
# TODO: compute half, quarter, etc. resolutions
if not overwriteAllowed and os.path.exists(path):
raise TileSourceError('Output path exists (%s).' % str(path))

Check warning on line 640 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L640

Added line #L640 was not covered by tests
annehaley marked this conversation as resolved.
Show resolved Hide resolved

self._validateZarr()
suffix = Path(path).suffix
data_file = self._tempfile
data_store = self._zarr_store

if self.crop:
x, y, w, h = self.crop
current_arrays = dict(self._zarr.arrays())

Check warning on line 649 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L648-L649

Added lines #L648 - L649 were not covered by tests
# create new temp storage for cropped data
data_file = tempfile.NamedTemporaryFile()
data_store = zarr.SQLiteStore(data_file.name)
cropped_zarr = zarr.open(data_store, mode='w')

Check warning on line 653 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L651-L653

Added lines #L651 - L653 were not covered by tests
for arr_name in current_arrays:
arr = np.array(current_arrays[arr_name])
cropped_arr = arr.take(

Check warning on line 656 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L655-L656

Added lines #L655 - L656 were not covered by tests
indices=range(x, x + w),
axis=self._axes.get('x'),
).take(
indices=range(y, y + h),
axis=self._axes.get('y'),
)
cropped_zarr.create_dataset(arr_name, data=cropped_arr, overwrite=True)
cropped_zarr.attrs.update(self._zarr.attrs)

Check warning on line 664 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L663-L664

Added lines #L663 - L664 were not covered by tests

data_file.flush()

if suffix in ['.db', '.sqlite']:
shutil.copy2(data_file.name, path)

Check warning on line 669 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L669

Added line #L669 was not covered by tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying out using a ZipStore or DirectoryStore for the temp file. In this case, I replaced the shutil.copy2 with zarr_store = zarr.SQLiteStore(path) zarr.copy_store(data_store, zarr_store), zarr_store.close(). When writing to a network-mounted drive, occasionally zarr.SQLiteStore(path) would throw an error that appeared to be cause by a time out. When we do zarr.SQLiteStore(path), should we be specifying any other parameters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. What does the error look like? I can search for if anyone has experienced it and found a solution.


elif suffix == '.zip':
zip_store = zarr.storage.ZipStore(path)
annehaley marked this conversation as resolved.
Show resolved Hide resolved
zarr.copy_store(data_store, zip_store)
annehaley marked this conversation as resolved.
Show resolved Hide resolved
zip_store.close()

elif suffix == '.zarr':
dir_store = zarr.storage.DirectoryStore(path)
zarr.copy_store(data_store, dir_store)
dir_store.close()

Check warning on line 679 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L677-L679

Added lines #L677 - L679 were not covered by tests

else:
from large_image_converter import convert

Check warning on line 682 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L682

Added line #L682 was not covered by tests

convert(data_file.name, path, overwrite=overwriteAllowed)

Check warning on line 684 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L684

Added line #L684 was not covered by tests
annehaley marked this conversation as resolved.
Show resolved Hide resolved

if self.crop:
data_file.close()

Check warning on line 687 in sources/zarr/large_image_source_zarr/__init__.py

View check run for this annotation

Codecov / codecov/patch

sources/zarr/large_image_source_zarr/__init__.py#L687

Added line #L687 was not covered by tests


def open(*args, **kwargs):
"""
Expand All @@ -451,3 +699,11 @@
Check if an input can be read by the module class.
"""
return ZarrFileTileSource.canRead(*args, **kwargs)


def new(*args, **kwargs):
"""
Create a new image, collecting the results from patches of numpy arrays or
smaller images.
"""
return ZarrFileTileSource(NEW_IMAGE_PATH_FLAG + str(uuid.uuid4()), *args, **kwargs)
Loading