pyvips.Image.new_from_array from np.mmap vs hdf5 #492

rgluskin · 2024-08-08T11:18:52Z

Hi.
I'm performing manipulations on WSIs and saving intermediate results in HDF5.
The end results are being saved to a new WSI using pyvips.Image.new_from_array.

However when running on very large slides I get OOM exceptions.

I did a quick experiment and saw that saving the intermediate results in np.mmap doesn't cause this issue.
From pyvips code it seems that the difference stems from __array_interface__ vs __array__ attributes (mmap has both but hdf5 only has the latter). However the fields that are actually accessed are fields such as dtype which are present in hdf5 anyway.

What would be your recommendation ?
Should I rewrite my whole code to use mmap ?
Or is there a more elegant approach ?

Thanks in advance.

The text was updated successfully, but these errors were encountered:

jcupitt · 2024-08-08T11:56:07Z

Hi @rgluskin,

Could you share some sample code that shows the problem? How are you saving in HDF5? Is this with write_to_file('xxx.mat'), or some numpy feature?

As far as I know, HDF5 is not a good format for large images, it needs huge amounts of memory. I would use (maybe) uncompressed pyramidal tiled TIFF, or perhaps jpeg-compressed, depending on the image size.

Are you sure you need to save intermediates? If you stick to pyvips, you can do most processing with no need for intermediates. What operation is forcing you to save?

rgluskin · 2024-08-08T12:12:37Z

I'm running various processing, including classical vision and deep learning pipelines.
HDF5 is used to store the results on disk, not in memory.
This is the gist of the code, I think it's enough to reproduce the issue.

with h5py.File(hdf5_fname, "w") as heatmap_file:
    heatmap_file.create_dataset(
        "value_map",
        (self.height, self.width, n_channels),
        dtype=dtype,
        chunks=self.HDF5_CHUNK_SIZE + (n_channels,),
        compression="gzip",
        fillvalue=0.0,
    )
self.heatmap_file = h5py.File(hdf5_fname, "r+")
self.value_map = self.heatmap_file["value_map"]
...
...

full_scale_image = self.value_map
...
...
# Tif resolution is pixels/cm, openslide resolution is um/pixel. VIPS interprets the parameter as pixels/mm
vips_res = 1e3 / resolution_umpp
pyvips.Image.new_from_array(full_scale_image, interpretation="rgb").write_to_file(file_path, pyramid=True, tile=True, bigtiff=True, xres=vips_res, yres=vips_res, compression="lzw")

Disk size is not an issue, only having to reallocate the whole image to RAM is (when return cls.new_from_array(obj.__array__()) is executed in vimage.py )

jcupitt · 2024-08-08T12:30:22Z

HDF5 is used to store the results on disk, not in memory.

I think loading HDF5 from disc will need a lot of memory, won't it? And it'll use amazing amounts of disc space -- your WSI scans will probably have been through jpeg already, so there's little chance of quality loss, I'd think.

You need to make a complete test program I can run that shows the problem. Otherwise I'll waste 30 minutes making a test myself, do something slightly different from your code, and not hit the same issue. Or that's what usually happens to me :(

rgluskin · 2024-08-08T14:44:07Z

Sorry here's the full code to reproduce:

import h5py
import pyvips

hdf5_fname = "temp.hdf5"
file_path = "temp.tiff"
resolution_umpp = 0.25
height = 100000
width = 192000
n_channels = 4
dtype = 'uint8'
HDF5_CHUNK_SIZE = (128, 128)

with h5py.File(hdf5_fname, "w") as heatmap_file:
    heatmap_file.create_dataset(
        "value_map",
        (height, width, n_channels),
        dtype=dtype,
        chunks=HDF5_CHUNK_SIZE + (n_channels,),
        compression="gzip",
        fillvalue=0.0,
    )
heatmap_file = h5py.File(hdf5_fname, "r+")
value_map = heatmap_file["value_map"]

full_scale_image = value_map

# Tif resolution is pixels/cm, openslide resolution is um/pixel. VIPS interprets the parameter as pixels/mm
vips_res = 1e3 / resolution_umpp
pyvips.Image.new_from_array(full_scale_image, interpretation="rgb").write_to_file(
    file_path, pyramid=True, tile=True, bigtiff=True, xres=vips_res, yres=vips_res, compression="lzw"
)

rgluskin · 2024-08-08T15:12:18Z

and this snippet works

import h5py
import numpy as np
import pyvips

hdf5_fname = "temp.hdf5"
file_path = "temp.tiff"
resolution_umpp = 0.25
height = 100000
width = 192000
n_channels = 4
dtype = 'uint8'

value_map = np.memmap(hdf5_fname, dtype=dtype, mode='w+', shape=(height, width, n_channels))


full_scale_image = value_map

# Tif resolution is pixels/cm, openslide resolution is um/pixel. VIPS interprets the parameter as pixels/mm
vips_res = 1e3 / resolution_umpp
pyvips.Image.new_from_array(full_scale_image, interpretation="rgb").write_to_file(
    file_path, pyramid=True, tile=True, bigtiff=True, xres=vips_res, yres=vips_res, compression="lzw"
)

My default course of action would be to refactor all such hdf5 usages to use np.memmap instead.
Unless you see a more elegant solution (i.e. somehow recognize that shape and dtype are present in hdf5 and read it from there)

jcupitt · 2024-08-08T16:34:17Z

Great! Thanks for that. I tried:

$ VIPS_PROGRESS=1 ./rgluskin.py 
mapping ...
making vips image ...
writing ...
rgluskin.py temp-2: 192000 x 100000 pixels, 32 threads, 192000 x 1 tiles, 640 lines in buffer
rgluskin.py temp-2: 24% complete
...

And watched it run in top. It allocated 78gb of VIRT at the start, then as the save executed, RES slowly creept up until, when the save ticker reached 100%, it was equal to VIRT.

I think this is probably unavoidable with HDF5 files opened via numpy. Does it have to be HDF5? TIFF (for example) should work well, eg. with your test file in TIFF format I see:

$ VIPS_PROGRESS=1 /usr/bin/time -f %M:%e  vips copy temp.tiff x.tif[tile,pyramid,compression=jpeg]
vips temp-6: 192000 x 100000 pixels, 32 threads, 128 x 128 tiles, 640 lines in buffer
vips temp-6: done in 151s          
memory: high-water mark 1.02 GB
1368380:151.39

1.3gb of peak memory use. Q85 jpeg should be no worse than LZW (assuming your WSI scanner outputs something like SVS), and much faster.

rgluskin · 2024-08-08T17:36:33Z

No, direct numpy works fine without hdf5. It has some other flexibility limitations, such as having to store a single tensor per file but I can refactor my code around that.

…

On Thu, 8 Aug 2024, 19:34 John Cupitt, ***@***.***> wrote: Great! Thanks for that. I tried: $ VIPS_PROGRESS=1 ./rgluskin.py mapping ... making vips image ... writing ... rgluskin.py temp-2: 192000 x 100000 pixels, 32 threads, 192000 x 1 tiles, 640 lines in buffer rgluskin.py temp-2: 24% complete ... And watched it run in top. It allocated 78gb of VIRT at the start, then as the save executes, RES slowly creeps up until, when the save ticker reaches 100%, it's equal to VIRT. I think this is probably unavoidable with HDF5 files opened via numpy. Does it have to be HDF5? TIFF (for example) should work well, eg. with your test file in TIFF format I see: $ VIPS_PROGRESS=1 /usr/bin/time -f %M:%e vips copy temp.tiff x.tif[tile,pyramid,compression=jpeg] vips temp-6: 192000 x 100000 pixels, 32 threads, 128 x 128 tiles, 640 lines in buffer vips temp-6: done in 151s memory: high-water mark 1.02 GB 1368380:151.39 1.3gb of peak memory use. Q85 jpeg should be no worse than LZW (assuming your WSI scanner outputs something like SVS), and much faster. — Reply to this email directly, view it on GitHub <#492 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4Y3IYGTMLBUZA2XSWN6ZTZQOMZ7AVCNFSM6AAAAABMGINPM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZWGIZTCNRXGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyvips.Image.new_from_array from np.mmap vs hdf5 #492

pyvips.Image.new_from_array from np.mmap vs hdf5 #492

rgluskin commented Aug 8, 2024

jcupitt commented Aug 8, 2024

rgluskin commented Aug 8, 2024 •

edited

Loading

jcupitt commented Aug 8, 2024

rgluskin commented Aug 8, 2024

rgluskin commented Aug 8, 2024

jcupitt commented Aug 8, 2024 •

edited

Loading

rgluskin commented Aug 8, 2024 via email

pyvips.Image.new_from_array from np.mmap vs hdf5 #492

pyvips.Image.new_from_array from np.mmap vs hdf5 #492

Comments

rgluskin commented Aug 8, 2024

jcupitt commented Aug 8, 2024

rgluskin commented Aug 8, 2024 • edited Loading

jcupitt commented Aug 8, 2024

rgluskin commented Aug 8, 2024

rgluskin commented Aug 8, 2024

jcupitt commented Aug 8, 2024 • edited Loading

rgluskin commented Aug 8, 2024 via email

rgluskin commented Aug 8, 2024 •

edited

Loading

jcupitt commented Aug 8, 2024 •

edited

Loading