feat: s3 data validation: annotations, frames, gains #207

daniel-ji · 2024-08-16T02:51:55Z

Creates s3 validation for annotations, frames, and gains. To be merged in after #228. See the annotation, frames, and gains sections of https://docs.google.com/document/d/1yMKM0DW9KRhlcYiBGPcR7oW0liGtUew6NAmBhMg5U3w/edit

I've left comments to describe some possibly confusing parts of the PR, but there are also some parts that I thought were relatively straight forward and left it out.

…i/s3-data-validation-dataset-deposition-photos

… into daniel-ji/s3-data-validation-frames-gains

ingestion_tools/scripts/common/fs.py

daniel-ji · 2024-08-16T02:53:44Z

ingestion_tools/pyproject.toml

For reading frames files (*.tiff / *.eer)

daniel-ji · 2024-08-16T02:54:17Z

ingestion_tools/scripts/data_validation/__init__.py

For better assertion errors, https://stackoverflow.com/questions/41522767/pytest-assert-introspection-in-helper-function

daniel-ji · 2024-08-16T02:55:43Z

ingestion_tools/scripts/data_validation/fixtures/data.py

+    zarrays = {}
+    for i in range(3):
+        zarrays[i] = json.loads(fsstore[str(i) + "/.zarray"].decode())
+    return {"zattrs": loc.root_attrs, "zarrays": zarrays}



^ gets zarr header data, while ensuring all the data is properly retrieved

manasaV3 · 2024-08-26T19:23:42Z

ingestion_tools/scripts/data_validation/fixtures/data.py


 from common.fs import FileSystemApi

 # ==================================================================================================
 # Helper functions
 # ==================================================================================================

+# block sizes are experimentally tested to be the fastest
+MRC_HEADER_BLOCK_SIZE = 500 * 2**10


This feels very large a size for just the headers.

Sorry, will correct (the 500KB block size is needed for bz2 compressed headers)

manasaV3 · 2024-08-26T19:28:35Z

ingestion_tools/scripts/data_validation/fixtures/data.py

+def get_mrc_header(mrcfile: str, fs: FileSystemApi) -> MrcInterpreter:
+    """Get the mrc file headers for a list of mrc files."""
+    try:
+        with fs.open(mrcfile, "rb", block_size=MRC_HEADER_BLOCK_SIZE) as f:


We could use the fs.local_readable here and in similar cases below, as that prevents us from refetching same data if we get the same mrc_header block outside this specific case.

I don't think there should be another way of getting mrc_header data? Functions that retrieve the mrc_header data also are "cached" with pytest fixtures, so I don't think we should ever be pulling down one mrc file's header more than once? Additionally, local_readable seems to pull down the entire file, even if a block_size is passed into it, which is not good when the mrc files can be quite large and we only need a small part of the file?

Not a blocker: but having a consistent files access pattern would be better. If we want to read just the block of data, the read_block would better to use.

Noted, I'll add this as a future improvement in a separate PR (after all of these current ones 😭)

manasaV3 · 2024-08-26T19:37:59Z

ingestion_tools/scripts/data_validation/fixtures/data.py

+
+def get_zarr_headers(zarrfile: str, fs: FileSystemApi) -> Dict[str, Dict]:
+    """Get the zattrs and zarray data for a zarr volume file."""
+    expected_children = {f"{child}" for child in [0, 1, 2, ".zattrs", ".zgroup"]}


I believe this is not being used outside a pytest fail message, where we could use another duplicate of this variable (expected_fsstore_children).

manasaV3 · 2024-08-26T19:39:45Z

ingestion_tools/scripts/data_validation/fixtures/data.py

+    fsstore_children = set(fsstore.listdir())
+    expected_fsstore_children = {"0", "1", "2", ".zattrs", ".zgroup"}
+    if expected_fsstore_children != fsstore_children:
+        pytest.fail(f"Expected zarr children: {expected_children}, Actual zarr children: {fsstore_children}")


We can replace expected_children with expected_fsstore_children here?

manasaV3 · 2024-08-26T19:45:53Z

ingestion_tools/scripts/data_validation/fixtures/data.py

+    fsstore = zarr.storage.FSStore(url=zarrfile, mode="r", fs=fs.s3fs, dimension_separator="/")
+    fsstore_children = set(fsstore.listdir())


We should keep it consistent, and either use all the zarr library functions or common.fs. We shouldn't add fsstore as another addition to the mix of libraries being used to interact with the s3 files.

Suggested change

fsstore = zarr.storage.FSStore(url=zarrfile, mode="r", fs=fs.s3fs, dimension_separator="/")

fsstore_children = set(fsstore.listdir())

file_paths = os. fs.glob(os.path.join(zarrfile, "*"))

fsstore_children = {os.path.basename(file) for file in file_paths}

manasaV3 · 2024-08-26T19:49:56Z

ingestion_tools/scripts/data_validation/fixtures/data.py

+    loc = ZarrLocation(fsstore)
+    zarrays = {}
+    for binning_factor in [0, 1, 2]:  # 1x, 2x, 4x
+        zarrays[binning_factor] = json.loads(fsstore[str(binning_factor) + "/.zarray"].decode())


Suggested change

loc = ZarrLocation(fsstore)

zarrays = {}

for binning_factor in [0, 1, 2]: # 1x, 2x, 4x

zarrays[binning_factor] = json.loads(fsstore[str(binning_factor) + "/.zarray"].decode())

zarrays = {

binning: json.load(fs.local_readable(os.path.join(zarrfile, binning, ".zarray")))

for binning in BINNING_SCALES

}

manasaV3 · 2024-08-26T19:54:14Z

ingestion_tools/scripts/data_validation/fixtures/data.py

-    return mdocs
+def tiltseries_mdoc(tiltseries_mdoc_files: str, filesystem: FileSystemApi) -> pd.DataFrame:
+    """Load the tiltseries mdoc files and return a concatenated DataFrame."""
+    tiltseries_dataframes = [mdocfile.read(filesystem.localreadable(mdoc_file)) for mdoc_file in tiltseries_mdoc_files]


Correct me if I am wrong, but didn't we want to retain the individual file data without merging?

I thought merging was fine? The read command just returns a pandas dataframe and concatenating the dataframes doesn't result in any information loss, but now we have all the data in one dataframe for easier validation checking?

I am not sure if there is a valid case for multiple mdoc files to exist.
cc: @uermel
But, if that is not the case, having multiple entries for a z-value can make it inconsistent. We should handle this case in source. We should fail if there are multiple mdoc files.

Note: after an in-person discussion, we are no longer merging.

ingestion_tools/scripts/data_validation/fixtures/data.py

manasaV3 · 2024-08-26T20:02:09Z

ingestion_tools/scripts/data_validation/fixtures/path.py

@@ -69,16 +70,20 @@ def run_meta_file(run_dir: str, filesystem: FileSystemApi) -> str:


 @pytest.fixture(scope="session")
-def frames_dir(run_dir: str, tiltseries_metadata: Dict[str, Any], filesystem: FileSystemApi) -> str:
+def frames_dir(run_dir: str, filesystem: FileSystemApi) -> str:
    """[Dataset]/[ExperimentRun]/Frames"""
    dst = f"{run_dir}/Frames"
    if filesystem.s3fs.exists(dst):


looks like we are using the filesystem.s3fs.exists in multiple places. So, we should support that as a method in filesystem.

manasaV3 · 2024-08-26T20:08:04Z

ingestion_tools/scripts/data_validation/fixtures/path.py

+    dst = f"{run_dir}/Frames"
+    if filesystem.s3fs.exists(dst):
+        return dst


This only validates that a frames directory exists and doesn't provide any additional information than the frames_dir. This unfortunately, doesn't have any relevance to the gain file in itself. The gain_file test below covers the case where the file exists or not.

My understanding is that the gain_dir is the frames_dir (at least as of now?). So this fixture provides the gain_dir if it exists (otherwise it skips the test). The gain_dir is then used for the gain_file fixture.

daniel-ji · 2024-08-27T05:21:11Z

ingestion_tools/scripts/data_validation/tests/annotation/test_segmentationmask_annotation.py

+
+        def check_zattrs_path(header_data, _zarr_filename):
+            del _zarr_filename
+            for binning_factor in [0, 1, 2]:  # 1x, 2x, 4x


the 0, 1, 2 is replaced by a constant in the tiltseries PR

daniel-ji · 2024-08-27T05:22:48Z

ingestion_tools/scripts/data_validation/tests/annotation/test_segmentationmask_annotation.py

+
+        def check_zattrs_voxel_spacings(header_data, _zarr_filename, voxel_spacing):
+            del _zarr_filename
+            for binning_factor in [0, 1, 2]:  # 1x, 2x, 4x


binning factors replaced by constant here as well

* break up #207 pr * move more code over * fixes from PR review * fixes

manasaV3 · 2024-08-26T20:17:18Z

ingestion_tools/scripts/data_validation/tests/annotation/helper_point.py

+    for annotation_filename, points in annotations.items():
+        print(f"\tFile: {annotation_filename}")
+        for point in points:
+            assert 0 <= point["location"]["x"] <= canonical_tomogram_metadata["size"]["x"] - 1


could we create max_tomo_x_limit outside the loops and reuse it instead of canonical_tomogram_metadata["size"]["x"] - 1? similarly for y and z. 😅

manasaV3 · 2024-08-26T20:19:47Z

ingestion_tools/scripts/data_validation/tests/annotation/test_metadata_annotation.py

+        for metadata in annotation_metadata.values():
+            assert isinstance(metadata["annotation_object"], dict)
+            assert isinstance(metadata["annotation_object"]["name"], str)
+            assert isinstance(metadata["annotation_object"]["id"], str)


I believe this could be null..

As per an in-person discussion, we determined it can't be null 👍

manasaV3 · 2024-08-26T20:22:40Z

ingestion_tools/scripts/data_validation/tests/annotation/test_segmentationmask_annotation.py

+# By setting this scope to session, scope="session" fixtures will be reinitialized for each run + voxel_spacing combination
+@pytest.mark.annotation
+@pytest.mark.parametrize("run_name, voxel_spacing", pytest.run_spacing_combinations, scope="session")
+class TestSegmentationMask(HelperTestMRCHeader):


Could you also add this comment in the code? 🤓

manasaV3 · 2024-08-26T20:26:20Z

ingestion_tools/scripts/data_validation/tests/helper_mrc.py

+from mrcfile.mrcinterpreter import MrcInterpreter
+
+
+class HelperTestMRCHeader:


Could this be HelperTestVolumeHeader so it could test both mrc and zarr files. We can have a validate_zarr property that is set/overriden by the inheriting classes to handle cases such as gains where there are no zarr files.

I've created a separate class called HelperTestZarrHeader in later PRs to achieve the same functionality, if that's alright. It helps separate concerns of classes better, I feel like.

manasaV3 · 2024-08-27T21:54:09Z

ingestion_tools/scripts/data_validation/fixtures/data.py

-    return mdocs
+def tiltseries_mdoc(tiltseries_mdoc_files: str, filesystem: FileSystemApi) -> pd.DataFrame:
+    """Load the tiltseries mdoc files and return a concatenated DataFrame."""
+    tiltseries_dataframes = [mdocfile.read(filesystem.localreadable(mdoc_file)) for mdoc_file in tiltseries_mdoc_files]


I am not sure if there is a valid case for multiple mdoc files to exist.
cc: @uermel
But, if that is not the case, having multiple entries for a z-value can make it inconsistent. We should handle this case in source. We should fail if there are multiple mdoc files.

manasaV3 · 2024-09-02T19:02:04Z

ingestion_tools/scripts/data_validation/fixtures/path.py

+        pytest.fail(
+            f"Metadata file not found for {len(remaining_annotation_files)} {name} annotation files.",
+        )
+
    if count == 0:


nit: Instead of using a counter variable for this, we could check if corresponding_annotation_files is not empty.

manasaV3 · 2024-09-02T19:04:47Z

ingestion_tools/scripts/data_validation/tests/test_gain.py

+        errors = []
+
+        for gain_file in gain_files:
+            if not any(gain_file.endswith(ext) for ext in PERMITTED_FRAME_EXTENSIONS):


Should this be PERMITTED_GAIN_EXTENSIONS?

manasaV3 · 2024-09-02T19:06:48Z

ingestion_tools/scripts/data_validation/tests/test_gain.py

+            assert first_mrc_gain.header.nx == first_mrc_frame.header.nx
+            assert first_mrc_gain.header.ny == first_mrc_frame.header.ny
+        else:
+            pytest.skip("No MRC files found to compare pixel spacing")


nit: Can we make this error "Couldn't find one of or both of the following: frame and gain"?

manasaV3 · 2024-09-02T19:07:19Z

ingestion_tools/scripts/data_validation/tests/test_gain.py

+        """Check that the gain pixel spacing & dimensions matches the frame pixel spacing. Just need to check first MRC file of each."""
+
+        first_mrc_gain = None
+        for _, gain_header in gain_headers.items():


Suggested change

for _, gain_header in gain_headers.items():

for gain_header in gain_headers.values():

manasaV3 · 2024-09-02T19:07:30Z

ingestion_tools/scripts/data_validation/tests/test_gain.py

+                break
+
+        first_mrc_frame = None
+        for _, frame_header in frames_headers.items():


Suggested change

for _, frame_header in frames_headers.items():

for frame_header in frames_headers.values():

daniel-ji added 26 commits July 31, 2024 19:55

Port over cryoet-portal s3 data validation scripts created by @uermel

b5c499c

remove some of utz validation scripts from pr

ad042c5

make annotation tests separate

c6ebe05

fix pytest

ce746ca

fix pytest again

d610b51

fix pytest for real

42cafc9

delete yarn.lock

8843332

refactor s3 data validation data.py

c8d7ec2

work on s3 data validation for annotations

fcd6598

work on s3 data validation, clean things up

9acdfb7

Merge branch 'chanzuckerberg:main' into s3-data-validation-annotations

1c2ab50

refactor annotation testing, simplify fixtures & tests

3bdfbb1

add more annotation validation tests

d1ee0fe

fix type annotations for data validation tests

aed8da0

small fixes for annotation validation tests

753b310

clean up data validation

7472974

add deposition / dataset / photos tests

065458b

use omezarr to read zarr headers for s3 validation

9c147b4

Merge branch 'daniel-ji/s3-data-validation-annotations' into daniel-j…

f6de8b9

…i/s3-data-validation-dataset-deposition-photos

work on frames validation

648d44d

improve s3 frame data validation reporting

5ed945c

refactor annotation validation

1456a5a

Merge branch 'daniel-ji/s3-data-validation-annotations' into daniel-j…

5284505

…i/s3-data-validation-dataset-deposition-photos

Merge branch 'daniel-ji/s3-data-validation-dataset-deposition-photos'…

fee8f88

… into daniel-ji/s3-data-validation-frames-gains

work on gain and frames validation

7aa6a2d

Merge branch 'main' into daniel-ji/s3-data-validation-frames-gains

c044270

daniel-ji commented Aug 16, 2024

View reviewed changes

ingestion_tools/scripts/common/fs.py Show resolved Hide resolved

daniel-ji commented Aug 16, 2024

View reviewed changes

ingestion_tools/pyproject.toml Outdated

Copy link

Contributor Author

daniel-ji Aug 16, 2024 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reading frames files (*.tiff / *.eer)

daniel-ji commented Aug 16, 2024

View reviewed changes

daniel-ji marked this pull request as ready for review August 22, 2024 16:28

daniel-ji added 4 commits August 22, 2024 11:31

more fixes

df2fd8e

fixes and more tests

bb01a3c

fix test text

4e26e40

remove file_size fs method, not needed

ecbbd60

manasaV3 reviewed Aug 26, 2024

View reviewed changes

daniel-ji added 3 commits August 26, 2024 14:01

frames / gains fixes

67aa2f3

support multiple gains, also check format

6b06e30

improve frame / gains tests

a7f8f30

daniel-ji commented Aug 27, 2024

View reviewed changes

daniel-ji added 2 commits August 26, 2024 22:36

more small fixes

2aa9fd1

fxi mdoc file loading

0ed792d

daniel-ji added a commit that referenced this pull request Aug 28, 2024

break up #207 pr

a55fcf7

daniel-ji mentioned this pull request Aug 28, 2024

s3 data validation: metadata & key photos #228

Merged

daniel-ji changed the base branch from main to daniel-ji/s3-data-validation-metadata-keyphotos August 28, 2024 00:54

daniel-ji changed the title ~~s3 data validation: annotations, datasets, deposition, key photos, frames, gains~~ s3 data validation: frames & gains Aug 28, 2024

daniel-ji changed the title ~~s3 data validation: frames & gains~~ s3 data validation: annotations, frames, gains Aug 28, 2024

fix gains

251214b

daniel-ji force-pushed the daniel-ji/s3-data-validation-frames-gains branch from 84da6a5 to 251214b Compare August 28, 2024 02:57

daniel-ji mentioned this pull request Aug 28, 2024

feat: run / tiltseries s3 data validation (zarr, mrc, mdoc, and metadata checks) #223

Merged

frames format warning instead of error

f1e934f

daniel-ji added a commit that referenced this pull request Aug 29, 2024

s3 data validation: metadata & key photos (#228)

96d15d7

* break up #207 pr * move more code over * fixes from PR review * fixes

Base automatically changed from daniel-ji/s3-data-validation-metadata-keyphotos to main August 29, 2024 18:48

Merge branch 'main' into daniel-ji/s3-data-validation-frames-gains

5a0566c

manasaV3 approved these changes Sep 2, 2024

View reviewed changes

manasaV3 changed the title ~~s3 data validation: annotations, frames, gains~~ feat: s3 data validation: annotations, frames, gains Sep 2, 2024

Merge branch 'main' into daniel-ji/s3-data-validation-frames-gains

f1b032f

daniel-ji merged commit 835de67 into main Sep 3, 2024
9 checks passed

daniel-ji deleted the daniel-ji/s3-data-validation-frames-gains branch September 3, 2024 00:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: s3 data validation: annotations, frames, gains #207

feat: s3 data validation: annotations, frames, gains #207

daniel-ji commented Aug 16, 2024 •

edited

Loading

daniel-ji Aug 16, 2024 •

edited

Loading

daniel-ji Aug 16, 2024 •

edited

Loading

daniel-ji Aug 16, 2024 •

edited

Loading

manasaV3 Aug 26, 2024

daniel-ji Aug 26, 2024 •

edited

Loading

manasaV3 Aug 26, 2024

daniel-ji Aug 26, 2024

manasaV3 Sep 2, 2024

daniel-ji Sep 2, 2024

manasaV3 Aug 26, 2024

manasaV3 Aug 26, 2024

manasaV3 Aug 26, 2024

manasaV3 Aug 26, 2024

manasaV3 Aug 26, 2024

daniel-ji Aug 26, 2024 •

edited

Loading

manasaV3 Aug 27, 2024

daniel-ji Aug 28, 2024

manasaV3 Aug 26, 2024

manasaV3 Aug 26, 2024

daniel-ji Aug 26, 2024

daniel-ji Aug 27, 2024

daniel-ji Aug 27, 2024 •

edited

Loading

manasaV3 Aug 26, 2024

manasaV3 Aug 26, 2024

daniel-ji Sep 2, 2024

manasaV3 Aug 26, 2024

manasaV3 Aug 26, 2024

daniel-ji Sep 2, 2024

manasaV3 Aug 27, 2024

manasaV3 Sep 2, 2024

manasaV3 Sep 2, 2024

manasaV3 Sep 2, 2024

manasaV3 Sep 2, 2024

manasaV3 Sep 2, 2024

		fsstore = zarr.storage.FSStore(url=zarrfile, mode="r", fs=fs.s3fs, dimension_separator="/")
		fsstore_children = set(fsstore.listdir())

		from mrcfile.mrcinterpreter import MrcInterpreter


		class HelperTestMRCHeader:

	for _, gain_header in gain_headers.items():
	for gain_header in gain_headers.values():

	for _, frame_header in frames_headers.items():
	for frame_header in frames_headers.values():

feat: s3 data validation: annotations, frames, gains #207

feat: s3 data validation: annotations, frames, gains #207

Conversation

daniel-ji commented Aug 16, 2024 • edited Loading

daniel-ji Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

daniel-ji Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

daniel-ji Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-ji Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-ji Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-ji Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-ji commented Aug 16, 2024 •

edited

Loading

daniel-ji Aug 16, 2024 •

edited

Loading

daniel-ji Aug 16, 2024 •

edited

Loading

daniel-ji Aug 16, 2024 •

edited

Loading

daniel-ji Aug 26, 2024 •

edited

Loading

daniel-ji Aug 26, 2024 •

edited

Loading

daniel-ji Aug 27, 2024 •

edited

Loading