Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: s3 data validation: annotations, frames, gains #207

Merged
merged 46 commits into from
Sep 3, 2024

Conversation

daniel-ji
Copy link
Contributor

@daniel-ji daniel-ji commented Aug 16, 2024

Creates s3 validation for annotations, frames, and gains. To be merged in after #228. See the annotation, frames, and gains sections of https://docs.google.com/document/d/1yMKM0DW9KRhlcYiBGPcR7oW0liGtUew6NAmBhMg5U3w/edit

I've left comments to describe some possibly confusing parts of the PR, but there are also some parts that I thought were relatively straight forward and left it out.

…i/s3-data-validation-dataset-deposition-photos
…i/s3-data-validation-dataset-deposition-photos
… into daniel-ji/s3-data-validation-frames-gains
Copy link
Contributor Author

@daniel-ji daniel-ji Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reading frames files (*.tiff / *.eer)

Copy link
Contributor Author

@daniel-ji daniel-ji Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarrays = {}
for i in range(3):
zarrays[i] = json.loads(fsstore[str(i) + "/.zarray"].decode())
return {"zattrs": loc.root_attrs, "zarrays": zarrays}

Copy link
Contributor Author

@daniel-ji daniel-ji Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ gets zarr header data, while ensuring all the data is properly retrieved

@daniel-ji daniel-ji marked this pull request as ready for review August 22, 2024 16:28

from common.fs import FileSystemApi

# ==================================================================================================
# Helper functions
# ==================================================================================================

# block sizes are experimentally tested to be the fastest
MRC_HEADER_BLOCK_SIZE = 500 * 2**10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels very large a size for just the headers.

Copy link
Contributor Author

@daniel-ji daniel-ji Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, will correct (the 500KB block size is needed for bz2 compressed headers)

def get_mrc_header(mrcfile: str, fs: FileSystemApi) -> MrcInterpreter:
"""Get the mrc file headers for a list of mrc files."""
try:
with fs.open(mrcfile, "rb", block_size=MRC_HEADER_BLOCK_SIZE) as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the fs.local_readable here and in similar cases below, as that prevents us from refetching same data if we get the same mrc_header block outside this specific case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there should be another way of getting mrc_header data? Functions that retrieve the mrc_header data also are "cached" with pytest fixtures, so I don't think we should ever be pulling down one mrc file's header more than once? Additionally, local_readable seems to pull down the entire file, even if a block_size is passed into it, which is not good when the mrc files can be quite large and we only need a small part of the file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: but having a consistent files access pattern would be better. If we want to read just the block of data, the read_block would better to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, I'll add this as a future improvement in a separate PR (after all of these current ones 😭)


def get_zarr_headers(zarrfile: str, fs: FileSystemApi) -> Dict[str, Dict]:
"""Get the zattrs and zarray data for a zarr volume file."""
expected_children = {f"{child}" for child in [0, 1, 2, ".zattrs", ".zgroup"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not being used outside a pytest fail message, where we could use another duplicate of this variable (expected_fsstore_children).

fsstore_children = set(fsstore.listdir())
expected_fsstore_children = {"0", "1", "2", ".zattrs", ".zgroup"}
if expected_fsstore_children != fsstore_children:
pytest.fail(f"Expected zarr children: {expected_children}, Actual zarr children: {fsstore_children}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can replace expected_children with expected_fsstore_children here?

Comment on lines 54 to 55
fsstore = zarr.storage.FSStore(url=zarrfile, mode="r", fs=fs.s3fs, dimension_separator="/")
fsstore_children = set(fsstore.listdir())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep it consistent, and either use all the zarr library functions or common.fs. We shouldn't add fsstore as another addition to the mix of libraries being used to interact with the s3 files.

Suggested change
fsstore = zarr.storage.FSStore(url=zarrfile, mode="r", fs=fs.s3fs, dimension_separator="/")
fsstore_children = set(fsstore.listdir())
file_paths = os. fs.glob(os.path.join(zarrfile, "*"))
fsstore_children = {os.path.basename(file) for file in file_paths}

Comment on lines 60 to 63
loc = ZarrLocation(fsstore)
zarrays = {}
for binning_factor in [0, 1, 2]: # 1x, 2x, 4x
zarrays[binning_factor] = json.loads(fsstore[str(binning_factor) + "/.zarray"].decode())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
loc = ZarrLocation(fsstore)
zarrays = {}
for binning_factor in [0, 1, 2]: # 1x, 2x, 4x
zarrays[binning_factor] = json.loads(fsstore[str(binning_factor) + "/.zarray"].decode())
zarrays = {
binning: json.load(fs.local_readable(os.path.join(zarrfile, binning, ".zarray")))
for binning in BINNING_SCALES
}

return mdocs
def tiltseries_mdoc(tiltseries_mdoc_files: str, filesystem: FileSystemApi) -> pd.DataFrame:
"""Load the tiltseries mdoc files and return a concatenated DataFrame."""
tiltseries_dataframes = [mdocfile.read(filesystem.localreadable(mdoc_file)) for mdoc_file in tiltseries_mdoc_files]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, but didn't we want to retain the individual file data without merging?

Copy link
Contributor Author

@daniel-ji daniel-ji Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought merging was fine? The read command just returns a pandas dataframe and concatenating the dataframes doesn't result in any information loss, but now we have all the data in one dataframe for easier validation checking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if there is a valid case for multiple mdoc files to exist.
cc: @uermel
But, if that is not the case, having multiple entries for a z-value can make it inconsistent. We should handle this case in source. We should fail if there are multiple mdoc files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: after an in-person discussion, we are no longer merging.

@@ -69,16 +70,20 @@ def run_meta_file(run_dir: str, filesystem: FileSystemApi) -> str:


@pytest.fixture(scope="session")
def frames_dir(run_dir: str, tiltseries_metadata: Dict[str, Any], filesystem: FileSystemApi) -> str:
def frames_dir(run_dir: str, filesystem: FileSystemApi) -> str:
"""[Dataset]/[ExperimentRun]/Frames"""
dst = f"{run_dir}/Frames"
if filesystem.s3fs.exists(dst):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we are using the filesystem.s3fs.exists in multiple places. So, we should support that as a method in filesystem.

Comment on lines 107 to 109
dst = f"{run_dir}/Frames"
if filesystem.s3fs.exists(dst):
return dst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only validates that a frames directory exists and doesn't provide any additional information than the frames_dir. This unfortunately, doesn't have any relevance to the gain file in itself. The gain_file test below covers the case where the file exists or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the gain_dir is the frames_dir (at least as of now?). So this fixture provides the gain_dir if it exists (otherwise it skips the test). The gain_dir is then used for the gain_file fixture.


def check_zattrs_path(header_data, _zarr_filename):
del _zarr_filename
for binning_factor in [0, 1, 2]: # 1x, 2x, 4x
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 0, 1, 2 is replaced by a constant in the tiltseries PR


def check_zattrs_voxel_spacings(header_data, _zarr_filename, voxel_spacing):
del _zarr_filename
for binning_factor in [0, 1, 2]: # 1x, 2x, 4x
Copy link
Contributor Author

@daniel-ji daniel-ji Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binning factors replaced by constant here as well

daniel-ji added a commit that referenced this pull request Aug 28, 2024
@daniel-ji daniel-ji changed the base branch from main to daniel-ji/s3-data-validation-metadata-keyphotos August 28, 2024 00:54
@daniel-ji daniel-ji changed the title s3 data validation: annotations, datasets, deposition, key photos, frames, gains s3 data validation: frames & gains Aug 28, 2024
@daniel-ji daniel-ji changed the title s3 data validation: frames & gains s3 data validation: annotations, frames, gains Aug 28, 2024
daniel-ji added a commit that referenced this pull request Aug 29, 2024
* break up #207 pr

* move more code over

* fixes from PR review

* fixes
Base automatically changed from daniel-ji/s3-data-validation-metadata-keyphotos to main August 29, 2024 18:48
for annotation_filename, points in annotations.items():
print(f"\tFile: {annotation_filename}")
for point in points:
assert 0 <= point["location"]["x"] <= canonical_tomogram_metadata["size"]["x"] - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we create max_tomo_x_limit outside the loops and reuse it instead of canonical_tomogram_metadata["size"]["x"] - 1? similarly for y and z. 😅

for metadata in annotation_metadata.values():
assert isinstance(metadata["annotation_object"], dict)
assert isinstance(metadata["annotation_object"]["name"], str)
assert isinstance(metadata["annotation_object"]["id"], str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this could be null..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per an in-person discussion, we determined it can't be null 👍

# By setting this scope to session, scope="session" fixtures will be reinitialized for each run + voxel_spacing combination
@pytest.mark.annotation
@pytest.mark.parametrize("run_name, voxel_spacing", pytest.run_spacing_combinations, scope="session")
class TestSegmentationMask(HelperTestMRCHeader):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add this comment in the code? 🤓

from mrcfile.mrcinterpreter import MrcInterpreter


class HelperTestMRCHeader:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be HelperTestVolumeHeader so it could test both mrc and zarr files. We can have a validate_zarr property that is set/overriden by the inheriting classes to handle cases such as gains where there are no zarr files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a separate class called HelperTestZarrHeader in later PRs to achieve the same functionality, if that's alright. It helps separate concerns of classes better, I feel like.

return mdocs
def tiltseries_mdoc(tiltseries_mdoc_files: str, filesystem: FileSystemApi) -> pd.DataFrame:
"""Load the tiltseries mdoc files and return a concatenated DataFrame."""
tiltseries_dataframes = [mdocfile.read(filesystem.localreadable(mdoc_file)) for mdoc_file in tiltseries_mdoc_files]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if there is a valid case for multiple mdoc files to exist.
cc: @uermel
But, if that is not the case, having multiple entries for a z-value can make it inconsistent. We should handle this case in source. We should fail if there are multiple mdoc files.

pytest.fail(
f"Metadata file not found for {len(remaining_annotation_files)} {name} annotation files.",
)

if count == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Instead of using a counter variable for this, we could check if corresponding_annotation_files is not empty.

errors = []

for gain_file in gain_files:
if not any(gain_file.endswith(ext) for ext in PERMITTED_FRAME_EXTENSIONS):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be PERMITTED_GAIN_EXTENSIONS?

assert first_mrc_gain.header.nx == first_mrc_frame.header.nx
assert first_mrc_gain.header.ny == first_mrc_frame.header.ny
else:
pytest.skip("No MRC files found to compare pixel spacing")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we make this error "Couldn't find one of or both of the following: frame and gain"?

"""Check that the gain pixel spacing & dimensions matches the frame pixel spacing. Just need to check first MRC file of each."""

first_mrc_gain = None
for _, gain_header in gain_headers.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for _, gain_header in gain_headers.items():
for gain_header in gain_headers.values():

break

first_mrc_frame = None
for _, frame_header in frames_headers.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for _, frame_header in frames_headers.items():
for frame_header in frames_headers.values():

@manasaV3 manasaV3 changed the title s3 data validation: annotations, frames, gains feat: s3 data validation: annotations, frames, gains Sep 2, 2024
@daniel-ji daniel-ji merged commit 835de67 into main Sep 3, 2024
9 checks passed
@daniel-ji daniel-ji deleted the daniel-ji/s3-data-validation-frames-gains branch September 3, 2024 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants