-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: s3 data validation: annotations, frames, gains #207
Conversation
…i/s3-data-validation-dataset-deposition-photos
…i/s3-data-validation-dataset-deposition-photos
… into daniel-ji/s3-data-validation-frames-gains
ingestion_tools/pyproject.toml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reading frames files (*.tiff / *.eer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better assertion errors, https://stackoverflow.com/questions/41522767/pytest-assert-introspection-in-helper-function
zarrays = {} | ||
for i in range(3): | ||
zarrays[i] = json.loads(fsstore[str(i) + "/.zarray"].decode()) | ||
return {"zattrs": loc.root_attrs, "zarrays": zarrays} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ gets zarr header data, while ensuring all the data is properly retrieved
|
||
from common.fs import FileSystemApi | ||
|
||
# ================================================================================================== | ||
# Helper functions | ||
# ================================================================================================== | ||
|
||
# block sizes are experimentally tested to be the fastest | ||
MRC_HEADER_BLOCK_SIZE = 500 * 2**10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels very large a size for just the headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, will correct (the 500KB block size is needed for bz2 compressed headers)
def get_mrc_header(mrcfile: str, fs: FileSystemApi) -> MrcInterpreter: | ||
"""Get the mrc file headers for a list of mrc files.""" | ||
try: | ||
with fs.open(mrcfile, "rb", block_size=MRC_HEADER_BLOCK_SIZE) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use the fs.local_readable
here and in similar cases below, as that prevents us from refetching same data if we get the same mrc_header block outside this specific case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there should be another way of getting mrc_header data? Functions that retrieve the mrc_header data also are "cached" with pytest fixtures, so I don't think we should ever be pulling down one mrc file's header more than once? Additionally, local_readable seems to pull down the entire file, even if a block_size is passed into it, which is not good when the mrc files can be quite large and we only need a small part of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker: but having a consistent files access pattern would be better. If we want to read just the block of data, the read_block would better to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, I'll add this as a future improvement in a separate PR (after all of these current ones 😭)
|
||
def get_zarr_headers(zarrfile: str, fs: FileSystemApi) -> Dict[str, Dict]: | ||
"""Get the zattrs and zarray data for a zarr volume file.""" | ||
expected_children = {f"{child}" for child in [0, 1, 2, ".zattrs", ".zgroup"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is not being used outside a pytest fail message, where we could use another duplicate of this variable (expected_fsstore_children
).
fsstore_children = set(fsstore.listdir()) | ||
expected_fsstore_children = {"0", "1", "2", ".zattrs", ".zgroup"} | ||
if expected_fsstore_children != fsstore_children: | ||
pytest.fail(f"Expected zarr children: {expected_children}, Actual zarr children: {fsstore_children}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can replace expected_children
with expected_fsstore_children
here?
fsstore = zarr.storage.FSStore(url=zarrfile, mode="r", fs=fs.s3fs, dimension_separator="/") | ||
fsstore_children = set(fsstore.listdir()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep it consistent, and either use all the zarr
library functions or common.fs
. We shouldn't add fsstore
as another addition to the mix of libraries being used to interact with the s3 files.
fsstore = zarr.storage.FSStore(url=zarrfile, mode="r", fs=fs.s3fs, dimension_separator="/") | |
fsstore_children = set(fsstore.listdir()) | |
file_paths = os. fs.glob(os.path.join(zarrfile, "*")) | |
fsstore_children = {os.path.basename(file) for file in file_paths} |
loc = ZarrLocation(fsstore) | ||
zarrays = {} | ||
for binning_factor in [0, 1, 2]: # 1x, 2x, 4x | ||
zarrays[binning_factor] = json.loads(fsstore[str(binning_factor) + "/.zarray"].decode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loc = ZarrLocation(fsstore) | |
zarrays = {} | |
for binning_factor in [0, 1, 2]: # 1x, 2x, 4x | |
zarrays[binning_factor] = json.loads(fsstore[str(binning_factor) + "/.zarray"].decode()) | |
zarrays = { | |
binning: json.load(fs.local_readable(os.path.join(zarrfile, binning, ".zarray"))) | |
for binning in BINNING_SCALES | |
} |
return mdocs | ||
def tiltseries_mdoc(tiltseries_mdoc_files: str, filesystem: FileSystemApi) -> pd.DataFrame: | ||
"""Load the tiltseries mdoc files and return a concatenated DataFrame.""" | ||
tiltseries_dataframes = [mdocfile.read(filesystem.localreadable(mdoc_file)) for mdoc_file in tiltseries_mdoc_files] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong, but didn't we want to retain the individual file data without merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought merging was fine? The read command just returns a pandas dataframe and concatenating the dataframes doesn't result in any information loss, but now we have all the data in one dataframe for easier validation checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if there is a valid case for multiple mdoc files to exist.
cc: @uermel
But, if that is not the case, having multiple entries for a z-value can make it inconsistent. We should handle this case in source. We should fail if there are multiple mdoc files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: after an in-person discussion, we are no longer merging.
@@ -69,16 +70,20 @@ def run_meta_file(run_dir: str, filesystem: FileSystemApi) -> str: | |||
|
|||
|
|||
@pytest.fixture(scope="session") | |||
def frames_dir(run_dir: str, tiltseries_metadata: Dict[str, Any], filesystem: FileSystemApi) -> str: | |||
def frames_dir(run_dir: str, filesystem: FileSystemApi) -> str: | |||
"""[Dataset]/[ExperimentRun]/Frames""" | |||
dst = f"{run_dir}/Frames" | |||
if filesystem.s3fs.exists(dst): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we are using the filesystem.s3fs.exists
in multiple places. So, we should support that as a method in filesystem
.
dst = f"{run_dir}/Frames" | ||
if filesystem.s3fs.exists(dst): | ||
return dst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only validates that a frames directory exists and doesn't provide any additional information than the frames_dir
. This unfortunately, doesn't have any relevance to the gain file in itself. The gain_file
test below covers the case where the file exists or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the gain_dir is the frames_dir (at least as of now?). So this fixture provides the gain_dir if it exists (otherwise it skips the test). The gain_dir is then used for the gain_file fixture.
|
||
def check_zattrs_path(header_data, _zarr_filename): | ||
del _zarr_filename | ||
for binning_factor in [0, 1, 2]: # 1x, 2x, 4x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 0, 1, 2 is replaced by a constant in the tiltseries PR
|
||
def check_zattrs_voxel_spacings(header_data, _zarr_filename, voxel_spacing): | ||
del _zarr_filename | ||
for binning_factor in [0, 1, 2]: # 1x, 2x, 4x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binning factors replaced by constant here as well
84da6a5
to
251214b
Compare
* break up #207 pr * move more code over * fixes from PR review * fixes
for annotation_filename, points in annotations.items(): | ||
print(f"\tFile: {annotation_filename}") | ||
for point in points: | ||
assert 0 <= point["location"]["x"] <= canonical_tomogram_metadata["size"]["x"] - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we create max_tomo_x_limit
outside the loops and reuse it instead of canonical_tomogram_metadata["size"]["x"] - 1
? similarly for y and z. 😅
for metadata in annotation_metadata.values(): | ||
assert isinstance(metadata["annotation_object"], dict) | ||
assert isinstance(metadata["annotation_object"]["name"], str) | ||
assert isinstance(metadata["annotation_object"]["id"], str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this could be null..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per an in-person discussion, we determined it can't be null 👍
# By setting this scope to session, scope="session" fixtures will be reinitialized for each run + voxel_spacing combination | ||
@pytest.mark.annotation | ||
@pytest.mark.parametrize("run_name, voxel_spacing", pytest.run_spacing_combinations, scope="session") | ||
class TestSegmentationMask(HelperTestMRCHeader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add this comment in the code? 🤓
from mrcfile.mrcinterpreter import MrcInterpreter | ||
|
||
|
||
class HelperTestMRCHeader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be HelperTestVolumeHeader
so it could test both mrc and zarr files. We can have a validate_zarr
property that is set/overriden by the inheriting classes to handle cases such as gains where there are no zarr files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created a separate class called HelperTestZarrHeader in later PRs to achieve the same functionality, if that's alright. It helps separate concerns of classes better, I feel like.
return mdocs | ||
def tiltseries_mdoc(tiltseries_mdoc_files: str, filesystem: FileSystemApi) -> pd.DataFrame: | ||
"""Load the tiltseries mdoc files and return a concatenated DataFrame.""" | ||
tiltseries_dataframes = [mdocfile.read(filesystem.localreadable(mdoc_file)) for mdoc_file in tiltseries_mdoc_files] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if there is a valid case for multiple mdoc files to exist.
cc: @uermel
But, if that is not the case, having multiple entries for a z-value can make it inconsistent. We should handle this case in source. We should fail if there are multiple mdoc files.
pytest.fail( | ||
f"Metadata file not found for {len(remaining_annotation_files)} {name} annotation files.", | ||
) | ||
|
||
if count == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Instead of using a counter variable for this, we could check if corresponding_annotation_files
is not empty.
errors = [] | ||
|
||
for gain_file in gain_files: | ||
if not any(gain_file.endswith(ext) for ext in PERMITTED_FRAME_EXTENSIONS): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be PERMITTED_GAIN_EXTENSIONS
?
assert first_mrc_gain.header.nx == first_mrc_frame.header.nx | ||
assert first_mrc_gain.header.ny == first_mrc_frame.header.ny | ||
else: | ||
pytest.skip("No MRC files found to compare pixel spacing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we make this error "Couldn't find one of or both of the following: frame and gain"?
"""Check that the gain pixel spacing & dimensions matches the frame pixel spacing. Just need to check first MRC file of each.""" | ||
|
||
first_mrc_gain = None | ||
for _, gain_header in gain_headers.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for _, gain_header in gain_headers.items(): | |
for gain_header in gain_headers.values(): |
break | ||
|
||
first_mrc_frame = None | ||
for _, frame_header in frames_headers.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for _, frame_header in frames_headers.items(): | |
for frame_header in frames_headers.values(): |
Creates s3 validation for annotations, frames, and gains. To be merged in after #228. See the annotation, frames, and gains sections of https://docs.google.com/document/d/1yMKM0DW9KRhlcYiBGPcR7oW0liGtUew6NAmBhMg5U3w/edit
I've left comments to describe some possibly confusing parts of the PR, but there are also some parts that I thought were relatively straight forward and left it out.