Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate inputs #85

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Validate inputs #85

wants to merge 7 commits into from

Conversation

JMGilbert
Copy link
Contributor

So far the input validation checks:

  1. That there are two RCPs and that they have labels rcp45 and rcp85 (assertion)
  2. That there are 15 batches with labels batch0 - batch14 (assertion)
  3. That the chunksize for batch is 15 (assertion)
  4. That all of the chunksizes are exactly as we want them (warning)

In general, I have tried to design the validate_damages function to be pretty flexible so it should be easy to change what checks we run on the input damages.

@JMGilbert JMGilbert changed the base branch from main to dscim-v0.4.0_fixes June 15, 2023 16:43
@codecov
Copy link

codecov bot commented Jun 15, 2023

Codecov Report

Merging #85 (172a518) into dscim-v0.4.0_fixes (970c623) will increase coverage by 0.56%.
The diff coverage is 100.00%.

@@                  Coverage Diff                   @@
##           dscim-v0.4.0_fixes      #85      +/-   ##
======================================================
+ Coverage               68.15%   68.72%   +0.56%     
======================================================
  Files                      17       17              
  Lines                    1878     1912      +34     
======================================================
+ Hits                     1280     1314      +34     
  Misses                    598      598              
Impacted Files Coverage Δ
src/dscim/preprocessing/input_damages.py 89.57% <100.00%> (+1.21%) ⬆️
src/dscim/preprocessing/preprocessing.py 72.18% <100.00%> (+0.37%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@JMGilbert
Copy link
Contributor Author

This validation script requires very strict input damages, but we may want to generalize in the future. The package itself requires very particular dimensions for each object, so we may want the config to include the dimensions which would allow more flexible input files.

Base automatically changed from dscim-v0.4.0_fixes to dscim-v0.4.0 July 6, 2023 17:53
@kemccusker kemccusker requested review from brews and kemccusker July 6, 2023 18:04
Copy link
Member

@brews brews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey y'all. I gave this a quick look and I have some comments here and in-line.

  • Having validation throw assertions is very common; unfortunately, it's possible to turn off assertions entirely. It also makes it hard for users to catch validation errors, specifically if assertions are getting thrown from different bodies of code in this pkg. Consider throwing a custom exception or something else instead. I'd consider what happens or what failures you'd see if this ran without throwing assertions and how that changes behavior. This might be something to consider in another PR later.

  • Checking for chunking might be too rigid. Is there a really good reason for this? Some of these checks feel pretty rigid, generally. Just double-check that you actually need to validate this stuff. It's pretty common to dance around with how tight or loose checks are. I'd say you want to ensure the data has what you need, but you also want to avoid blocking people from doing good work in the future because you were overzealous with checks.

## Unreleased
### Added
- Function to validate input damages. ([PR #85](https://github.com/ClimateImpactLab/dscim/pull/83), [@JMGilbert](https://github.com/JMGilbert))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this in a new "unreleased" section here and not in the one below?

@@ -863,3 +874,50 @@ def coastal_inputs(
consolidated=True,
mode="w",
)


def validate_damages(sector, path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstr? What are the inputs params and what is being checked? Expected behavior?

Be clear about what you're actually checking for. Don't be shy with inline comments, too because people will usually be back to read this kind of checking code in the future. (You did a pretty good job with comments here so 👍 )



def validate_damages(sector, path):
inputs = xr.open_zarr(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why have IO here instead of reading in a zarr store? Would this make it so you don't need to monkey patch it for tests?

chunks_expected = dict(zip(dims, chunks))

assert dims_expected == dict(inputs.dims)
for i in list(inputs.keys()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be able to just have this as inputs.keys() without casting it to a list?

monkeypatch.setattr(
"dscim.preprocessing.input_damages.validate_damages", lambda *args: True
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also applies to all monkeypatching below)

Why does this need to be monkey patched out?

Might be better if the validation function takes a dataset or dataarray as input instead of doing its own IO?

Patching like this in tests might be a symptom of a design problem.



@pytest.mark.parametrize("sector", ["mortality", "coastal"])
def test_validate_damages_correct(tmp_path, sector):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need docstr. What behavior is this testing for?

validate_damages(sector, path) # No assertion error should be raised


def test_validate_damages_incorrect_batches(tmp_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need docstr. What's this testing. I can kinda get a feel for it from the title but might want to say more specifically what you're looking for.

These docstrs also can print when tests fail so its good to have even if they seem obvious.

)


def test_validate_damages_incorrect_rcps(tmp_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need docstr. Harder to grok what this is testing for exactly.



@pytest.mark.parametrize("sector", ["mortality", "coastal"])
def test_validate_damages_incorrect_chunk_sizes(tmp_path, sector):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs docstr.



@pytest.mark.parametrize("sector", ["mortality", "coastal"])
def test_validate_damages_incorrect_region_chunk_sizes(tmp_path, sector):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neeeeds doooocstrr.

@brews
Copy link
Member

brews commented Jul 6, 2023

One more thing! Consider having an option to turn off hard, run-blocking validation in functions where you've hardcoded it. This is another thing that could help with testing or if the validation starts getting in the way for silly reasons.

Base automatically changed from dscim-v0.4.0 to main July 6, 2023 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants