Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support regular parquet in test statistics #330

Merged
merged 21 commits into from
Mar 13, 2024
Merged

Conversation

zigaLuksic
Copy link
Collaborator

WARNING: if you review this before Monday I will draw an unflattering doodle of you. Do not play games with me!

Not sure if we should have a special case where we still try to load the parquet as a geoparquet. Seems kinda hard to do in general.

@zigaLuksic zigaLuksic requested a review from mlubej February 21, 2024 13:40
@zigaLuksic
Copy link
Collaborator Author

i tested this in FB and the geometry column is problematic, not jsonifiable...

eogrow/utils/testing.py Outdated Show resolved Hide resolved
eogrow/utils/testing.py Outdated Show resolved Hide resolved
@zigaLuksic
Copy link
Collaborator Author

@mlubej any ideas how to solve the issue of WKB not being json serializable?

@zigaLuksic
Copy link
Collaborator Author

@mlubej one option i thought of was that the tester supplies the CRS. Wont work for parquets where each geometry potentially has a different CRS though...

@zigaLuksic
Copy link
Collaborator Author

tested it out a bit, the wkb->wkt also works fine. Kept the user-supplied CRS so that we can have more statistics in those cases

eogrow/utils/testing.py Outdated Show resolved Hide resolved
eogrow/utils/testing.py Outdated Show resolved Hide resolved
@mlubej
Copy link
Contributor

mlubej commented Mar 12, 2024

updates (@zigaLuksic):

  • added pyogrio as default gpd reading engine
  • set the num_random_values to 0 for parquet stats, code block is still there and redundant
  • saved also dtypes (as strings) along with columns for parquet and vector files

eogrow/utils/testing.py Outdated Show resolved Hide resolved
eogrow/utils/testing.py Outdated Show resolved Hide resolved
@mlubej
Copy link
Contributor

mlubej commented Mar 13, 2024

Added also an agg_stats for dataframe-related stats and included only mean/std/min/max.

The tests for tests/pipelines/test_import_vector.py need to be updated, but we don't have the mechanism in place to update the stats via CI artifacts.

@zigaLuksic do you think it makes sense to add that also for eo-grow? I can prepare that in a separate MR if you agree, otherwise I can just manually update from a linux machine

@zigaLuksic zigaLuksic merged commit e072650 into develop Mar 13, 2024
7 checks passed
@zigaLuksic zigaLuksic deleted the support-regular-parquet branch March 13, 2024 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants