add pandas_diff for fast-carpenter output #4

kreczko · 2019-05-21T15:30:15Z

@bkrikler Could you please send me some example output files?

kreczko · 2019-05-21T15:30:26Z

On 2019-03-04 Lukasz Kreczko (kreczko) wrote:

changed the description

kreczko · 2019-05-21T15:30:27Z

On 2019-03-04 Benjamin Krikler (bkrikler) wrote:

Thanks for putting this on an issue. Best bet for example outputs is the fast_cms_public_tutorial repository's pipeline, eg.: https://gitlab.cern.ch/fast-hep/public/fast_cms_public_tutorial/-/jobs/3492813/artifacts/browse/pipeline/carpenter/ (I've "kept" the job artifacts for that specific pipeline now).

kreczko · 2019-05-21T15:30:28Z

On 2019-03-04 Benjamin Krikler (bkrikler) wrote:

Also, I had a primitive set of pandas_diff-like tests running in the old FAST-RA1 project, which might help with this: https://gitlab.cern.ch/fast-cms/FAST-RA1/blob/master/tests/integrations/run_tests.py#L83-131. The tests there only checked for exact equality between two reloaded dataframes, but it might help provide a starting point for this. Although the rest of the code is pretty simple, so maybe it's not really adding anything for you...

kreczko · 2019-05-21T15:30:30Z

On 2019-03-04 Lukasz Kreczko (kreczko) wrote:

Thanks for the examples, this will be useful.

I am trying to get the diff into a similar shape to the ROOT version:

calculate KS & p-value for all 1D projections
display differing projections

For the current CSV files that's essentially identifying the category, variables & statistical data.
I will have to think how to do this in a general way (like for ROOT) without being to verbose with the settings (e.g. `pandas_diff -c dataset, --var nMuon, nIsoMuons, -n n).

Maybe worth looking at fast-plotter for this?

kreczko · 2019-05-21T15:30:33Z

On 2019-03-04 Benjamin Krikler (bkrikler) wrote:

Yes, fast-plotter could be quite helpful for this. It depends a bit how generic / specific you want to be, however, i.e. is this a pandas-diff function, or a "fast binned dataframe"-diff? I think if it's the former it could be tricky to do this in some meaningful but general way, at least if the pandas dataframes are stored as CSV files (as binary files, you'd lose less info, like which columns are actually in the index). If you're comfortable being more specific to fast-carpenter's outputs then fast-plotter could be quite helpful, since it wraps reloading the CSV files, and gives utilities to project and sum, plus potentially plot the resulting differences.

kreczko · 2019-05-21T15:30:34Z

On 2019-03-04 Lukasz Kreczko (kreczko) wrote:

Yes, I am thinking more fast_binned_df_diff

kreczko added this to the Version 0.3.0 milestone May 21, 2019

kreczko added the originally gitlab For items that were originally created on gitlab and imported over label May 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add pandas_diff for fast-carpenter output #4

add pandas_diff for fast-carpenter output #4

kreczko commented May 21, 2019

kreczko commented May 21, 2019

kreczko commented May 21, 2019

kreczko commented May 21, 2019

kreczko commented May 21, 2019

kreczko commented May 21, 2019

kreczko commented May 21, 2019

add pandas_diff for fast-carpenter output #4

add pandas_diff for fast-carpenter output #4

Comments

kreczko commented May 21, 2019

kreczko commented May 21, 2019

kreczko commented May 21, 2019

kreczko commented May 21, 2019

kreczko commented May 21, 2019

kreczko commented May 21, 2019

kreczko commented May 21, 2019