Save result of pseudo-experiments/toys #18

marinang · 2019-11-25T09:46:39Z

In #14 the frequentist calculator is introduced which uses toys to build test statistic distributions to compute pvalues for instance.

As the generation of toys + fitting + scanning is quite CPU intensive and takes time, an improvement would be to store the result of the pseudo experiments for instance in an hdf5 file.
This is currently an issue for testing which have time limits. For each pseudo experiment what need to be stored is:

the values of POI at generation
the best fitted values of the POI
the likelihood evaluated at the best fit POI
the likelihood evaluated at the POI points you want to scan (for upper limits or confidence intervals)

The results stored can be reused in hepstats.hypotests with the frequentist calculator without regenerating pseudo-experiments. This would let the possibility to the users to generate pseudo-experiments outside of heptstats as well.

A first design I have in mind is to create a class called ToyManager for instance which collect and save the results of the pseudo-experiments and can reopen them.

What do you think @eduardo-rodrigues @mayou36 @HDembinski ?

The text was updated successfully, but these errors were encountered:

eduardo-rodrigues · 2019-11-28T13:17:07Z

Hi @marinang, Brian had the same usecase when working in the scikit-hep package visual module, and he used fixtures to save material, see https://github.com/scikit-hep/scikit-hep/blob/master/tests/conftest.py, https://github.com/scikit-hep/scikit-hep/blob/master/tests/visual/test_hists.py, and related material. Hope that helps?

eduardo-rodrigues · 2019-11-28T13:18:18Z

Of course a toy manager would work, as you say. You could have your class as a fixture with 'session? scope ...

marinang · 2019-11-28T13:43:24Z

@eduardo-rodrigues the generations of toys (plus fit / plus scan) easily takes hour even for the simple example of a gaussian signal over an exponential background, so for sure tests cannot be ran in CI pipelines, the toys have to be read for the tests.

The idea would be to let users generate toys outside of hepstats using a batch system for instance, and collect the results in a way that can be read in hepstats.FrequentistCalculator.

eduardo-rodrigues · 2019-11-28T13:49:36Z

OK, then you have to store the data somewhere. Fair enough. You can still use a little fixture to make the data available in the tests, though …

I realise that scikit-hep-testdata needs to be worked on to work properly, but it might be a use-case for you? Basically the package is meant to store big-ish data files for tests. @benkrikler wanted to get back to the package and your use-case seems like a good reason to … ;-)

marinang · 2019-12-02T12:58:43Z

I am not sure whether it is useful of not to put the results of pseudo-experiments for a particular statistical model in scikit-hep-testdata, who else would use it? I have checked what I was doing with lauztat, I was using hdf5 files and for an upper limit calculation the file size is a few MB.

eduardo-rodrigues · 2019-12-02T13:34:44Z

That's a good point. Fair enough.

jonas-eschle · 2019-12-09T07:32:44Z

I expect this file to be in fact quite small, with about 100-1000 numerical entries, do you agree? This would also allow to e.g. use yaml?

And it seems related, just on a lower level, to #19. Does it make sense to mix the two

marinang · 2019-12-09T08:40:26Z

@mayou36 from experience to get a sensible result 1000 toys is the bare minimum. For each toy you need to store several numbers (more if you do a 2D confidence interval) stated in the first message.

For a 3 sigma evidence 1000 is the order of toys you want, but imagine for a 5 sigma discovery.

For upper limits ~ 1000 toys is fine but you have to multiply this by the number of values you scan for the POI (usually >= 10).

So I don’t think it is that small. For that regard I am a bit hesitant to store that in the same yml file where the stat results are stored.

jonas-eschle · 2019-12-09T09:03:59Z

Yes, true, that seems like too much, at least in human-readable format. yaml offers also the storage of arrays and there are some formats that combine yaml with hdf5 such as ASDF which I'm often looking at also in regard of likelihood storage.

But pure HDF5 seems also fine, I agree

marinang · 2019-12-09T09:26:26Z

There is still one issue, when the toys are stored there is no information stored about what model was used to generate the toys. So the user would have to be careful to match the toys with the corresponding loss in the FrequentistCalculator.
So I just got and idea, given that @mayou36 you are working on likelihood storage, would it make sense to allow some extras item to be written without interfering with the core content of the file?
This would allow to match likelihood and toys in one single file, then I could add a classmethod FrequentistCalculator.from_yaml(yaml_file, fitting_backend).

jonas-eschle · 2019-12-09T09:41:10Z

That would be great of course! It is though only a thing wrapper around
model = fitting_backend.from_yaml(...), right? So we would need to get that up first.

For the moment being, we may can leave the coordination to the user and let him name things properly?

marinang · 2019-12-09T09:48:07Z

Or better loss = fitting_backend.from_yaml(...).

@classmethod
def from_yaml(cls, yaml_file, fitting_backend):
    loss = fitting_backend.from_yaml(yaml_file)
    toys = function_extracting_toys_from_yaml(yaml_file)
    return cls(loss, toys)

For the moment being, we may can leave the coordination to the user and let him name things properly?

Yes for the moment we can only do that.

eduardo-rodrigues · 2019-12-09T13:25:17Z

Seems that at least for a first implementation the proposal to use hdf5 is viable. I would start an implementation with that.

The point you made in connection with likelihood storage and fitting tools in general is not to be undermined, see my comment in #19. 👍 on your proposal.

marinang added enhancement New feature or request help wanted Extra attention is needed hypotests labels Nov 25, 2019

marinang mentioned this issue Jan 24, 2020

add ToyResult and ToyManager classes #20

Merged

marinang closed this as completed Feb 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save result of pseudo-experiments/toys #18

Save result of pseudo-experiments/toys #18

marinang commented Nov 25, 2019 •

edited

Loading

eduardo-rodrigues commented Nov 28, 2019

eduardo-rodrigues commented Nov 28, 2019

marinang commented Nov 28, 2019 •

edited

Loading

eduardo-rodrigues commented Nov 28, 2019

marinang commented Dec 2, 2019 •

edited

Loading

eduardo-rodrigues commented Dec 2, 2019

jonas-eschle commented Dec 9, 2019

marinang commented Dec 9, 2019 •

edited

Loading

jonas-eschle commented Dec 9, 2019 •

edited

Loading

marinang commented Dec 9, 2019 •

edited

Loading

jonas-eschle commented Dec 9, 2019

marinang commented Dec 9, 2019 •

edited

Loading

eduardo-rodrigues commented Dec 9, 2019

Save result of pseudo-experiments/toys #18

Save result of pseudo-experiments/toys #18

Comments

marinang commented Nov 25, 2019 • edited Loading

eduardo-rodrigues commented Nov 28, 2019

eduardo-rodrigues commented Nov 28, 2019

marinang commented Nov 28, 2019 • edited Loading

eduardo-rodrigues commented Nov 28, 2019

marinang commented Dec 2, 2019 • edited Loading

eduardo-rodrigues commented Dec 2, 2019

jonas-eschle commented Dec 9, 2019

marinang commented Dec 9, 2019 • edited Loading

jonas-eschle commented Dec 9, 2019 • edited Loading

marinang commented Dec 9, 2019 • edited Loading

jonas-eschle commented Dec 9, 2019

marinang commented Dec 9, 2019 • edited Loading

eduardo-rodrigues commented Dec 9, 2019

marinang commented Nov 25, 2019 •

edited

Loading

marinang commented Nov 28, 2019 •

edited

Loading

marinang commented Dec 2, 2019 •

edited

Loading

marinang commented Dec 9, 2019 •

edited

Loading

jonas-eschle commented Dec 9, 2019 •

edited

Loading

marinang commented Dec 9, 2019 •

edited

Loading

marinang commented Dec 9, 2019 •

edited

Loading