Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save result of pseudo-experiments/toys #18

Closed
marinang opened this issue Nov 25, 2019 · 13 comments
Closed

Save result of pseudo-experiments/toys #18

marinang opened this issue Nov 25, 2019 · 13 comments
Labels
enhancement New feature or request help wanted Extra attention is needed hypotests

Comments

@marinang
Copy link
Member

marinang commented Nov 25, 2019

In #14 the frequentist calculator is introduced which uses toys to build test statistic distributions to compute pvalues for instance.

As the generation of toys + fitting + scanning is quite CPU intensive and takes time, an improvement would be to store the result of the pseudo experiments for instance in an hdf5 file.
This is currently an issue for testing which have time limits. For each pseudo experiment what need to be stored is:

  • the values of POI at generation
  • the best fitted values of the POI
  • the likelihood evaluated at the best fit POI
  • the likelihood evaluated at the POI points you want to scan (for upper limits or confidence intervals)

The results stored can be reused in hepstats.hypotests with the frequentist calculator without regenerating pseudo-experiments. This would let the possibility to the users to generate pseudo-experiments outside of heptstats as well.

A first design I have in mind is to create a class called ToyManager for instance which collect and save the results of the pseudo-experiments and can reopen them.

What do you think @eduardo-rodrigues @mayou36 @HDembinski ?

@marinang marinang added enhancement New feature or request help wanted Extra attention is needed hypotests labels Nov 25, 2019
@eduardo-rodrigues
Copy link
Member

Hi @marinang, Brian had the same usecase when working in the scikit-hep package visual module, and he used fixtures to save material, see https://github.com/scikit-hep/scikit-hep/blob/master/tests/conftest.py, https://github.com/scikit-hep/scikit-hep/blob/master/tests/visual/test_hists.py, and related material. Hope that helps?

@eduardo-rodrigues
Copy link
Member

Of course a toy manager would work, as you say. You could have your class as a fixture with 'session? scope ...

@marinang
Copy link
Member Author

marinang commented Nov 28, 2019

@eduardo-rodrigues the generations of toys (plus fit / plus scan) easily takes hour even for the simple example of a gaussian signal over an exponential background, so for sure tests cannot be ran in CI pipelines, the toys have to be read for the tests.

The idea would be to let users generate toys outside of hepstats using a batch system for instance, and collect the results in a way that can be read in hepstats.FrequentistCalculator.

@eduardo-rodrigues
Copy link
Member

OK, then you have to store the data somewhere. Fair enough. You can still use a little fixture to make the data available in the tests, though …

I realise that scikit-hep-testdata needs to be worked on to work properly, but it might be a use-case for you? Basically the package is meant to store big-ish data files for tests. @benkrikler wanted to get back to the package and your use-case seems like a good reason to … ;-)

@marinang
Copy link
Member Author

marinang commented Dec 2, 2019

I am not sure whether it is useful of not to put the results of pseudo-experiments for a particular statistical model in scikit-hep-testdata, who else would use it? I have checked what I was doing with lauztat, I was using hdf5 files and for an upper limit calculation the file size is a few MB.

@eduardo-rodrigues
Copy link
Member

That's a good point. Fair enough.

@jonas-eschle
Copy link
Contributor

I expect this file to be in fact quite small, with about 100-1000 numerical entries, do you agree? This would also allow to e.g. use yaml?

And it seems related, just on a lower level, to #19. Does it make sense to mix the two

@marinang
Copy link
Member Author

marinang commented Dec 9, 2019

@mayou36 from experience to get a sensible result 1000 toys is the bare minimum. For each toy you need to store several numbers (more if you do a 2D confidence interval) stated in the first message.

For a 3 sigma evidence 1000 is the order of toys you want, but imagine for a 5 sigma discovery.

For upper limits ~ 1000 toys is fine but you have to multiply this by the number of values you scan for the POI (usually >= 10).

So I don’t think it is that small. For that regard I am a bit hesitant to store that in the same yml file where the stat results are stored.

@jonas-eschle
Copy link
Contributor

jonas-eschle commented Dec 9, 2019

Yes, true, that seems like too much, at least in human-readable format. yaml offers also the storage of arrays and there are some formats that combine yaml with hdf5 such as ASDF which I'm often looking at also in regard of likelihood storage.

But pure HDF5 seems also fine, I agree

@marinang
Copy link
Member Author

marinang commented Dec 9, 2019

There is still one issue, when the toys are stored there is no information stored about what model was used to generate the toys. So the user would have to be careful to match the toys with the corresponding loss in the FrequentistCalculator.
So I just got and idea, given that @mayou36 you are working on likelihood storage, would it make sense to allow some extras item to be written without interfering with the core content of the file?
This would allow to match likelihood and toys in one single file, then I could add a classmethod FrequentistCalculator.from_yaml(yaml_file, fitting_backend).

@jonas-eschle
Copy link
Contributor

That would be great of course! It is though only a thing wrapper around
model = fitting_backend.from_yaml(...), right? So we would need to get that up first.

For the moment being, we may can leave the coordination to the user and let him name things properly?

@marinang
Copy link
Member Author

marinang commented Dec 9, 2019

Or better loss = fitting_backend.from_yaml(...).

@classmethod
def from_yaml(cls, yaml_file, fitting_backend):
    loss = fitting_backend.from_yaml(yaml_file)
    toys = function_extracting_toys_from_yaml(yaml_file)
    return cls(loss, toys)

For the moment being, we may can leave the coordination to the user and let him name things properly?

Yes for the moment we can only do that.

@eduardo-rodrigues
Copy link
Member

Seems that at least for a first implementation the proposal to use hdf5 is viable. I would start an implementation with that.

The point you made in connection with likelihood storage and fitting tools in general is not to be undermined, see my comment in #19. 👍 on your proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed hypotests
Projects
None yet
Development

No branches or pull requests

3 participants