-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change energy dispersion format #67
Comments
I don't have a strong opinion either way. Yes, using Note that EDISP can quickly dominate the DL3 data file size that's distributed to end users, so this question of how to store it well is relevant. The question of whether we use any compression to squeeze out zero bytes is related, comments on #46 would be welcome. For now my recommendation would be to stick with what we have and just extend the range in the HESS HAP exporters to avoid the clipping. It's very hard to find people to work on exporters or contribute to the format spec at all. Once CTA starts to produce EDISP (cc @TarekHC), and someone does a small study how to do it well and a concrete proposal how CTA EDISP DL3 should be distributed to users, I think we can add or change this. Of course, if people want to change now, that's fine with me as well. My point is that what we have now works for current IACTs and it's not clear to me that the proposed change really is an improvement. |
as for the file size, somebody should look into using FITS tiled image compression on it, which should reduce the size by a large factor (though if it is implemented as a TABLE and not an IMAGE extension, that may not work). That could be a future study though. Even using gzip/bzip2 on a sparsely populated table should give resonable results since most elements in the off-diagonal parts of the matrix will be 0. https://heasarc.gsfc.nasa.gov/docs/software/fitsio/compression.html |
Hi Christoph, Another suggestion I made is to store Edisp rather as
Energy reconstruction hardly ever gives a systematic underestimate of the energy, so saving So, though I agree with @kosack that it would be best to stick to the RMF standard, a compromise would be to use Good Luck, |
@mdpunch - Thanks for the comment about the One comment concerning RMF: it has this kinda complex zero-suppression scheme (storing groups of pixels above threshold) and how to store that in FITS. We might not want to use this for DL3, if we zero-suppress we might want a more generic scheme that we then also use e.g. for the table PSF format, or maybe we just want to let gzip on the FITS files take out the zeros and not do any compression in the format itself. And RMF doesn't support extra axes, like the offset axis that we currently have in the EDISP format. So I think what we need is a small study / evaluation / summary which of the options that have been suggested here really have advantages (in terms of simplicity / accuracy / storage or processing space):
My guess is that it doesn't matter much, all would work. One concern we have in HESS is low MC stats e.g. at high energies. We will have to start to implement EDISP smoothing methods. Some axis / binning formats might be better suited for this than others, I don't know. @registerrier, @GernotMaier, all - Thoughts on how you'd like EDISP to be stored? |
The only advantage to option 2 is that that is how you use the data, so no further "untransformation" is needed to get a probability distribution for a given Note that in any case, we should also store some sort of error bars with the matrix (something we learned when doing sensitivity curves). The reason for that is that when doing the transform, one needs to weight by the bin error in order to avoid crazy artifacts coming from sampling noise. I would not suggest following the RMF format, but rather just the concept (we can always make a converter to export to RMF). You can just store an image of logEtrue vs logEreco, and zero everything that has too low stats. Gzip will efficiently compress that, i just checked with an example matrix (you get a factor of up to 10 compression). The reasoning may also be that for CTA Ereco is ~ Etrue, but if this is to be general, it may not be the case. Especially for some other analysis like cosmic rays, it could be very different in shape. |
Currently the assumption is that DL3 IRF producers create responses like EDISP (but also the others like AEFF, BKG, PSF) that are noise-free and well-sampled! We should make that clearer in the current DL3 specs. @kosack - If you think this doesn't work for CTA or is not a good way to do things, please open a new separate issue! There have been discussions about IRF errors of course, but as far as I know no-one is concretely working on implementing a DL3 format for that or science tool code to handle that input.
That already exists in Gammapy (and probably Gammalib as well), there's an example how to do it here. |
Agreed, that is fine. It just needs to be a clear specification that all IRFs need to be smoothed or well-enough sampled such that noise is not a problem. How do you then deal with edges where the sampling is bad and you really don't know the response? |
having a NaN value in bins that should not be trusted or something could be one way to avoid border cases |
Let's keep this issue focused on EDISP y-axis, and try to have one GH issue per suggestion / topic. |
It is significant work to change this. In addition to the spec change, @jknodlseder would have to change ctools, @mimayer the HESS ParisAnalysis exporters and we the HESS HAP exporters and Gammapy, someone the VERITAS exporters. If this is a good change that results in a better format for the next decade, we should of course do it. I'm just mentioning this to make it clear that this is not something we should change often, i.e. what we change to now should be what we use at least for the next ~ year (until we learn more and find an even better way to do it). @jknodlseder @mimayer - Do you agree the suggested change here is a good one and that we should make it now? Would you be willing to implement it in Gammalib / the HESS ParisAnalysis exporters? Do you have a preference for using
Yes, that was a plotting bug in the axis labels. Thanks for pointing it out! Just to have another data point, and to illustrate that the cutoff issue in the HAP EDISP can easily be avoided, the HESS ParisAnalysis exporters currently use much more coarse binning in |
Another "data point", as you say, is this plot from CAT (http://arXiv.org/abs/astro-ph/9804133v1):
...whereas the usual threshold definition (max. convolved with Crab spectrum gives):
(you can understand my interest in this now, given the comments at the HESS meeting last week concerning threshold definitions) |
For practical purposes (interpolation), storing y = e_reco / e_true is much better, and it's also more efficient in terms of space, avoiding complex compression schemes.
… Le 4 oct. 2016 à 14:42, mdpunch ***@***.***> a écrit :
Hi Christoph,
Another suggestion I made is to store Edisp rather as yy = e_true / e_reco.
For the general case e_true ~= e_reco, so yy = e_true / e_reco ~= 1
For the case near and just below threshold, what happens is that the system triggers on upward fluctuations of showers, so e_reco is then overestimated. Then yy = e_true / e_reco < 1 and goes further towards zero as the energy decreases further (whereas the current y = e_reco / e_true increases as you go below threshold, and gets clipped easily, as can be seen on the plots).
Energy reconstruction hardly ever gives a systematic underestimate of the energy, so saving e_true / e_reco is naturally lower-bounded by 0, and the upper bound is likely to be closer than 1 than 2.
So, though I agree with @kosack <https://github.com/kosack> that it would be best to stick to the RMF standard, a compromise would be to use e_true / e_reco on the y-axis instead.
Good Luck,
Michael.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC2oV3be2779GFAVMRgRBdmr6gsY0elnks5qwkm2gaJpZM4KNkOz>.
|
Why is information lost?
… Le 4 oct. 2016 à 17:49, mdpunch ***@***.***> a écrit :
Okay, but does there need to be a long discussion on this. For me it's clear that y = e_reco / e_true is causing information loss, and seems evident that it needs to be changed:
<https://cloud.githubusercontent.com/assets/17853426/19080866/b2a9feb6-8a58-11e6-8e10-10d4f02eb0ad.png>
Actually, this last plot looks weird, and I think the axis labels are swapped, can you check?
Assuming that, then if you have an e_reco of 400GeV, the clipping of the plot is hiding that your e_true can have a huge error bar.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC2oV97J1svt4qh-IoQkicQMo6S5bdXYks5qwnWVgaJpZM4KNkOz>.
|
Hi @jknodlseder , Information is lost because of clipping. At the threshold, only showers with upward fluctuations trigger the detector. These showers are reconstructed to be at the threshold energy, whereas their true energy is lower. So You can literally see this effect on the plot e_reco vs. e_true plot as reconstructed from the stored In addition to the low threshold effect, there also seems to be an effect at high energies, which I have't tried to figure out why there are so many events with So, to avoid information loss, needs a much larger range for the migration matrix in Storing Another possibility, if it's really more practical to have Michael. |
I see, your proposal is to invert the ratio.
I think this in principle fine, at least I see no immediate show stopper.
I guess before doing the change I would like to make a test implementation to see what the technical implications are.
… Le 13 janv. 2017 à 13:16, mdpunch ***@***.***> a écrit :
Hi @jknodlseder <https://github.com/jknodlseder> ,
Information is lost because of clipping.
At the threshold, only showers with upward fluctuations trigger the detector. These showers are reconstructed to be at the threshold energy, whereas their true energy is lower. So e_reco/e_true for these showers tends to be very large.
You can literally see this effect on the plot e_reco vs. e_true plot as reconstructed from the stored y = e_reco/e_true values, see my comment on 4th October.
In addition to the low threshold effect, there also seems to be an effect at high energies, which I have't tried to figure out why there are so many events with e_reco/e_true>>1... but that effect is there too.
So, to avoid information loss, needs a much larger range for the migration matrix in e_reco/e_true vs e_true.
Storing y* = e_true/e_reco gets around this clipping effect, because it is rare that the reconstructed energy is much below the true energy. It is the simplest solution if we want to store things in linear scale.
Another possibility, if it's really more practical to have e_reco/e_true, is to store rather log(e_reco/e_true). I think this could take into account nicely the outliers without needing much bigger matrices, and even seems more natural to do this since ideally that distribution should be Gaussian, so storing it in log is better,
Michael.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC2oV_hexp1vfs8TxH1gOh9F4LzJ87HTks5rR2s5gaJpZM4KNkOz>.
|
Well, either to invert (from |
Not sure ... going from lin to log always implies keeping track of the Jacobi terms in the pdf computation
Also, I think with linear you get a nice Gaussian for higher energies
… Le 13 janv. 2017 à 17:44, mdpunch ***@***.***> a écrit :
Well, either to invert (from e_reco/e_true to e_true/e_reco), or to store as log... or both!
Maybe log(e_reco/e_true) is simplest to implement right now?
Whatever works...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC2oV8Sj10z4_jHSQ4HSlVIh6VUZXhxYks5rR6nogaJpZM4KNkOz>.
|
@woodmd and I discussed this just now, and he also said that he would try Another orthogonal suggestion he made was to try splines, because they:
I completely agree that this should be done. But it only makes sense to continue the discussion when there is someone willing to implement / prototype something, which as far as I can see is not the case at the moment. Hopefully this discussion will be useful to whoever does this work in the future. |
This is a suggestion for a change I got via email from Michael Punch and @kosack, after presenting the FITS data formats for H.E.S.S. at the collaboration meeting last week.
There we were using EDISP in the current format (see here) with y-axis range up to
y = e_reco / e_true = 2
which clipped the distribution.To avoid this they suggested to instead use
y = log(e_reco)
, which is what OGIP RMF does, and also use that for the DL3 responses (where there are extra axes e.g. for field of view offset, and no sparse matrix zero-suppression scheme).The text was updated successfully, but these errors were encountered: