Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaNs in Cl fix for PolyChord #231

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

williamjameshandley
Copy link
Contributor

In the initial stages of nested sampling, when it is drawing samples deep in the tails of the posterior distribution one can occasionally get NaN values for the Cls. When these are passed to clik code, this can cause segfaults or other undefined behaviour. The safest thing to do is to catch them and return a likelihood of logzero as an 'unphysical' point, consistent with other situations in the code

Credit to @lukashergt for writing and testing this code. I'm in the process of updating cobaya+polychord, and this will be the first in a sequence of pull requests

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2022

Codecov Report

Merging #231 (2e420f6) into master (d945838) will decrease coverage by 0.01%.
The diff coverage is 78.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #231      +/-   ##
==========================================
- Coverage   87.88%   87.86%   -0.02%     
==========================================
  Files          92       92              
  Lines        8335     8347      +12     
==========================================
+ Hits         7325     7334       +9     
- Misses       1010     1013       +3     
Impacted Files Coverage Δ
cobaya/likelihoods/base_classes/planck_pliklite.py 85.85% <75.00%> (-0.46%) ⬇️
cobaya/likelihoods/base_classes/cmblikes.py 85.25% <80.00%> (-0.09%) ⬇️
cobaya/likelihoods/base_classes/planck_clik.py 84.45% <80.00%> (-0.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d945838...2e420f6. Read the comment docs.

@williamjameshandley
Copy link
Contributor Author

@JesusTorrado, I think that (a) I've caught all of the relevant portions of the code and (b) that it needs to be checked and acted upon in each of these cases. In principle this could be pushed back into self.provider.check_nan_Cl(Cl) or something similar to avoid code repetition, although having just had a go at that it doesn't actually reduce the number of lines of code.

@williamjameshandley
Copy link
Contributor Author

The Travis CI timed out -- I'm not sure this is related to this update.

@cmbant
Copy link
Collaborator

cmbant commented Feb 11, 2022

I re-ran the failed one and was OK

@cmbant
Copy link
Collaborator

cmbant commented Feb 11, 2022

Why/from where is it giving NaN at all? Would be better to handle this at a higher level somewhere rather than in each likelihood.

@JesusTorrado
Copy link
Contributor

@williamjameshandley and @lukashergt, thanks a lot! I'll come back to this as soon as I am finished with #222 (1-2 working days)

As Antony said, ideally clik would do this. Do you have particular configurations that produce this error, so that we can send them to Karim?

But I think that this makes sense as a provisional solution, as opposed to having to tell people to reduce the size of the prior manually before running PolyChord. Have you checked how large is the overhead? (not significant, I guess)

@williamjameshandley
Copy link
Contributor Author

@cmbant -- these nans are very unusual. They only occur when sampling across the full prior range, and only ~O(10^-6) of the time. They're therefore quite hard to catch in debug mode, and are almost certainly from extreme corner cases. I can move the Cl checks further back, but in my view the Cl calculation is probably returning the 'right thing', and the correct thing to do is to catch it at an 'unphysical parameter point' level i.e. at the loglikelihood.

@williamjameshandley
Copy link
Contributor Author

@JesusTorrado, I'll put in some logging statements in my local code to output the parameter values if/when it catches them and get back to you.

However, I don't think we would send these to Karim. The problem is that click is undefined for nan inputs. Ideally it could return an error if provided nans (and this wouldn't require specific parameter values), but equally you could say it's the user's responsibility to not hand nonsense to the likelihood code. The 'issue' fundamentally lies with in camb for producing nan Cls, although again, I'm not certain it's wrong to do so for such unreasonable inputs. I think that the correct place to handle this is at the likelihood/modelling level.

@JesusTorrado
Copy link
Contributor

In that case, it sounds like CAMB/CLASS (which one?) should detect a failed computation and not even run the likelihood, I think.

@williamjameshandley
Copy link
Contributor Author

In this case it was CAMB, but it would likely apply to both. If 'detect a failed computation' is equivalent to 'have calculated nan cls', where is the best place for camb to check this, and what exception should be thrown?

@JesusTorrado
Copy link
Contributor

JesusTorrado commented Feb 11, 2022

Hard to say where it should be tested. One possibility would be to add to the Collector's (than handles how the products of CAMB/CLASS are gathered) an optional test property as a function that can be called in the calculate method after retrieval, and have that method return an error. @cmbant what do you think?

@cmbant
Copy link
Collaborator

cmbant commented Mar 5, 2022

For CAMB we could add a Collector "post" for Cl results, and raise and error there if any element of the array is NaN?

@JesusTorrado
Copy link
Contributor

JesusTorrado commented Mar 5, 2022 via email

@JesusTorrado
Copy link
Contributor

Just after this:

for spectrum, lmax in zip(self.requested_cls, self.l_maxs_cls)])

@cmbant
Copy link
Collaborator

cmbant commented Mar 5, 2022

But this is not clik specific?

@JesusTorrado
Copy link
Contributor

There are actually 2 problems:

  • CAMB-specific: failing silently by producing unphysical results, instead of raising an error.
  • clik-specific: segafulting instead of raising a Python error when receiving nan's

Ideally both should be fixed in their respective codes, not their Cobaya interface. Since that's not happening, let's fix both in the respective interfaces. (Even if fixing the first one fixes the second one, avoiding a segfault is worth a small amount of overhead.)

Agreed?

@williamjameshandley @lukashergt have you noticed this happening with CLASS too?

@lukashergt
Copy link
Contributor

@williamjameshandley @lukashergt have you noticed this happening with CLASS too?

Yes, I initially ran into this with Cobaya+PolyChord+CLASS+Planck runs.

@cmbant
Copy link
Collaborator

cmbant commented Mar 6, 2022

@williamjameshandley, if this fix is required for runs to work, that means it must be reproducible, so you could identify the underlying problematic models?

@JesusTorrado
Copy link
Contributor

Hi @williamjameshandley @lukashergt . Sorry for taking so long to come back to this!

Looking at the changes in this PR, they can be grouped as
a) clik-related (planck_clik.py)
b) DataLikelihood._fast_chi_squared-related.

About a), happy to include your proposed changes, but I would instead do the test in the log_likelihood method, just before the line with the comment # fill with likelihood parameters. This way, there is no additional loop, and nan's are only looked for in the ell range that matters (the granularity of knowing which Cl has the nan disappears, in exchange for greater speed; but we can always backtrack and find the responsible Cl within the if checking for the error).

About b), the only possible source of segfaults I can think of would be CAMB's camb.mathutils.chi_squared, so I would say in principle any test should appear just before its call, not inside each likelihood (again, loss of granularity in favour of speed). But I have played with that function for a while and I have not been able to make it segfault at all (with clik it was fairly easy). Are you completely sure that any of the three non-clik cases for which you send changes has actually segfaulted? Could you pass me a point in parameter space that reproduces it?

Regarding the test itself, apparently np.isnan(np.dot(a, a)) is the fastest way to check for a nan in a (see https://stackoverflow.com/questions/6736590/fast-check-for-nan-in-numpy ; I have tested it locally).

@JesusTorrado
Copy link
Contributor

clik done in master. Waiting for updates on the camb.mathutils.chi_squared cases.

@lukashergt
Copy link
Contributor

clik done in master. Waiting for updates on the camb.mathutils.chi_squared cases.

I've had issues with CLASS in the past, any issues with CAMB I managed to fix. So I guess that question is for @williamjameshandley...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants