NaNs in Cl fix for PolyChord #231

williamjameshandley · 2022-02-11T12:17:17Z

In the initial stages of nested sampling, when it is drawing samples deep in the tails of the posterior distribution one can occasionally get NaN values for the Cls. When these are passed to clik code, this can cause segfaults or other undefined behaviour. The safest thing to do is to catch them and return a likelihood of logzero as an 'unphysical' point, consistent with other situations in the code

Credit to @lukashergt for writing and testing this code. I'm in the process of updating cobaya+polychord, and this will be the first in a sequence of pull requests

codecov-commenter · 2022-02-11T12:32:13Z

Codecov Report

Merging #231 (2e420f6) into master (d945838) will decrease coverage by 0.01%.
The diff coverage is 78.57%.

@@            Coverage Diff             @@
##           master     #231      +/-   ##
==========================================
- Coverage   87.88%   87.86%   -0.02%     
==========================================
  Files          92       92              
  Lines        8335     8347      +12     
==========================================
+ Hits         7325     7334       +9     
- Misses       1010     1013       +3

Impacted Files	Coverage Δ
cobaya/likelihoods/base_classes/planck_pliklite.py	`85.85% <75.00%> (-0.46%)`	⬇️
cobaya/likelihoods/base_classes/cmblikes.py	`85.25% <80.00%> (-0.09%)`	⬇️
cobaya/likelihoods/base_classes/planck_clik.py	`84.45% <80.00%> (-0.21%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d945838...2e420f6. Read the comment docs.

williamjameshandley · 2022-02-11T12:42:20Z

@JesusTorrado, I think that (a) I've caught all of the relevant portions of the code and (b) that it needs to be checked and acted upon in each of these cases. In principle this could be pushed back into self.provider.check_nan_Cl(Cl) or something similar to avoid code repetition, although having just had a go at that it doesn't actually reduce the number of lines of code.

williamjameshandley · 2022-02-11T12:51:11Z

The Travis CI timed out -- I'm not sure this is related to this update.

cmbant · 2022-02-11T13:50:29Z

I re-ran the failed one and was OK

cmbant · 2022-02-11T13:55:14Z

Why/from where is it giving NaN at all? Would be better to handle this at a higher level somewhere rather than in each likelihood.

JesusTorrado · 2022-02-11T14:35:19Z

@williamjameshandley and @lukashergt, thanks a lot! I'll come back to this as soon as I am finished with #222 (1-2 working days)

As Antony said, ideally clik would do this. Do you have particular configurations that produce this error, so that we can send them to Karim?

But I think that this makes sense as a provisional solution, as opposed to having to tell people to reduce the size of the prior manually before running PolyChord. Have you checked how large is the overhead? (not significant, I guess)

williamjameshandley · 2022-02-11T14:35:41Z

@cmbant -- these nans are very unusual. They only occur when sampling across the full prior range, and only ~O(10^-6) of the time. They're therefore quite hard to catch in debug mode, and are almost certainly from extreme corner cases. I can move the Cl checks further back, but in my view the Cl calculation is probably returning the 'right thing', and the correct thing to do is to catch it at an 'unphysical parameter point' level i.e. at the loglikelihood.

williamjameshandley · 2022-02-11T14:40:33Z

@JesusTorrado, I'll put in some logging statements in my local code to output the parameter values if/when it catches them and get back to you.

However, I don't think we would send these to Karim. The problem is that click is undefined for nan inputs. Ideally it could return an error if provided nans (and this wouldn't require specific parameter values), but equally you could say it's the user's responsibility to not hand nonsense to the likelihood code. The 'issue' fundamentally lies with in camb for producing nan Cls, although again, I'm not certain it's wrong to do so for such unreasonable inputs. I think that the correct place to handle this is at the likelihood/modelling level.

JesusTorrado · 2022-02-11T14:41:59Z

In that case, it sounds like CAMB/CLASS (which one?) should detect a failed computation and not even run the likelihood, I think.

williamjameshandley · 2022-02-11T15:13:50Z

In this case it was CAMB, but it would likely apply to both. If 'detect a failed computation' is equivalent to 'have calculated nan cls', where is the best place for camb to check this, and what exception should be thrown?

JesusTorrado · 2022-02-11T15:23:08Z

Hard to say where it should be tested. One possibility would be to add to the Collector's (than handles how the products of CAMB/CLASS are gathered) an optional test property as a function that can be called in the calculate method after retrieval, and have that method return an error. @cmbant what do you think?

cmbant · 2022-03-05T19:45:59Z

For CAMB we could add a Collector "post" for Cl results, and raise and error there if any element of the array is NaN?

JesusTorrado · 2022-03-05T20:20:48Z

Maybe the simplest solution is a check just before the clik call, when the array containing Cl's and nuisance is constructed. Works for CAMB and CLASS.

JesusTorrado · 2022-03-05T20:28:48Z

Just after this:

cobaya/cobaya/likelihoods/base_classes/planck_clik.py

Line 126 in d945838

for spectrum, lmax in zip(self.requested_cls, self.l_maxs_cls)])

cmbant · 2022-03-05T20:53:44Z

But this is not clik specific?

JesusTorrado · 2022-03-05T21:41:02Z

There are actually 2 problems:

CAMB-specific: failing silently by producing unphysical results, instead of raising an error.
clik-specific: segafulting instead of raising a Python error when receiving nan's

Ideally both should be fixed in their respective codes, not their Cobaya interface. Since that's not happening, let's fix both in the respective interfaces. (Even if fixing the first one fixes the second one, avoiding a segfault is worth a small amount of overhead.)

Agreed?

@williamjameshandley @lukashergt have you noticed this happening with CLASS too?

lukashergt · 2022-03-05T22:55:19Z

@williamjameshandley @lukashergt have you noticed this happening with CLASS too?

Yes, I initially ran into this with Cobaya+PolyChord+CLASS+Planck runs.

cmbant · 2022-03-06T08:54:24Z

@williamjameshandley, if this fix is required for runs to work, that means it must be reproducible, so you could identify the underlying problematic models?

JesusTorrado · 2022-05-09T16:30:52Z

Hi @williamjameshandley @lukashergt . Sorry for taking so long to come back to this!

Looking at the changes in this PR, they can be grouped as
a) clik-related (planck_clik.py)
b) DataLikelihood._fast_chi_squared-related.

About a), happy to include your proposed changes, but I would instead do the test in the log_likelihood method, just before the line with the comment # fill with likelihood parameters. This way, there is no additional loop, and nan's are only looked for in the ell range that matters (the granularity of knowing which Cl has the nan disappears, in exchange for greater speed; but we can always backtrack and find the responsible Cl within the if checking for the error).

About b), the only possible source of segfaults I can think of would be CAMB's camb.mathutils.chi_squared, so I would say in principle any test should appear just before its call, not inside each likelihood (again, loss of granularity in favour of speed). But I have played with that function for a while and I have not been able to make it segfault at all (with clik it was fairly easy). Are you completely sure that any of the three non-clik cases for which you send changes has actually segfaulted? Could you pass me a point in parameter space that reproduces it?

Regarding the test itself, apparently np.isnan(np.dot(a, a)) is the fastest way to check for a nan in a (see https://stackoverflow.com/questions/6736590/fast-check-for-nan-in-numpy ; I have tested it locally).

JesusTorrado · 2022-05-13T13:13:21Z

clik done in master. Waiting for updates on the camb.mathutils.chi_squared cases.

lukashergt · 2022-05-18T22:09:15Z

clik done in master. Waiting for updates on the camb.mathutils.chi_squared cases.

I've had issues with CLASS in the past, any issues with CAMB I managed to fix. So I guess that question is for @williamjameshandley...?

williamjameshandley added 6 commits February 1, 2022 23:04

Updated to be resilient to nans in cl calculations

ea315fd

Updated polychord input parameters for 1.20

032a16b

Merge branch 'upstream' into nans_in_cl

bdb772e

reverted polychord changes

2d3f905

Updated pliklite and CamSpec to also check for nan cls

4ef537c

That should be all of the relevant likelihood calls to get_Cl

c0a9db2

williamjameshandley mentioned this pull request Feb 11, 2022

Fix and extension to the way that PolyChord treats priors #233

Open

Merge branch 'master' into nans_in_cl

2e420f6

williamjameshandley mentioned this pull request Mar 5, 2022

Polychord 1.20.0 #232

Merged

JesusTorrado added a commit that referenced this pull request May 13, 2022

planck_clik: fixes segfault when nan in Cl's. partially implements #231

4e85c4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaNs in Cl fix for PolyChord #231

NaNs in Cl fix for PolyChord #231

williamjameshandley commented Feb 11, 2022

codecov-commenter commented Feb 11, 2022 •

edited

Loading

williamjameshandley commented Feb 11, 2022

williamjameshandley commented Feb 11, 2022

cmbant commented Feb 11, 2022

cmbant commented Feb 11, 2022

JesusTorrado commented Feb 11, 2022

williamjameshandley commented Feb 11, 2022

williamjameshandley commented Feb 11, 2022

JesusTorrado commented Feb 11, 2022

williamjameshandley commented Feb 11, 2022

JesusTorrado commented Feb 11, 2022 •

edited

Loading

cmbant commented Mar 5, 2022

JesusTorrado commented Mar 5, 2022 via email

JesusTorrado commented Mar 5, 2022

cmbant commented Mar 5, 2022

JesusTorrado commented Mar 5, 2022

lukashergt commented Mar 5, 2022

cmbant commented Mar 6, 2022

JesusTorrado commented May 9, 2022

JesusTorrado commented May 13, 2022

lukashergt commented May 18, 2022

NaNs in Cl fix for PolyChord #231

Are you sure you want to change the base?

NaNs in Cl fix for PolyChord #231

Conversation

williamjameshandley commented Feb 11, 2022

codecov-commenter commented Feb 11, 2022 • edited Loading

Codecov Report

williamjameshandley commented Feb 11, 2022

williamjameshandley commented Feb 11, 2022

cmbant commented Feb 11, 2022

cmbant commented Feb 11, 2022

JesusTorrado commented Feb 11, 2022

williamjameshandley commented Feb 11, 2022

williamjameshandley commented Feb 11, 2022

JesusTorrado commented Feb 11, 2022

williamjameshandley commented Feb 11, 2022

JesusTorrado commented Feb 11, 2022 • edited Loading

cmbant commented Mar 5, 2022

JesusTorrado commented Mar 5, 2022 via email

JesusTorrado commented Mar 5, 2022

cmbant commented Mar 5, 2022

JesusTorrado commented Mar 5, 2022

lukashergt commented Mar 5, 2022

cmbant commented Mar 6, 2022

JesusTorrado commented May 9, 2022

JesusTorrado commented May 13, 2022

lukashergt commented May 18, 2022

codecov-commenter commented Feb 11, 2022 •

edited

Loading

JesusTorrado commented Feb 11, 2022 •

edited

Loading