Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RooFit EvalBackend("cpu") Disables Multi-Core Support During RooMinimizer Minimization #17344

Open
1 task
JieWu-GitHub opened this issue Jan 1, 2025 · 4 comments

Comments

@JieWu-GitHub
Copy link

JieWu-GitHub commented Jan 1, 2025

Check duplicate issues.

  • Checked for duplicates

Description

When minimizing the log-likelihood created with createNLL in RooFit,
specifying (the default backend)
nll1 = pdf1.createNLL(data1, RooFit.NumCPU(4), RooFit.EvalBackend("cpu"))
nll2 = pdf2.createNLL(data2, RooFit.NumCPU(4), RooFit.EvalBackend("cpu"))
unexpectedly disables multi-core usage.
To leverage multiple cores, the backend must be set to
nll1 = pdf1.createNLL(data1, RooFit.NumCPU(4), RooFit.EvalBackend("legacy"))
nll2 = pdf2.createNLL(data2, RooFit.NumCPU(4), RooFit.EvalBackend("legacy"))

Reproducer


import ROOT
from ROOT import RooFit, RooRealVar, RooGaussian, RooDataSet, RooArgSet, RooArgList
from ROOT import RooAddition, RooMinimizer, RooAddPdf

def simultaneous_fit_with_roo_minimizer():
    # 1) Define a shared observable
    x = RooRealVar("x", "x", -10, 10)

    # 2) Define SHARED parameter(s), e.g. one mean for both Gaussians
    mean = RooRealVar("mean", "shared mean", 0, -10, 10)  # common
    mean2 = RooRealVar("mean2", "shared mean2", 2, -10, 10)  # common

    # Distinct widths
    sigma1 = RooRealVar("sigma1", "width for data1", 1.5, 0.1, 5.0)
    sigma2 = RooRealVar("sigma2", "width for data2", 0.5, 0.1, 4.0)
    sigma3 = RooRealVar("sigma3", "width for data3", 1.5, 0.1, 5.0)
    sigma4 = RooRealVar("sigma4", "width for data4", 2.5, 0.1, 5.0)
    sigma5 = RooRealVar("sigma5", "width for data5", 3.5, 0.1, 5.0)
    sigma6 = RooRealVar("sigma6", "width for data6", 0.5, 0.1, 5.0)

    # yields
    yield1 = RooRealVar("yield1", "yield for data1", 100000, 0, 500000)
    yield2 = RooRealVar("yield2", "yield for data2", 20000, 0, 500000)
    yield3 = RooRealVar("yield3", "yield for data3", 300000, 0, 500000)
    yield4 = RooRealVar("yield4", "yield for data4", 400000, 0, 500000)
    yield5 = RooRealVar("yield5", "yield for data5", 500000, 0, 500000)
    yield6 = RooRealVar("yield6", "yield for data6", 600000, 0, 500000)

    # 3) Build two separate PDFs that share 'mean' but differ in sigma
    _pdf1 = RooGaussian("_pdf1", "_pdf1", x, mean, sigma1)
    _pdf2 = RooGaussian("_pdf2", "_pdf2", x, mean, sigma2)
    _pdf3 = RooGaussian("_pdf3", "_pdf3", x, mean, sigma3)
    _pdf4 = RooGaussian("_pdf4", "_pdf4", x, mean2, sigma4)
    _pdf5 = RooGaussian("_pdf5", "_pdf5", x, mean, sigma5)
    _pdf6 = RooGaussian("_pdf6", "_pdf6", x, mean2, sigma6)

    pdf1 = RooAddPdf("pdf1", "pdf1", RooArgList(_pdf1, _pdf2, _pdf3, _pdf4), RooArgList(yield1, yield2, yield3, yield4))
    pdf2 = RooAddPdf("pdf2", "pdf2", RooArgList(_pdf5, _pdf6), RooArgList(yield5, yield6))

    # 4) Generate two separate data sets from these shapes
    data1 = pdf1.generate(ROOT.RooArgSet(x), 1.5 * 300000)  # "first bin"
    data2 = pdf2.generate(ROOT.RooArgSet(x), 1.5 * 400000)  # "second bin"

    # 5) Create negative log-likelihood (NLL) for each dataset
    #    Note that 'mean' is shared, so the Minimizer sees it as common
    nll1 = pdf1.createNLL(data1, RooFit.NumCPU(4), RooFit.EvalBackend("cpu")) # no multi-cores
    nll2 = pdf2.createNLL(data2, RooFit.NumCPU(4), RooFit.EvalBackend("cpu")) # no multi-cores
    # nll1 = pdf1.createNLL(data1, RooFit.NumCPU(4), RooFit.EvalBackend("legacy")) # multi-cores enabled
    # nll2 = pdf2.createNLL(data2, RooFit.NumCPU(4), RooFit.EvalBackend("legacy")) # multi-cores enabled

    ROOT.Math.MinimizerOptions.SetDefaultMinimizer("Minuit")

    # 6) Sum the two NLL objects with RooAddition to get the total NLL
    NllList = RooArgList(nll1, nll2)
    total_nll = RooAddition("total_nll", "sum of nll1 + nll2", RooArgList(NllList))

    # 7) Create a RooMinimizer on total_nll and run
    minim = RooMinimizer(total_nll)
    minim.setPrintLevel(1)
    # Minimization
    migradStatus = minim.migrad()
    hesseStatus = minim.hesse()

if __name__ == "__main__":
    simultaneous_fit_with_roo_minimizer()

ROOT version


| Welcome to ROOT 6.32.02 https://root.cern |
| (c) 1995-2024, The ROOT Team; conception: R. Brun, F. Rademakers |
| Built for linuxx8664gcc on Sep 18 2024, 20:01:03 |
| From heads/master@tags/v6-32-02 |
| With |
| Try '.help'/'.?', '.demo', '.license', '.credits', '.quit'/'.q' |

Installation method

conda

Operating system

openSUSE Leap 15.6

Additional context

No response

@guitargeek
Copy link
Contributor

guitargeek commented Jan 1, 2025

Thank you very much for opening this issue! It reminds me that the documentation needs to be updated.

The scaling of the old RooFit multi-core support was not very good, and no matter how many cores you used in NumCPU(), I could not find any usecase where the old "legacy" backend with multiprocessing was consistently faster than the new default "cpu" backend even on a single thread.

Your reproducer script confirms this once again. Here are the performance numbers I get with it on my machine:

numcpu

Using multiple processes for the legacy backend converges to about the same runtime as the new CPU backend for your reproducer on my machine.

Therefore, implementing multi-core support for the new backend was not strictly necessary: there are no performance regressions. Also, parallelizing likelihood evaluations in the context of numeric minimization is difficult to do in a way that scales well in the general case. In some cases, it's better to parallelize over events, in others it's better to parallelize over likelihood components. Doing the wrong thing often results in even longer fitting times because of scheduling overhead. Hence, we didn't implement multi-core support for the new backend. Instead, users are encouraged to parallelize their workflows at a higher level, like at the level of doing many fits at the same time (e.g. for toy studies or profile likelihood scans).

I will update the documentation to make this clear.

That being said: you are free to then make a feature request where you ask for multi-processing with the new default backend! But I'm afraid the priority won't be that high. People were "fine" with the performance of the old backend, and the new backend generally beats it's performance for an arbitrary number of cores used in the legacy backend. So getting out more performance is not in the focus right now.

However, the situation would of course change if you have a realistic usecase where using the new backend with one thread would constitute a significant performance regression over using the old backend with a realistic number of threads, and there is no other parallelization possible at the user level!

Let me know if this reasoning makes sense to you, and already thank you very much for your further feedback!

@JieWu-GitHub
Copy link
Author

Hi Jonas,

Thank you very much for your prompt and detailed response. I appreciate you taking the time to explain the reasons behind the this!

I wanted to share some results from my own testing:

In my C++ project, which involves an angular analysis with weighted events, I handle approximately 800k inputs. The fitting process performs a simultaneous fit across 8 subsamples with a total of 32 parameters across 3 observables (3D fit). Here are the performance metrics I observed:
• New cpu Backend:
• Time Taken: Slightly above 11 minutes
• Configuration: Default single-threaded cpu backend
• Legacy Backend with Multi-Core Support:
• Time Taken: Slightly below 9 minutes
• Configuration: Utilized 28 cores (Intel® Xeon® CPU E5-2667 v3 @ 3.20GHz), legacy backend

These results indicate that, in this specific use case, the legacy backend with multi-core support outperforms the new cpu backend by approximately 2 minutes (~20%).

I understand the difficulties in parallelising the likelihood calculation in the new backend. Given these results, I am comfortable using the new cpu backend as it reduces resource usage (from 28 cores to 1 core) with only a modest increase in computation time.

However, it would be highly beneficial to have a more intelligent mechanism to automatically select the optimal backend based on the specific analysis case. This could potentially maximise performance while minimising resource consumption without requiring manual configuration ;)

Thank you once again for your attention to this matter and for your contributions to the project!

@guitargeek
Copy link
Contributor

Hi @JieWu-GitHub, thanks for also reporting your measurements! A slowdown by 2 minutes (20 %) is significant. Let's maybe keep this issue open for now to see if there is something that can be done easily. I will do some more checks. At least the issue should not be closed before the documentation is updated.

One more question about your usecase: are the 800k sample distributed uniformly over the subsamples in the simultaneous fit, e.g. each of the 8 subsamples has 100k entries?

@JieWu-GitHub
Copy link
Author

Hi @guitargeek , thank you for looking into this and for your thoughtful response.

Regarding your question, the samples are not evenly distributed across the subsamples. The distribution fractions are approximately as follows:
• (Subsample 1: 10%
• Subsample 2: 15%
• Subsample 3: 15%
• Subsample 4: 10%)
• (Subsample 5: 10%
• Subsample 6: 15%
• Subsample 7: 15%
• Subsample 8: 10%)

This actually involves a simultaneous fit across 4 mass bins and 2 charge states, resulting in a total of 8 bins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants