Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Temperature and core frequency graph issue #64

Open
ezekriSCW opened this issue Nov 26, 2024 · 3 comments
Open

[BUG] Temperature and core frequency graph issue #64

ezekriSCW opened this issue Nov 26, 2024 · 3 comments

Comments

@ezekriSCW
Copy link

Describe the bug
During full gpu load job, on 'core_freq vs Thermal' graph, at the end of the job, CPU temperature suddenly falls down et core frequencies have strange behaviour.
Same behaviour noticed with default 600s-runtime duration and with 420s-runtime.

To Reproduce
Steps to reproduce the behavior:

  1. uv run hwbench -j configs/full_cpu_load.conf -m monitoring.cfg
    2.uv run hwgraph graph --traces hwbench-out-20241126111954/results.json:DL340:BMC.Server --outdir DL340

Additional context
job graphs respectively with 600 and 400s
image
image

@ezekriSCW
Copy link
Author

image
additional cpu pkg vs thermal graph

@anisse
Copy link
Contributor

anisse commented Dec 4, 2024

Something is very weird; notably the fact that the core frequencies are moving towards the end. Some hypotheses:

  • your turbostat is not up-to-date (what version do you use?). I'm not sure it's this because I think it should just "not work" in this case
  • stress-ng crashed at some point and the benchmark stopped?
  • some other issue, maybe hardware-related?

@anisse
Copy link
Contributor

anisse commented Dec 4, 2024

How is this benchmark result in term of performance scalability ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants