[BUG] Temperature and core frequency graph issue #64

ezekriSCW · 2024-11-26T16:50:17Z

Describe the bug
During full gpu load job, on 'core_freq vs Thermal' graph, at the end of the job, CPU temperature suddenly falls down et core frequencies have strange behaviour.
Same behaviour noticed with default 600s-runtime duration and with 420s-runtime.

To Reproduce
Steps to reproduce the behavior:

uv run hwbench -j configs/full_cpu_load.conf -m monitoring.cfg
2.uv run hwgraph graph --traces hwbench-out-20241126111954/results.json:DL340:BMC.Server --outdir DL340

Additional context
job graphs respectively with 600 and 400s

ezekriSCW · 2024-12-04T08:29:32Z

additional cpu pkg vs thermal graph

anisse · 2024-12-04T16:12:18Z

Something is very weird; notably the fact that the core frequencies are moving towards the end. Some hypotheses:

your turbostat is not up-to-date (what version do you use?). I'm not sure it's this because I think it should just "not work" in this case
stress-ng crashed at some point and the benchmark stopped?
some other issue, maybe hardware-related?

anisse · 2024-12-04T16:15:12Z

How is this benchmark result in term of performance scalability ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Temperature and core frequency graph issue #64

[BUG] Temperature and core frequency graph issue #64

ezekriSCW commented Nov 26, 2024

ezekriSCW commented Dec 4, 2024

anisse commented Dec 4, 2024

anisse commented Dec 4, 2024

[BUG] Temperature and core frequency graph issue #64

[BUG] Temperature and core frequency graph issue #64

Comments

ezekriSCW commented Nov 26, 2024

ezekriSCW commented Dec 4, 2024

anisse commented Dec 4, 2024

anisse commented Dec 4, 2024