-
Notifications
You must be signed in to change notification settings - Fork 17
monitoring: add support for builtin netdata #51
Conversation
upstream: | ||
- repo: 'balena-netdata' | ||
url: 'https://github.com/balena-io-playground/balena-netdata' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take the repo is private so far? I tried looking it up and a not found error popped up 😮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it's a work-in-progress that we're hoping to get ready shortly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ptrm I've open-sourced that repo, if you'd like to give this netdata approach a spin I'd love to hear what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ptrm i wanted to follow up here and see if you had a chance to test this PR locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, busy time. I'm deploying it now on my 4gb rpi4. I will push it to others if it succeeds, though I assune longer-term run would be needed to evaluate it better, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good in itself and fast enough on my 4gb rpi4 :)
When it comes to other devices, 1gb rpi3 runs fine (edit: and fluently, too :) ). I get reboot loops on my two 2gb rpi4s. I have to investigate though, because they were overclocked to 2,1GHz to speed up the one task limit allowed on them, so not a clean environment ;) When booted with monitor plugged, no warning overlays showed though, so might be something else.
EDIT: the 2gb rpi4s had somehow 4 core limit set, after limiting back to 1 they seem to be running fine.
The memory and cpu footprint worry me a bit though when it comes to 1GB devices, I think it might be good to disable netdata there, and on 2GB devices, if there ever appears a way to run 2+ tasks, it might be worth considering too.
But having said that, I don't have much idea whether it is possible to e.g. enable headless mode (so a lan peer could gather data), or change the kind and amount of metrics gathered to lower netdata's footprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I don't know how much this PR was considered an aid in investigatring #47, but does netdata support capturing oom_kill
events? One of my suspicions is that the kernel might target wrong process (e.g. some balenaOS vital one) when out of memory, and as a consequence we get a reboot. I will try to fiddle with it in other ways, too, this month.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ptrm these are great notes, thank you very much for taking the time to test so thoroughly! I have opened a ticket to disable some features if we're in a low-mem/CPU situation: balena-io-examples/balena-netdata#10.
With respect to OOM events, we should collect all data up until the kernel pauses netdata as part of the OOM traversal. Unfortunately we'll lose any data for the time that it takes the kernel to traverse the page table. It would be worth reproducing #47 with netdata enabled to get a better idea of what's going on!
Change-type: patch Signed-off-by: Matthew McGinn <[email protected]>
Change-type: patch
Signed-off-by: Matthew McGinn [email protected]