Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undiagnosed rare bug will set heartbeatStats.LongestMissingBeat to a much higher number than is accurate #29

Open
5HT2 opened this issue Apr 3, 2022 · 10 comments
Labels
bug Something isn't working priority:critical

Comments

@5HT2
Copy link
Member

5HT2 commented Apr 3, 2022

Note that this bug does not affect uptime (nor device.LongestMissingBeat). This is only an issue with where time.Now() is used, I think.

@5HT2 5HT2 added bug Something isn't working priority:critical labels Apr 3, 2022
@5HT2
Copy link
Member Author

5HT2 commented Apr 9, 2022

Bumping priority, I can't tell why but this issue seemingly still happens after rebooting heartbeat. It appears to always set around ~447 hours. I don't know if this number has any significance (447 - 12 (my initial hours) = 435 hours).

This seems to come back even after the system time has been set and heartbeat has been restarted.

@5HT2
Copy link
Member Author

5HT2 commented Apr 9, 2022

Oh, it's ALWAYS 1610833 seconds

@5HT2
Copy link
Member Author

5HT2 commented Apr 9, 2022

I don't think it's caused by malformed data as my devices look fine
image

@5HT2
Copy link
Member Author

5HT2 commented Apr 9, 2022

I suspect this might have just been caused by erroneous data in the DB, I fixed it manually and will monitor for further changes.

@5HT2
Copy link
Member Author

5HT2 commented Dec 18, 2022

Can't reproduce. This issue has not happened since April 8th.

@5HT2 5HT2 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 18, 2022
@5HT2
Copy link
Member Author

5HT2 commented Jan 24, 2023

Reopening this as some sort of weird Redis behavior. Here's the appendonly.aof from the last time this happened on production, which is fairly recently:

{"last_beat_formatted":"2 minutes, 4 seconds","total_devices_formatted":"3","total_visits_formatted":"903,733","total_uptime_formatted":"11,773 hours, 27 minutes, 8 seconds"{"device_name":"mango","timestamp":1673974698}
*4
$14
JSON.ARRAPPEND
$5
beats
$1
.
$46
{"device_name":"mango","timestamp":1673974698}
*4
$8
JSON.SET
$5
stats
$1
.
$303
{"last_beat_formatted":"30 seconds","total_devices_formatted":"3","total_visits_formatted":"900,588","total_uptime_formatted":"11,599 hours, 47 minutes, 8 seconds","total_beats_formatted":"376,612","total_visits":900588,"total_uptime_milli":41759228479,"total_beats":376612,"longest_missing_beat":63127}
*4
$8
JSON.SET
$7
devices
$1
.
$399
[{"device_name":"lychee","last_beat":{"device_name":"lychee","timestamp":1673928714},"total_beats":4951,"longest_missing_beat":219481},{"device_name":"mango","last_beat":{"device_name":"mango","timestamp":1673974698},"total_beats":15873,"longest_missing_beat":40847},{"device_name":"blueberry","last_beat":{"device_name":"blueberry","timestamp":1671482520},"total_beats":1,"longest_missing_beat":0}]
*4
$8
JSON.SET
$9
last_beat
$1
.
$46
{"device_name":"mango","timestamp":1673974758}
*4
$14

@5HT2 5HT2 reopened this Jan 24, 2023
@5HT2 5HT2 pinned this issue Jan 24, 2023
@5HT2
Copy link
Member Author

5HT2 commented Jan 24, 2023

This time it is 692 hours (exactly 2492358 seconds in the DB). Unrelated to system timezone updates, I believe this issue might either be correlated to restarting Docker or adding a new device??? idfk I will solve this later, it happens so rarely that switching to the Rust rewrite might just be the way to go

@5HT2 5HT2 changed the title Changing system timezone while heartbeat is running will set heartbeatStats.LongestMissingBeat to a much higher number than is accurate Undiagnosed rare bug will set heartbeatStats.LongestMissingBeat to a much higher number than is accurate Jan 24, 2023
@5HT2
Copy link
Member Author

5HT2 commented Feb 19, 2023

Happened again. 1681153 seconds, or 466 hours. This HAS to be fixed ASAP because of canary / dead man switch reasons.

{"last_beat_formatted":"11 seconds","total_devices_formatted":"3","total_visits_formatted":"912,240","total_uptime_formatted":"12,354 hours, 39 minutes, 8 seconds","total_beats_formatted":"397,665","total_visits":912240,"total_uptime_milli":44476748495,"total_beats":397665,"longest_missing_beat":63127}
*4
$8
JSON.SET
$7
devices
$1
.
$407
[{"device_name":"lychee","last_beat":{"device_name":"lychee","timestamp":1676693440},"total_beats":10839,"longest_missing_beat":219481},{"device_name":"mango","last_beat":{"device_name":"mango","timestamp":1676693414},"total_beats":31029,"longest_missing_beat":47177},{"device_name":"blueberry","last_beat":{"device_name":"blueberry","timestamp":1675012321},"total_beats":10,"longest_missing_beat":2499120}]
*4
$8
JSON.SET
$9
last_beat
$1
.
$46
{"device_name":"mango","timestamp":1676693474}
*4
$14
JSON.ARRAPPEND
$5
beats
$1
.
$46
{"device_name":"mango","timestamp":1676693474}
*4
$8
JSON.SET
$9
last_beat
$1
.
$47
{"device_name":"lychee","timestamp":1676693500}
*4
$14
JSON.ARRAPPEND
$5
beats
$1
.
$47
{"device_name":"lychee","timestamp":1676693500}
*4
$8
JSON.SET
$5
stats
$1
.
$305
{"last_beat_formatted":"11 seconds","total_devices_formatted":"3","total_visits_formatted":"912,241","total_uptime_formatted":"12,354 hours, 40 minutes, 8 seconds","total_beats_formatted":"397,667","total_visits":912241,"total_uptime_milli":44476808495,"total_beats":397667,"longest_missing_beat":1681153}

@5HT2
Copy link
Member Author

5HT2 commented Feb 19, 2023

For future reference, if this happens again before I fix it, this is handy.

curl -X POST -H "Auth: $HEARTBEAT_AUTH" -i https://frog.icu/api/update/stats -d "$(curl -s https://frog.icu/api/stats | jq '.longest_missing_beat = 63127')"

@5HT2
Copy link
Member Author

5HT2 commented Mar 14, 2023

This is happening more frequently, and I need to fix it.

{"last_beat_formatted":"28 seconds","total_devices_formatted":"3","total_visits_formatted":"919,081","total_uptime_formatted":"12,806 hours, 35 minutes, 8 seconds","total_beats_formatted":"412,428","total_visits":919081,"total_uptime_milli":46103708499,"total_beats":412428,"longest_missing_beat":1618410}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:critical
Projects
None yet
Development

No branches or pull requests

1 participant