-
-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: docker containers don't crash #188
Comments
Hi @haf - thank you for your message! It's definitely something that should be improved. We will investigate and get back to you with a fix as soon as possible. |
Have we any status on this? Seem to be running into something sort of related where the API suddendly returns 500, and a compose down/up fixes it |
@OscarKolsrud we have to dig it a bit since this is the way Rails/Sidekiq works today, we may have to make it customizable. |
Hi, my org also noticed this problem, we ended up having a lot of problems because of the lago clock being frozen on error state for 2 days after a redis disconnect (we noticed that it just stopped logging after the error, although running, but customers weren't charged). It's the second or third time we lose a bill-customers day (we suspect) because of this type of outage (pods in error state in last/first days of months). Do you have updates on this, or any suggestion of a mechanism to restart pods once they enter error state? |
https://en.wikipedia.org/wiki/Crash-only_software - this is a reference to a very sane way of building software - where you auto-correct faults by restarting at different levels. |
@gabrielseibel1 if Redis is your problem, restarting the pod will not fix the issue, it will restart since Redis will be available again. |
If the connectivity to redis is a problem, a restart will trigger a retry of the connection |
each time a job is enqueued or want to run, the connection to redis is retried. |
If it's done using a liveness probe in k8s it would have solved my problem. If you've solved the bug from the stacktrace in the original post in this thread, you can close the issue. Do note that this stacktrace happens on container start though; not after a while / for networking errors - so in the case of this issue - you'd never have a successful health check. |
I'm currently working on the liveness probe on our helmchart so this is definitely something we'll release very soon! |
it's been a year. the clock crashes often. how do you guys deal with this? |
@doctorpangloss we never had any clock crash for our cloud environment. If it crash, it's always because of an unhealthy Redis service. |
Here is another log from broken clock with lago v1.4. The pod of the lago-clock did not recover within 2 days. There are no errors in the redis logs (newest log from redis is 2024-07-12T05:23:25.509128941Z)
|
thanks for it @grthr I'm having a look |
## Context - We often have an error message about redis timeout (0.1 seconds) - After investigation, its coming from the uniqueness job configuration ## Description - Update the `redis_timeout` value to the same as redis, `5 second` . It's a lot but it will cover cases where redis can be slower (self hosted). getlago/lago#188 (comment)
## Context - We often have an error message about redis timeout (0.1 seconds) - After investigation, its coming from the uniqueness job configuration ## Description - Update the `redis_timeout` value to the same as redis, `5 second` . It's a lot but it will cover cases where redis can be slower (self hosted). getlago/lago#188 (comment)
Describe the bug
If the database, or redis, is unavailable, the docker containers don't crash. This stops them from auto-healing (e.g. DNS recovering and injecting the env var).
To Reproduce
Steps to reproduce the behavior:
Expected behavior
If the environment a software service runs in is incorrectly configured, print an error message and exit the process. This lets monitoring software alert to the problem. Otherwise ops has to write app-specific event listeners, eventify the logs (including writing stacktrace parsers for ruby) and deploy listeners for the logs, then write API-integrations with the platform runtime (like kubernetes) to restart the pod.
Support
The text was updated successfully, but these errors were encountered: