Incidents may get stuck open #935

elfjes · 2024-11-07T13:59:33Z

We have a couple of incidents that the source system cannot close (ie END event). When sending an END event, the event is happily accepted, and nothing happens. Upon further investigation, I tracked it to the following code in EventViewSet.perform_create

        try:
            self.validate_event_type_for_incident(event_type, incident)
        except serializers.ValidationError as e:
            # Allow any event from source systems (as long as the user is allowed to post the event type),
            # even if the posted type is invalid for the incident - like if an `INCIDENT_END` event
            # is sent after the incident has been manually closed
            if not user.is_source_system:
                raise e

In case of an error (ie The incident already has a related event of type 'END'.) the end event is created, but not applied. The associated exception is swallowed. It would be really nice if instead of just dropping the exception, at least a message would be logged.

The incidents that can't be ENDed currently have an end_time == infinity and should be able to be close. Apparently something went wrong during the first END event, although I haven't figured out what yet. Unfortunately our logs don't go back long enough to see what happened the first time the incident was supposed to be ENDed so I can't pinpoint what was the issue there

I think that sending an END event now should still close the incident. Maybe it's better to look at the current end time, and end the incident if end_time==infinity, regardless of whether an END event was previously created. (Perhaps the same holds true for START events, but I'm not sure). Now we've ended up (pun intended) in a situation where an incident cannot be closed by the source system.

So, my suggestions

log a message when an event is accepted but ignored
update the logic on when an end event is excecuted based on its current end_time

The text was updated successfully, but these errors were encountered:

hmpf · 2024-11-13T08:59:56Z

Absolutely there should be logs. I can see several reasons why this happens but we need to log to find others.

A cause I can see: Both END and CLOSED sets end_time. An event can be CLOSED before a source system sets END so then I would say: let the END be created but keep the end_time as is, if it isn't already like that.

Another cause I can see: If end_time is still infinity even if there is an END- or CLOSE-event then we have been bitten by not having been atomic enough (those will be easier once we are rid of the current way of sending notifications. Not this year).

A third cause: race-conditions.

Fix: on a new END-event where there is already an END/CLOSED and end_time is still "infinity":

If there are no REOPEN events, set end_time to the timestamp of the oldest END/CLOSED. If there are REOPEN events, ignore END/CLOSE happening before the latest REOPEN and look at the oldest remaining END/CLOSED. There's hopefully a better way to set a write-lock on the incident-row instead of the "atomic" decorator/context manager.

We could maybe have a data migration to repair existing such in the database, and the actual code doing it should be a utils function so that it can also be run from elsewhere, like a management command or admin.

hmpf · 2024-11-13T09:16:12Z

If however there is already a CLOSED-event, end_time not infinity and another CLOSED-event arrives, that sounds like a bug or a hack-attempt (probably trivial via the admin, it has not been locked down I suspect). Log, including user pk, and toss/raise error.

hmpf linked a pull request Nov 15, 2024 that will close this issue

Fix wrong end_time #946

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incidents may get stuck open #935

Incidents may get stuck open #935

elfjes commented Nov 7, 2024 •

edited

Loading

hmpf commented Nov 13, 2024

hmpf commented Nov 13, 2024

Incidents may get stuck open #935

Incidents may get stuck open #935

Comments

elfjes commented Nov 7, 2024 • edited Loading

hmpf commented Nov 13, 2024

hmpf commented Nov 13, 2024

elfjes commented Nov 7, 2024 •

edited

Loading