Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wrong end_time #946

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Fix wrong end_time #946

wants to merge 14 commits into from

Conversation

hmpf
Copy link
Contributor

@hmpf hmpf commented Nov 15, 2024

Hopefylly closes #935

Also, attempt to clean up/document event types, and event type validation in the API.

src/argus/incident/models.py Outdated Show resolved Hide resolved
src/argus/incident/models.py Outdated Show resolved Hide resolved
src/argus/incident/models.py Outdated Show resolved Hide resolved
Comment on lines +527 to +532
if self.stateless_event:
# Weird, stateless event without stateless end_time, fix
self.end_time = None
self.save()
LOG.warn("Mismatch between self %s end_time and event type: set stateless", self.pk)
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a normalization issue? Two ways of indicating that an event is stateless?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and end_time = None is the more important/older one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh well 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a question what should happen if there is a stateless event but end_time is wrong. Here, I trust the statetless event and unset end_time, but we could also delete the event. Though, then we would have to check that there is an incident-start event etc. etc. etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would trust the event log more than the actual event attributes, since these are both timestamped and have an authenticated author - i.e. I think it is OK to repair the end_time value in this unlikely corner case.

def event_already_exists(self, event_type):
return self.events.filter(type=event_type).exists()

def repair_end_time(self) -> Optional[bool]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that you have prefetched all events here, because otherwise thie function may cause a bunch of round trips to the db

Copy link
Contributor Author

@hmpf hmpf Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. This is on the model instance. Prefetch before a loop over instances?

self.events.prefetch_related(..) would work but the only relevant thing to prefetch would be acknowledgments, which is irrelevant for exists(), and might be handled behind the scenes anyway thanks to the OneToOneField.

Did you mean something in repair_end_time?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant in repair_end_time, should've made that clearer. It has a bunch of checks based on properties that each filter on self.events, so make sure you're not firing a query for every check. Maybe it's prudent to just collect all events and manually process the list. As long as it's not too large, remembering incidents with 1000s of events

src/argus/incident/models.py Show resolved Hide resolved
Comment on lines 505 to 500
incident.repair_end_time()
abort_due_to_too_many_events(incident, event_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we would end up in our case. There already is an END event, but the incident is reported open, so we fire another END event. The incident is repaired, so the end_time is correctly set (closing the incident), but we still get a 400 error back like we did something wrong. We're being gaslighted here: "I didn't make a mistake and just fixed it, you made a mistake!" 😂

perhaps look a the return value of the repair?

Copy link
Contributor Author

@hmpf hmpf Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difficulty is: we must prevent storing the new event since it is redundant after the repair. How do we report that fact back to the API client in the best way?

incident.repair_end_time does not create any events itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to report that back? The client sees the incident is open and after their request the incident is closed. The way I see it is: as far as the client is concerned, the request was succesful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have to fake something in the API endpoint then, I'll dig into the drf-code...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll redirect to the GET incidents-endpoint if the client is provably outdated.

src/argus/incident/views.py Show resolved Hide resolved
@hmpf hmpf self-assigned this Nov 28, 2024
@hmpf hmpf force-pushed the fix-wrong-end-time branch 2 times, most recently from bd6b87d to e23c3b9 Compare December 5, 2024 08:30
hmpf added 5 commits December 9, 2024 14:19
/.. also, attempt to clean up/document event types, and event type
validation in the API.
@hmpf hmpf force-pushed the fix-wrong-end-time branch from e23c3b9 to aa56697 Compare December 9, 2024 13:19
@hmpf hmpf requested a review from elfjes December 10, 2024 09:42
@hmpf hmpf added the ddn Design decision needed label Dec 10, 2024
Comment on lines 505 to 500
incident.repair_end_time()
abort_due_to_too_many_events(incident, event_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to report that back? The client sees the incident is open and after their request the incident is closed. The way I see it is: as far as the client is concerned, the request was succesful.

src/argus/incident/views.py Show resolved Hide resolved
Comment on lines +536 to +539
if not self.stateful:
# the vital part for statelessness is set correctly
LOG.info("Incident %s: No mismatch, correctly stateless", self.pk)
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that you only want to call repair_end_time on stateful events? is that worth documenting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how that is implied.

repair_end_time has been designed to be called from anywhere and to do the right thing even if nothing is wrong. It does not assume pre-validation.

If I removed this block, or the comment in it, then I would almost be willing to bet the big bux (I don't bet though, not even LOTTO. Too much knowledge of the right kind of mathematics.) that some time in the future some helpful soul complains that the method is incomplete because it lacks this block or its comment.

Comment on lines 504 to 509
repaired = incident.repair_end_time()
if repaired:
raise AttributeError("end_time mismatch repaired, see logs")
# should never happen
LOG.error("Something weird happened, see other logs")
raise AttributeError("end_time mismatch was in error, see logs")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AttributeError seems like a strange exception to raise here. Perhaps a custom Exception class instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commit add new exceptions

src/argus/incident/models.py Show resolved Hide resolved
raise SuccessfulRepairException("end_time mismatch repaired, see logs")
# should never happen, insufficent preceeding logic construct?
LOG.error("Something weird happened, see other logs")
raise InconceivableException("Found end_time mismatch was in error, see logs")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception message seems to have a grammatical error, not sure what you wanted to say here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See new commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, which exception message?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Found end_time mismatch was in error, see logs"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The last line is hidden by a scroll bar.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wow 🤦

src/argus/util/exceptions.py Outdated Show resolved Hide resolved
src/argus/util/exceptions.py Show resolved Hide resolved
@hmpf
Copy link
Contributor Author

hmpf commented Dec 10, 2024

For some reason I'm not allowed to reply to @elfjes ' comments on line 507 to 510 in incident/views:

            if event_type in Event.CLOSING_TYPES and not incident.open:
                self._abort_due_to_type_validation_error("The incident is already closed.")
            if event_type == Event.Type.REOPEN and incident.open:
                self._abort_due_to_type_validation_error("The incident is already open.")

I presume these lines are also here for completeness sake just like the stateless checks in repair_end_time..

A source system is not allowed to reopen but that is handled earlier in the process, in validate_event_type_for_user.

If a client tries to reopen something that is correctly reopened already, or close something that is correctly closed already, then the client has an outdated view of the world. The server cannot fix that, or repair anything in itself to fix that, it can only report.

All we can do is either:

  • fail quietly and either ship back
    • nothing
    • the original event that the client doesn't know about
  • what we are currently doing: fail loudly and ship back a ValidationError.
  • fail loudly with something else than ValidationError and I haven't found any good status codes

@elfjes
Copy link
Collaborator

elfjes commented Dec 10, 2024

For some reason I'm not allowed to reply to @elfjes ' comments on line 507 to 510 in incident/views. I presume these lines are also here for completeness sake just like the stateful checks in repair_end_time..

A source system is not allowed to reopen but that is handled earlier in the process, in validate_event_type_for_user.

If a client tries to reopen something that is correctly reopened already, or close something that is correctly closed already, then the client has an outdated view of the world. The server cannot fix that, or repair anything in itself to fix that, it can only report.

All we can do is either:

* fail quietly and either ship back
  
  * nothing
  * the original event that the client doesn't know about

* what we are currently doing: fail loudly and ship back a ValidationError.

Yes, in case there was nothing to repair and it's all the clients fault, a ValidationError would be ok, even though i think failing quietly is a nicer experience. Not sure what would be the value of sending back of the original event; imho the current/actual state of the incident (ie the incident itself) would be more useful.

My comment was on 510 only (keep forgetting that GH adds the other lines and that I thus need to be more explicit in my comments): the text "Found end_time mismatch was in error, see logs" is not gramatically correct

@hmpf
Copy link
Contributor Author

hmpf commented Dec 10, 2024

All we can do is either:

  • fail quietly and either ship back

    • nothing
    • the original event that the client doesn't know about
  • what we are currently doing: fail loudly and ship back a ValidationError.

Yes, in case there was nothing to repair and it's all the clients fault, a ValidationError would be ok, even though i think failing quietly is a nicer experience. Not sure what would be the value of sending back of the original event; imho the current/actual state of the incident (ie the incident itself) would be more useful.

The endpoint returns a serialized copy of the newly created event if everything is alright, having it return a serialized incident after a repair makes no sense. After the repair it needs to re-fetch the incident, yes, so it needs to know that the repair happened.

We could change the type of the ValidationError: Now it sends {"type": message}. We could make it send {"repair": message}.

My comment was on 510 only (keep forgetting that GH adds the other lines and that I thus need to be more explicit in my comments): the text "Found end_time mismatch was in error, see logs" is not gramatically correct

Copy link
Member

@lunkwill42 lunkwill42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was interrupted in the middle of this review and had to leave, so this covers no more than half the PR (and may contain thoughts that are not fully formed yet)

"""Repairs end_time if there is a mismatch between events and end_time
This can happen under race-conditions and because we still cannot use
the ``atomic``-decorator everwhere.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the ``atomic``-decorator everwhere.
the ``atomic``-decorator everywhere.

* False if it was stateful and ok
* None if it was stateless and ok
"""
LOG.info("Incident %s: Detected potential mismatch of end_time and events", self.pk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first line of the method. Where did the mentioned detection take place?

Comment on lines +527 to +532
if self.stateless_event:
# Weird, stateless event without stateless end_time, fix
self.end_time = None
self.save()
LOG.warn("Mismatch between self %s end_time and event type: set stateless", self.pk)
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would trust the event log more than the actual event attributes, since these are both timestamped and have an authenticated author - i.e. I think it is OK to repair the end_time value in this unlikely corner case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ddn Design decision needed
Projects
Status: Ready for review
Development

Successfully merging this pull request may close these issues.

Incidents may get stuck open
3 participants