-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix wrong end_time #946
base: master
Are you sure you want to change the base?
Fix wrong end_time #946
Conversation
d6468ba
to
95a3ba7
Compare
if self.stateless_event: | ||
# Weird, stateless event without stateless end_time, fix | ||
self.end_time = None | ||
self.save() | ||
LOG.warn("Mismatch between self %s end_time and event type: set stateless", self.pk) | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a normalization issue? Two ways of indicating that an event is stateless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, and end_time = None
is the more important/older one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh well 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a question what should happen if there is a stateless event but end_time is wrong. Here, I trust the statetless event and unset end_time, but we could also delete the event. Though, then we would have to check that there is an incident-start event etc. etc. etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would trust the event log more than the actual event attributes, since these are both timestamped and have an authenticated author - i.e. I think it is OK to repair the end_time
value in this unlikely corner case.
def event_already_exists(self, event_type): | ||
return self.events.filter(type=event_type).exists() | ||
|
||
def repair_end_time(self) -> Optional[bool]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure that you have prefetched all events here, because otherwise thie function may cause a bunch of round trips to the db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. This is on the model instance. Prefetch before a loop over instances?
self.events.prefetch_related(..)
would work but the only relevant thing to prefetch would be acknowledgments, which is irrelevant for exists()
, and might be handled behind the scenes anyway thanks to the OneToOneField.
Did you mean something in repair_end_time
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant in repair_end_time
, should've made that clearer. It has a bunch of checks based on properties that each filter on self.events, so make sure you're not firing a query for every check. Maybe it's prudent to just collect all events and manually process the list. As long as it's not too large, remembering incidents with 1000s of events
src/argus/incident/views.py
Outdated
incident.repair_end_time() | ||
abort_due_to_too_many_events(incident, event_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we would end up in our case. There already is an END event, but the incident is reported open, so we fire another END event. The incident is repaired, so the end_time is correctly set (closing the incident), but we still get a 400 error back like we did something wrong. We're being gaslighted here: "I didn't make a mistake and just fixed it, you made a mistake!" 😂
perhaps look a the return value of the repair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difficulty is: we must prevent storing the new event since it is redundant after the repair. How do we report that fact back to the API client in the best way?
incident.repair_end_time
does not create any events itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to report that back? The client sees the incident is open and after their request the incident is closed. The way I see it is: as far as the client is concerned, the request was succesful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might have to fake something in the API endpoint then, I'll dig into the drf-code...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll redirect to the GET incidents-endpoint if the client is provably outdated.
bd6b87d
to
e23c3b9
Compare
/.. also, attempt to clean up/document event types, and event type validation in the API.
e23c3b9
to
aa56697
Compare
src/argus/incident/views.py
Outdated
incident.repair_end_time() | ||
abort_due_to_too_many_events(incident, event_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to report that back? The client sees the incident is open and after their request the incident is closed. The way I see it is: as far as the client is concerned, the request was succesful.
if not self.stateful: | ||
# the vital part for statelessness is set correctly | ||
LOG.info("Incident %s: No mismatch, correctly stateless", self.pk) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that you only want to call repair_end_time on stateful events? is that worth documenting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how that is implied.
repair_end_time
has been designed to be called from anywhere and to do the right thing even if nothing is wrong. It does not assume pre-validation.
If I removed this block, or the comment in it, then I would almost be willing to bet the big bux (I don't bet though, not even LOTTO. Too much knowledge of the right kind of mathematics.) that some time in the future some helpful soul complains that the method is incomplete because it lacks this block or its comment.
src/argus/incident/views.py
Outdated
repaired = incident.repair_end_time() | ||
if repaired: | ||
raise AttributeError("end_time mismatch repaired, see logs") | ||
# should never happen | ||
LOG.error("Something weird happened, see other logs") | ||
raise AttributeError("end_time mismatch was in error, see logs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AttributeError seems like a strange exception to raise here. Perhaps a custom Exception class instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See commit add new exceptions
raise SuccessfulRepairException("end_time mismatch repaired, see logs") | ||
# should never happen, insufficent preceeding logic construct? | ||
LOG.error("Something weird happened, see other logs") | ||
raise InconceivableException("Found end_time mismatch was in error, see logs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception message seems to have a grammatical error, not sure what you wanted to say here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See new commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, which exception message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Found end_time mismatch was in error, see logs"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The last line is hidden by a scroll bar.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wow 🤦
For some reason I'm not allowed to reply to @elfjes ' comments on line 507 to 510 in incident/views:
I presume these lines are also here for completeness sake just like the stateless checks in repair_end_time.. A source system is not allowed to reopen but that is handled earlier in the process, in If a client tries to reopen something that is correctly reopened already, or close something that is correctly closed already, then the client has an outdated view of the world. The server cannot fix that, or repair anything in itself to fix that, it can only report. All we can do is either:
|
Yes, in case there was nothing to repair and it's all the clients fault, a ValidationError would be ok, even though i think failing quietly is a nicer experience. Not sure what would be the value of sending back of the original event; imho the current/actual state of the incident (ie the incident itself) would be more useful. My comment was on 510 only (keep forgetting that GH adds the other lines and that I thus need to be more explicit in my comments): the text |
The endpoint returns a serialized copy of the newly created event if everything is alright, having it return a serialized incident after a repair makes no sense. After the repair it needs to re-fetch the incident, yes, so it needs to know that the repair happened. We could change the type of the ValidationError: Now it sends
|
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was interrupted in the middle of this review and had to leave, so this covers no more than half the PR (and may contain thoughts that are not fully formed yet)
"""Repairs end_time if there is a mismatch between events and end_time | ||
|
||
This can happen under race-conditions and because we still cannot use | ||
the ``atomic``-decorator everwhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ``atomic``-decorator everwhere. | |
the ``atomic``-decorator everywhere. |
* False if it was stateful and ok | ||
* None if it was stateless and ok | ||
""" | ||
LOG.info("Incident %s: Detected potential mismatch of end_time and events", self.pk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first line of the method. Where did the mentioned detection take place?
if self.stateless_event: | ||
# Weird, stateless event without stateless end_time, fix | ||
self.end_time = None | ||
self.save() | ||
LOG.warn("Mismatch between self %s end_time and event type: set stateless", self.pk) | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would trust the event log more than the actual event attributes, since these are both timestamped and have an authenticated author - i.e. I think it is OK to repair the end_time
value in this unlikely corner case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR mixes a bit of refactoring in with a bit of "fixing"; it might have been easier to read as two separate PRs.
Nevertheless, the idea, the fix and the refactor seems solid to me.
Without spending to much time thinking of the actual implications, I think that if a client posts redundant END
events, it doesn't matter whether why the client did it, I think it should still receive an error response, since no new event was actually created (regardless of whether any kind of repair took place or not).
(oh, and the tests are also failing, at least here on GH)
Ensure that `end_time` is correct according to state: `None` for stateless, | ||
less than datetime.max for closed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure that `end_time` is correct according to state: `None` for stateless, | |
less than datetime.max for closed. | |
Ensure that an incident's `end_time` is correct according to its state: `None` for stateless, | |
less than datetime.max for closed. |
self._abort_due_to_type_validation_error(f"A source system cannot post events of type '{event_type}'.") | ||
else: | ||
if event_type not in Event.ALLOWED_TYPES_FOR_END_USERS: | ||
self._raise_type_validation_error(f"An end user cannot post events of type '{event_type}'.") | ||
self._abort_due_to_type_validation_error(f"An end user cannot post events of type '{event_type}'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like this rename. I find myself questioning what "abort" means in this context, and I find I have to read the called function to understand it.
I find raise
to be pretty clear: raise
is a Python keyword with well-known operational implications, abort
is not. The intent of the called function seems to be to raise an exception if some validation rule fails.
|
||
|
||
class InconceivableException(Exception): | ||
# This should not happen, ever! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This should not happen, ever! | |
"""This should not happen, ever!""" |
# Something is missing or doubled up in the database, | ||
# or has been accidentally deleted | ||
# It cannot be fixed automatically so we must bail out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Something is missing or doubled up in the database, | |
# or has been accidentally deleted | |
# It cannot be fixed automatically so we must bail out | |
"""Something is missing or doubled up in the database, or has been accidentally deleted. | |
It cannot be fixed automatically so we must bail out. | |
""" |
Hopefylly closes #935
Also, attempt to clean up/document event types, and event type validation in the API.