Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds attempt to call failure callback on internal exception; adds tes… #23

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cgardens
Copy link

@cgardens cgardens commented Dec 7, 2018

  • adds attempt to call failure callback on internal exception
  • adds tests to JobletProcessHandler

in the case where the call to jobletStatusManager.getStatus(identifier) fails, the process dies without attempting to call failure callbacks. i don't think this behavior is very intuitive, because it does not allow the caller to realize that there's a problem. i'd prefer that in such a case daemon tries to call my failure callback. what do you guys think?

try {
status = jobletStatusManager.getStatus(identifier);
} catch (Exception e) {
status = JobletStatus.IN_PROGRESS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have full context on the goals here, but I am highly skeptical of eaten exceptions -- can we LOG.error the exception, even if we aren't re-throwing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on what the parent code does, it might even make sense to catch this exception, run the failure callback, and then re-throw the exception. In in case you want to mark the handler as "broken" (so it doesn't keep getting work to do and failing it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically agreeing with Ben here, I think the correct way to do this is to have the catch block try to run the failure callback, rather than messing with control flow in this way. You can, inside the catch block, wrap the failure callback call and throw some explicit exception type for when it's the failure callback itself that's doing the failing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i definitely hear you on the flow control. one of the reasons i was a little reticent to do this, is that if i just 1) catch the exception, 2) attempt to run failure callback 3) re-throw same exception or a new one then

        jobletStatusManager.remove(identifier);
        configStorage.deleteConfig(identifier);

are not going to get called, so there's some additional clean up that's not going to happen. that being said, that's no worse than what currently happens in this case, so i think you guys are right. i'll make this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just pushed the change. lmk if this is addresses your concerns.

@cgardens
Copy link
Author

cgardens commented Dec 7, 2018

was thinking about this a bit more last night and one concern that i have is that this is a change in behavior that may not be intuitive in all cases.

in the case where the status file is empty and throws an exception, as in the case described above, if status is actually DONE this change is going to incorrectly mark the job as failed (or call the failure callback). when i was writing this pr i was assuming that the status was unreachable that the status was effectively not done. i still think this change makes sense and is a more conservative option than what we previously had, but i do want to call out that it affects this other edge case.

* still throws an exception on failure, but also attempt failure callback

* throws explicit exception if the failure callback fails

* adjusts tests to match new behavior
@pwestling
Copy link
Contributor

I hear what you're saying, but I think the only reasonable way for the code to react here is if we can't determine that something is done, we determine that it has failed. idk to what extent we are good about this now, but our callbacks should be written with these transactional semantics in mind - a request should only count as done if it's success callback executes, and it should never leave anything around that's a problem if it's failure callback executes no matter how far it has gotten. Additionally failure callbacks have to be idempotent. Transactional semantics can be unintuitive but I think they're important enough for people to try to change their intuitions on.

Maybe worth doing some broader dev outreach about failures and the difficulties of handling them and thinking about transactions if we don't think people are thinking about this the right way.

@pwestling
Copy link
Contributor

The current change actually doesn't address all of the bugs that are in this section. For example, if the success callback fails, we never try to execute the failure callback. In my head I think we restructure it like:

try{
  if(getStatus == DONE){
    successCallback()
    cleanup()
    return
  }else{
    failureCallback()
    cleanup()
    return
  }
}catch(Exception e){
  try{
    failureCallback()
    cleanup()
  }catch(Exception e){
      throw new PotentialInconsistentStateException(e)
  }
  throw new DaemonException(e)
  
}

PotentialInconsistentStateException should maybe not be a DaemonException and should cause the daemon to crash entirely? Not certain on that part. The unintuitive but still necessary things that can happen here are:

  1. failure callback executes twice
  2. success callback executes, then failure callback

1 we need to write code so that this is fine, and 2 is definitely tricker, but may mean that we should also be writing our failure callbacks to not have any effect on things that have had successes execute for "true" correctness. Old FPS actually used to do this, though it resulted in as much pain as breakage prevention I think.

@cgardens
Copy link
Author

got it. additions you're proposing:

  1. on failure of a success callback we should try the failure callback
  2. on failure of a failure callback we should try the failure callback
  3. on failure of cleanup() we should trigger the failure callback and then try to cleanup again.
    does that feel like an accurate summation of what you're suggesting?

i don't think i'm super opinionated on 1 & 2. i can see arguments for both sides. failures in callbacks are already a pain point in daemon and either solution has tradeoffs. if we feel the contract is that a request has not succeeded until all of the user provided code (including its callbacks) have succeeded, then what you suggest seems reasonable. it definitely is a bit unintuitive to have a callbacks double trigger though.

i don't think i agree with you on 3--though i may be misunderstanding your position. i think of cleanup as internal daemon bookkeeping. it's failure shouldn't reflect on whether the request that daemon was processing was successful. to use a trite analogy, if i were a sculptor and a made you a statue, gave it to you, and then broke my back while cleaning up my shop after the job, i wouldn't then come to your house and smash up the sculpture too. this seems different to me than a callback failing. the callbacks are user provided code, so if any of the code the user provides us fails it doesn't seem unreasonable to say their job failed. that feels different from daemon didn't clean up some files on the file system.

@pwestling
Copy link
Contributor

For 3, I was assuming that failure to clean that stuff up would leave the possibility that that the daemon finds the request again in the future and runs callbacks on it again. I think this is true for ForkedDaemonExecutor, but I could be wrong. If I'm right, then the request is not properly completed until you can guarantee that it won't later fail, and thus failure of cleanup is a failure. If I'm wrong then the cleanup should not be included.

@cgardens
Copy link
Author

cool. i'm going to dig in and figure out if that's true.

@cgardens
Copy link
Author

i think technically speaking not cleaning up these resources given the current implementation does not risk a request getting picked up again. though given how much digging i had to do, i may have persuaded myself the your approach is more conservative (and perhaps more wise).

here's what if found. in JobletProcessHandler there are two resources that get cleaned up:
the status manager (and underlying persistence) and the config storage. i'll outline why i think it's not required that these be cleaned up for a request to be considered complete.

  1. as far as i can tell the JobletStatusManager is only used in the ProcessJobletHandler (passed in from JobletExecutors) and ForkedJobletRunner.
    a) in ProcessJobletHandler, it exists only to check status of the joblet onRemove to determine if we should call the success or failure callback (aka the code we've been talking about).
    b) in ForkedJobletRunner it's just set to in progress right before we try to run the joblet and then set it to done if no exception is thrown.
    it's not used at all for picking up new configs as far as i can tell, just deciding on the success of a joblet as it's spinning down. if a config in the future uses the exact same identifier as the non-cleaned up status manager in the DefaultJobletStatusManager case the old one will just get overwritten. for the kube version (KubernetesJobletStatusManagerWrapper) i think that's true, but i'm not familiar enough with the gcp storage api to say definitively.

  2. the config storage is used throughout the library, but never for figuring out if we should start a new run as far as i can tell. similarly, everything that i can find that implements JobletConfigStorage appears to obliterate an existing config if it exists when a new config comes in, so it should be safe.

so it's happenstance that the implementations of JobletStatusManager and JobletConfigStorage happen to overwrite any old data it's persisted, but certainly not an explicit part of the contract. so i'm open to the argument that it's better to just be more conservative here and fail the request if clean up doesn't happen.

the other thing giving me pause is that empirically there have been cases where an old serialized config has broken daemon entirely (here's one of your prs to make this less painful: #14). the model i've described doesn't account for something like that happening since i think that old configs are nuked when new ones come in. so either there's a hole in my investigation or the deserialization problem doesn't relate to this PR in the way i think it did. i can't find any stack traces of the old problem to verify one way or another. i can attempt to induce it locally to assuage my concern but wanted to check in with you first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants