adds attempt to call failure callback on internal exception; adds tes… #23

cgardens · 2018-12-07T00:04:16Z

adds attempt to call failure callback on internal exception
adds tests to JobletProcessHandler

in the case where the call to jobletStatusManager.getStatus(identifier) fails, the process dies without attempting to call failure callbacks. i don't think this behavior is very intuitive, because it does not allow the caller to realize that there's a problem. i'd prefer that in such a case daemon tries to call my failure callback. what do you guys think?

…ts to JobletProcessHandler

bpodgursky · 2018-12-07T00:44:02Z

src/main/java/com/liveramp/daemon_lib/utils/JobletProcessHandler.java

+      try {
+        status = jobletStatusManager.getStatus(identifier);
+      } catch (Exception e) {
+        status = JobletStatus.IN_PROGRESS;


I do not have full context on the goals here, but I am highly skeptical of eaten exceptions -- can we LOG.error the exception, even if we aren't re-throwing it?

Depending on what the parent code does, it might even make sense to catch this exception, run the failure callback, and then re-throw the exception. In in case you want to mark the handler as "broken" (so it doesn't keep getting work to do and failing it).

Basically agreeing with Ben here, I think the correct way to do this is to have the catch block try to run the failure callback, rather than messing with control flow in this way. You can, inside the catch block, wrap the failure callback call and throw some explicit exception type for when it's the failure callback itself that's doing the failing.

i definitely hear you on the flow control. one of the reasons i was a little reticent to do this, is that if i just 1) catch the exception, 2) attempt to run failure callback 3) re-throw same exception or a new one then

jobletStatusManager.remove(identifier); configStorage.deleteConfig(identifier);

are not going to get called, so there's some additional clean up that's not going to happen. that being said, that's no worse than what currently happens in this case, so i think you guys are right. i'll make this change.

i just pushed the change. lmk if this is addresses your concerns.

cgardens · 2018-12-07T19:26:58Z

was thinking about this a bit more last night and one concern that i have is that this is a change in behavior that may not be intuitive in all cases.

in the case where the status file is empty and throws an exception, as in the case described above, if status is actually DONE this change is going to incorrectly mark the job as failed (or call the failure callback). when i was writing this pr i was assuming that the status was unreachable that the status was effectively not done. i still think this change makes sense and is a more conservative option than what we previously had, but i do want to call out that it affects this other edge case.

* still throws an exception on failure, but also attempt failure callback * throws explicit exception if the failure callback fails * adjusts tests to match new behavior

pwestling · 2018-12-07T20:06:08Z

I hear what you're saying, but I think the only reasonable way for the code to react here is if we can't determine that something is done, we determine that it has failed. idk to what extent we are good about this now, but our callbacks should be written with these transactional semantics in mind - a request should only count as done if it's success callback executes, and it should never leave anything around that's a problem if it's failure callback executes no matter how far it has gotten. Additionally failure callbacks have to be idempotent. Transactional semantics can be unintuitive but I think they're important enough for people to try to change their intuitions on.

Maybe worth doing some broader dev outreach about failures and the difficulties of handling them and thinking about transactions if we don't think people are thinking about this the right way.

pwestling · 2018-12-07T20:17:28Z

The current change actually doesn't address all of the bugs that are in this section. For example, if the success callback fails, we never try to execute the failure callback. In my head I think we restructure it like:

try{
  if(getStatus == DONE){
    successCallback()
    cleanup()
    return
  }else{
    failureCallback()
    cleanup()
    return
  }
}catch(Exception e){
  try{
    failureCallback()
    cleanup()
  }catch(Exception e){
      throw new PotentialInconsistentStateException(e)
  }
  throw new DaemonException(e)
  
}

PotentialInconsistentStateException should maybe not be a DaemonException and should cause the daemon to crash entirely? Not certain on that part. The unintuitive but still necessary things that can happen here are:

failure callback executes twice
success callback executes, then failure callback

1 we need to write code so that this is fine, and 2 is definitely tricker, but may mean that we should also be writing our failure callbacks to not have any effect on things that have had successes execute for "true" correctness. Old FPS actually used to do this, though it resulted in as much pain as breakage prevention I think.

cgardens · 2018-12-11T18:59:15Z

got it. additions you're proposing:

on failure of a success callback we should try the failure callback
on failure of a failure callback we should try the failure callback
on failure of cleanup() we should trigger the failure callback and then try to cleanup again.
does that feel like an accurate summation of what you're suggesting?

i don't think i'm super opinionated on 1 & 2. i can see arguments for both sides. failures in callbacks are already a pain point in daemon and either solution has tradeoffs. if we feel the contract is that a request has not succeeded until all of the user provided code (including its callbacks) have succeeded, then what you suggest seems reasonable. it definitely is a bit unintuitive to have a callbacks double trigger though.

i don't think i agree with you on 3--though i may be misunderstanding your position. i think of cleanup as internal daemon bookkeeping. it's failure shouldn't reflect on whether the request that daemon was processing was successful. to use a trite analogy, if i were a sculptor and a made you a statue, gave it to you, and then broke my back while cleaning up my shop after the job, i wouldn't then come to your house and smash up the sculpture too. this seems different to me than a callback failing. the callbacks are user provided code, so if any of the code the user provides us fails it doesn't seem unreasonable to say their job failed. that feels different from daemon didn't clean up some files on the file system.

pwestling · 2018-12-12T02:13:51Z

For 3, I was assuming that failure to clean that stuff up would leave the possibility that that the daemon finds the request again in the future and runs callbacks on it again. I think this is true for ForkedDaemonExecutor, but I could be wrong. If I'm right, then the request is not properly completed until you can guarantee that it won't later fail, and thus failure of cleanup is a failure. If I'm wrong then the cleanup should not be included.

cgardens · 2018-12-13T18:40:31Z

cool. i'm going to dig in and figure out if that's true.

cgardens · 2018-12-13T21:35:41Z

i think technically speaking not cleaning up these resources given the current implementation does not risk a request getting picked up again. though given how much digging i had to do, i may have persuaded myself the your approach is more conservative (and perhaps more wise).

here's what if found. in JobletProcessHandler there are two resources that get cleaned up:
the status manager (and underlying persistence) and the config storage. i'll outline why i think it's not required that these be cleaned up for a request to be considered complete.

as far as i can tell the JobletStatusManager is only used in the ProcessJobletHandler (passed in from JobletExecutors) and ForkedJobletRunner.
a) in ProcessJobletHandler, it exists only to check status of the joblet onRemove to determine if we should call the success or failure callback (aka the code we've been talking about).
b) in ForkedJobletRunner it's just set to in progress right before we try to run the joblet and then set it to done if no exception is thrown.
it's not used at all for picking up new configs as far as i can tell, just deciding on the success of a joblet as it's spinning down. if a config in the future uses the exact same identifier as the non-cleaned up status manager in the DefaultJobletStatusManager case the old one will just get overwritten. for the kube version (KubernetesJobletStatusManagerWrapper) i think that's true, but i'm not familiar enough with the gcp storage api to say definitively.
the config storage is used throughout the library, but never for figuring out if we should start a new run as far as i can tell. similarly, everything that i can find that implements JobletConfigStorage appears to obliterate an existing config if it exists when a new config comes in, so it should be safe.

so it's happenstance that the implementations of JobletStatusManager and JobletConfigStorage happen to overwrite any old data it's persisted, but certainly not an explicit part of the contract. so i'm open to the argument that it's better to just be more conservative here and fail the request if clean up doesn't happen.

the other thing giving me pause is that empirically there have been cases where an old serialized config has broken daemon entirely (here's one of your prs to make this less painful: #14). the model i've described doesn't account for something like that happening since i think that old configs are nuked when new ones come in. so either there's a hole in my investigation or the deserialization problem doesn't relate to this PR in the way i think it did. i can't find any stack traces of the old problem to verify one way or another. i can attempt to induce it locally to assuage my concern but wanted to check in with you first.

adds attempt to call failure callback on internal exception; adds tes…

01f3622

…ts to JobletProcessHandler

cgardens requested review from pwestling, sherifnada and roshan December 7, 2018 00:04

bpodgursky reviewed Dec 7, 2018

View reviewed changes

code review feedback

7846b90

* still throws an exception on failure, but also attempt failure callback * throws explicit exception if the failure callback fails * adjusts tests to match new behavior

roshan force-pushed the master branch 4 times, most recently from 3712fd0 to 2409d2b Compare July 17, 2019 05:06

sherifnada removed request for sherifnada and roshan August 7, 2019 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds attempt to call failure callback on internal exception; adds tes… #23

adds attempt to call failure callback on internal exception; adds tes… #23

cgardens commented Dec 7, 2018

bpodgursky Dec 7, 2018

bpodgursky Dec 7, 2018

pwestling Dec 7, 2018

cgardens Dec 7, 2018

cgardens Dec 7, 2018

cgardens commented Dec 7, 2018

pwestling commented Dec 7, 2018

pwestling commented Dec 7, 2018

cgardens commented Dec 11, 2018

pwestling commented Dec 12, 2018

cgardens commented Dec 13, 2018

cgardens commented Dec 13, 2018

adds attempt to call failure callback on internal exception; adds tes… #23

Are you sure you want to change the base?

adds attempt to call failure callback on internal exception; adds tes… #23

Conversation

cgardens commented Dec 7, 2018

bpodgursky Dec 7, 2018

Choose a reason for hiding this comment

bpodgursky Dec 7, 2018

Choose a reason for hiding this comment

pwestling Dec 7, 2018

Choose a reason for hiding this comment

cgardens Dec 7, 2018

Choose a reason for hiding this comment

cgardens Dec 7, 2018

Choose a reason for hiding this comment

cgardens commented Dec 7, 2018

pwestling commented Dec 7, 2018

pwestling commented Dec 7, 2018

cgardens commented Dec 11, 2018

pwestling commented Dec 12, 2018

cgardens commented Dec 13, 2018

cgardens commented Dec 13, 2018