Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDAP-20832] Enable periodic restart when task workers are running concurrent requests #15575

Merged
merged 5 commits into from
Apr 30, 2024

Conversation

arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Apr 1, 2024

CDAP-20832

This PR introduces a deadline for all task worker executions. The task worker now does a periodic restart even when "user code isolation" is disabled, i.e. its running concurrent requests. When a periodic restart is scheduled, the task worker stops accepting new requests. It waits for the new configuration "task.worker.taskExecutionDeadline.second" time to elapse. If all executing tasks finish, the task worker is restarted immediately. Otherwise the task worker is restarted after the deadline expires.

@arjan-bal arjan-bal self-assigned this Apr 1, 2024
@arjan-bal arjan-bal added build Triggers github actions build 6.10 labels Apr 1, 2024
@arjan-bal arjan-bal force-pushed the feature/CDAP-20832-task-worker-restart branch from c510b17 to d8c51d2 Compare April 1, 2024 17:28
@arjan-bal arjan-bal force-pushed the feature/CDAP-20832-task-worker-restart branch from d8c51d2 to 7244a3f Compare April 1, 2024 19:28
taskDetails.emitMetrics(succeeded);
runningRequestCount.decrementAndGet();
requestProcessedCount.incrementAndGet();
runningTasks.remove(runningTaskDetails);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need requestProcessedCount now that we have runningTaks map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted most of the changes to use a periodic restart approach. This map is not present anymore.

}
// we restart once ongoing request (which has set runningRequestCount to 1)
// finishes.
mustRestart.set(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both mustRestart and lameDock logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, if there is an ongoing request, we can move the pod into lameDock.
Basically, I'm thinking of changing lameDock semantic from a task has timeout to pod needs to restart in the next safest time. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to carry out a periodic restart using the existing periodic restart code. I had used the lameduck concept to ensure the service is restarted only when a task is stuck. In the periodic restart approach, we may be restarting task workers even when no task is stuck.

@arjan-bal arjan-bal changed the title [CDAP-20832] Restart task workers when a task get stuck [CDAP-20832] Enable periodic restart when task workers are running concurrent requests Apr 10, 2024
@arjan-bal arjan-bal force-pushed the feature/CDAP-20832-task-worker-restart branch from a7fc249 to 3e72d2e Compare April 10, 2024 10:14
@arjan-bal arjan-bal requested a review from masoud-io April 10, 2024 10:15
return;
}

if (!enableUserCodeIsolationEnabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we disable the killAfterRequestCount feature when code isolation is disabled? This is a useful feature

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the existing behavior, i.e. not changed in this PR.

Are you asking in reference to this PR? or a design/implementation level question in general ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we should either change it or change the comments. I believe cdap-defaults does not talk about any dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the comment in cdap-default.xml

PTAL

waitTime,
TimeUnit.SECONDS);

if (lowerBound <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? did you mean to check the duration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was done for early exit in terms of code readability ?

Previously it was.

if (lowerBound > 0) {
  //logic 
 }
//Function ends return
} 

Now it's

if (lowerBound <= 0) {
  return
 }
 
//logic
//Function ends return
} 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change it to duration? I mean, it looks equivalent (lowerBound = duration*0.9, so lowerBound <= 0 same as duration <= 0), but it's harder to understand the logic of why are we checking the calculated value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@arjan-bal arjan-bal force-pushed the feature/CDAP-20832-task-worker-restart branch from 3e72d2e to 536f1d5 Compare April 11, 2024 18:08
@sahusanket sahusanket requested a review from tivv April 27, 2024 08:51
@arjan-bal arjan-bal removed their assignment Apr 29, 2024
@sahusanket sahusanket merged commit ebb1333 into develop Apr 30, 2024
11 of 12 checks passed
@sahusanket sahusanket deleted the feature/CDAP-20832-task-worker-restart branch April 30, 2024 06:20
@sahusanket sahusanket restored the feature/CDAP-20832-task-worker-restart branch April 30, 2024 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.10 build Triggers github actions build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants