-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CDAP-20832] Enable periodic restart when task workers are running concurrent requests #15575
Conversation
c510b17
to
d8c51d2
Compare
d8c51d2
to
7244a3f
Compare
taskDetails.emitMetrics(succeeded); | ||
runningRequestCount.decrementAndGet(); | ||
requestProcessedCount.incrementAndGet(); | ||
runningTasks.remove(runningTaskDetails); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need requestProcessedCount now that we have runningTaks map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted most of the changes to use a periodic restart approach. This map is not present anymore.
} | ||
// we restart once ongoing request (which has set runningRequestCount to 1) | ||
// finishes. | ||
mustRestart.set(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both mustRestart and lameDock logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, if there is an ongoing request, we can move the pod into lameDock.
Basically, I'm thinking of changing lameDock semantic from a task has timeout to pod needs to restart in the next safest time. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to carry out a periodic restart using the existing periodic restart code. I had used the lameduck concept to ensure the service is restarted only when a task is stuck. In the periodic restart approach, we may be restarting task workers even when no task is stuck.
This reverts commit 7244a3f.
a7fc249
to
3e72d2e
Compare
...-common/src/main/java/io/cdap/cdap/common/internal/remote/TaskWorkerHttpHandlerInternal.java
Outdated
Show resolved
Hide resolved
return; | ||
} | ||
|
||
if (!enableUserCodeIsolationEnabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we disable the killAfterRequestCount feature when code isolation is disabled? This is a useful feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the existing behavior, i.e. not changed in this PR.
Are you asking in reference to this PR? or a design/implementation level question in general ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we should either change it or change the comments. I believe cdap-defaults does not talk about any dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the comment in cdap-default.xml
PTAL
waitTime, | ||
TimeUnit.SECONDS); | ||
|
||
if (lowerBound <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? did you mean to check the duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was done for early exit in terms of code readability ?
Previously it was.
if (lowerBound > 0) {
//logic
}
//Function ends return
}
Now it's
if (lowerBound <= 0) {
return
}
//logic
//Function ends return
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change it to duration? I mean, it looks equivalent (lowerBound = duration*0.9, so lowerBound <= 0 same as duration <= 0), but it's harder to understand the logic of why are we checking the calculated value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
...-common/src/main/java/io/cdap/cdap/common/internal/remote/TaskWorkerHttpHandlerInternal.java
Outdated
Show resolved
Hide resolved
...-common/src/main/java/io/cdap/cdap/common/internal/remote/TaskWorkerHttpHandlerInternal.java
Outdated
Show resolved
Hide resolved
3e72d2e
to
536f1d5
Compare
CDAP-20832
This PR introduces a deadline for all task worker executions. The task worker now does a periodic restart even when "user code isolation" is disabled, i.e. its running concurrent requests. When a periodic restart is scheduled, the task worker stops accepting new requests. It waits for the new configuration "task.worker.taskExecutionDeadline.second" time to elapse. If all executing tasks finish, the task worker is restarted immediately. Otherwise the task worker is restarted after the deadline expires.