[CDAP-20832] Enable periodic restart when task workers are running concurrent requests #15575

arjan-bal · 2024-04-01T17:27:08Z

CDAP-20832

This PR introduces a deadline for all task worker executions. The task worker now does a periodic restart even when "user code isolation" is disabled, i.e. its running concurrent requests. When a periodic restart is scheduled, the task worker stops accepting new requests. It waits for the new configuration "task.worker.taskExecutionDeadline.second" time to elapse. If all executing tasks finish, the task worker is restarted immediately. Otherwise the task worker is restarted after the deadline expires.

masoud-io · 2024-04-09T20:35:01Z

...-common/src/main/java/io/cdap/cdap/common/internal/remote/TaskWorkerHttpHandlerInternal.java

        taskDetails.emitMetrics(succeeded);
        runningRequestCount.decrementAndGet();
        requestProcessedCount.incrementAndGet();
+        runningTasks.remove(runningTaskDetails);


Do we still need requestProcessedCount now that we have runningTaks map?

Reverted most of the changes to use a periodic restart approach. This map is not present anymore.

masoud-io · 2024-04-09T20:43:31Z

...-common/src/main/java/io/cdap/cdap/common/internal/remote/TaskWorkerHttpHandlerInternal.java

+          }
+          // we restart once ongoing request (which has set runningRequestCount to 1)
+          // finishes.
+          mustRestart.set(true);


Why do we need both mustRestart and lameDock logic?

Here, if there is an ongoing request, we can move the pod into lameDock.
Basically, I'm thinking of changing lameDock semantic from a task has timeout to pod needs to restart in the next safest time. WDYT?

Changed to carry out a periodic restart using the existing periodic restart code. I had used the lameduck concept to ensure the service is restarted only when a task is stuck. In the periodic restart approach, we may be restarting task workers even when no task is stuck.

This reverts commit 7244a3f.

...-common/src/main/java/io/cdap/cdap/common/internal/remote/TaskWorkerHttpHandlerInternal.java

tivv · 2024-04-10T23:20:28Z

...-common/src/main/java/io/cdap/cdap/common/internal/remote/TaskWorkerHttpHandlerInternal.java

+        return;
+      }
+
+      if (!enableUserCodeIsolationEnabled


Why do we disable the killAfterRequestCount feature when code isolation is disabled? This is a useful feature

I think this is the existing behavior, i.e. not changed in this PR.

Are you asking in reference to this PR? or a design/implementation level question in general ?

Well, we should either change it or change the comments. I believe cdap-defaults does not talk about any dependency.

changed the comment in cdap-default.xml

PTAL

tivv · 2024-04-10T23:24:17Z

...-common/src/main/java/io/cdap/cdap/common/internal/remote/TaskWorkerHttpHandlerInternal.java

-              waitTime,
-              TimeUnit.SECONDS);
+
+    if (lowerBound <= 0) {


Why? did you mean to check the duration?

I think this was done for early exit in terms of code readability ?

Previously it was.

if (lowerBound > 0) { //logic } //Function ends return }

Now it's

if (lowerBound <= 0) { return } //logic //Function ends return }

Can you change it to duration? I mean, it looks equivalent (lowerBound = duration*0.9, so lowerBound <= 0 same as duration <= 0), but it's harder to understand the logic of why are we checking the calculated value.

...-common/src/main/java/io/cdap/cdap/common/internal/remote/TaskWorkerHttpHandlerInternal.java

arjan-bal self-assigned this Apr 1, 2024

arjan-bal added build Triggers github actions build 6.10 labels Apr 1, 2024

arjan-bal force-pushed the feature/CDAP-20832-task-worker-restart branch from c510b17 to d8c51d2 Compare April 1, 2024 17:28

Restart task workers when a task gets stuck

7244a3f

arjan-bal force-pushed the feature/CDAP-20832-task-worker-restart branch from d8c51d2 to 7244a3f Compare April 1, 2024 19:28

arjan-bal requested review from masoud-io, tivv and itsankit-google April 1, 2024 19:34

masoud-io reviewed Apr 9, 2024

View reviewed changes

Revert "Restart task workers when a task gets stuck"

038c1c7

This reverts commit 7244a3f.

arjan-bal changed the title ~~[CDAP-20832] Restart task workers when a task get stuck~~ [CDAP-20832] Enable periodic restart when task workers are running concurrent requests Apr 10, 2024

arjan-bal force-pushed the feature/CDAP-20832-task-worker-restart branch from a7fc249 to 3e72d2e Compare April 10, 2024 10:14

arjan-bal requested a review from masoud-io April 10, 2024 10:15

tivv reviewed Apr 10, 2024

View reviewed changes

Enable periodic restart in non user code isolation mode

536f1d5

arjan-bal force-pushed the feature/CDAP-20832-task-worker-restart branch from 3e72d2e to 536f1d5 Compare April 11, 2024 18:08

CDAP-20832 : address comments

9c545cb

sahusanket requested a review from tivv April 27, 2024 08:51

CDAP-20832 : address comments 2

8449688

arjan-bal removed their assignment Apr 29, 2024

tivv approved these changes Apr 30, 2024

View reviewed changes

sahusanket merged commit ebb1333 into develop Apr 30, 2024
11 of 12 checks passed

sahusanket deleted the feature/CDAP-20832-task-worker-restart branch April 30, 2024 06:20

sahusanket mentioned this pull request Apr 30, 2024

Revert "[CDAP-20832] Enable periodic restart when task workers are running concurrent requests" #15628

Merged

sahusanket restored the feature/CDAP-20832-task-worker-restart branch April 30, 2024 08:12

This was referenced Apr 30, 2024

[CDAP-20832] Enable periodic restart when task workers are running concurrent requests #15630

Merged

[🍒][6.10][CDAP-20832] Enable periodic restart when task workers are running co… #15631

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CDAP-20832] Enable periodic restart when task workers are running concurrent requests #15575

[CDAP-20832] Enable periodic restart when task workers are running concurrent requests #15575

arjan-bal commented Apr 1, 2024 •

edited

Loading

masoud-io Apr 9, 2024

arjan-bal Apr 10, 2024

masoud-io Apr 9, 2024

masoud-io Apr 9, 2024

arjan-bal Apr 10, 2024

tivv Apr 10, 2024

sahusanket Apr 27, 2024

tivv Apr 29, 2024

sahusanket Apr 29, 2024

tivv Apr 10, 2024

sahusanket Apr 27, 2024

tivv Apr 29, 2024

sahusanket Apr 29, 2024

[CDAP-20832] Enable periodic restart when task workers are running concurrent requests #15575

[CDAP-20832] Enable periodic restart when task workers are running concurrent requests #15575

Conversation

arjan-bal commented Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjan-bal commented Apr 1, 2024 •

edited

Loading