[CDAP-21096] Remove in-memory launching queue from RunRecordMonitorService #15800

vsethi09 · 2025-01-13T12:37:41Z

Context

Multiple instances of RunRecordMonitoringService cannot run as distributed services as the in-memory cache of the launching queue will result in inconsistencies.

Removed the in-memory launching queue from the RunRecordMonitoringService and used AppMetadataStore APIs.

For more context, see: #15773 (comment).

Note: RunRecordMonitoringService is renamed as FlowControlMonitoringService.

Testing

Unit Tests
CDAP sandbox
Docker image

sonarqubecloud · 2025-01-13T15:53:13Z

Quality Gate passed

Issues
6 New issues
0 Accepted issues

Measures
0 Security Hotspots
89.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

tivv · 2025-01-15T21:14:51Z

cdap-app-fabric/src/main/java/io/cdap/cdap/internal/app/services/FlowControlMonitorService.java

+   * @return Counter with total number of launching and running program runs.
+   */
+  public Counter getCount() {
+    return getFlowControlMetrics(true, true);


Let's not use term metrics as it's pretty strongly associated with metrics service. Use counts.

Renamed with Counter.

tivv · 2025-01-15T21:22:47Z

cdap-app-fabric/src/main/java/io/cdap/cdap/internal/app/services/FlowControlMonitorService.java

+    }
+
+    int launchingCount = addRequest(programRunId, programOptions, programDescriptor);
+    int runningCount = getFlowControlMetrics(false, true).getRunningCount();


Are there value in having getFlowControlMetrics method? in addRequestAndGetCount it still ends up creating 2 separate database transactions. I'd consider moving transaction into addRequestAndGetCount/getCount and removing/splitting getFlowControlMetrics into 2 methods without transaction handling.
As of emitFlowControlMetrics, as far as I can see in all but edge case all metrics are emitted, so I'd just make it emit both metrics universaly.

tivv · 2025-01-15T21:27:30Z

...c/src/main/java/io/cdap/cdap/internal/app/services/ProgramNotificationSubscriberService.java

                    if (runRecordDetail.getStatus() == ProgramRunStatus.PENDING) {
-                      runRecordMonitorService.addRequest(runRecordDetail.getProgramRunId());
+                      flowControlMonitorService.addRequest(runRecordDetail.getProgramRunId(),


I don't think this is needed if the run is already in the database.

This is required for emitting flow control metrics.

Replaces addRequest() with emitFlowControlMetrics() call.

Well, originally we did it to recreate the in-memory store from database store. Now we read a database store. I don't see a reason to imeddiately write back. Do I miss something?

tivv · 2025-01-15T21:29:13Z

cdap-app-fabric/src/main/java/io/cdap/cdap/internal/app/store/AppMetadataStore.java

+      throws IOException {
+    long startTs = RunIds.getTime(programRunId.getRun(), TimeUnit.SECONDS);
+    if (startTs == -1L) {
+      LOG.error(


I would fail it. Since we are in the handler, we can fail and pass it to user. We could not do it in the subscriber

Done.

Throwing IllegalArgumentException for this case.

tivv · 2025-01-15T21:39:41Z

cdap-app-fabric/src/main/java/io/cdap/cdap/internal/app/store/AppMetadataStore.java

-          "Ignoring unexpected request to record rejected state for program run {} that has an existing "
-              + "run record in run state {} and cluster state {}.",
-          programRunId, existing.getStatus(), existing.getCluster().getStatus());
+          "Ignoring unexpected request to record rejected state for program run {} that has no existing run record.",


Why? I don't know in which case it may happen, but I'd still leave a trace. Just skip delete.

tivv · 2025-01-15T21:43:31Z

cdap-app-fabric/src/main/java/io/cdap/cdap/internal/app/store/AppMetadataStore.java

+   */
+  public int getLaunchingCount(Set<ProgramType> programTypes, @Nullable Integer limit) throws IOException {
+    AtomicInteger count = new AtomicInteger(0);
+    try (CloseableIterator<RunRecordDetail> iterator = queryProgramRuns(


This is heavy. Can we use io.cdap.cdap.spi.data.StructuredTable#count? If needed, we can even add a field when we write the record to filter efficiently

vsethi09 added the build Triggers github actions build label Jan 13, 2025

vsethi09 force-pushed the feature/CDAP-21096_fix_RunRecordMonitorService_queue_cache branch 2 times, most recently from ef61989 to 00add6d Compare January 13, 2025 14:19

vsethi09 force-pushed the feature/CDAP-21096_fix_RunRecordMonitorService_queue_cache branch 3 times, most recently from dc068c3 to 0ea30c6 Compare January 15, 2025 20:37

vsethi09 changed the title ~~[WIP] Remove in-memory launching queue in RunRecordMonitorService~~ [CDAP-21096] Remove in-memory launching queue from RunRecordMonitorService Jan 15, 2025

vsethi09 marked this pull request as ready for review January 15, 2025 20:42

vsethi09 force-pushed the feature/CDAP-21096_fix_RunRecordMonitorService_queue_cache branch from 0ea30c6 to abcd4c5 Compare January 15, 2025 20:49

tivv reviewed Jan 15, 2025

View reviewed changes

Remove in-memory launching queue in RunRecordMonitorService

343101b

vsethi09 force-pushed the feature/CDAP-21096_fix_RunRecordMonitorService_queue_cache branch from abcd4c5 to 343101b Compare January 16, 2025 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CDAP-21096] Remove in-memory launching queue from RunRecordMonitorService #15800

[CDAP-21096] Remove in-memory launching queue from RunRecordMonitorService #15800

vsethi09 commented Jan 13, 2025 •

edited

Loading

sonarqubecloud bot commented Jan 13, 2025

tivv Jan 15, 2025

vsethi09 Jan 16, 2025

tivv Jan 15, 2025

tivv Jan 15, 2025

vsethi09 Jan 16, 2025

tivv Jan 16, 2025

tivv Jan 15, 2025

vsethi09 Jan 16, 2025

tivv Jan 15, 2025

tivv Jan 15, 2025

[CDAP-21096] Remove in-memory launching queue from RunRecordMonitorService #15800

Are you sure you want to change the base?

[CDAP-21096] Remove in-memory launching queue from RunRecordMonitorService #15800

Conversation

vsethi09 commented Jan 13, 2025 • edited Loading

Context

Testing

sonarqubecloud bot commented Jan 13, 2025

Quality Gate passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vsethi09 commented Jan 13, 2025 •

edited

Loading