Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDAP-21096] Remove in-memory launching queue from RunRecordMonitorService #15800

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

vsethi09
Copy link
Contributor

@vsethi09 vsethi09 commented Jan 13, 2025

Context

Multiple instances of RunRecordMonitoringService cannot run as distributed services as the in-memory cache of the launching queue will result in inconsistencies.

Removed the in-memory launching queue from the RunRecordMonitoringService and used AppMetadataStore APIs.

For more context, see: #15773 (comment).

Note: RunRecordMonitoringService is renamed as FlowControlMonitoringService.

Testing

  • Unit Tests
  • CDAP sandbox
  • Docker image

@vsethi09 vsethi09 added the build Triggers github actions build label Jan 13, 2025
@vsethi09 vsethi09 force-pushed the feature/CDAP-21096_fix_RunRecordMonitorService_queue_cache branch 2 times, most recently from ef61989 to 00add6d Compare January 13, 2025 14:19
@vsethi09 vsethi09 force-pushed the feature/CDAP-21096_fix_RunRecordMonitorService_queue_cache branch 3 times, most recently from dc068c3 to 0ea30c6 Compare January 15, 2025 20:37
@vsethi09 vsethi09 changed the title [WIP] Remove in-memory launching queue in RunRecordMonitorService [CDAP-21096] Remove in-memory launching queue from RunRecordMonitorService Jan 15, 2025
@vsethi09 vsethi09 marked this pull request as ready for review January 15, 2025 20:42
@vsethi09 vsethi09 force-pushed the feature/CDAP-21096_fix_RunRecordMonitorService_queue_cache branch from 0ea30c6 to abcd4c5 Compare January 15, 2025 20:49
* @return Counter with total number of launching and running program runs.
*/
public Counter getCount() {
return getFlowControlMetrics(true, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use term metrics as it's pretty strongly associated with metrics service. Use counts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed with Counter.

}

int launchingCount = addRequest(programRunId, programOptions, programDescriptor);
int runningCount = getFlowControlMetrics(false, true).getRunningCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there value in having getFlowControlMetrics method? in addRequestAndGetCount it still ends up creating 2 separate database transactions. I'd consider moving transaction into addRequestAndGetCount/getCount and removing/splitting getFlowControlMetrics into 2 methods without transaction handling.
As of emitFlowControlMetrics, as far as I can see in all but edge case all metrics are emitted, so I'd just make it emit both metrics universaly.

if (runRecordDetail.getStatus() == ProgramRunStatus.PENDING) {
runRecordMonitorService.addRequest(runRecordDetail.getProgramRunId());
flowControlMonitorService.addRequest(runRecordDetail.getProgramRunId(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed if the run is already in the database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for emitting flow control metrics.

Replaces addRequest() with emitFlowControlMetrics() call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, originally we did it to recreate the in-memory store from database store. Now we read a database store. I don't see a reason to imeddiately write back. Do I miss something?

throws IOException {
long startTs = RunIds.getTime(programRunId.getRun(), TimeUnit.SECONDS);
if (startTs == -1L) {
LOG.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would fail it. Since we are in the handler, we can fail and pass it to user. We could not do it in the subscriber

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Throwing IllegalArgumentException for this case.

"Ignoring unexpected request to record rejected state for program run {} that has an existing "
+ "run record in run state {} and cluster state {}.",
programRunId, existing.getStatus(), existing.getCluster().getStatus());
"Ignoring unexpected request to record rejected state for program run {} that has no existing run record.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I don't know in which case it may happen, but I'd still leave a trace. Just skip delete.

*/
public int getLaunchingCount(Set<ProgramType> programTypes, @Nullable Integer limit) throws IOException {
AtomicInteger count = new AtomicInteger(0);
try (CloseableIterator<RunRecordDetail> iterator = queryProgramRuns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is heavy. Can we use io.cdap.cdap.spi.data.StructuredTable#count? If needed, we can even add a field when we write the record to filter efficiently

@vsethi09 vsethi09 force-pushed the feature/CDAP-21096_fix_RunRecordMonitorService_queue_cache branch from abcd4c5 to 343101b Compare January 16, 2025 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Triggers github actions build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants