[YUNIKORN-2926] placeholder counters incorrect #986

wilfred-s · 2024-10-18T08:23:07Z

What is this PR for?

Placeholder tracking data is maintained inside the application for scheduling. If the placeholder is released we update the counters in the tracking data. We have cases in which we do not do that correctly:

placeholders are smaller than the real allocation
placeholder does not have an allocation to replace
all allocations are removed from an application

Tests updated to check all the counters inside the placeholder data for consistency.

What type of PR is it?

- Bug Fix

What is the Jira issue?

YUNIKORN-2926

How should this be tested?

New unit tests added
e2e test with incorrectly sized placeholders is needed

Placeholder tracking data is maintained inside the application for scheduling. If the placeholder is released we update the counters in the tracking data. We have cases in which we do not do that correctly: * placeholders are smaller than the real allocation * placeholder does not have an allocation to replace * all allocations are removed from an application Tests updated to check all the counters inside the placeholder data for consistency.

codecov · 2024-10-18T08:25:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.25%. Comparing base (44705ae) to head (40e7231).
Report is 3 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #986      +/-   ##
==========================================
- Coverage   81.50%   81.25%   -0.25%     
==========================================
  Files          97       97              
  Lines       12625    15467    +2842     
==========================================
+ Hits        10290    12568    +2278     
- Misses       2052     2617     +565     
+ Partials      283      282       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

craigcondit · 2024-10-18T22:24:46Z

Is the e2e test failure expected?

remove inconsistent tracking update on timeout extended testing and checks in placeholder timeout test cases

wilfred-s · 2024-10-19T06:50:00Z

The failure was not expected. It was another point of inconsistent handling of the tracking data not covered by the unit tests.

wilfred-s · 2024-10-21T11:59:07Z

This second e2e test is failing due to the fact that the list of requests has a mixture of allocated and unallocated entries and the merging of the Ask and Allocation objects. The fix for the timeout tracking does not filter the requests lists for already allocated requests and needs to do that. We seem to be adding that same filter everywhere now to compensate for not removing the allocation from the request list.
This causes two issue:

We send allocations to the shim twice to be released. Once from the request list once from the allocations list.
Asks and allocations are processed the same way and both are returned to the core which can cause double counting them as timed out.

The other problem detected on log analysis is that the shim tries to clean up the same placeholders multiple times. First based on the core request then based on internal logic as part of the placeholder code. All cleanup in the placeholder code fails as the core has already done it. It does trigger more release messages to be sent to the core which then get ignored as the work is already done. That needs a cleanup in a follow up jira.

craigcondit

+1 LGTM.

pbacsko

LGTM

Placeholder tracking data is maintained inside the application for scheduling. If the placeholder is released we update the counters in the tracking data. We have cases in which we do not do that correctly: * placeholders are smaller than the real allocation * placeholder does not have an allocation to replace * all allocations are removed from an application Closes: #986 Signed-off-by: Craig Condit <[email protected]>

wilfred-s requested review from chia7712, lixmgl, zhuqi-lucas, craigcondit, pbacsko and steinsgateted October 18, 2024 08:23

wilfred-s self-assigned this Oct 18, 2024

additional changes for consistent handling of tracking

fc3c487

remove inconsistent tracking update on timeout extended testing and checks in placeholder timeout test cases

lint cleanup: taskgroup constants

40e7231

craigcondit approved these changes Oct 22, 2024

View reviewed changes

pbacsko approved these changes Oct 24, 2024

View reviewed changes

craigcondit closed this in 9869540 Oct 24, 2024

wilfred-s deleted the YUNIKORN-2926 branch November 11, 2024 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-2926] placeholder counters incorrect #986

[YUNIKORN-2926] placeholder counters incorrect #986

wilfred-s commented Oct 18, 2024

codecov bot commented Oct 18, 2024 •

edited

Loading

craigcondit commented Oct 18, 2024

wilfred-s commented Oct 19, 2024

wilfred-s commented Oct 21, 2024 •

edited

Loading

craigcondit left a comment

pbacsko left a comment

[YUNIKORN-2926] placeholder counters incorrect #986

[YUNIKORN-2926] placeholder counters incorrect #986

Conversation

wilfred-s commented Oct 18, 2024

What is this PR for?

What type of PR is it?

What is the Jira issue?

How should this be tested?

codecov bot commented Oct 18, 2024 • edited Loading

Codecov Report

craigcondit commented Oct 18, 2024

wilfred-s commented Oct 19, 2024

wilfred-s commented Oct 21, 2024 • edited Loading

craigcondit left a comment

Choose a reason for hiding this comment

pbacsko left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 18, 2024 •

edited

Loading

wilfred-s commented Oct 21, 2024 •

edited

Loading