Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2926] placeholder counters incorrect #986

Closed
wants to merge 3 commits into from

Conversation

wilfred-s
Copy link
Contributor

What is this PR for?

Placeholder tracking data is maintained inside the application for scheduling. If the placeholder is released we update the counters in the tracking data. We have cases in which we do not do that correctly:

  • placeholders are smaller than the real allocation
  • placeholder does not have an allocation to replace
  • all allocations are removed from an application

Tests updated to check all the counters inside the placeholder data for consistency.

What type of PR is it?

  • - Bug Fix

What is the Jira issue?

How should this be tested?

New unit tests added
e2e test with incorrectly sized placeholders is needed

Placeholder tracking data is maintained inside the application for
scheduling. If the placeholder is released we update the counters in the
tracking data. We have cases in which we do not do that correctly:
* placeholders are smaller than the real allocation
* placeholder does not have an allocation to replace
* all allocations are removed from an application

Tests updated to check all the counters inside the placeholder data for
consistency.
Copy link

codecov bot commented Oct 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.25%. Comparing base (44705ae) to head (40e7231).
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #986      +/-   ##
==========================================
- Coverage   81.50%   81.25%   -0.25%     
==========================================
  Files          97       97              
  Lines       12625    15467    +2842     
==========================================
+ Hits        10290    12568    +2278     
- Misses       2052     2617     +565     
+ Partials      283      282       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@craigcondit
Copy link
Contributor

Is the e2e test failure expected?

remove inconsistent tracking update on timeout
extended testing and checks in placeholder timeout test cases
@wilfred-s
Copy link
Contributor Author

The failure was not expected. It was another point of inconsistent handling of the tracking data not covered by the unit tests.

@wilfred-s
Copy link
Contributor Author

wilfred-s commented Oct 21, 2024

This second e2e test is failing due to the fact that the list of requests has a mixture of allocated and unallocated entries and the merging of the Ask and Allocation objects. The fix for the timeout tracking does not filter the requests lists for already allocated requests and needs to do that. We seem to be adding that same filter everywhere now to compensate for not removing the allocation from the request list.
This causes two issue:

  • We send allocations to the shim twice to be released. Once from the request list once from the allocations list.
  • Asks and allocations are processed the same way and both are returned to the core which can cause double counting them as timed out.

The other problem detected on log analysis is that the shim tries to clean up the same placeholders multiple times. First based on the core request then based on internal logic as part of the placeholder code. All cleanup in the placeholder code fails as the core has already done it. It does trigger more release messages to be sent to the core which then get ignored as the work is already done. That needs a cleanup in a follow up jira.

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

craigcondit pushed a commit that referenced this pull request Oct 24, 2024
Placeholder tracking data is maintained inside the application for
scheduling. If the placeholder is released we update the counters in the
tracking data. We have cases in which we do not do that correctly:
* placeholders are smaller than the real allocation
* placeholder does not have an allocation to replace
* all allocations are removed from an application

Closes: #986

Signed-off-by: Craig Condit <[email protected]>
@wilfred-s wilfred-s deleted the YUNIKORN-2926 branch November 11, 2024 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants