Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2818] Fix state tracking metrics for app and queue #951

Closed
wants to merge 8 commits into from

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Aug 20, 2024

What is this PR for?

State of appMetrics of Queue Metrics is incomplete and should be fixed

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2818

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@@ -84,8 +84,8 @@ func InitSchedulerMetrics() *SchedulerMetrics {
Help: "Total number of attempts to allocate containers. State of the attempt includes `allocated`, `rejected`, `error`, `released`",
}, []string{"state"})

s.applicationSubmission = prometheus.NewCounterVec(
prometheus.CounterOpts{
s.applicationSubmission = prometheus.NewGaugeVec(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to gauge because, we want to support dec also

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A submission can not be changed. When an application is submitted it will always be submitted. It should be a counter that keeps increasing. It is not a state that we track it is a pure counter from start to shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wilfred-s for clarify for this field,it makes sense to me.

@zhuqi-lucas zhuqi-lucas self-assigned this Aug 20, 2024
@zhuqi-lucas zhuqi-lucas requested review from wilfred-s, pbacsko, chenyulin0719 and craigcondit and removed request for pbacsko August 20, 2024 07:21
@craigcondit
Copy link
Contributor

Please run make lint before submitting PRs.

@zhuqi-lucas
Copy link
Contributor Author

Please run make lint before submitting PRs.

Fix go lint now.

Copy link

codecov bot commented Aug 21, 2024

Codecov Report

Attention: Patch coverage is 94.57364% with 7 lines in your changes missing coverage. Please review.

Project coverage is 80.18%. Comparing base (9f46b81) to head (a477027).
Report is 6 commits behind head on master.

Files Patch % Lines
pkg/metrics/queue.go 90.90% 4 Missing ⚠️
pkg/metrics/scheduler.go 92.10% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #951      +/-   ##
==========================================
+ Coverage   79.56%   80.18%   +0.61%     
==========================================
  Files          97       97              
  Lines       12275    12371      +96     
==========================================
+ Hits         9767     9920     +153     
+ Misses       2231     2176      -55     
+ Partials      277      275       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@chenyulin0719 chenyulin0719 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like "Rejected","Completed","Failed" to "Expired" app states are not tracked, neither in queue nor sheduler. Is it intentional?

Comment on lines 206 to 208
func (m *QueueMetrics) DecQueueApplicationsRejected() {
m.decQueueApplications(AppRejected)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DecQueueApplicationsRejected() is not called in "leave_Rejected" state? Is it intentional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rejected applications should only increase, never decrease: a simple counter.
It is a final state that the application only gets removed from as part of the cleanup. We keep it around for a while to make it traceable for the submitter. i.e. submit an application that gets rejected: find it is the list as such. If we would drop it immediately the submitter has no idea what happened.

Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenyulin0719 @wilfred-s
If i make sense right, we don't need to add decrease for following state? Because they are final state besides the expired:

  1. Rejected
  2. Failed
  3. Completed

And for expired, i don't think we need to tracking it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest PR.

Copy link
Contributor

@chenyulin0719 chenyulin0719 Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wilfred-s Understood, thanks for the explanation.

@zhuqi-lucas Not tracking expired app metrics make sense to me.
To avoid any confusion, I think we should remove all the decrement function for Rejected/Failed/Completed. WDYT?

Copy link
Contributor

@chenyulin0719 chenyulin0719 Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean removing the below functions:

  • DecQueueApplicationsRejected()
  • DecQueueApplicationsFailed()
  • DecQueueApplicationsCompleted()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it @chenyulin0719 , remove those unused code in latest PR!

pkg/metrics/queue.go Outdated Show resolved Hide resolved
pkg/metrics/queue.go Outdated Show resolved Hide resolved
pkg/metrics/scheduler.go Outdated Show resolved Hide resolved
Copy link
Contributor

@chenyulin0719 chenyulin0719 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@chenyulin0719 chenyulin0719 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM.

@craigcondit craigcondit changed the title [YUNIKORN-2818] State of appMetrics of Queue Metrics is incomplete an… [YUNIKORN-2818] Fix state tracking metrics for app and queue Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants