-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2818] Fix state tracking metrics for app and queue #951
Conversation
pkg/metrics/scheduler.go
Outdated
@@ -84,8 +84,8 @@ func InitSchedulerMetrics() *SchedulerMetrics { | |||
Help: "Total number of attempts to allocate containers. State of the attempt includes `allocated`, `rejected`, `error`, `released`", | |||
}, []string{"state"}) | |||
|
|||
s.applicationSubmission = prometheus.NewCounterVec( | |||
prometheus.CounterOpts{ | |||
s.applicationSubmission = prometheus.NewGaugeVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to gauge because, we want to support dec also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A submission can not be changed. When an application is submitted it will always be submitted. It should be a counter that keeps increasing. It is not a state that we track it is a pure counter from start to shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wilfred-s for clarify for this field,it makes sense to me.
Please run |
Fix go lint now. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #951 +/- ##
==========================================
+ Coverage 79.56% 80.18% +0.61%
==========================================
Files 97 97
Lines 12275 12371 +96
==========================================
+ Hits 9767 9920 +153
+ Misses 2231 2176 -55
+ Partials 277 275 -2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like "Rejected","Completed","Failed" to "Expired" app states are not tracked, neither in queue nor sheduler. Is it intentional?
pkg/metrics/queue.go
Outdated
func (m *QueueMetrics) DecQueueApplicationsRejected() { | ||
m.decQueueApplications(AppRejected) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DecQueueApplicationsRejected() is not called in "leave_Rejected" state? Is it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rejected applications should only increase, never decrease: a simple counter.
It is a final state that the application only gets removed from as part of the cleanup. We keep it around for a while to make it traceable for the submitter. i.e. submit an application that gets rejected: find it is the list as such. If we would drop it immediately the submitter has no idea what happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenyulin0719 @wilfred-s
If i make sense right, we don't need to add decrease for following state? Because they are final state besides the expired:
- Rejected
- Failed
- Completed
And for expired, i don't think we need to tracking it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wilfred-s Understood, thanks for the explanation.
@zhuqi-lucas Not tracking expired app metrics make sense to me.
To avoid any confusion, I think we should remove all the decrement function for Rejected/Failed/Completed. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean removing the below functions:
- DecQueueApplicationsRejected()
- DecQueueApplicationsFailed()
- DecQueueApplicationsCompleted()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it @chenyulin0719 , remove those unused code in latest PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks @chenyulin0719 for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM.
What is this PR for?
State of appMetrics of Queue Metrics is incomplete and should be fixed
What type of PR is it?
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-2818
How should this be tested?
Screenshots (if appropriate)
Questions: