Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Add batching for gauge metrics #57

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mplachter
Copy link
Contributor

  • POC Adding batching for metrics on a interval

TODO:

  • Add AVG during the batch interval
  • Add TimeStamp Filtering for old metrics

@mplachter mplachter force-pushed the add-batching-for-gauge-metrics branch from bbe7325 to bdbab36 Compare January 6, 2023 22:28
@@ -24,6 +24,7 @@ func Execute() {

rootCmd.PersistentFlags().StringSliceVar(&cfg.AuthUsers, "AuthUsers", []string{}, "List of allowed auth users and their passwords comma separated\n Example: \"user1=pass1,user2=pass2\"")
rootCmd.PersistentFlags().StringVar(&cfg.ApiListen, "apiListen", ":80", "Listen for API requests on this host/port.")
rootCmd.PersistentFlags().IntVar(&cfg.MetricBatchInterval, "metricBatchInterval", 5, "The amount of seconds to batch the metrics collected")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resetting on a timer requires us to sync this up w/ prometheus' scraping schedule, right? Default seems to be 60 seconds, which means we would by (by default) averaging the last 5 seconds worth of gauge stats, and ignoring the 55 seconds before that.

wouldn't resetting the stats whenever the metrics got scraped avoid this issue?

Copy link
Contributor Author

@mplachter mplachter Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically but than if someone/something just hits the endpoint for testing purposes all the metrics get jacked up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we could do is check if it's been scraped during the interval and if so then we can reset the gauges if the endpoint hasn't been scraped yet we can merge in gauges and if it has we can reset gauges....

Signed-off-by: Matt Plachter <[email protected]>
@mplachter mplachter force-pushed the add-batching-for-gauge-metrics branch from 1c4f10a to 2be4980 Compare January 6, 2023 22:50
@faangbait
Copy link

faangbait commented Mar 11, 2023

Hey, I see people are trying to solve the same problem I wandered over here from.

The solution I had was also "callback on GET /metrics that wipes after a scrape." If you're aggregating and squashing labels, that's the only way to prevent double counting.

As far as "what if a user touches the endpoint," that's easily solved with HTTP Basic Auth.

I'm sure everybody's in the same boat with "they don't build computers with the amount of memory we need, so let's aggregate some stuff." Wipe-on-scrape has the added benefit of being homeostatic.

Specifically, as the fleet that's being aggregated horizontally scales, memory usage increases at the aggregator level. You could tune memory usage by scraping more often, causing the aggregators to wipe more often. But this would increase memory requirements at the central server.

But then you just aggregate again through another layer of aggregators. From there, it's aggregators all the way down and the SREs are happy!

@mplachter
Copy link
Contributor Author

@faangbait, thanks for your insight here. I agree, I don't think duplicating all the metrics is valuable.

We do have a few issues with gauges tho, as we don't want to add them up we need to calculate a floating average for a given probed period. We also can't just keep averaging them out between probe intervals as thats dependent on the endpoint being scrapped, which could lead to an average over different time durations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants