Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

choose a consistent historgram quantile decay time #4

Open
cburroughs opened this issue Sep 21, 2018 · 3 comments
Open

choose a consistent historgram quantile decay time #4

cburroughs opened this issue Sep 21, 2018 · 3 comments

Comments

@cburroughs
Copy link
Contributor

My understanding from https://prometheus.io/docs/practices/histograms/ is that something like

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Means the 95th percentile, using "5-minute decay time". I'm not sure why I would choose 5 over 1 or 10 minutes, and I'm not sure what to make of "decay". Is it exponential?

However, I think most of our services are similar enough that we probably want to use the same value everywhere.

@cburroughs
Copy link
Contributor Author

@richardkiene Do you have a recommendation?

@cburroughs
Copy link
Contributor Author

<joshw> I actually already had an opinion about that one, I just didn't see the ticket. 
tl;dr: 1m is probably never a good idea, I think we should use 3 or 5 depending on how well 5 performs.

@joshwilsdon
Copy link
Contributor

What http_request_duration_seconds_bucket[5m] does is grab all the scraped data points for the last 5 minutes with timestamp+value. It then calculates a per-second rate extrapolating for missed scrapes and misalignments.

My recommendation against 1m is based on the fact that we're using a 15s scrape interval. That means if we miss one or two scrapes or have some polling delay we're trying to determine a per-second rate across a minute using only 2-3 data points. Having 3 minutes at least gives us 12 data points (in the best case) and would be more resilient to missed scrapes. It also should be more accurate since we're basing our rate on 3x as much data. 5m obviously gives us even more data to work with, but at some point the amount of processing will be a problem. I believe it will also give us smoother graphs which may or may not be what we want (because we're averaging over a larger rather than smaller sample).

So far what I have been using is 5m and I had intended to use that until it became a problem. But if we decided we wanted to standardize on 3m everywhere, I would be fine with that. I have read elsewhere that it really is not a good idea to be mixing different vector selectors within a dashboard but I cannot find the source for that. It does make sense however as you will see different features with different values here and if you mixed different values you might see a feature on one graph (likely the smaller selector) that you don't see on another and reach the wrong conclusion (that the data was not there, rather than that it was hidden/smoothed by averaging over a longer period).

I guess to some degree this also depends on what we're trying to get from these graphs. If the goal is to identify longer term trends, we should use larger values both for the vector selectors and for the graphs themselves (e.g. a rate(http_request_duration_seconds_bucket[5m]) and maybe a graph over 6 hours instead of 1 hour. But if we're trying to identify individual spikes or outliers we'd probably be better with smaller selectors (maybe even 1m, which could be pretty spiky) and have the graphs show a shorter window of time.

My focus so far as been on trying to improve overall performance/scalability and have been looking more at longer trends rather than short term spikes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants