choose a consistent historgram quantile decay time #4

cburroughs · 2018-09-21T19:00:00Z

My understanding from https://prometheus.io/docs/practices/histograms/ is that something like

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Means the 95th percentile, using "5-minute decay time". I'm not sure why I would choose 5 over 1 or 10 minutes, and I'm not sure what to make of "decay". Is it exponential?

However, I think most of our services are similar enough that we probably want to use the same value everywhere.

The text was updated successfully, but these errors were encountered:

cburroughs · 2018-09-28T15:27:22Z

@richardkiene Do you have a recommendation?

cburroughs · 2018-10-16T18:28:39Z

<joshw> I actually already had an opinion about that one, I just didn't see the ticket. 
tl;dr: 1m is probably never a good idea, I think we should use 3 or 5 depending on how well 5 performs.

joshwilsdon · 2018-10-16T22:39:06Z

What http_request_duration_seconds_bucket[5m] does is grab all the scraped data points for the last 5 minutes with timestamp+value. It then calculates a per-second rate extrapolating for missed scrapes and misalignments.

My recommendation against 1m is based on the fact that we're using a 15s scrape interval. That means if we miss one or two scrapes or have some polling delay we're trying to determine a per-second rate across a minute using only 2-3 data points. Having 3 minutes at least gives us 12 data points (in the best case) and would be more resilient to missed scrapes. It also should be more accurate since we're basing our rate on 3x as much data. 5m obviously gives us even more data to work with, but at some point the amount of processing will be a problem. I believe it will also give us smoother graphs which may or may not be what we want (because we're averaging over a larger rather than smaller sample).

So far what I have been using is 5m and I had intended to use that until it became a problem. But if we decided we wanted to standardize on 3m everywhere, I would be fine with that. I have read elsewhere that it really is not a good idea to be mixing different vector selectors within a dashboard but I cannot find the source for that. It does make sense however as you will see different features with different values here and if you mixed different values you might see a feature on one graph (likely the smaller selector) that you don't see on another and reach the wrong conclusion (that the data was not there, rather than that it was hidden/smoothed by averaging over a longer period).

I guess to some degree this also depends on what we're trying to get from these graphs. If the goal is to identify longer term trends, we should use larger values both for the vector selectors and for the graphs themselves (e.g. a rate(http_request_duration_seconds_bucket[5m]) and maybe a graph over 6 hours instead of 1 hour. But if we're trying to identify individual spikes or outliers we'd probably be better with smaller selectors (maybe even 1m, which could be pretty spiky) and have the graphs show a shorter window of time.

My focus so far as been on trying to improve overall performance/scalability and have been looking more at longer trends rather than short term spikes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

choose a consistent historgram quantile decay time #4

choose a consistent historgram quantile decay time #4

cburroughs commented Sep 21, 2018

cburroughs commented Sep 28, 2018

cburroughs commented Oct 16, 2018

joshwilsdon commented Oct 16, 2018

choose a consistent historgram quantile decay time #4

choose a consistent historgram quantile decay time #4

Comments

cburroughs commented Sep 21, 2018

cburroughs commented Sep 28, 2018

cburroughs commented Oct 16, 2018

joshwilsdon commented Oct 16, 2018