-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
choose a consistent historgram quantile decay time #4
Comments
@richardkiene Do you have a recommendation? |
|
What http_request_duration_seconds_bucket[5m] does is grab all the scraped data points for the last 5 minutes with timestamp+value. It then calculates a per-second rate extrapolating for missed scrapes and misalignments. My recommendation against 1m is based on the fact that we're using a 15s scrape interval. That means if we miss one or two scrapes or have some polling delay we're trying to determine a per-second rate across a minute using only 2-3 data points. Having 3 minutes at least gives us 12 data points (in the best case) and would be more resilient to missed scrapes. It also should be more accurate since we're basing our rate on 3x as much data. 5m obviously gives us even more data to work with, but at some point the amount of processing will be a problem. I believe it will also give us smoother graphs which may or may not be what we want (because we're averaging over a larger rather than smaller sample). So far what I have been using is 5m and I had intended to use that until it became a problem. But if we decided we wanted to standardize on 3m everywhere, I would be fine with that. I have read elsewhere that it really is not a good idea to be mixing different vector selectors within a dashboard but I cannot find the source for that. It does make sense however as you will see different features with different values here and if you mixed different values you might see a feature on one graph (likely the smaller selector) that you don't see on another and reach the wrong conclusion (that the data was not there, rather than that it was hidden/smoothed by averaging over a longer period). I guess to some degree this also depends on what we're trying to get from these graphs. If the goal is to identify longer term trends, we should use larger values both for the vector selectors and for the graphs themselves (e.g. a My focus so far as been on trying to improve overall performance/scalability and have been looking more at longer trends rather than short term spikes. |
My understanding from https://prometheus.io/docs/practices/histograms/ is that something like
Means the 95th percentile, using "5-minute decay time". I'm not sure why I would choose 5 over 1 or 10 minutes, and I'm not sure what to make of "decay". Is it exponential?
However, I think most of our services are similar enough that we probably want to use the same value everywhere.
The text was updated successfully, but these errors were encountered: