-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus metrics missing k8s_deployment_name attribute for short period after agent restart #37056
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
(Triage): the issue is well explained and contains all information to reproduce this, so removing the needs-triage label and adding waiting-for-code-owners I just looked into this - I believe this could be due to the fact that, as opposed to e.g. a DaemonSet, the deployment name for a given pod is retrieved indirectly via the related ReplicaSet: opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube/client.go Lines 507 to 513 in 165a18f
This is done at the time the pod is added or updated via the k8sattributesprocessor's Informer. My theory is that in case of a restart of the agent, during the initial sync of the informer where it retrieves all existing resources (i.e. pods, deployments, replicasets, etc.), the processor might be informed about a pod before it has the information about the related replicaset, and therefore the related deployment name is unavailable. One solution may be to, whenever a replicaset is received in the informer, check for any pods that reference this replicaset in their owner references and update the pod attributes accordingly, in case they do not have this information available yet. |
Thank's for investigating this @bacherfl! I wonder if the In general, I think it's more efficient to wait for the sync instead of getting back and looping over the whole Pod's cache each time a ReplicaSet is added/updated: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/37088/files#diff-da945276e27b1e0a3eed6a7e4a97c5b1bb9d90da43bd8ef9fce8b533584d4292R1100-R1101 |
Thanks @ChrsMark! I agree that iterating over the pods is not ideal - I have posted a suggestion for a potential alternative solution in the comment in the PR - for now I will revert it back to draft, and will keep you updated as soon as I have tried out the other approach |
Reopening as the fix is not working in 100% of the cases. See #37125. |
…tryGen` (#37131) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description The affected test was added to verify the solution implemented for #37056. However, this seems to not have been fully solved yet. Therefore, to avoid flaky test runs, this test is skipped for now, until the issue is fully resolved <!-- Issue number (e.g. #1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes #37125 <!--Describe what testing was performed and which tests were added.--> #### Testing Disabled the flaky test until #37056 is fully solved --------- Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Component(s)
processor/k8sattributes
What happened?
Description
We have otel deployed as a daemon set in our cluster and have noticed that after a restart of the otel agent pods, prometheus metrics are missing the k8s_deployment_name attribute even though it is being extracted as part of the k8sattributes processor. After about 5 minutes, the problem resolves and the label is present on metrics. Notably, we have not observed this issue with the k8s_daemonset_name attribute. By looking at the logs, we found the k8s.deployment.name attribute is missing from the resource, while the other extracted metadata is present (see attached logs).
Our config sets the
wait_for_metadata
flag to true. The k8sattribute processor config we are currently using:Steps to Reproduce
Deploy otel with a prometheus receiver and a pipeline that uses the k8sattributes processor with the config above. Use the debug exporter to output to stdout. Note that for ~5 min after a restart of the otel pod, k8s_deployment_name is missing from the metrics. After about 5 min, the resource attribute appears.
Expected Result
After the otel pods restart, the k8s_deployment_name label should be on all exported metrics.
Actual Result
For about 5 minutes after otel pods restart, exported prometheus metrics are missing the k8s_deployment_name label despite it being extracted as part of the k8sattribute processor.
Collector version
v0.116.1
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
Also notice that the k8s.daemonset.name attribute IS present immediately after otel pods restart:
After about 5 minutes, with no further changes, the resource DOES contain the k8s.deployment.name attribute:
The text was updated successfully, but these errors were encountered: