Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RemovePodsHavingTooManyRestarts plugin doesn't delete pods #1579

Open
daniglyk opened this issue Dec 18, 2024 · 8 comments
Open

RemovePodsHavingTooManyRestarts plugin doesn't delete pods #1579

daniglyk opened this issue Dec 18, 2024 · 8 comments

Comments

@daniglyk
Copy link

daniglyk commented Dec 18, 2024

Tasks

Preview Give feedback
No tasks being tracked yet.
@daniglyk
Copy link
Author

here is my policy

policy.yaml: |
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
nodeSelector: node-role.kubernetes.io/worker=true
metricsCollector:
enabled: true
profiles:
- name: default
pluginConfig:
- args:
evictLocalStoragePods: true
ignorePvcPods: true
name: DefaultEvictor
- name: RemoveDuplicates
- args:
includingInitContainers: true
podRestartThreshold: 100
name: RemovePodsHavingTooManyRestarts
- args:
nodeAffinityType:
- requiredDuringSchedulingIgnoredDuringExecution
name: RemovePodsViolatingNodeAffinity
- name: RemovePodsViolatingNodeTaints
- name: RemovePodsViolatingInterPodAntiAffinity
- name: RemovePodsViolatingTopologySpreadConstraint
- args:
targetThresholds:
cpu: 70
memory: 45
pods: 40
thresholds:
cpu: 60
memory: 35
pods: 30
name: LowNodeUtilization
plugins:
balance:
enabled:
- RemoveDuplicates
- RemovePodsViolatingTopologySpreadConstraint
- LowNodeUtilization
deschedule:
enabled:
- RemovePodsHavingTooManyRestarts
- RemovePodsViolatingNodeAffinity
- RemovePodsViolatingNodeTaints
- RemovePodsViolatingInterPodAntiAffinity

here is my logs

I1218 13:52:03.689278 1 profile.go:345] "Total number of pods evicted" extension point="Balance" evictedPods=0
I1218 13:52:03.689315 1 topologyspreadconstraint.go:122] Processing namespaces for topology spread constraints
I1218 13:52:03.690874 1 profile.go:345] "Total number of pods evicted" extension point="Balance" evictedPods=0
I1218 13:52:03.692279 1 nodeutilization.go:207] "Node is overutilized" node="vt-dev-kubw-01" usage={"cpu":"5821m","memory":"14051Mi","pods":"41"} usagePercentage={"cpu":72.76,"memory":43.91,"pods":37.27}
I1218 13:52:03.692386 1 nodeutilization.go:207] "Node is overutilized" node="vt-dev-kubw-02" usage={"cpu":"5900m","memory":"11720Mi","pods":"27"} usagePercentage={"cpu":73.75,"memory":36.63,"pods":24.55}
I1218 13:52:03.692407 1 nodeutilization.go:207] "Node is overutilized" node="vt-dev-kubw-04" usage={"cpu":"5730m","memory":"15912Mi","pods":"50"} usagePercentage={"cpu":71.63,"memory":49.71,"pods":45.45}
I1218 13:52:03.692423 1 nodeutilization.go:207] "Node is overutilized" node="vt-dev-kubw-05" usage={"cpu":"5910m","memory":"13116Mi","pods":"45"} usagePercentage={"cpu":73.88,"memory":40.97,"pods":40.91}
I1218 13:52:03.692440 1 nodeutilization.go:207] "Node is overutilized" node="vt-dev-kubw-07" usage={"cpu":"6440m","memory":"9378Mi","pods":"26"} usagePercentage={"cpu":80.5,"memory":29.3,"pods":23.64}
I1218 13:52:03.692639 1 nodeutilization.go:207] "Node is overutilized" node="vt-dev-kubw-08-baremetal" usage={"cpu":"8876m","memory":"13396Mi","pods":"39"} usagePercentage={"cpu":73.97,"memory":42.3,"pods":35.45}
I1218 13:52:03.692663 1 nodeutilization.go:207] "Node is overutilized" node="vt-dev-kubw-03" usage={"cpu":"6650m","memory":"13696Mi","pods":"33"} usagePercentage={"cpu":83.13,"memory":42.78,"pods":30}
I1218 13:52:03.692679 1 nodeutilization.go:207] "Node is overutilized" node="vt-dev-kubw-09" usage={"cpu":"5811m","memory":"10492Mi","pods":"34"} usagePercentage={"cpu":72.64,"memory":32.78,"pods":30.91}
I1218 13:52:03.692870 1 nodeutilization.go:207] "Node is overutilized" node="vt-dev-kubw-11" usage={"cpu":"5901m","memory":"9287Mi","pods":"39"} usagePercentage={"cpu":73.76,"memory":29.01,"pods":35.45}
I1218 13:52:03.692889 1 nodeutilization.go:207] "Node is overutilized" node="vt-dev-kubw-06" usage={"cpu":"6010m","memory":"10800Mi","pods":"39"} usagePercentage={"cpu":75.13,"memory":33.74,"pods":35.45}
I1218 13:52:03.692906 1 lownodeutilization.go:135] "Criteria for a node under utilization" CPU=60 Mem=35 Pods=30
I1218 13:52:03.692926 1 lownodeutilization.go:136] "Number of underutilized nodes" totalNumber=0
I1218 13:52:03.692942 1 lownodeutilization.go:149] "Criteria for a node above target utilization" CPU=70 Mem=45 Pods=40
I1218 13:52:03.692954 1 lownodeutilization.go:150] "Number of overutilized nodes" totalNumber=10
I1218 13:52:03.692967 1 lownodeutilization.go:153] "No node is underutilized, nothing to do here, you might tune your thresholds further"
I1218 13:52:03.692990 1 profile.go:345] "Total number of pods evicted" extension point="Balance" evictedPods=0
I1218 13:52:03.693006 1 descheduler.go:179] "Number of evicted pods" totalEvicted=0

i have 10+ pods with more than 100 restarts. i switch states (pending, running, crashloop, etc) thre is no results

@googs1025
Copy link
Member

How many nodes do you have? The descheduler does not seem to work on a single node.

@googs1025
Copy link
Member

I also reproduced this problem:
configmap

root@VM-0-16-ubuntu:/home/ubuntu# kubectl get cm my-release-descheduler -nkube-system -oyaml
apiVersion: v1
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha2"
    kind: "DeschedulerPolicy"
    profiles:
    - name: default
      pluginConfig:
      - args:
          evictLocalStoragePods: true
          ignorePvcPods: true
        name: DefaultEvictor
      - name: RemoveDuplicates
      - args:
          includingInitContainers: true
          podRestartThreshold: 5
        name: RemovePodsHavingTooManyRestarts
      - args:
          nodeAffinityType:
          - requiredDuringSchedulingIgnoredDuringExecution
        name: RemovePodsViolatingNodeAffinity
      - name: RemovePodsViolatingNodeTaints
      - name: RemovePodsViolatingInterPodAntiAffinity
      - name: RemovePodsViolatingTopologySpreadConstraint
      - args:
          targetThresholds:
            cpu: 50
            memory: 50
            pods: 50
          thresholds:
            cpu: 20
            memory: 20
            pods: 20
        name: LowNodeUtilization
      plugins:
        balance:
          enabled:
          - RemoveDuplicates
          - RemovePodsViolatingTopologySpreadConstraint
          - LowNodeUtilization
        deschedule:
          enabled:
          - RemovePodsHavingTooManyRestarts
          - RemovePodsViolatingNodeTaints
          - RemovePodsViolatingNodeAffinity
          - RemovePodsViolatingInterPodAntiAffinity
root@VM-0-16-ubuntu:/home/ubuntu# kubectl get pods -A
NAMESPACE            NAME                                             READY   STATUS             RESTARTS          AGE
default              always-restart-pod                               0/1     CrashLoopBackOff   7 (2m34s ago)     13m
koordinator-system   koord-descheduler-78d8d897b-9k9rv                1/1     Running            0                 2d
koordinator-system   koord-descheduler-78d8d897b-cptg2                1/1     Running            0                 2d
koordinator-system   koord-manager-59d8669cd6-mv7hx                   1/1     Running            0                 2d
koordinator-system   koord-manager-59d8669cd6-rggzb                   1/1     Running            0                 2d
koordinator-system   koord-scheduler-5b84476b7d-95bfz                 1/1     Running            0                 2d
koordinator-system   koord-scheduler-5b84476b7d-gvfm2                 1/1     Running            0                 2d
koordinator-system   koordlet-2knv9                                   0/1     CrashLoopBackOff   571 (3m19s ago)   2d
koordinator-system   koordlet-8b5g7                                   0/1     CrashLoopBackOff   571 (3m23s ago)   2d
koordinator-system   koordlet-d4tjq                                   0/1     CrashLoopBackOff   571 (4m58s ago)   2d
kube-system          coredns-668d6bf9bc-489fm                         1/1     Running            0                 2d
kube-system          coredns-668d6bf9bc-5s62j                         1/1     Running            0                 2d
kube-system          etcd-cluster1-control-plane                      1/1     Running            0                 2d
kube-system          kindnet-c29wj                                    1/1     Running            0                 2d
kube-system          kindnet-djgkf                                    1/1     Running            0                 2d
kube-system          kindnet-w6fdm                                    1/1     Running            0                 2d
kube-system          kube-apiserver-cluster1-control-plane            1/1     Running            0                 2d
kube-system          kube-controller-manager-cluster1-control-plane   1/1     Running            0                 2d
kube-system          kube-proxy-4xlj7                                 1/1     Running            0                 2d
kube-system          kube-proxy-6bnd7                                 1/1     Running            0                 2d
kube-system          kube-proxy-zddnh                                 1/1     Running            0                 2d
kube-system          kube-scheduler-cluster1-control-plane            1/1     Running            0                 2d
kube-system          my-release-descheduler-28918520-47mvj            0/1     Completed          0                 4m51s
kube-system          my-release-descheduler-28918522-m795j            0/1     Completed          0                 2m51s
kube-system          my-release-descheduler-28918524-lw5lj            0/1     Completed          0                 51s
local-path-storage   local-path-provisioner-58cc7856b6-dst4z          1/1     Running            0                 2d
root@VM-0-16-ubuntu:/home/ubuntu# kubectl logs -f my-release-descheduler-28918522-m795j -nkube-system
I1225 07:22:01.306377       1 secure_serving.go:57] Forcing use of http/1.1 only
I1225 07:22:01.306971       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1735111321\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1735111321\" (2024-12-25 06:22:01 +0000 UTC to 2025-12-25 06:22:01 +0000 UTC (now=2024-12-25 07:22:01.306945326 +0000 UTC))"
I1225 07:22:01.307011       1 secure_serving.go:213] Serving securely on [::]:10258
I1225 07:22:01.307027       1 tracing.go:87] Did not find a trace collector endpoint defined. Switching to NoopTraceProvider
I1225 07:22:01.307475       1 tlsconfig.go:243] "Starting DynamicServingCertificateController"
I1225 07:22:01.316720       1 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I1225 07:22:01.316753       1 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I1225 07:22:01.316767       1 reflector.go:305] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.316773       1 reflector.go:341] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.316815       1 reflector.go:305] Starting reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.316825       1 reflector.go:341] Listing and watching *v1.Namespace from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.316937       1 reflector.go:305] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.316948       1 reflector.go:341] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.317009       1 reflector.go:305] Starting reflector *v1.PriorityClass (0s) from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.317016       1 reflector.go:341] Listing and watching *v1.PriorityClass from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.318590       1 reflector.go:368] Caches populated for *v1.PriorityClass from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.318874       1 reflector.go:368] Caches populated for *v1.Node from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.319102       1 reflector.go:368] Caches populated for *v1.Namespace from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.329286       1 reflector.go:368] Caches populated for *v1.Pod from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.417573       1 descheduler.go:173] Setting up the pod evictor
I1225 07:22:01.417704       1 toomanyrestarts.go:116] "Processing node" node="cluster1-control-plane"
I1225 07:22:01.417801       1 toomanyrestarts.go:116] "Processing node" node="cluster1-worker"
I1225 07:22:01.417848       1 toomanyrestarts.go:116] "Processing node" node="cluster1-worker2"
I1225 07:22:01.417900       1 profile.go:317] "Total number of pods evicted" extension point="Deschedule" evictedPods=0
I1225 07:22:01.417922       1 node_taint.go:108] "Processing node" node="cluster1-control-plane"
I1225 07:22:01.417960       1 node_taint.go:108] "Processing node" node="cluster1-worker"
I1225 07:22:01.417994       1 node_taint.go:108] "Processing node" node="cluster1-worker2"
I1225 07:22:01.418019       1 profile.go:317] "Total number of pods evicted" extension point="Deschedule" evictedPods=0
I1225 07:22:01.418041       1 node_affinity.go:81] "Executing for nodeAffinityType" nodeAffinity="requiredDuringSchedulingIgnoredDuringExecution"
I1225 07:22:01.418049       1 node_affinity.go:121] "Processing node" node="cluster1-control-plane"
I1225 07:22:01.418092       1 node_affinity.go:121] "Processing node" node="cluster1-worker"
I1225 07:22:01.418117       1 node_affinity.go:121] "Processing node" node="cluster1-worker2"
I1225 07:22:01.418144       1 profile.go:317] "Total number of pods evicted" extension point="Deschedule" evictedPods=0
I1225 07:22:01.418173       1 pod_antiaffinity.go:93] "Processing node" node="cluster1-control-plane"
I1225 07:22:01.418182       1 pod_antiaffinity.go:93] "Processing node" node="cluster1-worker"
I1225 07:22:01.418189       1 pod_antiaffinity.go:93] "Processing node" node="cluster1-worker2"
I1225 07:22:01.418199       1 profile.go:317] "Total number of pods evicted" extension point="Deschedule" evictedPods=0
I1225 07:22:01.418210       1 removeduplicates.go:107] "Processing node" node="cluster1-control-plane"
I1225 07:22:01.418243       1 removeduplicates.go:107] "Processing node" node="cluster1-worker"
I1225 07:22:01.418266       1 removeduplicates.go:107] "Processing node" node="cluster1-worker2"
I1225 07:22:01.418290       1 profile.go:345] "Total number of pods evicted" extension point="Balance" evictedPods=0
I1225 07:22:01.418303       1 topologyspreadconstraint.go:122] Processing namespaces for topology spread constraints
I1225 07:22:01.418388       1 profile.go:345] "Total number of pods evicted" extension point="Balance" evictedPods=0
I1225 07:22:01.418452       1 nodeutilization.go:210] "Node is appropriately utilized" node="cluster1-control-plane" usage={"cpu":"950m","memory":"290Mi","pods":"10"} usagePercentage={"cpu":23.75,"memory":3.95,"pods":9.09}
I1225 07:22:01.418501       1 nodeutilization.go:207] "Node is overutilized" node="cluster1-worker" usage={"cpu":"2100m","memory":"1074Mi","pods":"7"} usagePercentage={"cpu":52.5,"memory":14.64,"pods":6.36}
I1225 07:22:01.418509       1 nodeutilization.go:210] "Node is appropriately utilized" node="cluster1-worker2" usage={"cpu":"1600m","memory":"818Mi","pods":"7"} usagePercentage={"cpu":40,"memory":11.15,"pods":6.36}
I1225 07:22:01.418519       1 lownodeutilization.go:135] "Criteria for a node under utilization" CPU=20 Mem=20 Pods=20
I1225 07:22:01.418534       1 lownodeutilization.go:136] "Number of underutilized nodes" totalNumber=0
I1225 07:22:01.418542       1 lownodeutilization.go:149] "Criteria for a node above target utilization" CPU=50 Mem=50 Pods=50
I1225 07:22:01.418551       1 lownodeutilization.go:150] "Number of overutilized nodes" totalNumber=1
I1225 07:22:01.418562       1 lownodeutilization.go:153] "No node is underutilized, nothing to do here, you might tune your thresholds further"
I1225 07:22:01.418578       1 profile.go:345] "Total number of pods evicted" extension point="Balance" evictedPods=0
I1225 07:22:01.418590       1 descheduler.go:179] "Number of evicted pods" totalEvicted=0
I1225 07:22:01.418747       1 reflector.go:311] Stopping reflector *v1.PriorityClass (0s) from k8s.io/client-go/informers/factory.go:160
I1225 07:22:01.418790       1 tlsconfig.go:258] "Shutting down DynamicServingCertificateController"
I1225 07:22:01.418849       1 secure_serving.go:258] Stopped listening on [::]:10258

@googs1025
Copy link
Member

@ingvagabund @a7i /PTAL Is there something wrong with the configuration?

@googs1025
Copy link
Member

install step

helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/
helm install my-release --namespace kube-system descheduler/descheduler

@daniglyk
Copy link
Author

How many nodes do you have? The descheduler does not seem to work on a single node.

11 workers

@daniglyk
Copy link
Author

install step

helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/
helm install my-release --namespace kube-system descheduler/descheduler

is it solution?

@ingvagabund
Copy link
Contributor

@googs1025 would you please increase the log level to 4 at least? To see if there are any violations of the filters in https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/framework/plugins/defaultevictor/defaultevictor.go#L262.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants