Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RemovePodsViolatingTopologySpreadConstraint does not work for cluster-level default constraints #1502

Open
odsevs opened this issue Aug 27, 2024 · 4 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@odsevs
Copy link

odsevs commented Aug 27, 2024

What version of descheduler are you using?

descheduler version: 0.30.1

Does this issue reproduce with the latest release?

yes

Which descheduler CLI options are you using?

I installed the helm chart as-is with the values described below.

Please provide a copy of your descheduler policy config file

# Default values for descheduler.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

# CronJob or Deployment
kind: Deployment

image:
  repository: registry.k8s.io/descheduler/descheduler
  # Overrides the image tag whose default is the chart version
  tag: ""
  pullPolicy: IfNotPresent

imagePullSecrets:
#   - name: container-registry-secret

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  # limits:
  #   cpu: 100m
  #   memory: 128Mi

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  privileged: false
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

# podSecurityContext -- [Security context for pod](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/)
podSecurityContext: {}
  # fsGroup: 1000

nameOverride: ""
fullnameOverride: ""

# labels that'll be applied to all resources
commonLabels: {}

cronJobApiVersion: "batch/v1"
schedule: "*/2 * * * *"
suspend: false
# startingDeadlineSeconds: 200
# successfulJobsHistoryLimit: 3
# failedJobsHistoryLimit: 1
# ttlSecondsAfterFinished 600
# timeZone: Etc/UTC

# Required when running as a Deployment
deschedulingInterval: 1m

# Specifies the replica count for Deployment
# Set leaderElection if you want to use more than 1 replica
# Set affinity.podAntiAffinity rule if you want to schedule onto a node
# only if that node is in the same zone as at least one already-running descheduler
replicas: 1

# Specifies whether Leader Election resources should be created
# Required when running as a Deployment
# NOTE: Leader election can't be activated if DryRun enabled
leaderElection: {}
#  enabled: true
#  leaseDuration: 15s
#  renewDeadline: 10s
#  retryPeriod: 2s
#  resourceLock: "leases"
#  resourceName: "descheduler"
#  resourceNamescape: "kube-system"

command:
- "/bin/descheduler"

cmdOptions:
  v: 3

# Recommended to use the latest Policy API version supported by the Descheduler app version
deschedulerPolicyAPIVersion: "descheduler/v1alpha2"

# deschedulerPolicy contains the policies the descheduler will execute.
# To use policies stored in an existing configMap use:
# NOTE: The name of the cm should comply to {{ template "descheduler.fullname" . }}
# deschedulerPolicy: {}
deschedulerPolicy:
  # nodeSelector: "key1=value1,key2=value2"
  # maxNoOfPodsToEvictPerNode: 10
  # maxNoOfPodsToEvictPerNamespace: 10
  # ignorePvcPods: true
  # evictLocalStoragePods: true
  # evictDaemonSetPods: true
  # tracing:
  #   collectorEndpoint: otel-collector.observability.svc.cluster.local:4317
  #   transportCert: ""
  #   serviceName: ""
  #   serviceNamespace: ""
  #   sampleRate: 1.0
  #   fallbackToNoOpProviderOnError: true
  profiles:
    - name: default
      pluginConfig:
        - name: DefaultEvictor
          args:
            ignorePvcPods: true
            evictLocalStoragePods: false
        - name: RemoveDuplicates
        - name: RemovePodsHavingTooManyRestarts
          args:
            podRestartThreshold: 100
            includingInitContainers: true
        - name: RemovePodsViolatingNodeAffinity
          args:
            nodeAffinityType:
            - requiredDuringSchedulingIgnoredDuringExecution
        - name: RemovePodsViolatingNodeTaints
        - name: RemovePodsViolatingInterPodAntiAffinity
        - name: RemovePodsViolatingTopologySpreadConstraint
        - name: LowNodeUtilization
          args:
            thresholds:
              cpu: 65
              memory: 60
              pods: 30
            targetThresholds:
              cpu: 85
              memory: 80
              pods: 40
      plugins:
        balance:
          enabled:
#            - RemoveDuplicates
            - RemovePodsViolatingTopologySpreadConstraint
#            - LowNodeUtilization
        deschedule:
          enabled:
#            - RemovePodsHavingTooManyRestarts
#            - RemovePodsViolatingNodeTaints
#            - RemovePodsViolatingNodeAffinity
#            - RemovePodsViolatingInterPodAntiAffinity

priorityClassName: system-cluster-critical

nodeSelector: {}
#  foo: bar

affinity: {}
# nodeAffinity:
#   requiredDuringSchedulingIgnoredDuringExecution:
#     nodeSelectorTerms:
#     - matchExpressions:
#       - key: kubernetes.io/e2e-az-name
#         operator: In
#         values:
#         - e2e-az1
#         - e2e-az2
#  podAntiAffinity:
#    requiredDuringSchedulingIgnoredDuringExecution:
#      - labelSelector:
#          matchExpressions:
#            - key: app.kubernetes.io/name
#              operator: In
#              values:
#                - descheduler
#        topologyKey: "kubernetes.io/hostname"
topologySpreadConstraints: []
# - maxSkew: 1
#   topologyKey: kubernetes.io/hostname
#   whenUnsatisfiable: DoNotSchedule
#   labelSelector:
#     matchLabels:
#       app.kubernetes.io/name: descheduler
tolerations: []
# - key: 'management'
#   operator: 'Equal'
#   value: 'tool'
#   effect: 'NoSchedule'

rbac:
  # Specifies whether RBAC resources should be created
  create: true

serviceAccount:
  # Specifies whether a ServiceAccount should be created
  create: true
  # The name of the ServiceAccount to use.
  # If not set and create is true, a name is generated using the fullname template
  name:
  # Specifies custom annotations for the serviceAccount
  annotations: {}

podAnnotations: {}

podLabels: {}

dnsConfig: {}

livenessProbe:
  failureThreshold: 3
  httpGet:
    path: /healthz
    port: 10258
    scheme: HTTPS
  initialDelaySeconds: 3
  periodSeconds: 10

service:
  enabled: false
  # @param service.ipFamilyPolicy [string], support SingleStack, PreferDualStack and RequireDualStack
  #
  ipFamilyPolicy: ""
  # @param service.ipFamilies [array] List of IP families (e.g. IPv4, IPv6) assigned to the service.
  # Ref: https://kubernetes.io/docs/concepts/services-networking/dual-stack/
  # E.g.
  # ipFamilies:
  #   - IPv6
  #   - IPv4
  ipFamilies: []

serviceMonitor:
  enabled: false
  # The namespace where Prometheus expects to find service monitors.
  # namespace: ""
  # Add custom labels to the ServiceMonitor resource
  additionalLabels: {}
    # prometheus: kube-prometheus-stack
  interval: ""
  # honorLabels: true
  insecureSkipVerify: true
  serverName: null
  metricRelabelings: []
    # - action: keep
    #   regex: 'descheduler_(build_info|pods_evicted)'
    #   sourceLabels: [__name__]
  relabelings: []
    # - sourceLabels: [__meta_kubernetes_pod_node_name]
    #   separator: ;
    #   regex: ^(.*)$
    #   targetLabel: nodename
    #   replacement: $1
    #   action: replace

What k8s version are you using (kubectl version)?

kubectl version v1.28.11
$ kubectl version
```
Client Version: v1.28.11+rke2r1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.11+rke2r1
```

What did you do?

I defined a default topology spread constraint for the cluster, as described here: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#cluster-level-default-constraints

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /var/lib/rancher/rke2/server/cred/scheduler.kubeconfig
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: PodTopologySpread
        args:
          defaultConstraints:
            - maxSkew: 1
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: DoNotSchedule
              nodeTaintsPolicy: Honor
          defaultingType: List

I then:

  1. Cordoned one of the three nodes in the cluster
  2. Created a deployment of 12 pods. Thanks to the KubeSchedulerConfiguration, I got 6 pods on each of the other two nodes.
  3. Uncordoned the node
  4. I installed descheduler with the config provided above

What did you expect to see?
Descheduler kicks in to enforce the default pod topology constraint

What did you see instead?
Descheduler does nothing.
In the logs:

I0827 15:15:31.256918       1 descheduler.go:156] Building a pod evictor
I0827 15:15:31.257002       1 topologyspreadconstraint.go:122] Processing namespaces for topology spread constraints
I0827 15:15:31.257732       1 profile.go:349] "Total number of pods evicted" extension point="Balance" evictedPods=0
I0827 15:15:31.257776       1 descheduler.go:170] "Number of evicted pods" totalEvicted=0
@odsevs odsevs added the kind/bug Categorizes issue or PR as related to a bug. label Aug 27, 2024
@damemi
Copy link
Contributor

damemi commented Aug 27, 2024

Hi, this is working as intended. The descheduler does not know about cluster-level default constraints because it does not have access to the cluster's KubeSchedulerConfiguration.

@damemi
Copy link
Contributor

damemi commented Aug 27, 2024

/remove-kind bug
/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Aug 27, 2024
@odsevs
Copy link
Author

odsevs commented Aug 28, 2024

Thanks for the reply.
It's a strange design decision that the KubeSchedulerConfiguration is not accessible from within the cluster, but indeed in that case, there is not much you can do. I assume, if this would become a feature, that this configuration would unfortunately need to be duplicated between the kube-scheduler and the descheduler.

@damemi
Copy link
Contributor

damemi commented Aug 28, 2024

Yes, exactly. This has been discussed before if you'd like some more info: #609

Essentially the KubeSchedulerConfiguration isn't an actual API object, it's just mounted to the scheduler pod. Users can also deploy custom or multiple schedulers so there isn't a good way to automatically know which default to go with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants