Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap pods to reduce fragmentation #1519

Open
jinglinliang opened this issue Sep 17, 2024 · 3 comments
Open

Swap pods to reduce fragmentation #1519

jinglinliang opened this issue Sep 17, 2024 · 3 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@jinglinliang
Copy link

Some of our clusters have some small anti-affinity deployments and are causing lots of fragmentations. Here's a snapshot of one cluster
image
the blue deployment has anti-affinity, and the cluster ended up in this state after the blue deployment restarts

I'm poking around solutions to alleviate this situation.

First, cluster autoscaler (CAS) is not scaling down those low utilization nodes because none of the blue pods can fit into the rest of the nodes, which are pretty much fully packed.

I came across the HighNodeUtilization & LowNodeUtilization plugins in de-scheduler but looks like the eviction logic is similar to CAS.

I'm wondering if it's possible to implement or use existing de-scheduler plugins to achieve some kind of swap function, which swaps the blue pods with the non-anti-affinity pods in other nodes, so that each of the fully packed nodes can have one blue pod. And the swapped out non-anti-affinity pods can be packed into much fewer nodes.

Any ideas are appreciated!

@ingvagabund
Copy link
Contributor

ingvagabund commented Sep 22, 2024

Hi @jinglinliang. The issue description looks quite awesome. I love the snapshot picture. I wished there were more such reports :).

Wrt. either of the node utilization strategies it's ultimately up to the scheduler to make the switch. The descheduler plugins might evict some of the non-anti-affinity pods. Yet, these non-anti-affinity pods need to first get scheduled to where the blue pods are. Running LowNodeUtilization might help with that. Depending on the pod's priorities and preeviction filters. Once freed enough running HighNodeUtilization might evict some of the blue pods. A kind of "shaking the nodes" hoping the scheduler will re-distribute the pods towards what's requested here. Yet, it's an iterative process that does not guarantee a success. To provide a real swap kubelets need to have ability to allocate slots.

I presume preemption and priorities does not help since both blue and green pods have the same or very similar priority? I.e. HighNodeUtilization plugin (with nodeFit disabled) evicting blue pods and having the kube-scheduler preempt green pods to free space for blue ones.

With the profiles you can configure something like:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: Round1Low
    # timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
    pluginConfig:
    - name: "LowNodeUtilization"
      args:
        thresholds:
          "memory": 20
        targetThresholds:
          "memory": 70
    - name: "DefaultEvictor"
      args:
        ... # evict only green pods
    plugins:
      balance:
        enabled:
          - "LowNodeUtilization"
  - name: Round1High
    # timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
    pluginConfig:
    - name: "HighNodeUtilization"
      args:
        thresholds:
          "memory": 20
    - name: "DefaultEvictor"
      args:
        ... # evict only blue pods
    plugins:
      balance:
        enabled:
          - "HighNodeUtilization"
  - name: Round2Low
    # timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
    pluginConfig:
    - name: "LowNodeUtilization"
      args:
        thresholds:
          "memory": 20
        targetThresholds:
          "memory": 70
    - name: "DefaultEvictor"
      args:
        ... # evict only green pods
    plugins:
      balance:
        enabled:
          - "LowNodeUtilization"
  - name: Round2High
    pluginConfig:
    - name: "HighNodeUtilization"
      args:
        thresholds:
          "memory": 20
    - name: "DefaultEvictor"
      args:
        ... # evict only blue pods
    plugins:
      balance:
        enabled:
          - "HighNodeUtilization"
...

Perform the shaking multiple times. Yet, the current descheduler will be quite quick in evicting pods. So we'd have to implement a timeout between profiles that will wait e.g. for 1 minute (user configured) before "shaking the nodes" again.

@jinglinliang
Copy link
Author

Hi @ingvagabund. Thank you very much for the reply :)

Some clarifications:

  1. Priority does not help here. All deployments have the same priority.
  2. Our goal is to pack each of our clusters as tight as possible, preferably with ~0 fragmentation cores
  3. The "blue" and "green" pods here are just examples, we have hundreds of deployments spread across thousands of clusters, so we need the descheduler configuration to be generic.

"Shaking the nodes" is an interesting idea but seems very unpredictable.

- name: "DefaultEvictor"
      args:
        ... # evict only blue pods

It would be difficult to define the "blue" or "green" pods here. Also, the clusters may just enter a stable state based on the profile when all nodes are, for example, 50% allocated, and the total number of nodes stays the same as the snapshot. (please correct me if i'm wrong)

Another idea we had is to set the HighNodeUtilization threshhold, or similarly, Cluster Autoscaler's scale-down threshhold to 100%, so that the non-blue nodes on the right side will be torn down and pods will stack on top of the blues ones. However, this could cause lots of unnecessary pod disruptions

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants