Kubernetes v1.36 Beta: Dynamically Adjusting Pod Resources for Suspended Jobs

From Michili, the free encyclopedia of technology

Introduction

Kubernetes v1.36 introduces a powerful beta feature that allows modifying container resource requests and limits in the pod template of a suspended Job. This capability, initially released as alpha in v1.35, empowers queue controllers and cluster administrators to fine-tune CPU, memory, GPU, and extended resource specifications on a Job while it remains suspended—before it starts or resumes execution. By eliminating the need to recreate Jobs for resource adjustments, this feature greatly improves operational flexibility in dynamic cluster environments.

Kubernetes v1.36 Beta: Dynamically Adjusting Pod Resources for Suspended Jobs

Why Mutable Pod Resources Matter

Batch and machine learning workloads often face uncertain resource requirements at Job creation time. The optimal allocation depends on real-time cluster capacity, queue priorities, and the availability of specialized hardware such as GPUs. Previously, once a Job's pod template resource fields were set, they became immutable—any change required deleting and recreating the entire Job, which caused loss of metadata, status, and history. For queue controllers like Kueue, this was a significant limitation.

With the new beta feature, queue controllers can now:

  • Adjust resource allocations for suspended Jobs based on current cluster load.
  • Avoid losing Job metadata or history when scaling resources.
  • Enable CronJob instances to run with reduced resources under heavy load instead of failing entirely.

Consider a machine learning training Job initially requesting 4 GPUs:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job-example-abcd123
  labels:
    app.kubernetes.io/name: trainer
spec:
  suspend: true
  template:
    metadata:
      annotations:
        kubernetes.io/description: "ML training, ID abcd123"
    spec:
      containers:
      - name: trainer
        image: example-registry.example.com/training:2026-04-23T150405.678
        resources:
          requests:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
          limits:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
      restartPolicy: Never

A queue controller evaluating cluster capacity might discover only 2 GPUs available. With this feature, it can update the Job’s resource requests before resuming:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job-example-abcd123
  labels:
    app.kubernetes.io/name: trainer
spec:
  suspend: true
  template:
    metadata:
      annotations:
        kubernetes.io/description: "ML training, ID abcd123"
    spec:
      containers:
      - name: trainer
        image: example-registry.example.com/training:2026-04-23T150405.678
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
            example-hardware-vendor.com/gpu: "2"
          limits:
            cpu: "4"
            memory: "16Gi"
            example-hardware-vendor.com/gpu: "2"
      restartPolicy: Never

Once updated, the controller resumes the Job by setting spec.suspend to false, and new Pods are created with the adjusted resource specifications.

How It Works

Under the hood, the Kubernetes API server relaxes the immutability constraint on pod template resource fields—but only for suspended Jobs. No new API types were introduced; instead, the existing Job and pod template structures accommodate this change through a targeted relaxation of validation logic.

Implementation Details

When a Job is suspended (spec.suspend: true), the API server now allows updates to spec.template.spec.containers[*].resources.requests and limits. These modifications are applied before the Job resumes, ensuring that newly created Pods use the updated resource profile. The feature is enabled by default in v1.36 due to its beta status, making it available without any special feature gate.

Practical Benefits

  • No Job recreation: Adjust resources without losing Job history or associated metadata.
  • Graceful degradation: CronJob instances can continue running with reduced resources instead of failing under load.
  • Better scheduling: Queue controllers can optimize resource utilization across the cluster in real time.

This enhancement is particularly valuable for batch processing, ML training pipelines, and any environment where resource demands fluctuate. For more details, refer to the Kubernetes Job documentation.

Conclusion

The mutable pod resources feature for suspended Jobs in Kubernetes v1.36 (beta) marks a significant improvement in workload management. By enabling dynamic resource adjustments without Job recreation, it reduces operational overhead and increases cluster efficiency. Operators and developers using batch or ML workloads should evaluate this capability to simplify their resource orchestration strategies.