Running AI Workloads on Kubernetes at Scale
Why Kubernetes for AI?
Kubernetes has become the de facto orchestration platform for AI workloads in production. While Slurm dominates research clusters, Kubernetes excels when you need to run training, inference, and traditional microservices on shared infrastructure with unified observability and deployment pipelines. The ecosystem of operators, custom resources, and integrations makes it possible to build a complete AI platform on top of Kubernetes.
That said, running GPU workloads on Kubernetes requires careful configuration. The default scheduler does not understand GPU topology, memory constraints, or the communication patterns of distributed training. This guide covers the essential components and configurations for production-grade AI on Kubernetes.
GPU Device Plugin and Resource Management
The foundation is the NVIDIA GPU Operator, which automates the deployment of GPU drivers, the container toolkit, device plugins, and DCGM monitoring. Install it via Helm:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--set migManager.enabled=true
Once deployed, GPUs appear as nvidia.com/gpu resources that pods can request. For inference workloads that do not need a full GPU, enable MIG partitioning or time-slicing to improve utilization. For training, always allocate whole GPUs to avoid noisy-neighbor performance degradation.
Scheduling Strategies for GPU Pods
The default Kubernetes scheduler assigns GPUs without considering topology. For distributed training, this means your pods might land on nodes connected by different network switches, introducing latency in collective operations. To solve this, implement topology-aware scheduling:
- Node affinity rules: Use labels to group nodes by rack, switch, or InfiniBand subnet. Schedule multi-node training jobs on nodes within the same failure domain.
- Topology Spread Constraints: Ensure pods in a training job are spread across nodes but not across racks, balancing fault tolerance with network performance.
- Gang scheduling: Use Volcano or Kueue to ensure all pods in a distributed training job are scheduled simultaneously. Without gang scheduling, partial allocations waste GPU resources while waiting for the remaining pods.
- Priority and preemption: Define PriorityClasses so that production inference workloads can preempt batch training jobs when resources are scarce.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: inference-critical
value: 1000000
globalDefault: false
description: "Priority class for production inference endpoints"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: training-batch
value: 100000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Priority class for batch training jobs"
Serving Inference Endpoints
For serving LLMs and other models, we recommend KServe or Triton Inference Server deployed as Kubernetes services. Key configurations include:
- Horizontal Pod Autoscaler (HPA): Scale inference pods based on custom metrics like request queue depth or GPU utilization, not just CPU. Use KEDA for more flexible scaling triggers.
- Resource requests and limits: Always set both CPU and GPU requests. Set memory requests based on model size plus a 20% buffer for KV cache growth during inference.
- Readiness probes: Configure model-loading health checks. Large models can take 2-5 minutes to load into GPU memory; the pod should not receive traffic until loading completes.
- Persistent Volume Claims: Cache model weights on fast local NVMe storage using a DaemonSet that pre-pulls models, reducing cold start times from minutes to seconds.
Networking for Distributed Training
Container networking (CNI) is often the bottleneck for distributed training on Kubernetes. Standard overlay networks like Calico or Flannel add encapsulation overhead that reduces RDMA throughput. For serious training workloads, deploy the NVIDIA Network Operator alongside Multus CNI to provide secondary RDMA-capable network interfaces to training pods.
This configuration gives each training pod a high-performance InfiniBand or RoCE interface alongside the standard pod network, allowing NCCL to use RDMA for collective operations while keeping service discovery and health checks on the primary CNI.
Observability and Cost Tracking
Deploy DCGM Exporter to push per-GPU metrics to Prometheus. Build dashboards that track GPU utilization by namespace, team, and workload type. Implement kubecost or OpenCost to attribute GPU costs to specific teams and projects — this visibility alone typically reduces waste by 20-30% as teams become accountable for their resource consumption.
Alert on GPU utilization below 40% for pods that have been running longer than one hour. This catches abandoned Jupyter notebooks and forgotten debug sessions that silently consume expensive GPU resources.
Want help implementing these practices?