Last modified March 14, 2025
GPU workloads in Cluster API workload clusters
This guide explains how to configure and use GPU-enabled nodes in Cluster API (CAPI) workload clusters to run GPU-accelerated workloads, focusing on NVIDIA GPUs. GPU support is available starting with release v30.1.0.
Overview
GPU-accelerated computing can significantly enhance performance for specific workloads like machine learning, video processing, scientific simulations, and other compute-intensive tasks. Giant Swarm’s CAPI workload clusters support GPU nodes on AWS (CAPA) and Azure (CAPZ).
Prerequisites
- A running CAPI workload cluster
kubectl
configured to access your workload cluster
Supported cloud providers and GPU
types
AWS
Giant Swarm CAPI clusters on AWS support the following Graphics Processing Unit instance types (incl. upcoming series):
p2
family: Cost-effective GPU instances with NVIDIA K80p3
family: High-performance instances with NVIDIA V100p4
family: Latest generation with NVIDIA A100p5
family: Advanced performance with NVIDIA H100p5e
family: Enhanced performance with NVIDIA H200g3
family: Graphics-optimized with NVIDIA Tesla M60g4dn
family: Balanced GPU compute with NVIDIA T4g5
family: Latest generation graphics-optimized with NVIDIA A10Gg6
family: Next-generation GPU instances with NVIDIA L4g6e
family: Enhanced performance with NVIDIA L40S Tensor Core
Azure
Giant Swarm CAPI clusters on Azure support the following GPU VM families and series (incl. upcoming series):
NC-family
(Compute-intensive, Graphics-intensive)
NC
series: NVIDIA K80 GPUNCv2
series andNCv3
series: NVIDIA P100 and V100NCasT4_v3
series: NVIDIA T4NC_A100_v4
series: NVIDIA A100NCads_H100_v5
series andNCCads_H100_v5
series: NVIDIA H100
ND-family
(Large memory compute-intensive workloads)
ND_A100_v4
series andNDm_A100_v4
series: NVIDIA A100ND-H100-v5
series: NVIDIA H100
NV-family
(Visualization and rendering)
NV
series: NVIDIA M60NVv3
series: NVIDIA Tesla M60NVadsA10_v5
series: NVIDIA A10
Adding GPU
nodes to your Cluster API workload cluster
Configuring a GPU
node pool
To add GPU nodes to your CAPI workload cluster, you need to create a new node pool with the appropriate GPU instance type. The following example shows how to add GPU nodes to an AWS (CAPA) workload cluster.
Update your cluster app’s values to add a new node pool with GPU instances:
nodePools:
gpu-worker-pool-1:
# instance type with GPU
instanceType: g5.2xlarge
maxSize: 1
minSize: 1
# root volume size in GB needs to be at least 15
rootVolumeSizeGB: 15
instanceWarmup: 600
minHealthyPercentage: 90
# taints which are required for GPU workloads
customNodeTaints:
- key: "nvidia.com/gpu"
value: "Exists"
effect: "NoSchedule"
Apply the updated configuration to your cluster.
Installing the GPU
Operator app
The recommended way to enable NVIDIA GPU support in your cluster is to use the GPU Operator
app, which is needed for scheduling and running GPU workloads.
Install the GPU Operator
app from the Giant Swarm Catalog in the kube-system
namespace.
The GPU Operator app installs several components:
- NVIDIA Device Plugin
- NVIDIA MIG Manager (for A100
GPUs
) - Node Feature Discovery
- GPU Feature Discovery
We don’t install the NVIDIA driver and toolkit by the GPU Operator, because it’s already provided by default.
Verifying the installation
Verify that the GPU operator is installed and works correctly and wait for all the pods to be running:
gpu-feature-discovery-tjj65 1/1 Running 0 6s
gpu-operator-6d5b78c78f-f7dg8 1/1 Running 0 15s
gpu-operator-node-feature-discovery-gc-554ccf9b5-vzwd2 1/1 Running 0 15s
gpu-operator-node-feature-discovery-master-567bf66c77-xvsbz 1/1 Running 0 15s
gpu-operator-node-feature-discovery-worker-2lhks 1/1 Running 0 15s
gpu-operator-node-feature-discovery-worker-jp7dm 1/1 Running 0 15s
gpu-operator-node-feature-discovery-worker-mvq4t 1/1 Running 0 15s
gpu-operator-node-feature-discovery-worker-pxtxq 1/1 Running 0 15s
gpu-operator-node-feature-discovery-worker-s5zr6 1/1 Running 0 15s
gpu-operator-node-feature-discovery-worker-xfbl7 1/1 Running 0 15s
nvidia-dcgm-exporter-6l8vx 1/1 Running 0 6s
nvidia-device-plugin-daemonset-rr49p 1/1 Running 0 6s
nvidia-operator-validator-vjhxh 1/1 Running 0 7s
Running GPU
workloads
To run workloads on GPU
nodes, you need to request GPU
resources and the runtimeClassName nvidia
needs to be specified in your pod specification:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
namespace: kube-system
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
# runtimeClassName is required to run GPU workloads
runtimeClassName: nvidia
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
resources:
limits:
nvidia.com/gpu: 1 # Requesting 1 GPU
Resource management
GPUs
are only available through limits, not requests. The number specified in limits determines how many GPUs
will be allocated to the pod.
Best practices
Taints and
tolerations
: Use Kubernetes taints on GPU nodes and correspondingtolerations
in pod specifications to prevent non-GPU workloads from being scheduled on expensive GPU resources.Resource quotas: Implement resource quotas to control GPU allocation in multi-tenant environments.
Node auto provisioning: Consider using
Karpenter
or cluster autoscaling with GPU node pools to automatically scale GPU resources based on demand.
Getting support
If you encounter issues that cannot be resolved using the troubleshooting steps above, contact Giant Swarm support with the following information:
- Output of
kubectl get nodes -o wide
- Output of
kubectl describe node <gpu-node-name>
- Logs from the GPU Operator pods:
kubectl logs -n kube-system gpu-operator-<deployment-id>-<pod-id>
Limitations
- GPU memory overcommitment is not supported by default
- Dynamic allocation of GPU is not supported (GPU is allocated at container start time)
- Specific GPU driver versions may be required for certain CUDA applications
Further reading
Need help, got feedback?
We listen to your Slack support channel. You can also reach us at support@giantswarm.io. And of course, we welcome your pull requests!