Last modified December 5, 2024
Clusters over multiple availability zones
In Giant Swarm platform you can easily launch clusters with worker nodes spread across multiple availability zones (AZ
). This will lower the risk that your cluster will become unavailable due to an incident in a particular AWS
, Azure
or on-premise data center.
What availability zones are good for
Cloud providers, and on-premise installations, supports availability zones within their regions. These zones still have good inter-connectivity but are separated from each other to isolate failures within one zone from the others. As a result, it’s more likely that one zone goes down than the whole region. If your clusters are running within multiple availability zones then Kubernetes
can easily shift your workloads from one zone to another.
Your cluster nodes have labels that indicate which availability zone they’re running in. You can influence the scheduling of your pods via node affinity and/or inter-pod affinity or anti-affinity.
This enables use cases such as:
Running stateless services balanced over multiple availability zones to increase availability.
Grouping certain services that communicate with each other often and make sure they always run together within the same availability zone.
Setup replication of your data across multiple availability zones by keeping each instance of your StatefulSet in a different zone.
Details you should know about
On AWS and Azure availability zones are randomized across accounts. There is no way to determine which zone you are really in, based on the name. For example, the zone
eu-central-1a
in the account of the cluster isn’t necessarily the sameeu-central-1a
in your account.Availability zones get selected randomly by the Giant Swarm management cluster. You only need to specify the required number of availability zones.
Nodes will get distributed evenly across availability zones. There is currently no way to determine which or how many nodes should be started in a particular availability zone. But the nodes will have a label
topology.kubernetes.io/zone
(or in Kubernetes before 1.17:failure-domain.beta.kubernetes.io/zone
) that indicates which availability zone the node is running in.Single availability zone clusters start in a random availability zone too. This is a means to minimize the risk of all your clusters becoming unavailable due to a failure in one particular zone.
Standard volumes can not be moved across availability zones. You need to take this into account when designing for high availability. If the zone with your volume goes down, there will be no way to reschedule the pod to another availability zone. You either need to create a new volume from a snapshot or you will have to replicate your data across zones. On AWS, you can also consider using EFS as a storage provider to be able to access volumes from several availability zones.
To make sure your pods and volumes end up on the same nodes, our recommendation is to specify
WaitForFirstConsumer
asvolumeBindingMode
in your storage classes. Your clusters come with a default storage class that contains this setting already. See the Volume Binding Mode section in the Kubernetes storage documentation for more information.
Spreading worker nodes over multiple availability zones can be configured per node pool and independent of the choice of a single control plane node vs. using multiple control plane nodes (currently multiple control plane nodes are only supported on AWS).
Example pod topology spread constraints and affinity
To make sure your workload gets scheduled over available worker nodes over availability zones you can make use of Pod Topology Spread Constraints
and affinity and anti-affinity.
In this example, the two constraints make sure, pods with the given labels are distributed across available zones. The first constraint makes sure the number of pods across nodes with the same topology.kubernetes.io/zone
label is one at max. The second constraint picks the Node with the lowest number of Pods still across all nodes and all zones. This might collide with the first constraint and no schedule would happen at all unless whenUnsatisfiable
is set to ScheduleAnyway
. The affinity
makes sure to not schedule any Pods on master
Nodes.
The following fields are part of the spec.template.spec
of Deployments
, StatefulSets
, and DaemonSets
:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: hello-world-app
app.kubernetes.io/version: 0.1.0
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/name: hello-world-app
app.kubernetes.io/version: 0.1.0
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/role
operator: NotIn
values:
- master
These constraints are checked during scheduling of new pods. In case some dis-balancing happened because of other reasons, Descheduler
for Kubernetes
can be used to re-schedule pods to be more balanced.
Please be aware of some undesirable edge-cases and caveats. These might show up on more complex topologySpreadConstraints
and affinity
setups. Discussions in issues kubernetes/kubernetes#106127, kubernetes/kubernetes#105977 and kubernetes/kubernetes#105291 touch some of the problems that could arise.
Get started
You can read kubectl gs template cluster
to learn how to create a cluster with multiple availability zones.
When inspecting details of such a cluster, you can get the node pool information using the node pool infrastructure resource depending on the provider.
The node pools in Cluster API for AWS are managed by the AWSMachinePool
resource. When you create a cluster with kubectl gs template cluster
you rely on Helm
cluster-aws
chart. To extend or modify the availability zones for the node pools you can provide a values.yaml
file as shown below.
Define the node pool with the availability zones in the values.yaml
file.
cat << EOF > /tmp/values.yaml
data:
values: |
global:
nodePools:
np001:
instanceType: m6i.xlarge
maxSize: 3
minSize: 1
availabilityZones:
- eu-west-1a
- eu-west-1b
EOF
Template the cluster with kubectl gs template cluster
:
kubectl gs template cluster \
--provider capa \
--name mycluster \
--organization testing \
--release 29.0.1 > /tmp/cluster.yaml
Append the node pool values configuration manually or using this yq
command:
yq eval-all '
select(fileIndex == 0) |
(
select(.kind == "ConfigMap") |
.data.values |= (
(. | from_yaml) * (load("/tmp/values.yaml").data.values | from_yaml) | to_yaml
)
),
select(fileIndex == 0 and .kind == "App")
' /tmp/cluster.yaml > /tmp/merged.yaml
Now you can apply the merged configuration:
kubectl apply -f /tmp/merged.yaml
To get the node pool information, query the AWSMachinePool
resource as shown below.
$ kubectl get AWSManagedMachinePool -l cluster.x-k8s.io/cluster-name=mycluster -oyaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
name: mycluster-np001
namespace: org-testing
spec:
availabilityZones:
- eu-west-1a
...
The node pools in Cluster API for EKS are managed by the AWSManagedMachinePool
resource. When you create a cluster with kubectl gs template cluster
you rely on Helm
cluster-eks
chart. To extend or modify the availability zones for the node pools you can provide a values.yaml
file as shown below.
Define the node pool with the availability zones in the values.yaml
file.
cat << EOF > /tmp/values.yaml
data:
values: |
global:
nodePools:
np001:
name: "np001"
instanceType: "m5.large"
minSize: 1
maxSize: 3
availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2c"]
EOF
Template the cluster with kubectl gs template cluster
:
kubectl gs template cluster \
--provider eks \
--name mycluster \
--organization testing \
--release 29.0.1 > /tmp/cluster.yaml
Append the node pool values configuration manually or using this yq
command:
yq eval-all '
select(fileIndex == 0) |
(
select(.metadata.name == "mycluster-userconfig") |
.data.values |= (
(. | from_yaml) * (load("/tmp/values.yaml").data.values | from_yaml) | to_yaml
)
),
select(fileIndex == 0 and .metadata.name == "mycluster"),
select(fileIndex == 0 and .metadata.name == "mycluster-default-apps-userconfig"),
select(fileIndex == 0 and .metadata.name == "mycluster-default-apps")
' /tmp/cluster.yaml > /tmp/merged.yaml
Now you can apply the merged configuration:
kubectl apply -f /tmp/merged.yaml
To get the node pool information, query the AWSManagedMachinePool
resource as shown below.
$ kubectl get AWSManagedMachinePool -l cluster.x-k8s.io/cluster-name=mycluster -oyaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
name: mycluster-np001
namespace: org-testing
spec:
availabilityZones:
- us-west-2a
- us-west-2b
...
The node pools in Cluster API for Azure are managed by the MachineDeployment
resource. When you create a cluster with kubectl gs template cluster
you can define the availability zones for the worker nodes as shown below.
Define the failure domain for the node pool in the values.yaml
file.
cat << EOF > /tmp/cluster.yaml
data:
values: |
global:
nodePools:
np001:
instanceType: Standard_D4s_v5
replicas: 4
failureDomain: "2"
EOF
Template the cluster with kubectl gs template cluster
:
$ kubectl gs template cluster \
--provider capz \
--name mycluster \
--organization testing \
--region westeurope \
--azure-subscription-id xxxxxxxxx \
--release 29.0.1 > /tmp/cluster.yaml
Append the node pool values configuration manually or using this yq
command:
yq eval-all '
select(fileIndex == 0) |
(
select(.metadata.name == "mycluster-userconfig") |
.data.values |= (
(. | from_yaml) * (load("/tmp/values.yaml").data.values | from_yaml) | to_yaml
)
),
select(fileIndex == 0 and .metadata.name == "mycluster")
' /tmp/cluster.yaml > /tmp/merged.yaml
Now you can apply the merged configuration:
kubectl apply -f /tmp/merged.yaml
Get the node pool information by querying the MachineDeployment
resource as shown below.
$ kubectl get MachineDeployment -l cluster.x-k8s.io/cluster-name=mycluster -oyaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: mycluster-np001
namespace: org-testing
spec:
failureDomain: 2
...
Learn more about node pools and how to manage them.
Need help, got feedback?
We listen to your Slack support channel. You can also reach us at support@giantswarm.io. And of course, we welcome your pull requests!