Last modified December 13, 2022
Fine-tuning upgrade disruption on AWS
AWS |
|
---|
Introduction
Cluster upgraded, described in detail in our cluster upgrades reference), can cause disruption on workloads, if the upgrade requires upgrading worker nodes.
We provide two ways of limiting the amount of disruption:
- Maximum batch size: the highest possible number of nodes to upgrade at the same time
- Pause time: the time to wait between batches of worker nodes
As worker node pools on AWS are based on Auto Scaling Groups built using CloudFormation, both settings directly affect configuration details of the CloudFormation stack’s update policy.
Configurability of these details has been introduced with workload cluster release v12.7.0 for AWS. The feature is currently in an early stage and its behaviour may change in the near future.
Adjustments to these settings require using the Management API to edit the AWSCluster
resource of the cluster (for cluster-wide settings) or the AWSMachineDeployment
of an individual node pool.
Maximum batch size
When a worker node update is necessary, nodes are updated (terminated and replaced by new ones) in groups or batches. The maximum batch size configures how many nodes are updated at the same time.
The settings directly maps to the MaxBatchSize
property of the AWS CloudFormation update policy.
The default value for this property is 0.3
, which means that each batch will contain at most 30% of the nodes.
In order to override the default for either the entire cluster or a spdcific node pool, the annotation
alpha.aws.giantswarm.io/update-max-batch-size
must be set
- on the
AWSCluster
resource, to be applied as a default to all node pools of the cluster. - on the
AWSMachineDeployment
resource, to be effective for only one node pool. A value here will override any value specified on theAWSCluster
level.
You have two options to configure the maximum batch size:
- Absolute: using an integer number larger than zero, you’ll specify the maximum number of nodes in absolute terms.
- Relative: using a decimal number between
0.0
and1.0
you can define the group size as a percentage of the node pool size. As an example, the value"0.25"
would mean that all worker nodes would be divided into four groups, each containing 25 percent of the worker nodes (roughly).
The smaller you configure the maximum batch size, the less disruptive the upgrade will be, and the longer it will take.
Note: As with any Kubernetes annotation, the value must be of type String. In YAML this requires wrapping the value in double quotes.
Examples
In this first example we set the absolute value 10
to roll a maximum of ten worker nodes per batch.
apiVersion: infrastructure.giantswarm.io/v1alpha2
kind: AWSCluster
metadata:
name: jni9x
annotations:
alpha.aws.giantswarm.io/update-max-batch-size: "10"
...
In this second example, we set the value to 0.1
to roll a maximum of 10 percent of nodes of a node pool per single batch.
apiVersion: infrastructure.giantswarm.io/v1alpha2
kind: AWSCluster
metadata:
name: jni9x
annotations:
alpha.aws.giantswarm.io/update-max-batch-size: "0.1"
...
Pause Time
After updating a batch of worker nodes, a pause time is applied. This time effectively allows Kubernetes to schedule workloads on the new worker nodes.
This setting maps to the PauseTime
property of the AWS CloudFormation update policy.
By default, the pause time is set to 15 minutes.
This value can be influenced via the annotation
alpha.aws.giantswarm.io/update-pause-time
Again, the setting can be defined on two levels:
- on the
AWSCluster
resource, to be applied as a default to all node pools of the cluster. - on the
AWSMachineDeployment
resource, to be effective for only one node pool. A value here will override any value specified on theAWSCluster
level.
The value must be a string in the ISO 8601 duration format. Value examples:
PT10M
: 10 minutesPT15S
: 15 secondsPT1M30S
: 1 minute and 30 seconds
The maximum pause time is one hour (PT1H
).
Examples
In the first example, we set the value to PT1M30S
to pause for one and a half minute between each batch.
apiVersion: infrastructure.giantswarm.io/v1alpha2
kind: AWSCluster
metadata:
name: jni9x
annotations:
alpha.aws.giantswarm.io/update-pause-time: "PT1M30S"
...
In this second example, we set the value to PT5M
on AWSMachineDeployment
to pause for five minutes between each batch only for this node pool.
apiVersion: infrastructure.giantswarm.io/v1alpha2
kind: AWSMachineDeployment
metadata:
name: 2suw9
annotations:
alpha.aws.giantswarm.io/update-pause-time: "PT5M"
...
Further reading
Need help, got feedback?
We listen to your Slack support channel. You can also reach us at support@giantswarm.io. And of course, we welcome your pull requests!