Last modified December 13, 2022
Automatic termination of unhealthy nodes
AWS |
|
---|---|
Azure |
|
On-premises (KVM) |
|
Introduction
Degraded nodes in a Kubernetes cluster should be a rare issue, however when it occurs, it can have severe consequences for the workloads scheduled to the affected nodes. The goal should be to detect bad nodes early and remove them from the cluster, replacing them with healthy ones.
Starting with workload cluster release v12.6.0 for AWS, v13.1.0 for Azure and v14.0.0 for KVM, you have the option to automate the detection and termination of bad nodes. When enabled, all nodes in your cluster are periodically checked. If a node fails consecutive health checks over an extended time period, it will be drained and terminated.
From AWS workload cluster release v15.0.0 the feature is enabled by default. In order to enable this functionality for other providers, please refer to the applicable section below.
Technical details
The node’s health status is used to determine if a node needs to be terminated. A node reporting a Ready
status is considered healthy. If a node reports an unhealthy status continuously for a given time threshold it will be recycled.
An unhealthy status means the kubelet
on a given node has reported NotReady
status on consecutive checks (aproximately every 2-3 minutes) over a certain time threshold (approximately 15 minutes).
Enabling automatic termination
Automatic termination of unhealthy nodes is enabled by default on AWS as of release v15.0.0
. It continues to be disabled by default for older releases.
This section explains how you can enable the feature for each supported provider.
AWS
Workload cluster releases v12.6.x to v14.x.x
To enable the feature, edit the AWSCluster
resource of your cluster using the Management API.
Make sure the resource has the alpha.node.giantswarm.io/terminate-unhealthy
annotation. The value can be anything you like, as only the presence of that annotation is checked. Here is an example:
apiVersion: infrastructure.giantswarm.io/v1alpha2
kind: AWSCluster
metadata:
annotations:
alpha.node.giantswarm.io/terminate-unhealthy: "true"
labels:
giantswarm.io/cluster: jni9x
giantswarm.io/organization: giantswarm
release.giantswarm.io/version: 12.6.0
name: jni9x
namespace: default
spec:
...
If you want to disable the feature you must remove the annotation from the AWSCluster
custom resource.
Workload cluster releases v15.x.x and newer
To disable automatic termination of unhealthy nodes, edit the AWSCluster
resource of your cluster using the Management API.
Make sure the resource has the annotation
node.giantswarm.io/terminate-unhealthy: "false"
```yaml
apiVersion: infrastructure.giantswarm.io/v1alpha2
kind: AWSCluster
metadata:
annotations:
node.giantswarm.io/terminate-unhealthy: "false"
labels:
giantswarm.io/cluster: jni9x
giantswarm.io/organization: giantswarm
release.giantswarm.io/version: 12.6.0
name: jni9x
namespace: default
spec:
...
Azure
To enable automatic termination of unhealthy nodes, edit the Cluster
resource of your cluster using the Management API.
Make sure the resource has the alpha.node.giantswarm.io/terminate-unhealthy
annotation. The value can be anything you like, as just the presence of that annotation is checked. Here is an example:
apiVersion: cluster.x-k8s.io/v1alpha3
kind: Cluster
metadata:
annotations:
alpha.node.giantswarm.io/terminate-unhealthy: "true"
name: fn7t8
namespace: org-giantswarm
spec:
...
If you want to disable the feature you must remove the annotation from the Cluster
custom resource.
KVM
To enable it, you have to edit the KVMConfig
resource of your cluster using the Management API.
Make sure the resource has the alpha.node.giantswarm.io/terminate-unhealthy
annotation. The value can be anything you like, as only the presence of that annotation is checked. Here is an example:
apiVersion: provider.giantswarm.io/v1alpha1
kind: KVMConfig
metadata:
annotations:
alpha.node.giantswarm.io/terminate-unhealthy: "true"
name: fn7t8
namespace: default
spec:
...
If you want to disable the feature you must remove the annotation from the KVMConfig
custom resource.
Need help, got feedback?
We listen to your Slack support channel. You can also reach us at support@giantswarm.io. And of course, we welcome your pull requests!