Last modified November 23, 2020

Automatic termination of unhealthy nodes

Degraded nodes in a Kubernetes cluster should be a rare issue, however when it occurs, it can have severe consequences for the workloads scheduled to the affected nodes. The goal should be to detect bad nodes early and remove them from the cluster, replacing them with healthy ones.

Starting with tenant cluster release v12.6.0 for AWS and v13.1.0 for Azure, you now have the option to automate the detection and termination of bad nodes. When enabled, all nodes in your cluster are periodically checked. If a node fails consecutive health checks over an extended time period, it will be drained and terminated.

This function is currently available on AWS and Azure only.

Technical details

The node’s health status is used to determine if a node needs to be terminated. A node reporting a Ready status is considered healthy. If a node reports an unhealthy status continuously for a given time threshold it will be recycled.

An unhealthy status means the kubelet on a given node has reported NotReady status on consecutive checks over a certain time threshold (approximately 15 minutes).

Enabling automatic termination

Automatic termination of unhealthy nodes is disabled by default. You can enable it for each individual cluster.

This section explains how you can enable the feature for each supported provider.

AWS

To enable it, you have to edit the AWSCluster resource of your cluster using the Control Plane Kubernetes API.

Make sure the resource has the alpha.node.giantswarm.io/terminate-unhealthy annotation. The value can be anything you like, as only the presence of that annotation is checked. Here is an example:

apiVersion: infrastructure.giantswarm.io/v1alpha2
kind: AWSCluster
metadata:
  annotations:
    alpha.node.giantswarm.io/terminate-unhealthy: "true"
  labels:
    giantswarm.io/cluster: jni9x
    giantswarm.io/organization: giantswarm
    release.giantswarm.io/version: 12.6.0
  name: jni9x
  namespace: default
spec:
  ...

If you want to disable the feature you must remove the annotation from the AWSCluster custom resource.

Azure

To enable it, you have to edit the Cluster resource of your cluster using the Control Plane Kubernetes API.

Make sure the resource has the alpha.node.giantswarm.io/terminate-unhealthy annotation. The value can be anything you like, as only the presence of that annotation is checked. Here is an example:

apiVersion: cluster.x-k8s.io/v1alpha3
kind: Cluster
metadata:
  annotations:
    alpha.node.giantswarm.io/terminate-unhealthy: "true"
  name: fn7t8
  namespace: org-giantswarm
spec:
  ...

If you want to disable the feature you must remove the annotation from the Cluster custom resource.