Tenant Cluster Releases for Azure

  • This is the first tenant cluster release to support Kubernetes 1.18 and node pools on Azure.

    A node pool is a subset of the Kubernetes nodes. They enable having pools of nodes with different configurations (like a different instance size) within one cluster. After cluster creation with 1 node pools, additional node pools can be freely added and removed from the cluster.

    If you have access to the Control Plane API you can manage your clusters directly from there. The clusters that you create are now represented by Cluster API CRDs (Custom Resource Definition). Using our kubectl plugin you can easily create the Custom Resources required to create a cluster.

    This tenant cluster release is marked as beta, which means that it contains all the content of the future v13.0.0 stable release and secures upgrade paths from and to that release. Beta version is recommended to be used and tested in non-production clusters. The only difference between this beta version and the stable one will be bug fixes for issues found during your testing.

    Change details

    kubernetes 1.18.12

    app-operator 2.7.0

    azure-operator 5.0.0-beta5

    calico 3.15.3

    cert-exporter 1.3.0

    chart-operator 2.5.0

    cluster-operator 0.23.18

    containerlinux 2605.7.0

    etcd 3.4.13

    external-dns 1.5.0

    kube-state-metrics 1.2.1

    node-exporter 1.6.0

    net-exporter 1.9.2

  • Nodes will be rolled during upgrade to this version.

    This patch release prevents an issue with QPS (Queries per Second) limits introduced by Docker Hub.

    Note before upgrade:

    Please contact your Solution Engineer before upgrading. The upgrade is automated. However, it includes a data migration from Helm 2 release configmaps to Helm 3 release secrets, there are some pre-upgrade checks and we recommend monitoring the upgrade to ensure safety.

    Note for Solution Engineers:

    Please use Upgrading tenant clusters to Helm 3 as a guide on the upgrade process for the checks and monitoring steps.

    Note for future 12.x.x releases:

    Please persist this note and the above, until all customers are on Azure v12.1.x and above.

    Change details

    azure-operator 4.3.0

    • Pass dockerhub token for kubelet authorized image pulling.
    • Update k8scloudconfig to v7.2.0, containing a fix for Docker Hub QPS.
  • Change details

    azure-operator 5.0.0-alpha4

    kubernetes 1.18.10

    Design

    • Prevent logging of docker config contents if file is malformed (#95347, @sfowl) [SIG Auth and Node]

    Bug or Regression

    • Do not fail sorting empty elements. (#94666, @soltysh) [SIG CLI]
    • Ensure getPrimaryInterfaceID not panic when network interfaces for Azure VMSS are null (#94801, @nilo19) [SIG Cloud Provider]
    • Fix bug where loadbalancer deletion gets stuck because of missing resource group #75198 (#93962, @phiphi282) [SIG Cloud Provider]
    • Fix detach azure disk issue when vm not exist (#95177, @andyzhangx) [SIG Cloud Provider]
    • Fix etcd_object_counts metric reported by kube-apiserver (#94818, @tkashem) [SIG API Machinery]
    • Fix network_programming_latency metric reporting for Endpoints/EndpointSlice deletions, where we don’t have correct timestamp (#95363, @wojtek-t) [SIG Network and Scalability]
    • Fix scheduler cache snapshot when a Node is deleted before its Pods (#95154, @alculquicondor) [SIG Scheduling]
    • Fix the cloudprovider_azure_api_request_duration_seconds metric buckets to correctly capture the latency metrics. Previously, the majority of the calls would fall in the “+Inf” bucket. (#95375, @marwanad) [SIG Cloud Provider and Instrumentation]
    • Fix: azure disk resize error if source does not exist (#93011, @andyzhangx) [SIG Cloud Provider]
    • Fix: detach azure disk broken on Azure Stack (#94885, @andyzhangx) [SIG Cloud Provider]
    • Fixed a bug where improper storage and comparison of endpoints led to excessive API traffic from the endpoints controller (#94934, @damemi) [SIG Apps, Network and Testing]
    • Gracefully delete nodes when their parent scale set went missing (#95289, @bpineau) [SIG Cloud Provider]
    • Kubeadm: warn but do not error out on missing “ca.key” files for root CA, front-proxy CA and etcd CA, during “kubeadm join –control-plane” if the user has provided all certificates, keys and kubeconfig files which require signing with the given CA keys. (#94988, @neolit123) [SIG Cluster Lifecycle]

    Other (Cleanup or Flake)

    • Masks ceph RBD adminSecrets in logs when logLevel >= 4 (#95245, @sfowl) [SIG Storage]
  • If you are upgrading from 12.1.0, upgrading to this release will not roll your nodes.

    This patch release fixes a problem causing the accidental deletion and reinstallation of Preinstalled Apps (such as CoreDNS) in 12.x.x tenant clusters.

    Please upgrade all older clusters to this version in order to prevent possible downtime.

    Note before upgrade:

    Please contact your Solution Engineer before upgrading. The upgrade is automated. However, it includes a data migration from Helm 2 release configmaps to Helm 3 release secrets, there are some pre-upgrade checks and we recommend monitoring the upgrade to ensure safety.

    Note for Solution Engineers:

    Please use Upgrading tenant clusters to Helm 3 as a guide on the upgrade process for the checks and monitoring steps.

    Note for future 12.x.x releases:

    Please persist this note and the above, until all customers are on Azure v12.1.x and above.

    Change details

    cluster-operator 0.23.18

    • Remove all chartconfig migration logic that caused accidental deletion and is no longer needed.

    app-operator 2.3.5

    • Add resource version for chart configmaps and secrets to the chart CR to reduce latency of update events.
    • Fix YAML comparison for chart configmaps and secrets.

    chart-operator 2.3.5

    • Updated Helm to v3.3.4.
    • Added event count metrics for delete, install, rollback and update of Helm releases.
  • If you are upgrading from 12.0.2, upgrading to this release will not roll your nodes. It will only update the apps.

    This release upgrades all Helm releases managed by Giant Swarm to use Helm v3.3.0.

    This lets us benefit from the improved security model and keep up to date with the community. We also remove the Tiller deployment from the giantswarm namespace, removing its gRPC endpoint, which reduces operational complexity.

    If you are still using Helm 2 then these Helm releases will not be affected. However we encourage you to upgrade to Helm 3. As Helm 2 support ends on November 13th 2020. https://helm.sh/blog/helm-v2-deprecation-timeline/

    Below, you can find more details on components that were changed with this release.

    Note before upgrade:

    Please contact your Solution Engineer before upgrading. The upgrade is automated. However, it includes a data migration from Helm 2 release configmaps to Helm 3 release secrets, there are some pre-upgrade checks and we recommend monitoring the upgrade to ensure safety.

    Note for Solution Engineers:

    Please use Upgrading tenant clusters to Helm 3 as a guide on the upgrade process for the checks and monitoring steps.

    Note for future 12.x.x releases:

    Please persist this note and the above, until all customers are on Azure v12.1.x and above.

    Change details

    app-operator v2.1.1

    chart-operator v2.3.0

    cert-exporter v1.2.4

    • Adjusted vault token format check for base62 tokens.
  • If you are upgrading from 12.0.1, upgrading to this release will not roll your nodes. It will only update the apps.

    This release upgrades external-dns app to v1.4.0 to improve observability.

    Below, you can find more details on components that were changed with this release.

    external-dns v0.7.3 Giant Swarm app 1.4.0

    • Improve monitoring.
  • If you are upgrading from 12.0.0, upgrading to this release will not roll your nodes. It will only update the apps.

    This release upgrades external-dns app to v1.3.0.

    Below, you can find more details on components that were changed with this release.

    external-dns v0.7.3 Giant Swarm app 1.3.0

  • This is the first release to support Kubernetes 1.17 on Azure.

    Important Warning During master upgrade from 11.4.0 to 12.0.0, within the time frame of 30 seconds we had noticed a spike in requests failures. This is most likely caused by Azure CNI upgrade and despite our great efforts, we had not found a solution to maintain upgrade path and avoid this disturbance. Please keep this in mind when scheduling an upgrade window, and contact your SE if you have further questions.

    As of this release NGINX Ingress Controller App is optional and not pre-installed component on Azure. This allows NGINX App installations to be managed independently from the base platform lifecycle. It is both benefit but also new responsibility to keep NGINX App installations up-to-date separately from rest of the cluster. Making NGINX optional enables use of other ingress controller alternatives without wasting resources where NGINX is not the preferred option. Upgrading existing tenant clusters with pre-installed NGINX will leave NGINX unchanged. Existing NGINX App custom resources will still have giantswarm.io/managed-by: cluster-operator label, but it will be ignored. The label will be cleaned up at a later point after all tenant clusters have been upgraded and Azure platform releases older than v12.0.0 archived.

    This release also includes a fix for Quay being a single point of failure by using Docker mirroring feature. This ensures availability of all images needed for node bootstrap, thus the cluster creation/scaling doesn’t depend on Quay availability anymore.

    Apart from that, the release contains many changes in other components, including important security fixes in Kubernetes and Calico, and also brings technical changes that are needed to prepare the upcoming Cluster API release with Node Pools support.

    Notes for future 12.x.x releases: In order to proceed with upgrade, clusters must be using 11.4.0 or newer release version, otherwise the upgrade request will fail. Pay attention to the increased failure rate of requests during upgrade of the master node and include this information in release notes if the problem persists.

    Below, you can find more details on components that were changed with this release.

    Kubernetes v1.17.9

    • Cloud Provider Labels reach General Availability: Added as a beta feature way back in v1.2, v1.17 sees the general availability of cloud provider labels.
    • Volume Snapshot Moves to Beta: The Kubernetes Volume Snapshot feature is now beta in Kubernetes v1.17. It was introduced as alpha in Kubernetes v1.12, with a second alpha with breaking changes in Kubernetes v1.13.
    • CSI Migration Beta: The Kubernetes in-tree storage plugin to Container Storage Interface (CSI) migration infrastructure is now beta in Kubernetes v1.17. CSI migration was introduced as alpha in Kubernetes v1.14.
    • Includes a fix for CVE-2020-8559 that allowed a privilege escalation issue from a compromised node to the cluster.
    • Updated from v1.16.12 - to review all changes please read changelog since v1.17.0.

    azure-operator v4.2.0

    • Changed how the Azure authentication works when connecting to a different Subscription than the Control Plane’s one so we can move forward with replacing VPN gateway with VNET peering for connecting customer clusters with the Giant Swarm Control Plane to increase performance and lower the costs.
    • Restricted storage account access to the local VNET only to increase security.
    • The Azure MSI extension for linux was not used so it will not be deployed anymore.

    Calico v3.15.1

    • Includes a fix for CVE-2020-13597 that allows a Pod to be compromised and redirect full or partial network traffic from the Node to the compromised Pod.
    • Updated from 3.10.1 - to review all changes please read changelogs since v3.15.0, v3.14, v3.13, v3.12, v3.11, and v3.10.

    CoreDNS v1.6.5 (Giant Swarm app v1.2.0)

    • Apply a readiness probe to improve reliability of the clusters.
    • Increase the liveness probe failure threshold from 5 failures to 7 failures.

    cluster-operator 0.23.12

    • Support for making NGINX IC App optional and not pre-installed.

    external-dns v0.7.2 Giant Swarm app 1.2.2

    • Updated from v0.5.18 - to review all changes please read changelog since v0.5.18.
  • In this release, NGINX Ingress Controller LoadBalancer Service external traffic policy has been made configurable, and the policy default has been changed from Cluster to Local.

    These changes:

    • Enable source IP preservation, needed for IP based security access control.
    • Remove unnecessary ingress traffic intra-cluster network hops, increasing concurrent connection capacity and reducing latency.
    • Allow configuration to be reverted, where these defaults and associated tradeoffs are found to be inappropriate.

    Management of NGINX IC LoadBalancer Service is moved from azure-operator to NGINX IC App itself to:

    • Enable external traffic policy configurability, in a way consistent with other NGINX IC configuration options.
    • Lay the foundation for making NGINX IC App optional and not pre-installed in a future Azure platform release.

    Along with azure-operator and NGINX IC, minor improvements were also made to CoreDNS, cluster-operator, Flatcar, and Kubernetes.

    When upgrading clusters, migration of LoadBalancer Service is automated but not fully automatic. Therefore, please delegate cluster upgrades to your SE.

    Note for SEs:

    Before cluster upgrade to 11.4.0, check if the DNS zone has a TXT DNS record called <clusterid>ingress.<clusterid>.<region>.azure.gigantic.io. If the record is not there, but there is another TXT one called ingress.<clusterid>.<region>.azure.gigantic.io, please create a new TXT record using the right name (with the cluster id at the beginning) and the value copied from the old one (the one without the cluster id at the beginning). If you don’t do that the migration script execution (see below) will fail.

    After cluster upgrade to 11.4.0, both old ingress-loadbalancer LoadBalancer Service managed by azure-operator and new one nginx-ingress-controller managed by NGINX IC App remain on the cluster. To switch the ingress traffic to the new LoadBalancer and remove old NGINX LoadBalancer Service without downtime please:

    • Together with the customer have any firewall in front of NGINX reconfigured to allow both old and new LoadBalancer Service IPs.
    • Next use the migration script to switch DNS records to the new load balancer IP. The script ensures IP is assigned to the new LB, and also that the cluster DNS records resolve to it instead of old IP.
    • Now delete the old NGINX IC LoadBalancer Service (the one called ingress-loadbalancer).

    Note for future 11.4.x releases:

    To prevent downtime, please persist this and the above note until all customers are in 11.4.0 and above.

    Below, you can find more details on components that were changed with this release.

    azure-operator v4.1.0

    • Moved NGINX LoadBalancer Service management from azure-operator to nginx-ingress-controller app.
    • The default egress strategy for worker nodes VMSS is now a VNET Gateway rather than the Load Balancer.

    CoreDNS v1.6.5 (Giant Swarm app v1.1.10)

    • Made forward options optional.
    • Made resources (requests/limits) configurable.

    cluster-operator v0.23.10

    • Fixed a bug in user values migration logic for apps.
    • Enabled NGINX LoadBalancer Service on Azure.

    Flatcar v2512.2.1

    Kubernetes v1.16.12

    • Updated from v1.16.8 - changelog since v1.16.11, since v1.16.10, since v1.16.9 and since v1.16.8
    • Includes a fix for CVE-2019-11253 related to json/yaml decoding where large or malformed documents could consume excessive server resources. Request bodies for normal API requests (create/delete/update/patch operations of regular resources) are now limited to 3MB.

    nginx-ingress-controller v0.33.0 (Giant Swarm app v1.7.0)

    • Changed NGINX Service type from NodePort to LoadBalancer for Azure.
    • Made NGINX LoadBalancer Service external traffic policy configurable.
    • Use Local as NGINX LoadBalancer Service default external traffic policy.
    • Supported configuring (via controller.service.public configuration property) whether NGINX LoadBalancer Service should be public (default) or internal.
    • Upgraded to ingress-nginx 0.33.0.
  • If you are upgrading from 11.3.1 or 11.3.2, upgrading to this release will not roll your nodes. It will only update the apps.

    This release adds support for using a different CIDR for the control plane and the tenant clusters which makes it more flexible when users need to connect the tenant cluster network with their own virtual or on-premise networks.

    Note when upgrading from v11.2.x to v11.3.x:

    This release contains the replacement of CoreOS with Flatcar introduced in v11.3.0. Please carefully read release notes for 11.3.0 including Flatcar OS migration steps.

    Note for future v11.3.x releases:

    Please persist this and the above note until all customers are in v11.3.0 and above.

    Below, you can find more details on components that were changed with this release.

    metrics-server v0.3.3 (Giant Swarm app v1.1.0)

    • Add the installation’s IPAM cidr to the allowed egress subnets.

    net-exporter (v1.8.1)

    • Add the installation’s IPAM cidr to the allowed egress subnets.

    kube-state-metrics v1.9.5 (Giant Swarm app v1.1.0)

    • Add the installation’s IPAM cidr to the allowed egress subnets.