Observability

  • Changed

    • Upgraded chart dependency to kube-prometheus-stack-85.2.0
      • prometheus-operator to 0.90.1
      • alertmanager image tag from v0.31.1 to v0.32.1
      • prometheus image tag from v3.10.0 to v3.11.3-distroless (distroless variant now enabled by default)
      • prometheus-node-exporter from 4.52.0 to 4.55.0, distroless variant enabled by default
      • grafana subchart from 11.2.3 to 12.3.3 (Grafana app upgraded to v13)
      • kube-state-metrics from 7.2.0 to 7.3.0
      • kube-webhook-certgen image tag from 1.7.8 to 1.8.2
  • Added

    • Add Strimzi Kafka operator dashboards to the Shared Org Grafana organization under the Kafka folder
  • Changed

    • Upgrade upstream loki helm chart from v11 (11.6.4) to v13 (13.5.0). Loki app version unchanged at 3.7.1.
    • Pin loki.deploymentMode to SimpleScalable in values.yaml. Upstream v13 changed the default to Monolithic; this preserves the existing backend/read/write topology.
    • Move gateway route configuration from loki.gateway.route to top-level gatewayRoute. Upstream v13 redefined gateway.route as a strict-schema map of named routes that rejected our flat structure. Consumers overriding loki.gateway.route.* must rename to gatewayRoute.*.
      • Note: It’s better if you use the upstream loki.gateway.route. The gatewayRoute section is here for compatibility, and keeping a few extra features we had added.

    Notable upstream changes

    • New default livenessProbe on every loki pod (/loki/api/v1/status/buildinfo, 30s period, 10× failure threshold ≈ 5 min before kill).
    • Memberlist hardening: defaults add abort_if_cluster_join_fails, IPv6-friendly advertise_addr: ${HASH_RING_INSTANCE_ADDR} (auto-injected from status.podIP), join retry/backoff. Backend/read/write pods now run with -config.expand-env=true.
    • Server tuning defaults added (graceful_shutdown_timeout: 5s, gRPC keepalive, grpc_server_max_concurrent_streams: 1000, 100 MiB gRPC msg size, http_server_idle_timeout: 30s).
    • k8s-sidecar bumped 2.6.0 → 2.7.1; gains /healthz liveness/readiness probes on new http-sidecar port (8080).
    • Gateway nginx image bumped 1.29-alpine1.30-alpine.
  • Changed

    • Upgrade grafana chart: 11.3.7 => 12.3.0 (Grafana 13 by default).
    • Upgrade grafana (appVersion): 12.4.1 => 13.0.1.
  • Added

    • Add envoy-vs-nginx-loadtesting and envoy-vs-kong-loadtesting dashboards to public org, split from the former envoy-gateway-loadtesting private dashboard

    Fixed

    • Fix metric comparability in envoy-vs-nginx-loadtesting: align downstream RPS (rate/aggregation), success rate window ([2m]→[5m]), and replace wrong Nginx upstream RPS metric; fix Nginx downstream latency unit (seconds→ms)
    • Fix metric comparability in envoy-vs-kong-loadtesting: align downstream RPS (rate/aggregation), fix success rate regex to correctly exclude 4xx/5xx codes
  • Fixed

    • Removed hardcoded label in one of envoy-gateway-loadtesting dashboard’s graph.
  • Changed

    • Upgrade grafana chart: 11.3.6 => 11.3.7
    • Upgrade pg-cluster-recovery-test subchart: v0.4.1 => v0.5.0

    Fixed

    • Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.
  • Fixed

    • Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.
  • Fixed

    • Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.
  • Changed

    • Upgrade Tempo Vulture chart from 0.12.7 to 0.12.9
    • Upgrade Tempo chart from to 2.6.2 to 2.14.3
      • Upgrades Tempo from 2.10.2 to 2.10.4

    Fixed

    • Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.