Observability

  • Changed

    • Upgrade upstream loki helm chart from v11 (11.6.4) to v13 (13.5.0). Loki app version unchanged at 3.7.1.
    • Pin loki.deploymentMode to SimpleScalable in values.yaml. Upstream v13 changed the default to Monolithic; this preserves the existing backend/read/write topology.
    • Move gateway route configuration from loki.gateway.route to top-level gatewayRoute. Upstream v13 redefined gateway.route as a strict-schema map of named routes that rejected our flat structure. Consumers overriding loki.gateway.route.* must rename to gatewayRoute.*.
      • Note: It’s better if you use the upstream loki.gateway.route. The gatewayRoute section is here for compatibility, and keeping a few extra features we had added.

    Notable upstream changes

    • New default livenessProbe on every loki pod (/loki/api/v1/status/buildinfo, 30s period, 10× failure threshold ≈ 5 min before kill).
    • Memberlist hardening: defaults add abort_if_cluster_join_fails, IPv6-friendly advertise_addr: ${HASH_RING_INSTANCE_ADDR} (auto-injected from status.podIP), join retry/backoff. Backend/read/write pods now run with -config.expand-env=true.
    • Server tuning defaults added (graceful_shutdown_timeout: 5s, gRPC keepalive, grpc_server_max_concurrent_streams: 1000, 100 MiB gRPC msg size, http_server_idle_timeout: 30s).
    • k8s-sidecar bumped 2.6.0 → 2.7.1; gains /healthz liveness/readiness probes on new http-sidecar port (8080).
    • Gateway nginx image bumped 1.29-alpine1.30-alpine.
  • Changed

    • Upgrade grafana chart: 11.3.7 => 12.3.0 (Grafana 13 by default).
    • Upgrade grafana (appVersion): 12.4.1 => 13.0.1.
  • Added

    • Add envoy-vs-nginx-loadtesting and envoy-vs-kong-loadtesting dashboards to public org, split from the former envoy-gateway-loadtesting private dashboard

    Fixed

    • Fix metric comparability in envoy-vs-nginx-loadtesting: align downstream RPS (rate/aggregation), success rate window ([2m]→[5m]), and replace wrong Nginx upstream RPS metric; fix Nginx downstream latency unit (seconds→ms)
    • Fix metric comparability in envoy-vs-kong-loadtesting: align downstream RPS (rate/aggregation), fix success rate regex to correctly exclude 4xx/5xx codes
  • Fixed

    • Removed hardcoded label in one of envoy-gateway-loadtesting dashboard’s graph.
  • Changed

    • Upgrade grafana chart: 11.3.6 => 11.3.7
    • Upgrade pg-cluster-recovery-test subchart: v0.4.1 => v0.5.0

    Fixed

    • Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.
  • Fixed

    • Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.
  • Fixed

    • Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.
  • Changed

    • Upgrade Tempo Vulture chart from 0.12.7 to 0.12.9
    • Upgrade Tempo chart from to 2.6.2 to 2.14.3
      • Upgrades Tempo from 2.10.2 to 2.10.4

    Fixed

    • Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.
  • Changed

    • Update envoy-gateway-loadtesting dashboard with added kong graphs
  • Fixed

    • Disabled per-component serviceaccounts so we rely on a central serviceaccount that has proper IRSA annotation.