Observability
Changed
- Upgrade upstream loki helm chart from v11 (11.6.4) to v13 (13.5.0). Loki app version unchanged at 3.7.1.
- Pin
loki.deploymentMode to SimpleScalable in values.yaml. Upstream v13 changed the default to Monolithic; this preserves the existing backend/read/write topology. - Move gateway route configuration from
loki.gateway.route to top-level gatewayRoute. Upstream v13 redefined gateway.route as a strict-schema map of named routes that rejected our flat structure. Consumers overriding loki.gateway.route.* must rename to gatewayRoute.*.- Note: It’s better if you use the upstream
loki.gateway.route. The gatewayRoute section is here for compatibility, and keeping a few extra features we had added.
Notable upstream changes
- New default livenessProbe on every loki pod (
/loki/api/v1/status/buildinfo, 30s period, 10× failure threshold ≈ 5 min before kill). - Memberlist hardening: defaults add
abort_if_cluster_join_fails, IPv6-friendly advertise_addr: ${HASH_RING_INSTANCE_ADDR} (auto-injected from status.podIP), join retry/backoff. Backend/read/write pods now run with -config.expand-env=true. - Server tuning defaults added (
graceful_shutdown_timeout: 5s, gRPC keepalive, grpc_server_max_concurrent_streams: 1000, 100 MiB gRPC msg size, http_server_idle_timeout: 30s). k8s-sidecar bumped 2.6.0 → 2.7.1; gains /healthz liveness/readiness probes on new http-sidecar port (8080).- Gateway nginx image bumped
1.29-alpine → 1.30-alpine.
Changed
- Upgrade grafana chart: 11.3.7 => 12.3.0 (Grafana 13 by default).
- Upgrade grafana (appVersion): 12.4.1 => 13.0.1.
Added
- Add
envoy-vs-nginx-loadtesting and envoy-vs-kong-loadtesting dashboards to public org, split from the former envoy-gateway-loadtesting private dashboard
Fixed
- Fix metric comparability in
envoy-vs-nginx-loadtesting: align downstream RPS (rate/aggregation), success rate window ([2m]→[5m]), and replace wrong Nginx upstream RPS metric; fix Nginx downstream latency unit (seconds→ms) - Fix metric comparability in
envoy-vs-kong-loadtesting: align downstream RPS (rate/aggregation), fix success rate regex to correctly exclude 4xx/5xx codes
Fixed
- Removed hardcoded label in one of
envoy-gateway-loadtesting dashboard’s graph.
Changed
- Upgrade grafana chart: 11.3.6 => 11.3.7
- Upgrade pg-cluster-recovery-test subchart: v0.4.1 => v0.5.0
Fixed
- Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.
Fixed
- Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.
Fixed
- Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.
Changed
- Upgrade Tempo Vulture chart from 0.12.7 to 0.12.9
- Upgrade Tempo chart from to 2.6.2 to 2.14.3
- Upgrades Tempo from 2.10.2 to 2.10.4
Fixed
- Make sure crossplane cannot delete Crossplane azure storage accounts/containers and s3 buckets.
Changed
- Update
envoy-gateway-loadtesting dashboard with added kong graphs
Fixed
- Disabled per-component serviceaccounts so we rely on a central serviceaccount that has proper IRSA annotation.