Observability
Added
- Add monthly GitHub Actions workflow to auto-update Tempo, Mimir, Loki, and Alloy mixin dashboards
- Add Monitoring Landscape / Customer Audit dashboard to audit monitoring tools across workload clusters and compare resource consumption with the GiantSwarm observability platform
Changed
- Move Tempo dashboards from
private_dashboards_mz to team_atlas under Giant Swarm/Observability/Tempo - Move Loki dashboards from
private_dashboards_al to team_atlas under Giant Swarm/Observability/Loki - Refresh Loki dashboards from latest upstream mixin
- Move Mimir dashboards from
private_dashboards_mz to team_atlas under Giant Swarm/Observability/Mimir - Refresh Mimir mixin dashboards from upstream
mimir-2.17.6 - Move Alloy dashboards from
private_dashboards_al to team_atlas under Giant Swarm/Observability/Alloy - Refresh Alloy mixin dashboards from upstream
v1.15.0- Adds
alloy-loki and alloy-otel-engine-overview dashboards
Fixed
- Update recovery-test AWS IAM to support unique names
Added
- Add E2E test suites for all alloy topologies (metrics, logs, events) on WC using
apptest-framework. - Add Helm CI test values for controller types, network policies, Kyverno, secrets, and PodLogs.
Changed
- Upgrade Alloy upstream chart from 1.6.1 to 1.7.0 (CHANGELOG)
- This bumps the version of Alloy from 1.13.2 to 1.15.0 (CHANGELOG)
Removed
- Remove ATS (Python/pytest) test infrastructure in favour of
apptest-framework.
Fixed
- upgrade pg-cluster-recovery-test subchart: v0.4.0 => v0.4.1
Changed
- Updated Tempo dashboards to mixins v2.10
- bugfixes in Tempo operational dashboard
- Update Network Traffic Analysis Overview dashboard
- Replace average network traffic gauges with total data transfer bar chart
- Add a time period selector for data transfer periods
- Make total visible in all panels
- Remove datasource variable
- Simplify traffic rate queries
Removed
- Remove the
object-storage-operator
Changed
- Upgrade grafana chart: 11.3.3 => 11.3.6
- upgrade pg-cluster-recovery-test subchart: v0.3.0 => v0.4.0
Fixed
- Fix Grafana Management Lifecycle Policy.
Added
- Added new gRPC routes for Loki and Tempo write
Changed
- Rename
mimir.writeRewritePaths → mimir.write.stripPrefixPaths to clarify that the /prometheus prefix is stripped before forwarding; add equivalent stripPrefixPaths: [] defaults to loki and tempo write config. - Expose Tempo gRPC backend config in values (
tempo.read.grpc.backendService, tempo.read.grpc.backendPort) instead of hardcoding in the template. - Expose Loki and Mimir backend config in values (
loki.read.backendService, loki.read.backendPort, loki.write.backendService, loki.write.backendPort, mimir.read.backendService, mimir.read.backendPort, mimir.write.backendService, mimir.write.backendPort) instead of hardcoding in the template. - Restructure Helm templates into per-service subdirectories (
templates/loki/, templates/mimir/, templates/tempo/). - Share
HTTPRouteFilter resources within each service: a single headers-check filter and (for Mimir/Tempo) a single rewrite filter are now referenced by all routes in that service namespace.
Changed
- Upgrade grafana chart: 11.2.3 => 11.3.3
- Upgrade grafana (appVersion): 12.4.0 => 12.4.1
Changed
- Upgrade Tempo chart from to 2.4.2 to 2.6.2
- Upgrades Tempo from 2.10.1 to 2.10.2
Added
- requests and limits for distributor - fixes its HPA
Fixed
- vulture search window reduced to 24h to avoid querying traces out of retention period