Observability
Fixed
- Disabled per-component serviceaccounts so we rely on a central serviceaccount that has proper IRSA annotation.
Changed
- Updated Loki Helm chart dependency to use
grafana-community/helm-charts repository. - Upgrade upstream loki helm chart from v6 (6.53.0) to v11 (11.6.4), Loki app version 3.6.5 → 3.7.1.
- Remove custom ScaledObject wrapper templates (backend, gateway, read, write); upstream v11 ships native ScaledObject support, keeping both caused duplicate resources.
- Remove
kedaAutoscaling value blocks consumed only by the deleted wrapper templates. - Remove
selfMonitoring dead code; removed from upstream prior to v11. - Remove redundant
podSecurityContext and containerSecurityContext overrides for loki and gateway components; they are now identical to upstream v11 defaults. - Rename
sample_configs/ directory to examples/.
Added
- Add
envoy-gateway-loadtesting dashboard
Changed
- Reorganize Grafana Cloud directories:
dashboards/ → grafana-cloud/sources/, grafana-cloud/ → grafana-cloud/backup/, GC API scripts → grafana-cloud/scripts/ - Update backup workflow and README to reflect new paths
Added
- Add monthly GitHub Actions workflow to auto-update Tempo, Mimir, Loki, and Alloy mixin dashboards
- Add Monitoring Landscape / Customer Audit dashboard to audit monitoring tools across workload clusters and compare resource consumption with the GiantSwarm observability platform
Changed
- Move Tempo dashboards from
private_dashboards_mz to team_atlas under Giant Swarm/Observability/Tempo - Move Loki dashboards from
private_dashboards_al to team_atlas under Giant Swarm/Observability/Loki - Refresh Loki dashboards from latest upstream mixin
- Move Mimir dashboards from
private_dashboards_mz to team_atlas under Giant Swarm/Observability/Mimir - Refresh Mimir mixin dashboards from upstream
mimir-2.17.6 - Move Alloy dashboards from
private_dashboards_al to team_atlas under Giant Swarm/Observability/Alloy - Refresh Alloy mixin dashboards from upstream
v1.15.0- Adds
alloy-loki and alloy-otel-engine-overview dashboards
Fixed
- Update recovery-test AWS IAM to support unique names
Added
- Add E2E test suites for all alloy topologies (metrics, logs, events) on WC using
apptest-framework. - Add Helm CI test values for controller types, network policies, Kyverno, secrets, and PodLogs.
Changed
- Upgrade Alloy upstream chart from 1.6.1 to 1.7.0 (CHANGELOG)
- This bumps the version of Alloy from 1.13.2 to 1.15.0 (CHANGELOG)
Removed
- Remove ATS (Python/pytest) test infrastructure in favour of
apptest-framework.
Fixed
- upgrade pg-cluster-recovery-test subchart: v0.4.0 => v0.4.1
Changed
- Updated Tempo dashboards to mixins v2.10
- bugfixes in Tempo operational dashboard
- Update Network Traffic Analysis Overview dashboard
- Replace average network traffic gauges with total data transfer bar chart
- Add a time period selector for data transfer periods
- Make total visible in all panels
- Remove datasource variable
- Simplify traffic rate queries
Removed
- Remove the
object-storage-operator
Changed
- Upgrade grafana chart: 11.3.3 => 11.3.6
- upgrade pg-cluster-recovery-test subchart: v0.3.0 => v0.4.0
Fixed
- Fix Grafana Management Lifecycle Policy.