Observability

  • Fixed

    • Disabled per-component serviceaccounts so we rely on a central serviceaccount that has proper IRSA annotation.
  • Changed

    • Updated Loki Helm chart dependency to use grafana-community/helm-charts repository.
    • Upgrade upstream loki helm chart from v6 (6.53.0) to v11 (11.6.4), Loki app version 3.6.5 → 3.7.1.
    • Remove custom ScaledObject wrapper templates (backend, gateway, read, write); upstream v11 ships native ScaledObject support, keeping both caused duplicate resources.
    • Remove kedaAutoscaling value blocks consumed only by the deleted wrapper templates.
    • Remove selfMonitoring dead code; removed from upstream prior to v11.
    • Remove redundant podSecurityContext and containerSecurityContext overrides for loki and gateway components; they are now identical to upstream v11 defaults.
    • Rename sample_configs/ directory to examples/.
  • Added

    • Add envoy-gateway-loadtesting dashboard

    Changed

    • Reorganize Grafana Cloud directories: dashboards/grafana-cloud/sources/, grafana-cloud/grafana-cloud/backup/, GC API scripts → grafana-cloud/scripts/
    • Update backup workflow and README to reflect new paths
  • Added

    • Add monthly GitHub Actions workflow to auto-update Tempo, Mimir, Loki, and Alloy mixin dashboards
    • Add Monitoring Landscape / Customer Audit dashboard to audit monitoring tools across workload clusters and compare resource consumption with the GiantSwarm observability platform

    Changed

    • Move Tempo dashboards from private_dashboards_mz to team_atlas under Giant Swarm/Observability/Tempo
    • Move Loki dashboards from private_dashboards_al to team_atlas under Giant Swarm/Observability/Loki
    • Refresh Loki dashboards from latest upstream mixin
    • Move Mimir dashboards from private_dashboards_mz to team_atlas under Giant Swarm/Observability/Mimir
    • Refresh Mimir mixin dashboards from upstream mimir-2.17.6
    • Move Alloy dashboards from private_dashboards_al to team_atlas under Giant Swarm/Observability/Alloy
    • Refresh Alloy mixin dashboards from upstream v1.15.0
      • Adds alloy-loki and alloy-otel-engine-overview dashboards
  • Fixed

    • Update recovery-test AWS IAM to support unique names
  • Added

    • Add E2E test suites for all alloy topologies (metrics, logs, events) on WC using apptest-framework.
    • Add Helm CI test values for controller types, network policies, Kyverno, secrets, and PodLogs.

    Changed

    • Upgrade Alloy upstream chart from 1.6.1 to 1.7.0 (CHANGELOG)
      • This bumps the version of Alloy from 1.13.2 to 1.15.0 (CHANGELOG)

    Removed

    • Remove ATS (Python/pytest) test infrastructure in favour of apptest-framework.
  • Fixed

    • upgrade pg-cluster-recovery-test subchart: v0.4.0 => v0.4.1
  • Changed

    • Updated Tempo dashboards to mixins v2.10
    • bugfixes in Tempo operational dashboard
    • Update Network Traffic Analysis Overview dashboard
      • Replace average network traffic gauges with total data transfer bar chart
      • Add a time period selector for data transfer periods
      • Make total visible in all panels
      • Remove datasource variable
      • Simplify traffic rate queries

    Removed

    • Remove the object-storage-operator
  • Changed

    • Upgrade grafana chart: 11.3.3 => 11.3.6
    • upgrade pg-cluster-recovery-test subchart: v0.3.0 => v0.4.0
  • Fixed

    • Fix Grafana Management Lifecycle Policy.