Observability

  • Added

    • Add monthly GitHub Actions workflow to auto-update Tempo, Mimir, Loki, and Alloy mixin dashboards
    • Add Monitoring Landscape / Customer Audit dashboard to audit monitoring tools across workload clusters and compare resource consumption with the GiantSwarm observability platform

    Changed

    • Move Tempo dashboards from private_dashboards_mz to team_atlas under Giant Swarm/Observability/Tempo
    • Move Loki dashboards from private_dashboards_al to team_atlas under Giant Swarm/Observability/Loki
    • Refresh Loki dashboards from latest upstream mixin
    • Move Mimir dashboards from private_dashboards_mz to team_atlas under Giant Swarm/Observability/Mimir
    • Refresh Mimir mixin dashboards from upstream mimir-2.17.6
    • Move Alloy dashboards from private_dashboards_al to team_atlas under Giant Swarm/Observability/Alloy
    • Refresh Alloy mixin dashboards from upstream v1.15.0
      • Adds alloy-loki and alloy-otel-engine-overview dashboards
  • Fixed

    • Update recovery-test AWS IAM to support unique names
  • Added

    • Add E2E test suites for all alloy topologies (metrics, logs, events) on WC using apptest-framework.
    • Add Helm CI test values for controller types, network policies, Kyverno, secrets, and PodLogs.

    Changed

    • Upgrade Alloy upstream chart from 1.6.1 to 1.7.0 (CHANGELOG)
      • This bumps the version of Alloy from 1.13.2 to 1.15.0 (CHANGELOG)

    Removed

    • Remove ATS (Python/pytest) test infrastructure in favour of apptest-framework.
  • Fixed

    • upgrade pg-cluster-recovery-test subchart: v0.4.0 => v0.4.1
  • Changed

    • Updated Tempo dashboards to mixins v2.10
    • bugfixes in Tempo operational dashboard
    • Update Network Traffic Analysis Overview dashboard
      • Replace average network traffic gauges with total data transfer bar chart
      • Add a time period selector for data transfer periods
      • Make total visible in all panels
      • Remove datasource variable
      • Simplify traffic rate queries

    Removed

    • Remove the object-storage-operator
  • Changed

    • Upgrade grafana chart: 11.3.3 => 11.3.6
    • upgrade pg-cluster-recovery-test subchart: v0.3.0 => v0.4.0
  • Fixed

    • Fix Grafana Management Lifecycle Policy.
  • Added

    • Added new gRPC routes for Loki and Tempo write

    Changed

    • Rename mimir.writeRewritePathsmimir.write.stripPrefixPaths to clarify that the /prometheus prefix is stripped before forwarding; add equivalent stripPrefixPaths: [] defaults to loki and tempo write config.
    • Expose Tempo gRPC backend config in values (tempo.read.grpc.backendService, tempo.read.grpc.backendPort) instead of hardcoding in the template.
    • Expose Loki and Mimir backend config in values (loki.read.backendService, loki.read.backendPort, loki.write.backendService, loki.write.backendPort, mimir.read.backendService, mimir.read.backendPort, mimir.write.backendService, mimir.write.backendPort) instead of hardcoding in the template.
    • Restructure Helm templates into per-service subdirectories (templates/loki/, templates/mimir/, templates/tempo/).
    • Share HTTPRouteFilter resources within each service: a single headers-check filter and (for Mimir/Tempo) a single rewrite filter are now referenced by all routes in that service namespace.
  • Changed

    • Upgrade grafana chart: 11.2.3 => 11.3.3
    • Upgrade grafana (appVersion): 12.4.0 => 12.4.1
  • Changed

    • Upgrade Tempo chart from to 2.4.2 to 2.6.2
      • Upgrades Tempo from 2.10.1 to 2.10.2

    Added

    • requests and limits for distributor - fixes its HPA

    Fixed

    • vulture search window reduced to 24h to avoid querying traces out of retention period