How to create alerting and recording rules

Learn how to create alerting and recording rules in the Giant Swarm Observability Platform using PrometheusRule resources.

The Giant Swarm Observability Platform provides an alerting pipeline that you can configure per tenant. This enables you to create your own alerting and recording rules per tenant.

Following Giant Swarm’s GitOps approach, you define alerting and recording rules using Prometheus Operator PrometheusRule resources. You can deploy these rules to both management clusters and workload clusters.

Warning: The observability.giantswarm.io/tenant label on your rules must reference an existing tenant defined in a Grafana Organization. The system ignores any PrometheusRule that references a non-existing tenant. Learn more about our multi-tenancy in Multi-tenancy in the observability platform.

Create alerting rules

Alerting rules define conditions that trigger notifications when specific issues occur in your infrastructure or applications. These rules use Prometheus (PromQL) or Loki (LogQL) expressions to evaluate conditions.

Basic alerting rule structure

Here’s a basic example of an alerting rule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    # Required: specifies which tenant this alert belongs to
    observability.giantswarm.io/tenant: myteam
  name: component-availability
  namespace: my-namespace
spec:
  groups:
  - name: reliability
    rules:
      - alert: ComponentDown
        annotations:
          # Human-readable summary
          summary: 'Component {{ $labels.service }} is down'
          # Detailed description with context
          description: 'Component {{ $labels.service }} has been down for more than 5 minutes.'
          # Optional: link to relevant dashboard
          __dashboardUid__: my-dashboard-uid
          # Optional: link to troubleshooting documentation
          runbook_url: https://my-runbook-url
        # PromQL expression that defines the alert condition
        expr: up{job=~"component/.*"} == 0
        # Duration the condition must be true before firing
        for: 5m
        labels:
          # Severity level for routing and prioritization
          severity: critical

Key components

  • alert: unique name for the alert rule
  • expr: PromQL or LogQL expression that defines when the alert should fire
  • for: duration the condition must be true before the alert fires
  • labels: key-value pairs for routing and grouping alerts
  • annotations: human-readable information about the alert

For guidance on writing effective PromQL queries, refer to the Prometheus querying documentation. You can also explore queries in your installation’s Grafana explore interface.

Create recording rules

Recording rules let you pre-compute frequently needed or computationally expensive expressions. They save results as new time series. This improves query performance and enables the creation of custom metrics for dashboards and alerts.

When to use recording rules

Recording rules are useful when you need to:

  • Improve performance by pre-calculating expensive aggregations used frequently
  • Create custom metrics by combining multiple metrics into business-specific indicators
  • Simplify complex queries by breaking them into manageable components

Basic recording rule structure

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    # Required: specifies which tenant this rule belongs to
    observability.giantswarm.io/tenant: myteam
  name: cluster-resource-usage
  namespace: my-namespace
spec:
  groups:
  - name: cluster-resource-usage
    rules:
      - expr: |
          avg by (cluster_id) (
            node:node_cpu_utilization:ratio_rate5m
          )          
        record: cluster:node_cpu:ratio_rate5m

Log-based alerting

Log-based alerting lets you monitor application logs for specific patterns, errors, or anomalies using LogQL queries. The Loki ruler evaluates those alerts and provides powerful capabilities for application-level monitoring.

Configure log-based rules

To create log-based alerts, include specific labels to indicate evaluation by Loki:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    observability.giantswarm.io/tenant: myteam
    # Required: indicates this is a log-based rule
    observability.giantswarm.io/rule-type: logs
    # Deprecated but still required for compatibility
    application.giantswarm.io/prometheus-rule-kind: loki
  name: application-log-alerts
  namespace: my-namespace
spec:
  groups:
  - name: log-monitoring
    rules:
      - alert: HighErrorLogRate
        annotations:
          summary: 'High error rate in application logs'
          description: 'Application {{ $labels.app }} is producing {{ $value }} errors per minute'
        # LogQL expression to count error logs
        expr: |
          sum(rate({app="my-app"} |= "ERROR" [5m])) by (app) > 10          
        for: 2m
        labels:
          severity: warning

For more information about writing LogQL queries, refer to the Loki LogQL documentation.

Rule scoping

The Observability Platform provides scoping mechanisms to prevent conflicts when deploying the same rules across multiple clusters within a tenant.

Scoping behavior

Workload cluster deployment (cluster-scoped)

When you deploy a PrometheusRule in a workload cluster, the system automatically scopes the rules to that specific cluster. For example, deploying a rule with expression up{job="good"} > 0 in workload cluster alpha1 results in the loaded expression: up{cluster_id="alpha1", job="good"} > 0.

Management cluster deployment (installation-scoped)

When you deploy a PrometheusRule in a management cluster, the rules target all clusters in the installation without modification.

Limitations

  • The system only provides cluster scoping for metric-based alerts due to upstream limitations
  • Scoping applies when teams deploy the same rule across multiple clusters
  • For rules you deploy per application in different namespaces, you must manage conflicts manually

For multi-environment deployments, consider using unique naming or namespace-specific labeling to avoid conflicts.

Using tenant federation for system data alerts

With the introduction of Alloy 1.9, the Giant Swarm Observability Platform supports tenant federation capabilities. These capabilities let you create alerting and recording rules based on other tenants data without duplicating data intake. To use this feature, add the monitoring.grafana.com/source_tenants label on your PrometheusRule

For more information about multi-tenancy and tenant management, see our multi-tenancy documentation.

Example: Alerting on Giant Swarm system metrics

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    observability.giantswarm.io/tenant: myteam
    # Define on the source tenant for the metrics used in the alert
    monitoring.grafana.com/source_tenants: giantswarm
  name: system-node-alerts
  namespace: my-namespace
spec:
  groups:
  - name: system-monitoring
    rules:
      - alert: NodeDown
        annotations:
          summary: 'Cluster node is down'
          description: 'Node {{ $labels.instance }} in cluster {{ $labels.cluster_id }} has been down for more than 5 minutes.'
          # Reference the system metrics data source in Grafana
          __dashboardUid__: system-metrics-dashboard
        # Query system metrics from the giantswarm tenant
        expr: up{job="node-exporter"} == 0
        for: 5m
        labels:
          severity: critical

This part of our documentation refers to our vintage product. The content may be not valid anymore for our current product. Please check our new documentation hub for the latest state of our docs.