Last modified July 29, 2025

Alert rules

This guide shows you how to create and deploy alerting and recording rules using Kubernetes resources. For an overview of what these rules are and how they fit into the alerting pipeline, see the alert management overview.

How to deploy rules

You define alerting and recording rules using Prometheus Operator PrometheusRule resources, following Giant Swarm’s GitOps approach. Deploy these rules to both management clusters and workload clusters.

The platform evaluates your rules and routes alerts through the alerting pipeline to configured receivers.

Required tenant labeling

Important: All alert rules must include the observability.giantswarm.io/tenant label that references an existing tenant defined in a Grafana Organization. The system ignores any PrometheusRule that references a non-existing tenant.

Get familiar with tenant management in our multi-tenancy documentation.

Alerting rule examples

Create alerting rules using Prometheus alerting rule syntax with PromQL or LogQL expressions.

Metric-based alert

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    # Required: specifies which tenant this alert belongs to
    observability.giantswarm.io/tenant: my_team
  name: component-availability
  namespace: my-namespace
spec:
  groups:
  - name: reliability
    rules:
      - alert: ComponentDown
        annotations:
          summary: 'Component {{ $labels.service }} is down'
          description: 'Component {{ $labels.service }} has been down for more than 5 minutes.'
          # Optional: link to relevant dashboard
          __dashboardUid__: my-dashboard-uid
          # Optional: link to troubleshooting documentation
          runbook_url: https://my-runbook-url
        # PromQL expression that defines the alert condition
        expr: up{job=~"component/.*"} == 0
        # Duration the condition must be true before firing
        for: 5m
        labels:
          # Severity level for routing and prioritization
          severity: critical

Key components

alert: Unique name for the alert rule
expr: PromQL or LogQL expression defining when the alert fires
for: Duration the condition must remain true before firing
labels: Key-value pairs for routing and grouping alerts
annotations: Human-readable information about the alert

For guidance on writing effective PromQL queries, see the Prometheus querying documentation or our advanced PromQL tutorial. You can also explore queries in your installation’s Grafana explore interface.

Recording rule examples

Create recording rules using Prometheus recording rule syntax to pre-compute expensive expressions.

Basic recording rule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    observability.giantswarm.io/tenant: my_team
  name: cluster-resource-usage
  namespace: my-namespace
spec:
  groups:
  - name: cluster-resource-usage
    rules:
      - expr: |
          avg by (cluster_id) (
            node:node_cpu_utilization:ratio_rate5m
          )          
        record: cluster:node_cpu:ratio_rate5m

Log-based alerting examples

Create log-based alerts using LogQL queries. These require specific labels to route to the Loki ruler.

Log pattern alert

You’ll need specific labels to indicate evaluation by Loki:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    observability.giantswarm.io/tenant: my_team
    # Required: indicates this is a log-based rule
    observability.giantswarm.io/rule-type: logs
    # Deprecated but still required for compatibility
    application.giantswarm.io/prometheus-rule-kind: loki
  name: application-log-alerts
  namespace: my-namespace
spec:
  groups:
  - name: log-monitoring
    rules:
      - alert: HighErrorLogRate
        annotations:
          summary: 'High error rate in application logs'
          description: 'Application {{ $labels.app }} is producing {{ $value }} errors per minute'
        # LogQL expression to count error logs
        expr: |
          sum(rate({app="my-app"} |= "ERROR" [5m])) by (app) > 10          
        for: 2m
        labels:
          severity: warning

For more information about writing LogQL queries, see the Loki LogQL documentation.

Rule scoping

The platform provides scoping mechanisms to prevent conflicts when deploying the same rules across multiple clusters within a tenant.

Scoping behavior

Workload cluster deployment (cluster-scoped)

When you deploy a PrometheusRule in a workload cluster, the system automatically scopes rules to that specific cluster. For example, deploying a rule with expression up{job="good"} > 0 in workload cluster alpha1 results in: up{cluster_id="alpha1", job="good"} > 0.

Management cluster deployment (installation-scoped)

When you deploy a PrometheusRule in a management cluster, rules target all clusters in the installation without modification.

Limitations

Only metric-based alerts support cluster scoping due to upstream limitations
Manual conflict management required for rules deployed per application in different namespaces
Consider unique naming or namespace-specific labeling for multi-environment deployments

Tenant federation

With Alloy 1.9, the platform supports tenant federation, letting you create rules based on other tenants’ data without duplicating data intake. Just add the monitoring.grafana.com/source_tenants label to your PrometheusRule.

Example: System metrics alerting

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    observability.giantswarm.io/tenant: my_team
    # Define the source tenant for metrics used in the alert
    monitoring.grafana.com/source_tenants: giantswarm
  name: system-node-alerts
  namespace: my-namespace
spec:
  groups:
  - name: system-monitoring
    rules:
      - alert: NodeDown
        annotations:
          summary: 'Cluster node is down'
          description: 'Node {{ $labels.instance }} in cluster {{ $labels.cluster_id }} has been down for more than 5 minutes.'
          __dashboardUid__: system-metrics-dashboard
        # Query system metrics from the giantswarm tenant
        expr: up{job="node-exporter"} == 0
        for: 5m
        labels:
          severity: critical

Next steps

Configure alert routing for your tenants to complete the alerting pipeline
Review the alerting pipeline architecture to understand how alerts flow through the system
Learn about data exploration to query and analyze the metrics and logs that drive your alerts

Alert rules work best when integrated with other platform capabilities:

Data management: Use advanced querying techniques to test and refine your alert expressions before deploying them
Multi-tenancy: Essential for understanding tenant labeling requirements and secure alert isolation
Data Import and Export: Import external logs that can trigger alerts and export alert data for comprehensive monitoring coverage across your infrastructure

Need help, got feedback?

We listen to your Slack support channel. You can also reach us at support@giantswarm.io. And of course, we welcome your pull requests!

Giant Swarm Offerings

Alert rules

How to deploy rules

Required tenant labeling

Alerting rule examples

Metric-based alert

Key components

Recording rule examples

Basic recording rule

Log-based alerting examples

Log pattern alert

Rule scoping

Scoping behavior

Workload cluster deployment (cluster-scoped)

Management cluster deployment (installation-scoped)

Limitations

Tenant federation

Example: System metrics alerting

Next steps

Need help, got feedback?

About the company

Giant Swarm Offerings

Alert rules

How to deploy rules

Required tenant labeling

Alerting rule examples

Metric-based alert

Key components

Recording rule examples

Basic recording rule

Log-based alerting examples

Log pattern alert

Rule scoping

Scoping behavior

Workload cluster deployment (cluster-scoped)

Management cluster deployment (installation-scoped)

Limitations

Tenant federation

Example: System metrics alerting

Next steps

Related observability features

Need help, got feedback?

About the company