Last modified July 29, 2025
Alert rules
This guide shows you how to create and deploy alerting and recording rules using Kubernetes resources. For an overview of what these rules are and how they fit into the alerting pipeline, see the alert management overview.
How to deploy rules
You define alerting and recording rules using Prometheus Operator PrometheusRule
resources, following Giant Swarm’s GitOps approach. Deploy these rules to both management clusters and workload clusters.
The platform evaluates your rules and routes alerts through the alerting pipeline to configured receivers.
Required tenant labeling
Important: All alert rules must include the observability.giantswarm.io/tenant
label that references an existing tenant defined in a Grafana Organization. The system ignores any PrometheusRule
that references a non-existing tenant.
Get familiar with tenant management in our multi-tenancy documentation.
Alerting rule examples
Create alerting rules using Prometheus alerting rule syntax with PromQL or LogQL expressions.
Metric-based alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
# Required: specifies which tenant this alert belongs to
observability.giantswarm.io/tenant: my_team
name: component-availability
namespace: my-namespace
spec:
groups:
- name: reliability
rules:
- alert: ComponentDown
annotations:
summary: 'Component {{ $labels.service }} is down'
description: 'Component {{ $labels.service }} has been down for more than 5 minutes.'
# Optional: link to relevant dashboard
__dashboardUid__: my-dashboard-uid
# Optional: link to troubleshooting documentation
runbook_url: https://my-runbook-url
# PromQL expression that defines the alert condition
expr: up{job=~"component/.*"} == 0
# Duration the condition must be true before firing
for: 5m
labels:
# Severity level for routing and prioritization
severity: critical
Key components
alert
: Unique name for the alert ruleexpr
: PromQL or LogQL expression defining when the alert firesfor
: Duration the condition must remain true before firinglabels
: Key-value pairs for routing and grouping alertsannotations
: Human-readable information about the alert
For guidance on writing effective PromQL queries, see the Prometheus querying documentation or our advanced PromQL tutorial. You can also explore queries in your installation’s Grafana explore interface.
Recording rule examples
Create recording rules using Prometheus recording rule syntax to pre-compute expensive expressions.
Basic recording rule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
observability.giantswarm.io/tenant: my_team
name: cluster-resource-usage
namespace: my-namespace
spec:
groups:
- name: cluster-resource-usage
rules:
- expr: |
avg by (cluster_id) (
node:node_cpu_utilization:ratio_rate5m
)
record: cluster:node_cpu:ratio_rate5m
Log-based alerting examples
Create log-based alerts using LogQL queries. These require specific labels to route to the Loki ruler.
Log pattern alert
You’ll need specific labels to indicate evaluation by Loki:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
observability.giantswarm.io/tenant: my_team
# Required: indicates this is a log-based rule
observability.giantswarm.io/rule-type: logs
# Deprecated but still required for compatibility
application.giantswarm.io/prometheus-rule-kind: loki
name: application-log-alerts
namespace: my-namespace
spec:
groups:
- name: log-monitoring
rules:
- alert: HighErrorLogRate
annotations:
summary: 'High error rate in application logs'
description: 'Application {{ $labels.app }} is producing {{ $value }} errors per minute'
# LogQL expression to count error logs
expr: |
sum(rate({app="my-app"} |= "ERROR" [5m])) by (app) > 10
for: 2m
labels:
severity: warning
For more information about writing LogQL queries, see the Loki LogQL documentation.
Rule scoping
The platform provides scoping mechanisms to prevent conflicts when deploying the same rules across multiple clusters within a tenant.
Scoping behavior
Workload cluster deployment (cluster-scoped)
When you deploy a PrometheusRule
in a workload cluster, the system automatically scopes rules to that specific cluster. For example, deploying a rule with expression up{job="good"} > 0
in workload cluster alpha1
results in: up{cluster_id="alpha1", job="good"} > 0
.
Management cluster deployment (installation-scoped)
When you deploy a PrometheusRule
in a management cluster, rules target all clusters in the installation without modification.
Limitations
- Only metric-based alerts support cluster scoping due to upstream limitations
- Manual conflict management required for rules deployed per application in different namespaces
- Consider unique naming or namespace-specific labeling for multi-environment deployments
Tenant federation
With Alloy 1.9, the platform supports tenant federation, letting you create rules based on other tenants’ data without duplicating data intake. Just add the monitoring.grafana.com/source_tenants
label to your PrometheusRule
.
Example: System metrics alerting
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
observability.giantswarm.io/tenant: my_team
# Define the source tenant for metrics used in the alert
monitoring.grafana.com/source_tenants: giantswarm
name: system-node-alerts
namespace: my-namespace
spec:
groups:
- name: system-monitoring
rules:
- alert: NodeDown
annotations:
summary: 'Cluster node is down'
description: 'Node {{ $labels.instance }} in cluster {{ $labels.cluster_id }} has been down for more than 5 minutes.'
__dashboardUid__: system-metrics-dashboard
# Query system metrics from the giantswarm tenant
expr: up{job="node-exporter"} == 0
for: 5m
labels:
severity: critical
Next steps
- Configure alert routing for your tenants to complete the alerting pipeline
- Review the alerting pipeline architecture to understand how alerts flow through the system
- Learn about data exploration to query and analyze the metrics and logs that drive your alerts
Related observability features
Alert rules work best when integrated with other platform capabilities:
- Data management: Use advanced querying techniques to test and refine your alert expressions before deploying them
- Multi-tenancy: Essential for understanding tenant labeling requirements and secure alert isolation
- Data Import and Export: Import external logs that can trigger alerts and export alert data for comprehensive monitoring coverage across your infrastructure
Need help, got feedback?
We listen to your Slack support channel. You can also reach us at support@giantswarm.io. And of course, we welcome your pull requests!