Last modified October 7, 2025

Trace-derived metrics

Giant Swarm’s observability platform automatically generates metrics from your trace data using Tempo’s metrics-generator. This transformation enables you to create alerts and dashboards based on distributed tracing insights while using familiar Prometheus/PromQL tooling.

You can’t create alerts directly from trace data, so these automatically generated metrics bridge the gap between detailed trace analysis and reliable monitoring.

Understanding metrics derived from traces

Tempo’s metrics-generator automatically creates rate, error, and duration (RED) metrics from your trace data:

Rate

Request rate: The number of requests per second for each service and operation.

# Total request rate for a service
rate(tempo_service_graph_request_total[5m])

# Request rate by operation
rate(tempo_service_graph_request_total{operation="GET /api/users"}[5m])

Error

Error rate: The percentage of failed requests for each service and operation.

# Error rate for a service
(
  rate(tempo_service_graph_request_failed_total[5m]) / 
  rate(tempo_service_graph_request_total[5m])
) * 100

# Error rate by HTTP status code
rate(tempo_service_graph_request_total{status_code=~"5.."}[5m])

Duration

Response time: Latency percentiles for each service and operation.

# 95th percentile latency
histogram_quantile(0.95, rate(tempo_service_graph_request_duration_seconds_bucket[5m]))

# Average response time
rate(tempo_service_graph_request_duration_seconds_sum[5m]) / 
rate(tempo_service_graph_request_duration_seconds_count[5m])

Available trace-derived metrics

Tempo’s metrics-generator creates several categories of metrics from your traces:

Service graph metrics

Metrics representing service-to-service communication:

# Request rate between services
tempo_service_graph_request_total{client="api-gateway", server="user-service"}

# Failed requests between services
tempo_service_graph_request_failed_total{client="api-gateway", server="user-service"}

# Request duration between services
tempo_service_graph_request_duration_seconds{client="api-gateway", server="user-service"}

Span metrics

Metrics for individual operations within services:

# Span request rate by operation
tempo_span_metrics_calls_total{service_name="user-service", span_name="GET /api/users"}

# Span error rate
tempo_span_metrics_calls_total{service_name="user-service", status_code="STATUS_CODE_ERROR"}

# Span duration percentiles
tempo_span_metrics_duration_seconds{service_name="user-service", span_name="database_query"}

Custom dimensions

Additional dimensions based on span attributes:

# Metrics by HTTP method
tempo_span_metrics_calls_total{http_method="POST"}

# Metrics by database operation
tempo_span_metrics_calls_total{db_operation="SELECT"}

# Custom business dimensions
tempo_span_metrics_calls_total{customer_tier="premium"}

Querying trace-derived metrics

Finding available metrics

Discover metrics generated from your traces:

# List all trace-derived metrics
{__name__=~"tempo_.*"}

# Service graph metrics
{__name__=~"tempo_service_graph.*"}

# Span metrics
{__name__=~"tempo_span_metrics.*"}

Common query patterns

Service health monitoring

# Service availability (requests per second)
sum(rate(tempo_service_graph_request_total[5m])) by (server)

# Service error rates
sum(rate(tempo_service_graph_request_failed_total[5m])) by (server) /
sum(rate(tempo_service_graph_request_total[5m])) by (server)

# Service response times
histogram_quantile(0.95, 
  sum(rate(tempo_service_graph_request_duration_seconds_bucket[5m])) by (server, le)
)

Operation-level monitoring

# HTTP endpoint error rates
sum(rate(tempo_span_metrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])) by (span_name) /
sum(rate(tempo_span_metrics_calls_total[5m])) by (span_name)

# Database operation latency
histogram_quantile(0.99,
  sum(rate(tempo_span_metrics_duration_seconds_bucket{span_kind="SPAN_KIND_CLIENT"}[5m])) 
  by (span_name, le)
)

# External service dependencies
sum(rate(tempo_span_metrics_calls_total{span_kind="SPAN_KIND_CLIENT"}[5m])) 
by (service_name, span_name)

Cross-service analysis

# Traffic between service pairs
sum(rate(tempo_service_graph_request_total[5m])) by (client, server)

# Inter-service error propagation
sum(rate(tempo_service_graph_request_failed_total[5m])) by (client, server)

# Service dependency latency
avg(tempo_service_graph_request_duration_seconds) by (client, server)

Setting up alerts based on trace data

Alert rule examples

Create alerting rules using trace-derived metrics:

High error rate alert

groups:
- name: trace-based-alerts
  rules:
  - alert: HighServiceErrorRate
    expr: |
      (
        sum(rate(tempo_service_graph_request_failed_total[5m])) by (server) /
        sum(rate(tempo_service_graph_request_total[5m])) by (server)
      ) * 100 > 5      
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected for service {{ $labels.server }}"
      description: "Service {{ $labels.server }} has error rate of {{ $value }}% for 5 minutes"

High latency alert

- alert: HighServiceLatency
  expr: |
    histogram_quantile(0.95,
      sum(rate(tempo_service_graph_request_duration_seconds_bucket[5m])) by (server, le)
    ) > 2    
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "High latency detected for service {{ $labels.server }}"
    description: "Service {{ $labels.server }} 95th percentile latency is {{ $value }}s"

Service availability alert

- alert: ServiceUnavailable
  expr: |
    absent_over_time(
      sum(rate(tempo_service_graph_request_total[1m])) by (server)[5m:]
    ) == 1    
  labels:
    severity: critical
  annotations:
    summary: "Service {{ $labels.server }} appears to be unavailable"
    description: "No requests detected for service {{ $labels.server }} in the last 5 minutes"

Best practices for using trace-derived metrics

Alert design principles

Focus on business impact: Alert on conditions that affect user experience
Use appropriate time windows: Balance sensitivity with noise reduction
Set meaningful thresholds: Base thresholds on historical data and SLA requirements
Include context: Add relevant labels and annotations for effective incident response

Common monitoring patterns

Service-level monitoring: Track overall service health using RED metrics
Dependency monitoring: Alert when upstream services affect downstream performance
Capacity planning: Monitor request rates and latency trends over time
Quality monitoring: Track degradation in service quality metrics

Next steps

To effectively use trace-derived metrics:

Configure comprehensive alerting: Set up alert rules using trace-derived metrics
Create service dashboards: Visualize trace metrics alongside other observability data
Learn advanced PromQL: Master querying techniques for trace-derived metrics
Understand service graphs: Connect metrics to visual service topology analysis

For more detailed configuration options, refer to the Tempo metrics-generator documentation.

Need help, got feedback?

We listen to your Slack support channel. You can also reach us at support@giantswarm.io. And of course, we welcome your pull requests!

Giant Swarm Offerings