Full Observability: Prometheus + Grafana + Tempo + Loki
Observability means being able to answer “what’s wrong and why” from the outside — without modifying the code. The three pillars: metrics (what happened), logs (what the code did), and traces (how a request flowed). This article wires them all together.
The Stack
Spring Boot App
├── Metrics → Micrometer → Prometheus scrape → Grafana dashboards
├── Traces → Micrometer Tracing → OTLP → Tempo → Grafana trace view
└── Logs → Logback → Loki4j → Loki → Grafana log explorer
All three converge in Grafana — click a metric spike to see the correlated logs and traces for that exact time window.
Dependencies
<!-- Metrics -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<!-- Distributed Tracing -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
<!-- Loki log shipping -->
<dependency>
<groupId>com.github.loki4j</groupId>
<artifactId>loki-logback-appender</artifactId>
<version>1.5.1</version>
</dependency>
Application Configuration
spring:
application:
name: order-service
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
endpoint:
health:
probes:
enabled: true
show-details: when-authorized
metrics:
tags:
application: ${spring.application.name}
environment: ${ENVIRONMENT:local}
tracing:
sampling:
probability: 1.0 # 100% in dev; 0.1 (10%) in prod
otlp:
tracing:
endpoint: http://tempo:4318/v1/traces
Logback with Trace Correlation
<!-- logback-spring.xml -->
<configuration>
<springProperty scope="context" name="appName" source="spring.application.name"/>
<!-- Dev: colorized console with trace IDs -->
<springProfile name="dev,local">
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{HH:mm:ss.SSS} %highlight(%-5level) [%X{traceId},%X{spanId}] %cyan(%logger{36}) - %msg%n</pattern>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="CONSOLE"/>
</root>
<logger name="com.devopsmonk" level="DEBUG"/>
</springProfile>
<!-- Prod: JSON to Loki -->
<springProfile name="prod">
<appender name="LOKI" class="com.github.loki4j.logback.Loki4jAppender">
<http>
<url>http://loki:3100/loki/api/v1/push</url>
</http>
<format>
<label>
<pattern>app=${appName},env=${ENVIRONMENT},host=${HOSTNAME}</pattern>
</label>
<message class="com.github.loki4j.logback.JsonLayout">
<!-- Include MDC fields in every log entry -->
<includeKeyValue>true</includeKeyValue>
</message>
</format>
</appender>
<root level="WARN">
<appender-ref ref="LOKI"/>
</root>
<logger name="com.devopsmonk" level="INFO"/>
</springProfile>
</configuration>
Micrometer Tracing automatically puts traceId and spanId into MDC — every log statement includes them without any code changes. In Grafana, you can click a trace and jump directly to the correlated logs.
Custom Spans
@Service
@RequiredArgsConstructor
@Slf4j
public class OrderService {
private final Tracer tracer;
private final OrderRepository repository;
private final InventoryClient inventoryClient;
public Order createOrder(CreateOrderRequest request) {
// Create a child span for the business operation
Span span = tracer.nextSpan().name("order.create").start();
try (Tracer.SpanInScope scope = tracer.withSpan(span)) {
span.tag("customerId", request.customerId().toString());
span.tag("itemCount", String.valueOf(request.items().size()));
// Inventory check — creates its own child span via Feign instrumentation
inventoryClient.checkAvailability(request.items());
Order order = repository.save(buildOrder(request));
span.tag("orderId", order.getId().toString());
log.info("Order created: orderId={}", order.getId()); // includes traceId automatically
return order;
} catch (Exception e) {
span.error(e);
throw e;
} finally {
span.end();
}
}
}
Feign clients, Kafka producers/consumers, and database calls are automatically instrumented — they appear as child spans in Tempo without any code changes.
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: order-service
metrics_path: /actuator/prometheus
static_configs:
- targets: ['order-service:8081']
# Or for Kubernetes:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
Kubernetes annotation-based discovery — annotate pods to enable scraping:
# In pod template spec:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/actuator/prometheus"
prometheus.io/port: "8081"
Docker Compose: Full Stack Locally
# docker-compose.observability.yml
services:
prometheus:
image: prom/prometheus:v2.51.0
volumes:
- ./observability/prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=7d'
grafana:
image: grafana/grafana:10.4.0
ports:
- "3000:3000"
environment:
GF_AUTH_ANONYMOUS_ENABLED: "true"
GF_AUTH_ANONYMOUS_ORG_ROLE: Admin
volumes:
- ./observability/grafana/provisioning:/etc/grafana/provisioning
- ./observability/grafana/dashboards:/var/lib/grafana/dashboards
tempo:
image: grafana/tempo:2.4.1
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./observability/tempo.yaml:/etc/tempo.yaml
ports:
- "3200:3200" # Tempo UI
- "4318:4318" # OTLP HTTP
loki:
image: grafana/loki:2.9.6
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
Grafana Dashboards
Grafana’s Spring Boot dashboard (ID: 17175) shows out-of-the-box:
- Request rate, error rate, latency (R.E.D. metrics)
- JVM heap, GC pauses, thread count
- HikariCP connection pool utilization
- Logback log rates by level
Import it: Grafana → Dashboards → Import → enter 17175.
Custom Dashboard Queries
# Request rate by endpoint
rate(http_server_requests_seconds_count{application="order-service"}[5m])
# Error rate
rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m])
# p99 latency
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket{application="order-service"}[5m]))
# Active orders (custom gauge)
orders_pending_count{application="order-service"}
# Circuit breaker state (0=closed, 1=open, 2=half-open)
resilience4j_circuitbreaker_state{application="order-service"}
Alerting
# alerting-rules.yaml
groups:
- name: order-service
rules:
- alert: HighErrorRate
expr: |
rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m])
/ rate(http_server_requests_seconds_count{application="order-service"}[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Order service error rate > 5%"
runbook: https://wiki.devopsmonk.com/runbooks/order-service
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state{state="open"} == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} is open"
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Order service p99 latency > 2s"
The Debugging Workflow
When an alert fires:
Grafana Dashboards — see the spike in error rate or latency. What time? Which endpoint?
Prometheus — query
http_server_requests_seconds_count{status="500",uri="/api/orders"}— confirm the failing endpoint.Loki — query
{app="order-service"} | json | level="ERROR"for that time window — see the error messages and stack traces.Tempo — find a trace ID from the Loki log. Open it in Tempo — see every span: which service was slow, which database call failed, where the error originated.
Fix the code. With traceId correlating logs and traces, you see the full picture — not just that something failed, but exactly why and where.
What You’ve Learned
- Three pillars of observability: metrics (Micrometer/Prometheus), logs (Logback/Loki), traces (Micrometer Tracing/Tempo)
- Micrometer Tracing puts
traceIdandspanIdin MDC automatically — logs and traces are correlated with no code changes - Custom spans with
Tracer.nextSpan()add business context to distributed traces - Feign clients, Kafka, and database calls are automatically instrumented as child spans
- Grafana unifies all three signals — click from a metric alert to correlated logs and traces
- Alert on error rate, p99 latency, and circuit breaker state — not just “service is down”
This completes Part 10: Containers and Cloud. You now have everything needed to deploy, operate, and observe Spring Boot in production.
Next: Part 11 — Spring Boot 4 and Modern Java starts with Article 55: What’s New in Spring Boot 4.0.