Full Observability: Prometheus + Grafana + Tempo + Loki

Observability means being able to answer “what’s wrong and why” from the outside — without modifying the code. The three pillars: metrics (what happened), logs (what the code did), and traces (how a request flowed). This article wires them all together.

The Stack

Spring Boot App
├── Metrics  → Micrometer → Prometheus scrape → Grafana dashboards
├── Traces   → Micrometer Tracing → OTLP → Tempo → Grafana trace view
└── Logs     → Logback → Loki4j → Loki → Grafana log explorer

All three converge in Grafana — click a metric spike to see the correlated logs and traces for that exact time window.

Dependencies

<!-- Metrics -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

<!-- Distributed Tracing -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>

<!-- Loki log shipping -->
<dependency>
    <groupId>com.github.loki4j</groupId>
    <artifactId>loki-logback-appender</artifactId>
    <version>1.5.1</version>
</dependency>

Application Configuration

spring:
  application:
    name: order-service

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  endpoint:
    health:
      probes:
        enabled: true
      show-details: when-authorized
  metrics:
    tags:
      application: ${spring.application.name}
      environment: ${ENVIRONMENT:local}
  tracing:
    sampling:
      probability: 1.0    # 100% in dev; 0.1 (10%) in prod
  otlp:
    tracing:
      endpoint: http://tempo:4318/v1/traces

Logback with Trace Correlation

<!-- logback-spring.xml -->
<configuration>
    <springProperty scope="context" name="appName" source="spring.application.name"/>

    <!-- Dev: colorized console with trace IDs -->
    <springProfile name="dev,local">
        <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
            <encoder>
                <pattern>%d{HH:mm:ss.SSS} %highlight(%-5level) [%X{traceId},%X{spanId}] %cyan(%logger{36}) - %msg%n</pattern>
            </encoder>
        </appender>
        <root level="INFO">
            <appender-ref ref="CONSOLE"/>
        </root>
        <logger name="com.devopsmonk" level="DEBUG"/>
    </springProfile>

    <!-- Prod: JSON to Loki -->
    <springProfile name="prod">
        <appender name="LOKI" class="com.github.loki4j.logback.Loki4jAppender">
            <http>
                <url>http://loki:3100/loki/api/v1/push</url>
            </http>
            <format>
                <label>
                    <pattern>app=${appName},env=${ENVIRONMENT},host=${HOSTNAME}</pattern>
                </label>
                <message class="com.github.loki4j.logback.JsonLayout">
                    <!-- Include MDC fields in every log entry -->
                    <includeKeyValue>true</includeKeyValue>
                </message>
            </format>
        </appender>
        <root level="WARN">
            <appender-ref ref="LOKI"/>
        </root>
        <logger name="com.devopsmonk" level="INFO"/>
    </springProfile>
</configuration>

Micrometer Tracing automatically puts traceId and spanId into MDC — every log statement includes them without any code changes. In Grafana, you can click a trace and jump directly to the correlated logs.

Custom Spans

@Service
@RequiredArgsConstructor
@Slf4j
public class OrderService {

    private final Tracer tracer;
    private final OrderRepository repository;
    private final InventoryClient inventoryClient;

    public Order createOrder(CreateOrderRequest request) {
        // Create a child span for the business operation
        Span span = tracer.nextSpan().name("order.create").start();

        try (Tracer.SpanInScope scope = tracer.withSpan(span)) {
            span.tag("customerId", request.customerId().toString());
            span.tag("itemCount", String.valueOf(request.items().size()));

            // Inventory check — creates its own child span via Feign instrumentation
            inventoryClient.checkAvailability(request.items());

            Order order = repository.save(buildOrder(request));
            span.tag("orderId", order.getId().toString());

            log.info("Order created: orderId={}", order.getId());  // includes traceId automatically
            return order;

        } catch (Exception e) {
            span.error(e);
            throw e;
        } finally {
            span.end();
        }
    }
}

Feign clients, Kafka producers/consumers, and database calls are automatically instrumented — they appear as child spans in Tempo without any code changes.

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: order-service
    metrics_path: /actuator/prometheus
    static_configs:
      - targets: ['order-service:8081']
    # Or for Kubernetes:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__

Kubernetes annotation-based discovery — annotate pods to enable scraping:

# In pod template spec:
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/path: "/actuator/prometheus"
    prometheus.io/port: "8081"

Docker Compose: Full Stack Locally

# docker-compose.observability.yml
services:
  prometheus:
    image: prom/prometheus:v2.51.0
    volumes:
      - ./observability/prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=7d'

  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3000:3000"
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: Admin
    volumes:
      - ./observability/grafana/provisioning:/etc/grafana/provisioning
      - ./observability/grafana/dashboards:/var/lib/grafana/dashboards

  tempo:
    image: grafana/tempo:2.4.1
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./observability/tempo.yaml:/etc/tempo.yaml
    ports:
      - "3200:3200"    # Tempo UI
      - "4318:4318"    # OTLP HTTP

  loki:
    image: grafana/loki:2.9.6
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml

Grafana Dashboards

Grafana’s Spring Boot dashboard (ID: 17175) shows out-of-the-box:

  • Request rate, error rate, latency (R.E.D. metrics)
  • JVM heap, GC pauses, thread count
  • HikariCP connection pool utilization
  • Logback log rates by level

Import it: Grafana → Dashboards → Import → enter 17175.

Custom Dashboard Queries

# Request rate by endpoint
rate(http_server_requests_seconds_count{application="order-service"}[5m])

# Error rate
rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m])

# p99 latency
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket{application="order-service"}[5m]))

# Active orders (custom gauge)
orders_pending_count{application="order-service"}

# Circuit breaker state (0=closed, 1=open, 2=half-open)
resilience4j_circuitbreaker_state{application="order-service"}

Alerting

# alerting-rules.yaml
groups:
  - name: order-service
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m])
          / rate(http_server_requests_seconds_count{application="order-service"}[5m]) > 0.05          
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Order service error rate > 5%"
          runbook: https://wiki.devopsmonk.com/runbooks/order-service

      - alert: CircuitBreakerOpen
        expr: resilience4j_circuitbreaker_state{state="open"} == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker {{ $labels.name }} is open"

      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) > 2          
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Order service p99 latency > 2s"

The Debugging Workflow

When an alert fires:

  1. Grafana Dashboards — see the spike in error rate or latency. What time? Which endpoint?

  2. Prometheus — query http_server_requests_seconds_count{status="500",uri="/api/orders"} — confirm the failing endpoint.

  3. Loki — query {app="order-service"} | json | level="ERROR" for that time window — see the error messages and stack traces.

  4. Tempo — find a trace ID from the Loki log. Open it in Tempo — see every span: which service was slow, which database call failed, where the error originated.

  5. Fix the code. With traceId correlating logs and traces, you see the full picture — not just that something failed, but exactly why and where.

What You’ve Learned

  • Three pillars of observability: metrics (Micrometer/Prometheus), logs (Logback/Loki), traces (Micrometer Tracing/Tempo)
  • Micrometer Tracing puts traceId and spanId in MDC automatically — logs and traces are correlated with no code changes
  • Custom spans with Tracer.nextSpan() add business context to distributed traces
  • Feign clients, Kafka, and database calls are automatically instrumented as child spans
  • Grafana unifies all three signals — click from a metric alert to correlated logs and traces
  • Alert on error rate, p99 latency, and circuit breaker state — not just “service is down”

This completes Part 10: Containers and Cloud. You now have everything needed to deploy, operate, and observe Spring Boot in production.


Next: Part 11 — Spring Boot 4 and Modern Java starts with Article 55: What’s New in Spring Boot 4.0.