Spring Kafka Production Checklist and Best Practices

Before You Ship

This is the checklist distilled from everything in this series. Work through it before your first production deployment. Each item links to the article where it’s covered in depth.


Producer Checklist

Durability

# Never lose data on leader failure
spring.kafka.producer.acks=all

# At least 2 brokers must acknowledge every write
spring.kafka.producer.properties.min.insync.replicas=2

# Enables exactly-once message delivery (required for transactions)
spring.kafka.producer.properties.enable.idempotence=true

Do: Set acks=all and min.insync.replicas=2 for any topic that carries business data. These two settings together mean data is written to at least 2 brokers before the producer considers it committed.

Don’t: Use acks=0 or acks=1 for orders, payments, or inventory. The 10–30% throughput gain is not worth the risk of data loss on broker failure.

Retries

# Maximum time to retry (includes all backoff)
spring.kafka.producer.properties.delivery.timeout.ms=120000

# Initial retry backoff
spring.kafka.producer.properties.retry.backoff.ms=100

# Max time to wait for a single request
spring.kafka.producer.properties.request.timeout.ms=30000

Set delivery.timeout.ms to match your SLA. If you need the produce call to succeed within 5 seconds, set it to 5000.

Batching and Throughput

# Wait up to 10ms to fill a batch (reduces requests, increases throughput)
spring.kafka.producer.properties.linger.ms=10

# Batch size in bytes (32KB is a good starting point)
spring.kafka.producer.properties.batch.size=32768

# Enable LZ4 compression (good balance of speed and ratio)
spring.kafka.producer.properties.compression.type=lz4

Compression reduces network and storage usage. lz4 is the best default — fast and effective. For maximum compression at the cost of CPU, use zstd.


Consumer Checklist

Offset Management

# Never use auto-commit in production
spring.kafka.consumer.enable-auto-commit=false
spring.kafka.listener.ack-mode=batch

BATCH is the right default: commit after all records in a poll batch are processed successfully. Use MANUAL when you need per-record control or selective nack.

Concurrency

Set concurrency equal to the number of partitions the consumer expects to own. If the orders topic has 6 partitions and you run 2 instances, set concurrency=3 per instance.

factory.setConcurrency(3);

Isolation Level (Transactional Topics)

# Read only committed records from transactional producers
spring.kafka.consumer.properties.isolation.level=read_committed

Always set read_committed when consuming from topics written by transactional producers.

Poll Timeout and Max Records

# How long poll() blocks
spring.kafka.consumer.properties.max.poll.interval.ms=300000

# Max records per poll (tune based on processing time)
spring.kafka.consumer.properties.max.poll.records=500

max.poll.interval.ms must be longer than the time to process a full batch. If processing 500 records takes 30 seconds, set it to at least 60 seconds. A poll interval timeout causes the consumer to leave the group and triggers a rebalance.


Error Handling Checklist

Required Configuration

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Object> kafkaTemplate) {
    DeadLetterPublishingRecoverer recoverer =
        new DeadLetterPublishingRecoverer(kafkaTemplate);

    DefaultErrorHandler handler = new DefaultErrorHandler(
        recoverer,
        new ExponentialBackOff(1000L, 2.0) {{
            setMaxElapsedTime(30_000L);
        }}
    );

    handler.addNotRetryableExceptions(
        // Add your permanent failure exceptions
        IllegalArgumentException.class
    );

    return handler;
}

Every production listener needs:

  • An error handler with backoff (not the default “retry 10 times immediately”)
  • A recoverer — at minimum a logging recoverer; ideally DeadLetterPublishingRecoverer
  • Non-retryable exception classification
  • ErrorHandlingDeserializer wrapping your value deserializer

Dead Letter Topics

  • DLT topics created explicitly with longer retention (7–30 days)
  • A DLT consumer that logs, alerts, or auto-reprocesses failed records
  • Partition count on DLT matches source topic

Deserialization

props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer.class);
props.put(ErrorHandlingDeserializer.VALUE_DESERIALIZER_CLASS, JsonDeserializer.class.getName());
  • ErrorHandlingDeserializer wraps every deserializer
  • TRUSTED_PACKAGES set explicitly (not * in production unless all packages are internal)
  • Type mapping configured so full class names are not on the wire

Serialization Checklist

JSON (Simple Projects)

  • TYPE_MAPPINGS configured on both producer and consumer — do not put full class names on the wire across services
  • TRUSTED_PACKAGES restricted to your event packages

Avro (Multi-Team or Schema Evolution)

  • Schema Registry URL in producer and consumer config
  • Schema compatibility mode set to BACKWARD or FULL
  • New fields always have defaults
  • Schema compatibility checked in CI before merging

Topic Configuration Checklist

SettingRecommendedReason
Partitions≥ max expected concurrent consumersOne partition per consumer thread
Replicas3 (or 2 for dev)Survive one broker failure
min.insync.replicas2Data committed to ≥ 2 brokers
retention.ms7 days (production), 30 days (DLT)Replay window
cleanup.policydelete for events, compact for KTableCompaction for changelog topics
compression.typelz4 or zstdReduce storage and network

Monitoring Checklist

  • Consumer lag exposed via Micrometer (kafka.consumer.fetch.manager.records.lag)
  • Alert: lag > threshold for > 5 minutes → PagerDuty/Slack
  • Alert: producer error rate > 0 for > 1 minute
  • Business metrics: records processed per second, processing latency (p99)
  • Spring Boot Actuator Kafka health check in liveness/readiness probe
  • Grafana dashboard: lag per partition, throughput, error rate, rebalance frequency

Security Checklist

For production clusters:

# TLS for data in transit
spring.kafka.security.protocol=SASL_SSL
spring.kafka.ssl.trust-store-location=classpath:kafka.truststore.jks
spring.kafka.ssl.trust-store-password=${KAFKA_TRUSTSTORE_PASSWORD}

# SASL authentication
spring.kafka.properties.sasl.mechanism=PLAIN
spring.kafka.properties.sasl.jaas.config=\
  org.apache.kafka.common.security.plain.PlainLoginModule required \
  username="${KAFKA_USERNAME}" \
  password="${KAFKA_PASSWORD}";
  • TLS enabled for all broker connections
  • SASL authentication configured
  • ACLs: each service can only produce to its output topics and consume from its input topics
  • Credentials stored in secrets manager (not hardcoded in application.properties)

Architectural Patterns

Topic Naming Convention

{domain}-{event-type}         # orders-placed, payments-confirmed
{service}-{entity}-{action}   # inventory-stock-reserved
{environment}.{domain}.events # prod.orders.events

Pick one and enforce it — naming drift is painful to fix later.

Event Schema Contract

  • Events are append-only — never delete or modify published events
  • Fields can be added with defaults; never removed or renamed in backward-compatible schemas
  • Use a shared event library or Schema Registry to enforce contracts between services
  • Version the schema, not the topic — avoid orders-v2 topic proliferation

Idempotent Consumers

Every consumer should be idempotent — processing the same record twice must produce the same result:

public void reserveStock(OrderPlacedEvent event) {
    // Use orderId as idempotency key — second call is a no-op
    if (reservationRepository.existsByOrderId(event.getOrderId())) {
        log.info("Stock already reserved for order: {}", event.getOrderId());
        return;
    }
    reservationRepository.save(new StockReservation(event.getOrderId(), event.getItems()));
}

Consumer Group Design

flowchart TD
    T["orders topic\n(6 partitions)"]
    subgraph InventorySvc["inventory-service group"]
        I1["instance 1\nP0, P1, P2"]
        I2["instance 2\nP3, P4, P5"]
    end
    subgraph NotifySvc["notification-service group"]
        N1["instance 1\nP0-P5"]
    end
    subgraph AnalyticsSvc["analytics-service group"]
        A1["instance 1\nP0-P5"]
    end

    T --> InventorySvc
    T --> NotifySvc
    T --> AnalyticsSvc

Each service has its own consumer group — they all read from the same topic independently, each maintaining its own offset.

Startup Readiness

@Component
public class KafkaReadinessCheck implements ApplicationRunner {
    @Autowired KafkaAdmin admin;

    @Override
    public void run(ApplicationArguments args) throws Exception {
        // Verify required topics exist with correct config
        // Fail fast if broker unreachable
    }
}

Applications should verify their required topics exist at startup and fail fast if they don’t. Silent startup followed by mysterious consumer failures is worse than a clear error message.


The 10 Things Most Teams Get Wrong

  1. Auto-commit on — set enable.auto.commit=false and use Spring Kafka’s AckMode
  2. No error handler — the default handler retries forever; configure DefaultErrorHandler with backoff
  3. No DLT — failed records disappear silently; always use DeadLetterPublishingRecoverer
  4. No ErrorHandlingDeserializer — one corrupt message blocks the partition forever
  5. Full class names on the wire — breaks when producer and consumer are in different services; use type mappings
  6. TRUSTED_PACKAGES=* — a security risk; restrict to your event packages
  7. Concurrency > partition count — extra threads receive no partitions; size concurrency correctly
  8. No consumer lag alert — first sign of trouble is users complaining, not an alert
  9. No idempotent consumers — at-least-once delivery guarantees redelivery; your processing must handle it
  10. max.poll.interval.ms too short — slow processing causes the consumer to leave the group and trigger rebalances

Summary: The Production-Ready Configuration

// Producer
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
props.put(ProducerConfig.LINGER_MS_CONFIG, 10);
props.put(JsonSerializer.TYPE_MAPPINGS, "orderPlaced:com.example.events.OrderPlacedEvent");

// Consumer
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 500);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer.class);
props.put(ErrorHandlingDeserializer.VALUE_DESERIALIZER_CLASS, JsonDeserializer.class.getName());
props.put(JsonDeserializer.TRUSTED_PACKAGES, "com.example.events");
props.put(JsonDeserializer.TYPE_MAPPINGS, "orderPlaced:com.example.inventory.events.OrderPlacedEvent");

// Factory
factory.setConcurrency(3);
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.BATCH);
factory.setCommonErrorHandler(new DefaultErrorHandler(
    new DeadLetterPublishingRecoverer(kafkaTemplate),
    new ExponentialBackOff(1000L, 2.0)
));

This completes the Spring Kafka Tutorial series. You now have everything needed to build production-grade Kafka applications with Spring Boot — from first principles through exactly-once transactions, Kafka Streams, and production monitoring.