Spring Kafka Production Checklist and Best Practices

Part 37 of 37

May 04, 2026 Abhay 6 min read

Spring Kafka Production Checklist and Best Practices

Before You Ship

This is the checklist distilled from everything in this series. Work through it before your first production deployment. Each item links to the article where it’s covered in depth.

Producer Checklist

Durability

# Never lose data on leader failure
spring.kafka.producer.acks=all

# At least 2 brokers must acknowledge every write
spring.kafka.producer.properties.min.insync.replicas=2

# Enables exactly-once message delivery (required for transactions)
spring.kafka.producer.properties.enable.idempotence=true

Do: Set acks=all and min.insync.replicas=2 for any topic that carries business data. These two settings together mean data is written to at least 2 brokers before the producer considers it committed.

Don’t: Use acks=0 or acks=1 for orders, payments, or inventory. The 10–30% throughput gain is not worth the risk of data loss on broker failure.

Retries

# Maximum time to retry (includes all backoff)
spring.kafka.producer.properties.delivery.timeout.ms=120000

# Initial retry backoff
spring.kafka.producer.properties.retry.backoff.ms=100

# Max time to wait for a single request
spring.kafka.producer.properties.request.timeout.ms=30000

Set delivery.timeout.ms to match your SLA. If you need the produce call to succeed within 5 seconds, set it to 5000.

Batching and Throughput

# Wait up to 10ms to fill a batch (reduces requests, increases throughput)
spring.kafka.producer.properties.linger.ms=10

# Batch size in bytes (32KB is a good starting point)
spring.kafka.producer.properties.batch.size=32768

# Enable LZ4 compression (good balance of speed and ratio)
spring.kafka.producer.properties.compression.type=lz4

Compression reduces network and storage usage. lz4 is the best default — fast and effective. For maximum compression at the cost of CPU, use zstd.

Consumer Checklist

Offset Management

# Never use auto-commit in production
spring.kafka.consumer.enable-auto-commit=false
spring.kafka.listener.ack-mode=batch

BATCH is the right default: commit after all records in a poll batch are processed successfully. Use MANUAL when you need per-record control or selective nack.

Concurrency

Set concurrency equal to the number of partitions the consumer expects to own. If the orders topic has 6 partitions and you run 2 instances, set concurrency=3 per instance.

factory.setConcurrency(3);

Isolation Level (Transactional Topics)

# Read only committed records from transactional producers
spring.kafka.consumer.properties.isolation.level=read_committed

Always set read_committed when consuming from topics written by transactional producers.

Poll Timeout and Max Records

# How long poll() blocks
spring.kafka.consumer.properties.max.poll.interval.ms=300000

# Max records per poll (tune based on processing time)
spring.kafka.consumer.properties.max.poll.records=500

max.poll.interval.ms must be longer than the time to process a full batch. If processing 500 records takes 30 seconds, set it to at least 60 seconds. A poll interval timeout causes the consumer to leave the group and triggers a rebalance.

Error Handling Checklist

Required Configuration

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Object> kafkaTemplate) {
    DeadLetterPublishingRecoverer recoverer =
        new DeadLetterPublishingRecoverer(kafkaTemplate);

    DefaultErrorHandler handler = new DefaultErrorHandler(
        recoverer,
        new ExponentialBackOff(1000L, 2.0) {{
            setMaxElapsedTime(30_000L);
        }}
    );

    handler.addNotRetryableExceptions(
        // Add your permanent failure exceptions
        IllegalArgumentException.class
    );

    return handler;
}

Every production listener needs:

An error handler with backoff (not the default “retry 10 times immediately”)
A recoverer — at minimum a logging recoverer; ideally DeadLetterPublishingRecoverer
Non-retryable exception classification
ErrorHandlingDeserializer wrapping your value deserializer

Dead Letter Topics

DLT topics created explicitly with longer retention (7–30 days)
A DLT consumer that logs, alerts, or auto-reprocesses failed records
Partition count on DLT matches source topic

Deserialization

props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer.class);
props.put(ErrorHandlingDeserializer.VALUE_DESERIALIZER_CLASS, JsonDeserializer.class.getName());

ErrorHandlingDeserializer wraps every deserializer
TRUSTED_PACKAGES set explicitly (not * in production unless all packages are internal)
Type mapping configured so full class names are not on the wire

Serialization Checklist

JSON (Simple Projects)

TYPE_MAPPINGS configured on both producer and consumer — do not put full class names on the wire across services
TRUSTED_PACKAGES restricted to your event packages

Avro (Multi-Team or Schema Evolution)

Schema Registry URL in producer and consumer config
Schema compatibility mode set to BACKWARD or FULL
New fields always have defaults
Schema compatibility checked in CI before merging

Topic Configuration Checklist

Setting	Recommended	Reason
Partitions	≥ max expected concurrent consumers	One partition per consumer thread
Replicas	3 (or 2 for dev)	Survive one broker failure
`min.insync.replicas`	2	Data committed to ≥ 2 brokers
`retention.ms`	7 days (production), 30 days (DLT)	Replay window
`cleanup.policy`	`delete` for events, `compact` for KTable	Compaction for changelog topics
`compression.type`	`lz4` or `zstd`	Reduce storage and network

Monitoring Checklist

Consumer lag exposed via Micrometer (kafka.consumer.fetch.manager.records.lag)
Alert: lag > threshold for > 5 minutes → PagerDuty/Slack
Alert: producer error rate > 0 for > 1 minute
Business metrics: records processed per second, processing latency (p99)
Spring Boot Actuator Kafka health check in liveness/readiness probe
Grafana dashboard: lag per partition, throughput, error rate, rebalance frequency

Security Checklist

For production clusters:

# TLS for data in transit
spring.kafka.security.protocol=SASL_SSL
spring.kafka.ssl.trust-store-location=classpath:kafka.truststore.jks
spring.kafka.ssl.trust-store-password=${KAFKA_TRUSTSTORE_PASSWORD}

# SASL authentication
spring.kafka.properties.sasl.mechanism=PLAIN
spring.kafka.properties.sasl.jaas.config=\
  org.apache.kafka.common.security.plain.PlainLoginModule required \
  username="${KAFKA_USERNAME}" \
  password="${KAFKA_PASSWORD}";

TLS enabled for all broker connections
SASL authentication configured
ACLs: each service can only produce to its output topics and consume from its input topics
Credentials stored in secrets manager (not hardcoded in application.properties)

Architectural Patterns

Topic Naming Convention

{domain}-{event-type}         # orders-placed, payments-confirmed
{service}-{entity}-{action}   # inventory-stock-reserved
{environment}.{domain}.events # prod.orders.events

Pick one and enforce it — naming drift is painful to fix later.

Event Schema Contract

Events are append-only — never delete or modify published events
Fields can be added with defaults; never removed or renamed in backward-compatible schemas
Use a shared event library or Schema Registry to enforce contracts between services
Version the schema, not the topic — avoid orders-v2 topic proliferation

Idempotent Consumers

Every consumer should be idempotent — processing the same record twice must produce the same result:

public void reserveStock(OrderPlacedEvent event) {
    // Use orderId as idempotency key — second call is a no-op
    if (reservationRepository.existsByOrderId(event.getOrderId())) {
        log.info("Stock already reserved for order: {}", event.getOrderId());
        return;
    }
    reservationRepository.save(new StockReservation(event.getOrderId(), event.getItems()));
}

Consumer Group Design

flowchart TD
    T["orders topic\n(6 partitions)"]
    subgraph InventorySvc["inventory-service group"]
        I1["instance 1\nP0, P1, P2"]
        I2["instance 2\nP3, P4, P5"]
    end
    subgraph NotifySvc["notification-service group"]
        N1["instance 1\nP0-P5"]
    end
    subgraph AnalyticsSvc["analytics-service group"]
        A1["instance 1\nP0-P5"]
    end

    T --> InventorySvc
    T --> NotifySvc
    T --> AnalyticsSvc

Each service has its own consumer group — they all read from the same topic independently, each maintaining its own offset.

Startup Readiness

@Component
public class KafkaReadinessCheck implements ApplicationRunner {
    @Autowired KafkaAdmin admin;

    @Override
    public void run(ApplicationArguments args) throws Exception {
        // Verify required topics exist with correct config
        // Fail fast if broker unreachable
    }
}

Applications should verify their required topics exist at startup and fail fast if they don’t. Silent startup followed by mysterious consumer failures is worse than a clear error message.

The 10 Things Most Teams Get Wrong

Auto-commit on — set enable.auto.commit=false and use Spring Kafka’s AckMode
No error handler — the default handler retries forever; configure DefaultErrorHandler with backoff
No DLT — failed records disappear silently; always use DeadLetterPublishingRecoverer
No ErrorHandlingDeserializer — one corrupt message blocks the partition forever
Full class names on the wire — breaks when producer and consumer are in different services; use type mappings
TRUSTED_PACKAGES=* — a security risk; restrict to your event packages
Concurrency > partition count — extra threads receive no partitions; size concurrency correctly
No consumer lag alert — first sign of trouble is users complaining, not an alert
No idempotent consumers — at-least-once delivery guarantees redelivery; your processing must handle it
max.poll.interval.ms too short — slow processing causes the consumer to leave the group and trigger rebalances

Summary: The Production-Ready Configuration

// Producer
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
props.put(ProducerConfig.LINGER_MS_CONFIG, 10);
props.put(JsonSerializer.TYPE_MAPPINGS, "orderPlaced:com.example.events.OrderPlacedEvent");

// Consumer
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 500);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer.class);
props.put(ErrorHandlingDeserializer.VALUE_DESERIALIZER_CLASS, JsonDeserializer.class.getName());
props.put(JsonDeserializer.TRUSTED_PACKAGES, "com.example.events");
props.put(JsonDeserializer.TYPE_MAPPINGS, "orderPlaced:com.example.inventory.events.OrderPlacedEvent");

// Factory
factory.setConcurrency(3);
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.BATCH);
factory.setCommonErrorHandler(new DefaultErrorHandler(
    new DeadLetterPublishingRecoverer(kafkaTemplate),
    new ExponentialBackOff(1000L, 2.0)
));

This completes the Spring Kafka Tutorial series. You now have everything needed to build production-grade Kafka applications with Spring Boot — from first principles through exactly-once transactions, Kafka Streams, and production monitoring.