Spring Kafka Production Checklist and Best Practices
Before You Ship
This is the checklist distilled from everything in this series. Work through it before your first production deployment. Each item links to the article where it’s covered in depth.
Producer Checklist
Durability
# Never lose data on leader failure
spring.kafka.producer.acks=all
# At least 2 brokers must acknowledge every write
spring.kafka.producer.properties.min.insync.replicas=2
# Enables exactly-once message delivery (required for transactions)
spring.kafka.producer.properties.enable.idempotence=true
Do: Set acks=all and min.insync.replicas=2 for any topic that carries business data. These two settings together mean data is written to at least 2 brokers before the producer considers it committed.
Don’t: Use acks=0 or acks=1 for orders, payments, or inventory. The 10–30% throughput gain is not worth the risk of data loss on broker failure.
Retries
# Maximum time to retry (includes all backoff)
spring.kafka.producer.properties.delivery.timeout.ms=120000
# Initial retry backoff
spring.kafka.producer.properties.retry.backoff.ms=100
# Max time to wait for a single request
spring.kafka.producer.properties.request.timeout.ms=30000
Set delivery.timeout.ms to match your SLA. If you need the produce call to succeed within 5 seconds, set it to 5000.
Batching and Throughput
# Wait up to 10ms to fill a batch (reduces requests, increases throughput)
spring.kafka.producer.properties.linger.ms=10
# Batch size in bytes (32KB is a good starting point)
spring.kafka.producer.properties.batch.size=32768
# Enable LZ4 compression (good balance of speed and ratio)
spring.kafka.producer.properties.compression.type=lz4
Compression reduces network and storage usage. lz4 is the best default — fast and effective. For maximum compression at the cost of CPU, use zstd.
Consumer Checklist
Offset Management
# Never use auto-commit in production
spring.kafka.consumer.enable-auto-commit=false
spring.kafka.listener.ack-mode=batch
BATCH is the right default: commit after all records in a poll batch are processed successfully. Use MANUAL when you need per-record control or selective nack.
Concurrency
Set concurrency equal to the number of partitions the consumer expects to own. If the orders topic has 6 partitions and you run 2 instances, set concurrency=3 per instance.
factory.setConcurrency(3);
Isolation Level (Transactional Topics)
# Read only committed records from transactional producers
spring.kafka.consumer.properties.isolation.level=read_committed
Always set read_committed when consuming from topics written by transactional producers.
Poll Timeout and Max Records
# How long poll() blocks
spring.kafka.consumer.properties.max.poll.interval.ms=300000
# Max records per poll (tune based on processing time)
spring.kafka.consumer.properties.max.poll.records=500
max.poll.interval.ms must be longer than the time to process a full batch. If processing 500 records takes 30 seconds, set it to at least 60 seconds. A poll interval timeout causes the consumer to leave the group and triggers a rebalance.
Error Handling Checklist
Required Configuration
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Object> kafkaTemplate) {
DeadLetterPublishingRecoverer recoverer =
new DeadLetterPublishingRecoverer(kafkaTemplate);
DefaultErrorHandler handler = new DefaultErrorHandler(
recoverer,
new ExponentialBackOff(1000L, 2.0) {{
setMaxElapsedTime(30_000L);
}}
);
handler.addNotRetryableExceptions(
// Add your permanent failure exceptions
IllegalArgumentException.class
);
return handler;
}
Every production listener needs:
- An error handler with backoff (not the default “retry 10 times immediately”)
- A recoverer — at minimum a logging recoverer; ideally
DeadLetterPublishingRecoverer - Non-retryable exception classification
-
ErrorHandlingDeserializerwrapping your value deserializer
Dead Letter Topics
- DLT topics created explicitly with longer retention (7–30 days)
- A DLT consumer that logs, alerts, or auto-reprocesses failed records
- Partition count on DLT matches source topic
Deserialization
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer.class);
props.put(ErrorHandlingDeserializer.VALUE_DESERIALIZER_CLASS, JsonDeserializer.class.getName());
-
ErrorHandlingDeserializerwraps every deserializer -
TRUSTED_PACKAGESset explicitly (not*in production unless all packages are internal) - Type mapping configured so full class names are not on the wire
Serialization Checklist
JSON (Simple Projects)
-
TYPE_MAPPINGSconfigured on both producer and consumer — do not put full class names on the wire across services -
TRUSTED_PACKAGESrestricted to your event packages
Avro (Multi-Team or Schema Evolution)
- Schema Registry URL in producer and consumer config
- Schema compatibility mode set to
BACKWARDorFULL - New fields always have defaults
- Schema compatibility checked in CI before merging
Topic Configuration Checklist
| Setting | Recommended | Reason |
|---|---|---|
| Partitions | ≥ max expected concurrent consumers | One partition per consumer thread |
| Replicas | 3 (or 2 for dev) | Survive one broker failure |
min.insync.replicas | 2 | Data committed to ≥ 2 brokers |
retention.ms | 7 days (production), 30 days (DLT) | Replay window |
cleanup.policy | delete for events, compact for KTable | Compaction for changelog topics |
compression.type | lz4 or zstd | Reduce storage and network |
Monitoring Checklist
- Consumer lag exposed via Micrometer (
kafka.consumer.fetch.manager.records.lag) - Alert: lag > threshold for > 5 minutes → PagerDuty/Slack
- Alert: producer error rate > 0 for > 1 minute
- Business metrics: records processed per second, processing latency (p99)
- Spring Boot Actuator Kafka health check in liveness/readiness probe
- Grafana dashboard: lag per partition, throughput, error rate, rebalance frequency
Security Checklist
For production clusters:
# TLS for data in transit
spring.kafka.security.protocol=SASL_SSL
spring.kafka.ssl.trust-store-location=classpath:kafka.truststore.jks
spring.kafka.ssl.trust-store-password=${KAFKA_TRUSTSTORE_PASSWORD}
# SASL authentication
spring.kafka.properties.sasl.mechanism=PLAIN
spring.kafka.properties.sasl.jaas.config=\
org.apache.kafka.common.security.plain.PlainLoginModule required \
username="${KAFKA_USERNAME}" \
password="${KAFKA_PASSWORD}";
- TLS enabled for all broker connections
- SASL authentication configured
- ACLs: each service can only produce to its output topics and consume from its input topics
- Credentials stored in secrets manager (not hardcoded in
application.properties)
Architectural Patterns
Topic Naming Convention
{domain}-{event-type} # orders-placed, payments-confirmed
{service}-{entity}-{action} # inventory-stock-reserved
{environment}.{domain}.events # prod.orders.events
Pick one and enforce it — naming drift is painful to fix later.
Event Schema Contract
- Events are append-only — never delete or modify published events
- Fields can be added with defaults; never removed or renamed in backward-compatible schemas
- Use a shared event library or Schema Registry to enforce contracts between services
- Version the schema, not the topic — avoid
orders-v2topic proliferation
Idempotent Consumers
Every consumer should be idempotent — processing the same record twice must produce the same result:
public void reserveStock(OrderPlacedEvent event) {
// Use orderId as idempotency key — second call is a no-op
if (reservationRepository.existsByOrderId(event.getOrderId())) {
log.info("Stock already reserved for order: {}", event.getOrderId());
return;
}
reservationRepository.save(new StockReservation(event.getOrderId(), event.getItems()));
}
Consumer Group Design
flowchart TD
T["orders topic\n(6 partitions)"]
subgraph InventorySvc["inventory-service group"]
I1["instance 1\nP0, P1, P2"]
I2["instance 2\nP3, P4, P5"]
end
subgraph NotifySvc["notification-service group"]
N1["instance 1\nP0-P5"]
end
subgraph AnalyticsSvc["analytics-service group"]
A1["instance 1\nP0-P5"]
end
T --> InventorySvc
T --> NotifySvc
T --> AnalyticsSvc
Each service has its own consumer group — they all read from the same topic independently, each maintaining its own offset.
Startup Readiness
@Component
public class KafkaReadinessCheck implements ApplicationRunner {
@Autowired KafkaAdmin admin;
@Override
public void run(ApplicationArguments args) throws Exception {
// Verify required topics exist with correct config
// Fail fast if broker unreachable
}
}
Applications should verify their required topics exist at startup and fail fast if they don’t. Silent startup followed by mysterious consumer failures is worse than a clear error message.
The 10 Things Most Teams Get Wrong
- Auto-commit on — set
enable.auto.commit=falseand use Spring Kafka’sAckMode - No error handler — the default handler retries forever; configure
DefaultErrorHandlerwith backoff - No DLT — failed records disappear silently; always use
DeadLetterPublishingRecoverer - No
ErrorHandlingDeserializer— one corrupt message blocks the partition forever - Full class names on the wire — breaks when producer and consumer are in different services; use type mappings
TRUSTED_PACKAGES=*— a security risk; restrict to your event packages- Concurrency > partition count — extra threads receive no partitions; size concurrency correctly
- No consumer lag alert — first sign of trouble is users complaining, not an alert
- No idempotent consumers — at-least-once delivery guarantees redelivery; your processing must handle it
max.poll.interval.mstoo short — slow processing causes the consumer to leave the group and trigger rebalances
Summary: The Production-Ready Configuration
// Producer
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
props.put(ProducerConfig.LINGER_MS_CONFIG, 10);
props.put(JsonSerializer.TYPE_MAPPINGS, "orderPlaced:com.example.events.OrderPlacedEvent");
// Consumer
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 500);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer.class);
props.put(ErrorHandlingDeserializer.VALUE_DESERIALIZER_CLASS, JsonDeserializer.class.getName());
props.put(JsonDeserializer.TRUSTED_PACKAGES, "com.example.events");
props.put(JsonDeserializer.TYPE_MAPPINGS, "orderPlaced:com.example.inventory.events.OrderPlacedEvent");
// Factory
factory.setConcurrency(3);
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.BATCH);
factory.setCommonErrorHandler(new DefaultErrorHandler(
new DeadLetterPublishingRecoverer(kafkaTemplate),
new ExponentialBackOff(1000L, 2.0)
));
This completes the Spring Kafka Tutorial series. You now have everything needed to build production-grade Kafka applications with Spring Boot — from first principles through exactly-once transactions, Kafka Streams, and production monitoring.