Producer Retries: Backoff, Timeouts, and Retry Strategies
Why Producers Need Retries
Network errors, leader elections, and broker restarts are normal events in a distributed system. Without retries, a transient broker hiccup causes permanent data loss from the producer’s perspective. With retries, the producer automatically re-sends failed records until either the broker accepts them or a timeout deadline is reached.
sequenceDiagram
participant Producer
participant Leader as Leader (Broker 1)
participant NewLeader as New Leader (Broker 2)
Producer->>Leader: ProduceRequest (offset 42)
Note over Leader: Broker 1 crashes mid-write
Leader--xProducer: No response (timeout)
Note over Producer: retry.backoff.ms = 100ms wait
Producer->>Producer: Wait 100ms
Note over NewLeader: Broker 2 elected as new leader
Producer->>NewLeader: ProduceRequest (retry 1)
NewLeader-->>Producer: ProduceResponse ✓ (offset 42)
Note over Producer: Retry succeeded transparently
The Timeout Hierarchy
Before configuring retries, understand how Kafka’s three timeout settings interact:
flowchart LR
subgraph DeliveryTimeout["delivery.timeout.ms (120,000ms default — outer deadline)"]
subgraph RequestTimeout["request.timeout.ms (30,000ms — per-attempt deadline)"]
Attempt["Send attempt 1"]
end
subgraph Backoff1["retry.backoff.ms (100ms)"]
Wait1["Wait 100ms"]
end
subgraph Request2["request.timeout.ms"]
Attempt2["Send attempt 2"]
end
subgraph Backoff2["retry.backoff.ms"]
Wait2["Wait 100ms"]
end
subgraph Request3["request.timeout.ms"]
Attempt3["Send attempt N"]
end
Attempt --> Wait1 --> Attempt2 --> Wait2 --> Attempt3
end
Expire["delivery.timeout.ms elapsed\n→ TimeoutException\nrecord dropped"]
Attempt3 --> Expire
| Setting | Default | Controls |
|---|---|---|
delivery.timeout.ms | 120,000 ms (2 min) | Total time a send can take, including all retries |
request.timeout.ms | 30,000 ms (30 sec) | How long to wait for a single broker response |
retry.backoff.ms | 100 ms | How long to wait between retry attempts |
retries | Integer.MAX_VALUE | Max number of retry attempts |
delivery.timeout.ms is the outer deadline — when it expires, the record is failed regardless of retries count.
Configuring Retries
Approach 1: Simple Properties
# application.properties
spring.kafka.producer.retries=10
spring.kafka.producer.properties.retry.backoff.ms=200
spring.kafka.producer.properties.delivery.timeout.ms=60000
spring.kafka.producer.properties.request.timeout.ms=15000
Approach 2: @Bean Configuration (Recommended)
@Bean
public ProducerFactory<String, Object> producerFactory() {
Map<String, Object> config = new HashMap<>();
config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JsonSerializer.class);
// Reliability
config.put(ProducerConfig.ACKS_CONFIG, "all");
config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
// Retry settings
config.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE); // retry indefinitely
config.put(ProducerConfig.RETRY_BACKOFF_MS_CONFIG, 100); // 100ms between retries
// Timeout settings
config.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120_000); // 2 min total deadline
config.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 30_000); // 30s per attempt
// Batching (affects when linger causes flush)
config.put(ProducerConfig.LINGER_MS_CONFIG, 5); // wait up to 5ms to fill batch
return new DefaultKafkaProducerFactory<>(config);
}
linger.ms: Batching and Retry Interaction
linger.ms controls how long the producer waits before sending a batch even if it is not full. It trades latency for throughput.
flowchart TB
subgraph Linger0["linger.ms=0 (default)"]
R0_1["Record arrives"] --> S0_1["Send immediately\n(small batch)"]
R0_2["Record arrives"] --> S0_2["Send immediately\n(small batch)"]
R0_3["Record arrives"] --> S0_3["Send immediately\n(small batch)"]
Note0["More network requests\nLower latency\nLower throughput"]
end
subgraph Linger5["linger.ms=5"]
R5_1["Record 1 arrives"] --> W5["Wait up to 5ms"]
R5_2["Record 2 arrives"] --> W5
R5_3["Record 3 arrives"] --> W5
W5 --> S5["Send batch of 3\n(one network request)"]
Note5["Fewer network requests\nSlightly higher latency\nHigher throughput"]
end
For most services, linger.ms=5 is a good default — the extra 5ms is imperceptible to users but significantly improves throughput under load.
Retry Backoff: Fixed vs. Exponential
Fixed Backoff (Default)
With retry.backoff.ms=100, every retry waits exactly 100ms:
Attempt 1 → fail → wait 100ms → Attempt 2 → fail → wait 100ms → Attempt 3...
Exponential Backoff (Available since Kafka 2.6)
Use retry.backoff.max.ms to cap exponential growth:
spring.kafka.producer.properties.retry.backoff.ms=100
spring.kafka.producer.properties.retry.backoff.max.ms=10000
Attempt 1 → fail → wait 100ms
Attempt 2 → fail → wait 200ms
Attempt 3 → fail → wait 400ms
...
Attempt N → fail → wait 10,000ms (capped at max)
Exponential backoff is gentler on an already-struggling broker — it gives the broker more time to recover as successive retries occur.
Message Reordering: A Critical Retry Pitfall
Without idempotence, retries can cause out-of-order delivery:
sequenceDiagram
participant Producer
participant Broker
Producer->>Broker: Record A (seq=1)
Producer->>Broker: Record B (seq=2)
Note over Broker: Record A write fails (network hiccup)
Broker-->>Producer: Error for Record A
Broker-->>Producer: Ack for Record B ✓ (B is committed at offset 0)
Producer->>Broker: Retry Record A
Broker-->>Producer: Ack for Record A ✓ (A is committed at offset 1)
Note over Broker: Final order on broker:\noffset 0 = Record B\noffset 1 = Record A\n❌ OUT OF ORDER!
This happens because Kafka allows multiple in-flight requests per partition (max.in.flight.requests.per.connection=5 by default). Record B is accepted while Record A’s retry is still in flight.
Solution: Enable Idempotence
spring.kafka.producer.properties.enable.idempotence=true
With idempotence enabled, Kafka assigns a sequence number to each record. The broker detects and rejects duplicates and reorders retries correctly — Record A’s retry is rejected if B was already committed with a higher sequence number in the same batch sequence.
Idempotence also automatically sets:
acks=all(required for idempotence)max.in.flight.requests.per.connection=5(maximum allowed with idempotence)retries=Integer.MAX_VALUE
Always enable idempotence when using retries. It is the safe default.
What Errors Are Retried?
Not all errors are retriable. The producer classifies errors:
flowchart TD
Error["ProduceResponse Error"]
Retriable{Retriable?}
Retry["Retry automatically\n(wait retry.backoff.ms)"]
Fail["Fail immediately\n(no retry)"]
Retriable -->|Yes| Retry
Retriable -->|No| Fail
RetryableExamples["Retriable errors:\n• LEADER_NOT_AVAILABLE\n• NOT_LEADER_OR_FOLLOWER\n• REQUEST_TIMED_OUT\n• BROKER_NOT_AVAILABLE\n• NETWORK_EXCEPTION\n• NOT_ENOUGH_REPLICAS (transient)"]
NonRetryableExamples["Non-retriable errors:\n• INVALID_TOPIC_EXCEPTION\n• RECORD_TOO_LARGE\n• UNKNOWN_TOPIC_OR_PARTITION\n• AUTHORIZATION_FAILED\n• INVALID_REQUIRED_ACKS"]
Retry --- RetryableExamples
Fail --- NonRetryableExamples
LEADER_NOT_AVAILABLE and NOT_LEADER_OR_FOLLOWER are the most common retriable errors — they occur during leader elections (broker restart, partition rebalancing). The producer retries and successfully connects to the new leader.
Monitoring Send Failures in Spring Kafka
@Bean
public KafkaTemplate<String, Object> kafkaTemplate(
ProducerFactory<String, Object> producerFactory) {
KafkaTemplate<String, Object> template = new KafkaTemplate<>(producerFactory);
template.setProducerListener(new ProducerListener<>() {
@Override
public void onError(ProducerRecord<String, Object> record,
RecordMetadata metadata,
Exception exception) {
if (exception instanceof TimeoutException) {
// delivery.timeout.ms exceeded — all retries exhausted
log.error("[KAFKA] Delivery timeout — record permanently lost: "
+ "topic={} key={}", record.topic(), record.key());
deadLetterService.saveFailedRecord(record, exception);
} else {
log.error("[KAFKA] Send error: topic={} key={}",
record.topic(), record.key(), exception);
}
}
});
return template;
}
Recommended Retry Configuration
flowchart LR
subgraph Config["Recommended Production Settings"]
R1["retries = Integer.MAX_VALUE\n(let delivery.timeout.ms decide)"]
R2["delivery.timeout.ms = 120,000\n(2 min total window)"]
R3["request.timeout.ms = 30,000\n(30s per attempt)"]
R4["retry.backoff.ms = 100\n(100ms between attempts)"]
R5["enable.idempotence = true\n(no duplicates on retry)"]
R6["linger.ms = 5\n(micro-batching for throughput)"]
end
config.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
config.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120_000);
config.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 30_000);
config.put(ProducerConfig.RETRY_BACKOFF_MS_CONFIG, 100);
config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
config.put(ProducerConfig.LINGER_MS_CONFIG, 5);
config.put(ProducerConfig.ACKS_CONFIG, "all"); // required by idempotence
The constraint that must hold: delivery.timeout.ms >= request.timeout.ms + linger.ms. Spring Kafka will warn if this is violated.
Key Takeaways
retriescounts retry attempts;delivery.timeout.msis the absolute deadline — whichever comes first fails the sendrequest.timeout.msis per-attempt;retry.backoff.msis the wait between attempts- Without
enable.idempotence=true, retries can cause out-of-order delivery within a partition enable.idempotence=truerequiresacks=alland prevents both duplicates and reordering- Retriable errors (leader unavailable, timeout) are retried automatically; non-retriable errors (invalid topic, message too large) fail immediately
linger.ms=5is a good default for production throughput without noticeable latency increase- Set
retries=Integer.MAX_VALUEand letdelivery.timeout.mscontrol the retry window
Next: Idempotent Producers: Eliminating Duplicate Messages — understand exactly how Kafka’s idempotence mechanism works internally, how producer IDs and sequence numbers prevent duplicates.