Producer Retries: Backoff, Timeouts, and Retry Strategies

Why Producers Need Retries

Network errors, leader elections, and broker restarts are normal events in a distributed system. Without retries, a transient broker hiccup causes permanent data loss from the producer’s perspective. With retries, the producer automatically re-sends failed records until either the broker accepts them or a timeout deadline is reached.

sequenceDiagram
    participant Producer
    participant Leader as Leader (Broker 1)
    participant NewLeader as New Leader (Broker 2)

    Producer->>Leader: ProduceRequest (offset 42)
    Note over Leader: Broker 1 crashes mid-write
    Leader--xProducer: No response (timeout)

    Note over Producer: retry.backoff.ms = 100ms wait
    Producer->>Producer: Wait 100ms

    Note over NewLeader: Broker 2 elected as new leader
    Producer->>NewLeader: ProduceRequest (retry 1)
    NewLeader-->>Producer: ProduceResponse ✓ (offset 42)
    Note over Producer: Retry succeeded transparently

The Timeout Hierarchy

Before configuring retries, understand how Kafka’s three timeout settings interact:

flowchart LR
    subgraph DeliveryTimeout["delivery.timeout.ms (120,000ms default — outer deadline)"]
        subgraph RequestTimeout["request.timeout.ms (30,000ms — per-attempt deadline)"]
            Attempt["Send attempt 1"]
        end
        subgraph Backoff1["retry.backoff.ms (100ms)"]
            Wait1["Wait 100ms"]
        end
        subgraph Request2["request.timeout.ms"]
            Attempt2["Send attempt 2"]
        end
        subgraph Backoff2["retry.backoff.ms"]
            Wait2["Wait 100ms"]
        end
        subgraph Request3["request.timeout.ms"]
            Attempt3["Send attempt N"]
        end

        Attempt --> Wait1 --> Attempt2 --> Wait2 --> Attempt3
    end

    Expire["delivery.timeout.ms elapsed\n→ TimeoutException\nrecord dropped"]
    Attempt3 --> Expire
SettingDefaultControls
delivery.timeout.ms120,000 ms (2 min)Total time a send can take, including all retries
request.timeout.ms30,000 ms (30 sec)How long to wait for a single broker response
retry.backoff.ms100 msHow long to wait between retry attempts
retriesInteger.MAX_VALUEMax number of retry attempts

delivery.timeout.ms is the outer deadline — when it expires, the record is failed regardless of retries count.


Configuring Retries

Approach 1: Simple Properties

# application.properties
spring.kafka.producer.retries=10
spring.kafka.producer.properties.retry.backoff.ms=200
spring.kafka.producer.properties.delivery.timeout.ms=60000
spring.kafka.producer.properties.request.timeout.ms=15000
@Bean
public ProducerFactory<String, Object> producerFactory() {
    Map<String, Object> config = new HashMap<>();
    config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
    config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
    config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JsonSerializer.class);

    // Reliability
    config.put(ProducerConfig.ACKS_CONFIG, "all");
    config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);

    // Retry settings
    config.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);  // retry indefinitely
    config.put(ProducerConfig.RETRY_BACKOFF_MS_CONFIG, 100);       // 100ms between retries

    // Timeout settings
    config.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120_000); // 2 min total deadline
    config.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 30_000);   // 30s per attempt

    // Batching (affects when linger causes flush)
    config.put(ProducerConfig.LINGER_MS_CONFIG, 5);  // wait up to 5ms to fill batch

    return new DefaultKafkaProducerFactory<>(config);
}

linger.ms: Batching and Retry Interaction

linger.ms controls how long the producer waits before sending a batch even if it is not full. It trades latency for throughput.

flowchart TB
    subgraph Linger0["linger.ms=0 (default)"]
        R0_1["Record arrives"] --> S0_1["Send immediately\n(small batch)"]
        R0_2["Record arrives"] --> S0_2["Send immediately\n(small batch)"]
        R0_3["Record arrives"] --> S0_3["Send immediately\n(small batch)"]
        Note0["More network requests\nLower latency\nLower throughput"]
    end

    subgraph Linger5["linger.ms=5"]
        R5_1["Record 1 arrives"] --> W5["Wait up to 5ms"]
        R5_2["Record 2 arrives"] --> W5
        R5_3["Record 3 arrives"] --> W5
        W5 --> S5["Send batch of 3\n(one network request)"]
        Note5["Fewer network requests\nSlightly higher latency\nHigher throughput"]
    end

For most services, linger.ms=5 is a good default — the extra 5ms is imperceptible to users but significantly improves throughput under load.


Retry Backoff: Fixed vs. Exponential

Fixed Backoff (Default)

With retry.backoff.ms=100, every retry waits exactly 100ms:

Attempt 1 → fail → wait 100ms → Attempt 2 → fail → wait 100ms → Attempt 3...

Exponential Backoff (Available since Kafka 2.6)

Use retry.backoff.max.ms to cap exponential growth:

spring.kafka.producer.properties.retry.backoff.ms=100
spring.kafka.producer.properties.retry.backoff.max.ms=10000
Attempt 1 → fail → wait 100ms
Attempt 2 → fail → wait 200ms
Attempt 3 → fail → wait 400ms
...
Attempt N → fail → wait 10,000ms (capped at max)

Exponential backoff is gentler on an already-struggling broker — it gives the broker more time to recover as successive retries occur.


Message Reordering: A Critical Retry Pitfall

Without idempotence, retries can cause out-of-order delivery:

sequenceDiagram
    participant Producer
    participant Broker

    Producer->>Broker: Record A (seq=1)
    Producer->>Broker: Record B (seq=2)
    Note over Broker: Record A write fails (network hiccup)
    Broker-->>Producer: Error for Record A
    Broker-->>Producer: Ack for Record B ✓ (B is committed at offset 0)
    Producer->>Broker: Retry Record A
    Broker-->>Producer: Ack for Record A ✓ (A is committed at offset 1)

    Note over Broker: Final order on broker:\noffset 0 = Record B\noffset 1 = Record A\n❌ OUT OF ORDER!

This happens because Kafka allows multiple in-flight requests per partition (max.in.flight.requests.per.connection=5 by default). Record B is accepted while Record A’s retry is still in flight.

Solution: Enable Idempotence

spring.kafka.producer.properties.enable.idempotence=true

With idempotence enabled, Kafka assigns a sequence number to each record. The broker detects and rejects duplicates and reorders retries correctly — Record A’s retry is rejected if B was already committed with a higher sequence number in the same batch sequence.

Idempotence also automatically sets:

  • acks=all (required for idempotence)
  • max.in.flight.requests.per.connection=5 (maximum allowed with idempotence)
  • retries=Integer.MAX_VALUE

Always enable idempotence when using retries. It is the safe default.


What Errors Are Retried?

Not all errors are retriable. The producer classifies errors:

flowchart TD
    Error["ProduceResponse Error"]
    Retriable{Retriable?}
    
    Retry["Retry automatically\n(wait retry.backoff.ms)"]
    Fail["Fail immediately\n(no retry)"]

    Retriable -->|Yes| Retry
    Retriable -->|No| Fail

    RetryableExamples["Retriable errors:\n• LEADER_NOT_AVAILABLE\n• NOT_LEADER_OR_FOLLOWER\n• REQUEST_TIMED_OUT\n• BROKER_NOT_AVAILABLE\n• NETWORK_EXCEPTION\n• NOT_ENOUGH_REPLICAS (transient)"]
    
    NonRetryableExamples["Non-retriable errors:\n• INVALID_TOPIC_EXCEPTION\n• RECORD_TOO_LARGE\n• UNKNOWN_TOPIC_OR_PARTITION\n• AUTHORIZATION_FAILED\n• INVALID_REQUIRED_ACKS"]

    Retry --- RetryableExamples
    Fail --- NonRetryableExamples

LEADER_NOT_AVAILABLE and NOT_LEADER_OR_FOLLOWER are the most common retriable errors — they occur during leader elections (broker restart, partition rebalancing). The producer retries and successfully connects to the new leader.


Monitoring Send Failures in Spring Kafka

@Bean
public KafkaTemplate<String, Object> kafkaTemplate(
        ProducerFactory<String, Object> producerFactory) {

    KafkaTemplate<String, Object> template = new KafkaTemplate<>(producerFactory);

    template.setProducerListener(new ProducerListener<>() {

        @Override
        public void onError(ProducerRecord<String, Object> record,
                            RecordMetadata metadata,
                            Exception exception) {
            if (exception instanceof TimeoutException) {
                // delivery.timeout.ms exceeded — all retries exhausted
                log.error("[KAFKA] Delivery timeout — record permanently lost: "
                    + "topic={} key={}", record.topic(), record.key());
                deadLetterService.saveFailedRecord(record, exception);
            } else {
                log.error("[KAFKA] Send error: topic={} key={}",
                    record.topic(), record.key(), exception);
            }
        }
    });

    return template;
}

flowchart LR
    subgraph Config["Recommended Production Settings"]
        R1["retries = Integer.MAX_VALUE\n(let delivery.timeout.ms decide)"]
        R2["delivery.timeout.ms = 120,000\n(2 min total window)"]
        R3["request.timeout.ms = 30,000\n(30s per attempt)"]
        R4["retry.backoff.ms = 100\n(100ms between attempts)"]
        R5["enable.idempotence = true\n(no duplicates on retry)"]
        R6["linger.ms = 5\n(micro-batching for throughput)"]
    end
config.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
config.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120_000);
config.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 30_000);
config.put(ProducerConfig.RETRY_BACKOFF_MS_CONFIG, 100);
config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
config.put(ProducerConfig.LINGER_MS_CONFIG, 5);
config.put(ProducerConfig.ACKS_CONFIG, "all");  // required by idempotence

The constraint that must hold: delivery.timeout.ms >= request.timeout.ms + linger.ms. Spring Kafka will warn if this is violated.


Key Takeaways

  • retries counts retry attempts; delivery.timeout.ms is the absolute deadline — whichever comes first fails the send
  • request.timeout.ms is per-attempt; retry.backoff.ms is the wait between attempts
  • Without enable.idempotence=true, retries can cause out-of-order delivery within a partition
  • enable.idempotence=true requires acks=all and prevents both duplicates and reordering
  • Retriable errors (leader unavailable, timeout) are retried automatically; non-retriable errors (invalid topic, message too large) fail immediately
  • linger.ms=5 is a good default for production throughput without noticeable latency increase
  • Set retries=Integer.MAX_VALUE and let delivery.timeout.ms control the retry window

Next: Idempotent Producers: Eliminating Duplicate Messages — understand exactly how Kafka’s idempotence mechanism works internally, how producer IDs and sequence numbers prevent duplicates.