Retry Logic: Handling Transient Failures Gracefully

Part 17 of 25

May 03, 2026 Abhay 6 min read

Retry Logic: Handling Transient Failures Gracefully

Introduction

Batch jobs interact with databases, REST APIs, and file systems — all of which fail transiently. A MySQL deadlock resolves itself in milliseconds. A network timeout to an external service clears up in seconds. Retrying these transient failures automatically is far better than failing the entire job and requiring a manual restart.

Spring Batch has built-in retry support at the step level, integrated with its transaction management. This article covers everything you need to configure robust retry behaviour.

Transient vs Fatal Exceptions

Before configuring retry, classify your exceptions:

Category	Examples	Right response
Transient — self-resolving	`DeadlockLoserDataAccessException`, `QueryTimeoutException`, HTTP 503, connection timeouts	Retry with backoff
Fatal — won’t resolve with retry	`FlatFileParseException`, `DataIntegrityViolationException`, `BadSqlGrammarException`, `NullPointerException`	Skip or fail immediately

Retrying a fatal exception wastes time and burns retry budget. Always explicitly include only retryable exception types.

Basic Step-Level Retry

Enable retry with .faultTolerant() on the step builder:

@Bean
public Step importOrdersStep(JobRepository jobRepository,
                              PlatformTransactionManager tx,
                              FlatFileItemReader<Order> reader,
                              JdbcBatchItemWriter<Order> writer) {

    return new StepBuilder("importOrdersStep", jobRepository)
            .<Order, Order>chunk(100, tx)
            .reader(reader)
            .writer(writer)
            .faultTolerant()
            .retry(DeadlockLoserDataAccessException.class)
            .retryLimit(3)   // 1 attempt + 2 retries = 3 total
            .build();
}

When the writer throws DeadlockLoserDataAccessException, Spring Batch:

Rolls back the chunk transaction.
Retries the entire chunk (re-reads all items, re-processes, re-writes).
After retryLimit attempts, falls back to item-by-item processing to isolate the bad item.
If item-by-item also fails retryLimit times, the step fails.

RetryPolicy

SimpleRetryPolicy (default)

return new StepBuilder("step", jobRepository)
        .<Order, Order>chunk(100, tx)
        .reader(reader).writer(writer)
        .faultTolerant()
        .retryPolicy(new SimpleRetryPolicy(3, Map.of(
                DeadlockLoserDataAccessException.class, true,
                QueryTimeoutException.class,            true,
                DataIntegrityViolationException.class,  false  // don't retry
        )))
        .build();

The Map<Class<? extends Throwable>, Boolean> value is true = retryable, false = not retryable. Non-retryable exceptions throw immediately without consuming retry budget.

TimeoutRetryPolicy

Retry until a time limit is reached, regardless of attempt count:

TimeoutRetryPolicy policy = new TimeoutRetryPolicy();
policy.setTimeout(30_000);  // keep retrying for up to 30 seconds

CompositeRetryPolicy

Combine policies — stops retrying when any policy says stop:

CompositeRetryPolicy composite = new CompositeRetryPolicy();
composite.setPolicies(new RetryPolicy[]{
        new SimpleRetryPolicy(5),               // max 5 attempts
        new TimeoutRetryPolicy(30_000)           // OR 30 seconds total
});
// Whichever limit hits first wins

NeverRetryPolicy

Explicitly disable retry (useful when overriding a default):

.retryPolicy(new NeverRetryPolicy())

BackOffPolicy

Without a BackOffPolicy, retries happen immediately — no delay. For most transient errors, you want to wait before retrying to let the condition resolve.

FixedBackOffPolicy

FixedBackOffPolicy backOff = new FixedBackOffPolicy();
backOff.setBackOffPeriod(1_000);  // 1 second between every retry

ExponentialBackOffPolicy

ExponentialBackOffPolicy backOff = new ExponentialBackOffPolicy();
backOff.setInitialInterval(200);    // first wait: 200ms
backOff.setMultiplier(2.0);         // doubles each time: 200, 400, 800, 1600...
backOff.setMaxInterval(10_000);     // cap at 10 seconds

ExponentialRandomBackOffPolicy

Adds random jitter to exponential backoff — prevents multiple threads retrying in lockstep (thundering herd):

ExponentialRandomBackOffPolicy backOff = new ExponentialRandomBackOffPolicy();
backOff.setInitialInterval(200);
backOff.setMultiplier(2.0);
backOff.setMaxInterval(10_000);
// Actual delay = random value between [interval/2, interval * 1.5]

Use ExponentialRandomBackOffPolicy in multi-threaded steps and remote-partitioned jobs.

Applying BackOffPolicy to a step

return new StepBuilder("importOrdersStep", jobRepository)
        .<Order, Order>chunk(100, tx)
        .reader(reader).writer(writer)
        .faultTolerant()
        .retry(DeadlockLoserDataAccessException.class)
        .retryLimit(5)
        .backOffPolicy(backOff)
        .build();

Retrying MySQL Deadlocks

MySQL deadlocks are the most common transient error in batch jobs. Spring wraps MySQL error code 1213 as DeadlockLoserDataAccessException.

@Bean
public Step processOrdersStep(JobRepository jobRepository,
                               PlatformTransactionManager tx,
                               JdbcPagingItemReader<Order> reader,
                               JdbcBatchItemWriter<ProcessedOrder> writer) {

    ExponentialBackOffPolicy backOff = new ExponentialBackOffPolicy();
    backOff.setInitialInterval(100);
    backOff.setMultiplier(2.0);
    backOff.setMaxInterval(5_000);

    return new StepBuilder("processOrdersStep", jobRepository)
            .<Order, ProcessedOrder>chunk(200, tx)
            .reader(reader)
            .processor(orderProcessor())
            .writer(writer)
            .faultTolerant()
            .retry(DeadlockLoserDataAccessException.class)
            .retryLimit(3)
            .backOffPolicy(backOff)
            .build();
}

To also handle lock wait timeouts (MySQL error 1205):

.retry(DeadlockLoserDataAccessException.class)
.retry(PessimisticLockingFailureException.class)  // covers lock wait timeout too
.retryLimit(3)

noRollback — Skip the Rollback for Certain Exceptions

By default, any exception during a chunk causes a full transaction rollback. For exceptions that occur during reading (before any write happened), the rollback is unnecessary overhead.

return new StepBuilder("step", jobRepository)
        .<Order, Order>chunk(100, tx)
        .reader(reader).writer(writer)
        .faultTolerant()
        .noRollback(FlatFileParseException.class)  // no DB write happened — skip rollback
        .skip(FlatFileParseException.class)
        .skipLimit(100)
        .build();

noRollback is ignored during write exceptions — if a write already happened, you must roll it back to maintain consistency.

RetryListener — Log Every Attempt

@Bean
public RetryListener orderRetryListener() {
    return new RetryListener() {

        @Override
        public <T, E extends Throwable> boolean open(RetryContext context,
                                                       RetryCallback<T, E> callback) {
            return true;  // return false to skip the retry entirely
        }

        @Override
        public <T, E extends Throwable> void onError(RetryContext context,
                                                      RetryCallback<T, E> callback,
                                                      Throwable throwable) {
            log.warn("Retry attempt {} failed: {} — {}",
                    context.getRetryCount(),
                    throwable.getClass().getSimpleName(),
                    throwable.getMessage());
        }

        @Override
        public <T, E extends Throwable> void close(RetryContext context,
                                                    RetryCallback<T, E> callback,
                                                    Throwable throwable) {
            if (throwable != null) {
                log.error("All retry attempts exhausted after {} tries",
                        context.getRetryCount());
            } else if (context.getRetryCount() > 0) {
                log.info("Succeeded on attempt {}", context.getRetryCount() + 1);
            }
        }
    };
}

.faultTolerant()
.retry(DeadlockLoserDataAccessException.class)
.retryLimit(3)
.listener(orderRetryListener())

Step-Level Retry vs @Retryable in Processors

Spring Retry’s @Retryable can annotate methods in an ItemProcessor. Both approaches retry on failure — but they behave differently:

	Step-level retry	`@Retryable` in processor
Scope	Entire chunk (re-reads + re-processes + re-writes)	Single method call
Transaction rollback	Yes — full chunk rollback	No — processor runs outside the write transaction
Backoff	Configured on step	Configured on annotation
When to use	Writer failures (deadlocks, timeouts during INSERT)	Processor failures (REST API call, external service)

Use @Retryable in a processor when the failure is in the processor itself (an HTTP call to a pricing API), not the writer:

@Component
public class PriceEnrichmentProcessor implements ItemProcessor<Order, Order> {

    private final RestClient pricingClient;

    @Retryable(
            retryFor = {HttpServerErrorException.class, ResourceAccessException.class},
            maxAttempts = 3,
            backoff = @Backoff(delay = 500, multiplier = 2.0, maxDelay = 5_000)
    )
    @Override
    public Order process(Order order) throws Exception {
        BigDecimal marketPrice = pricingClient.get()
                .uri("/price/{productId}", order.getProductId())
                .retrieve()
                .body(BigDecimal.class);

        order.setMarketPrice(marketPrice);
        return order;
    }

    @Recover
    public Order recover(HttpServerErrorException ex, Order order) {
        log.error("Pricing API unavailable for order {}. Using list price.", order.getOrderId());
        order.setMarketPrice(order.getListPrice());  // fall back to list price
        return order;                                 // don't filter — use fallback
    }
}

Enable @Retryable in Spring Boot:

@SpringBootApplication
@EnableRetry
public class BatchApplication { ... }

Complete Fault-Tolerant Step

@Bean
public Step faultTolerantImportStep(JobRepository jobRepository,
                                     PlatformTransactionManager tx,
                                     FlatFileItemReader<Order> reader,
                                     ItemProcessor<Order, ProcessedOrder> processor,
                                     JdbcBatchItemWriter<ProcessedOrder> writer,
                                     OrderSkipListener skipListener,
                                     RetryListener retryListener) {

    ExponentialRandomBackOffPolicy backOff = new ExponentialRandomBackOffPolicy();
    backOff.setInitialInterval(200);
    backOff.setMultiplier(2.0);
    backOff.setMaxInterval(10_000);

    SimpleRetryPolicy retryPolicy = new SimpleRetryPolicy(3, Map.of(
            DeadlockLoserDataAccessException.class,    true,
            PessimisticLockingFailureException.class,  true,
            QueryTimeoutException.class,               true,
            DataIntegrityViolationException.class,     false,  // never retry
            FlatFileParseException.class,              false   // never retry
    ));

    return new StepBuilder("faultTolerantImportStep", jobRepository)
            .<Order, ProcessedOrder>chunk(200, tx)
            .reader(reader)
            .processor(processor)
            .writer(writer)
            .faultTolerant()
            .retryPolicy(retryPolicy)
            .backOffPolicy(backOff)
            .noRollback(FlatFileParseException.class)
            .skip(FlatFileParseException.class)
            .skip(DataIntegrityViolationException.class)
            .skipLimit(500)
            .listener(skipListener)
            .listener(retryListener)
            .build();
}

Key Takeaways

Always classify exceptions as transient (retry) or fatal (skip/fail) before configuring retry.
Use ExponentialRandomBackOffPolicy for multi-threaded or distributed jobs — random jitter prevents thundering herd.
retryLimit is the total number of attempts (initial + retries). .retryLimit(3) = 1 initial + 2 retries.
noRollback avoids a pointless transaction rollback for exceptions where no write occurred (like read-time parse errors).
@Retryable in a processor retries at the method level without transaction rollback — use it for external API calls. Use step-level retry for database write failures.
Always register a RetryListener in production so you have visibility into retry storms.

What’s Next

Article 18 covers skip logic in depth — custom SkipPolicy, dead-letter patterns, stopping a job intentionally, and designing jobs that restart cleanly after failure.