Retry Logic: Handling Transient Failures Gracefully
Introduction
Batch jobs interact with databases, REST APIs, and file systems — all of which fail transiently. A MySQL deadlock resolves itself in milliseconds. A network timeout to an external service clears up in seconds. Retrying these transient failures automatically is far better than failing the entire job and requiring a manual restart.
Spring Batch has built-in retry support at the step level, integrated with its transaction management. This article covers everything you need to configure robust retry behaviour.
Transient vs Fatal Exceptions
Before configuring retry, classify your exceptions:
| Category | Examples | Right response |
|---|---|---|
| Transient — self-resolving | DeadlockLoserDataAccessException, QueryTimeoutException, HTTP 503, connection timeouts | Retry with backoff |
| Fatal — won’t resolve with retry | FlatFileParseException, DataIntegrityViolationException, BadSqlGrammarException, NullPointerException | Skip or fail immediately |
Retrying a fatal exception wastes time and burns retry budget. Always explicitly include only retryable exception types.
Basic Step-Level Retry
Enable retry with .faultTolerant() on the step builder:
@Bean
public Step importOrdersStep(JobRepository jobRepository,
PlatformTransactionManager tx,
FlatFileItemReader<Order> reader,
JdbcBatchItemWriter<Order> writer) {
return new StepBuilder("importOrdersStep", jobRepository)
.<Order, Order>chunk(100, tx)
.reader(reader)
.writer(writer)
.faultTolerant()
.retry(DeadlockLoserDataAccessException.class)
.retryLimit(3) // 1 attempt + 2 retries = 3 total
.build();
}
When the writer throws DeadlockLoserDataAccessException, Spring Batch:
- Rolls back the chunk transaction.
- Retries the entire chunk (re-reads all items, re-processes, re-writes).
- After
retryLimitattempts, falls back to item-by-item processing to isolate the bad item. - If item-by-item also fails
retryLimittimes, the step fails.
RetryPolicy
SimpleRetryPolicy (default)
return new StepBuilder("step", jobRepository)
.<Order, Order>chunk(100, tx)
.reader(reader).writer(writer)
.faultTolerant()
.retryPolicy(new SimpleRetryPolicy(3, Map.of(
DeadlockLoserDataAccessException.class, true,
QueryTimeoutException.class, true,
DataIntegrityViolationException.class, false // don't retry
)))
.build();
The Map<Class<? extends Throwable>, Boolean> value is true = retryable, false = not retryable. Non-retryable exceptions throw immediately without consuming retry budget.
TimeoutRetryPolicy
Retry until a time limit is reached, regardless of attempt count:
TimeoutRetryPolicy policy = new TimeoutRetryPolicy();
policy.setTimeout(30_000); // keep retrying for up to 30 seconds
CompositeRetryPolicy
Combine policies — stops retrying when any policy says stop:
CompositeRetryPolicy composite = new CompositeRetryPolicy();
composite.setPolicies(new RetryPolicy[]{
new SimpleRetryPolicy(5), // max 5 attempts
new TimeoutRetryPolicy(30_000) // OR 30 seconds total
});
// Whichever limit hits first wins
NeverRetryPolicy
Explicitly disable retry (useful when overriding a default):
.retryPolicy(new NeverRetryPolicy())
BackOffPolicy
Without a BackOffPolicy, retries happen immediately — no delay. For most transient errors, you want to wait before retrying to let the condition resolve.
FixedBackOffPolicy
FixedBackOffPolicy backOff = new FixedBackOffPolicy();
backOff.setBackOffPeriod(1_000); // 1 second between every retry
ExponentialBackOffPolicy
ExponentialBackOffPolicy backOff = new ExponentialBackOffPolicy();
backOff.setInitialInterval(200); // first wait: 200ms
backOff.setMultiplier(2.0); // doubles each time: 200, 400, 800, 1600...
backOff.setMaxInterval(10_000); // cap at 10 seconds
ExponentialRandomBackOffPolicy
Adds random jitter to exponential backoff — prevents multiple threads retrying in lockstep (thundering herd):
ExponentialRandomBackOffPolicy backOff = new ExponentialRandomBackOffPolicy();
backOff.setInitialInterval(200);
backOff.setMultiplier(2.0);
backOff.setMaxInterval(10_000);
// Actual delay = random value between [interval/2, interval * 1.5]
Use ExponentialRandomBackOffPolicy in multi-threaded steps and remote-partitioned jobs.
Applying BackOffPolicy to a step
return new StepBuilder("importOrdersStep", jobRepository)
.<Order, Order>chunk(100, tx)
.reader(reader).writer(writer)
.faultTolerant()
.retry(DeadlockLoserDataAccessException.class)
.retryLimit(5)
.backOffPolicy(backOff)
.build();
Retrying MySQL Deadlocks
MySQL deadlocks are the most common transient error in batch jobs. Spring wraps MySQL error code 1213 as DeadlockLoserDataAccessException.
@Bean
public Step processOrdersStep(JobRepository jobRepository,
PlatformTransactionManager tx,
JdbcPagingItemReader<Order> reader,
JdbcBatchItemWriter<ProcessedOrder> writer) {
ExponentialBackOffPolicy backOff = new ExponentialBackOffPolicy();
backOff.setInitialInterval(100);
backOff.setMultiplier(2.0);
backOff.setMaxInterval(5_000);
return new StepBuilder("processOrdersStep", jobRepository)
.<Order, ProcessedOrder>chunk(200, tx)
.reader(reader)
.processor(orderProcessor())
.writer(writer)
.faultTolerant()
.retry(DeadlockLoserDataAccessException.class)
.retryLimit(3)
.backOffPolicy(backOff)
.build();
}
To also handle lock wait timeouts (MySQL error 1205):
.retry(DeadlockLoserDataAccessException.class)
.retry(PessimisticLockingFailureException.class) // covers lock wait timeout too
.retryLimit(3)
noRollback — Skip the Rollback for Certain Exceptions
By default, any exception during a chunk causes a full transaction rollback. For exceptions that occur during reading (before any write happened), the rollback is unnecessary overhead.
return new StepBuilder("step", jobRepository)
.<Order, Order>chunk(100, tx)
.reader(reader).writer(writer)
.faultTolerant()
.noRollback(FlatFileParseException.class) // no DB write happened — skip rollback
.skip(FlatFileParseException.class)
.skipLimit(100)
.build();
noRollback is ignored during write exceptions — if a write already happened, you must roll it back to maintain consistency.
RetryListener — Log Every Attempt
@Bean
public RetryListener orderRetryListener() {
return new RetryListener() {
@Override
public <T, E extends Throwable> boolean open(RetryContext context,
RetryCallback<T, E> callback) {
return true; // return false to skip the retry entirely
}
@Override
public <T, E extends Throwable> void onError(RetryContext context,
RetryCallback<T, E> callback,
Throwable throwable) {
log.warn("Retry attempt {} failed: {} — {}",
context.getRetryCount(),
throwable.getClass().getSimpleName(),
throwable.getMessage());
}
@Override
public <T, E extends Throwable> void close(RetryContext context,
RetryCallback<T, E> callback,
Throwable throwable) {
if (throwable != null) {
log.error("All retry attempts exhausted after {} tries",
context.getRetryCount());
} else if (context.getRetryCount() > 0) {
log.info("Succeeded on attempt {}", context.getRetryCount() + 1);
}
}
};
}
Register on the step:
.faultTolerant()
.retry(DeadlockLoserDataAccessException.class)
.retryLimit(3)
.listener(orderRetryListener())
Step-Level Retry vs @Retryable in Processors
Spring Retry’s @Retryable can annotate methods in an ItemProcessor. Both approaches retry on failure — but they behave differently:
| Step-level retry | @Retryable in processor | |
|---|---|---|
| Scope | Entire chunk (re-reads + re-processes + re-writes) | Single method call |
| Transaction rollback | Yes — full chunk rollback | No — processor runs outside the write transaction |
| Backoff | Configured on step | Configured on annotation |
| When to use | Writer failures (deadlocks, timeouts during INSERT) | Processor failures (REST API call, external service) |
Use @Retryable in a processor when the failure is in the processor itself (an HTTP call to a pricing API), not the writer:
@Component
public class PriceEnrichmentProcessor implements ItemProcessor<Order, Order> {
private final RestClient pricingClient;
@Retryable(
retryFor = {HttpServerErrorException.class, ResourceAccessException.class},
maxAttempts = 3,
backoff = @Backoff(delay = 500, multiplier = 2.0, maxDelay = 5_000)
)
@Override
public Order process(Order order) throws Exception {
BigDecimal marketPrice = pricingClient.get()
.uri("/price/{productId}", order.getProductId())
.retrieve()
.body(BigDecimal.class);
order.setMarketPrice(marketPrice);
return order;
}
@Recover
public Order recover(HttpServerErrorException ex, Order order) {
log.error("Pricing API unavailable for order {}. Using list price.", order.getOrderId());
order.setMarketPrice(order.getListPrice()); // fall back to list price
return order; // don't filter — use fallback
}
}
Enable @Retryable in Spring Boot:
@SpringBootApplication
@EnableRetry
public class BatchApplication { ... }
Complete Fault-Tolerant Step
@Bean
public Step faultTolerantImportStep(JobRepository jobRepository,
PlatformTransactionManager tx,
FlatFileItemReader<Order> reader,
ItemProcessor<Order, ProcessedOrder> processor,
JdbcBatchItemWriter<ProcessedOrder> writer,
OrderSkipListener skipListener,
RetryListener retryListener) {
ExponentialRandomBackOffPolicy backOff = new ExponentialRandomBackOffPolicy();
backOff.setInitialInterval(200);
backOff.setMultiplier(2.0);
backOff.setMaxInterval(10_000);
SimpleRetryPolicy retryPolicy = new SimpleRetryPolicy(3, Map.of(
DeadlockLoserDataAccessException.class, true,
PessimisticLockingFailureException.class, true,
QueryTimeoutException.class, true,
DataIntegrityViolationException.class, false, // never retry
FlatFileParseException.class, false // never retry
));
return new StepBuilder("faultTolerantImportStep", jobRepository)
.<Order, ProcessedOrder>chunk(200, tx)
.reader(reader)
.processor(processor)
.writer(writer)
.faultTolerant()
.retryPolicy(retryPolicy)
.backOffPolicy(backOff)
.noRollback(FlatFileParseException.class)
.skip(FlatFileParseException.class)
.skip(DataIntegrityViolationException.class)
.skipLimit(500)
.listener(skipListener)
.listener(retryListener)
.build();
}
Key Takeaways
- Always classify exceptions as transient (retry) or fatal (skip/fail) before configuring retry.
- Use
ExponentialRandomBackOffPolicyfor multi-threaded or distributed jobs — random jitter prevents thundering herd. retryLimitis the total number of attempts (initial + retries)..retryLimit(3)= 1 initial + 2 retries.noRollbackavoids a pointless transaction rollback for exceptions where no write occurred (like read-time parse errors).@Retryablein a processor retries at the method level without transaction rollback — use it for external API calls. Use step-level retry for database write failures.- Always register a
RetryListenerin production so you have visibility into retry storms.
What’s Next
Article 18 covers skip logic in depth — custom SkipPolicy, dead-letter patterns, stopping a job intentionally, and designing jobs that restart cleanly after failure.