Let me paint you a picture. It is 2:47 AM on a Tuesday. Your phone lights up with a PagerDuty alert. The payment processing service is down. Not because Stripe is down, and not because your code has a bug. It is down because a single network hiccup caused one HTTP request to fail, and your application treated that as a fatal error and gave up.
I have seen this exact scenario play out at three different companies. The fix is always the same: a proper retry mechanism. But "proper" is doing a lot of heavy lifting in that sentence.
The Naive Approach (and Why It Will Hurt You)
The first thing most developers reach for is something like this:
// DON'T DO THIS async function fetchWithRetry(url, retries = 3) { for (let i = 0; i < retries; i++) { try { return await fetch(url); } catch (err) { if (i === retries - 1) throw err; } } }
This looks reasonable. It retries three times. If all three fail, it throws. What could go wrong?
Everything. This pattern has three critical flaws:
- No backoff. If the upstream service is struggling under load, hammering it with immediate retries makes the problem worse. You are now part of the thundering herd.
- No jitter. If ten instances of your service all retry at exactly the same intervals, they will all hit the upstream service at the same instant. Synchronized retries are a denial-of-service attack wearing a trench coat.
- No discrimination. A 400 Bad Request is not a transient error. Retrying it will produce the same result every time. You need to distinguish between retryable errors (timeouts, 502s, 503s) and permanent failures (4xx, validation errors).
The Pattern That Actually Works
Here is the retry utility I have carried from project to project for the past five years. It has survived millions of requests in production across payment processors, third-party APIs, database connections, and message queues.
const DEFAULT_OPTS = { maxRetries: 3, baseDelay: 1000, // 1 second maxDelay: 30000, // 30 seconds ceiling backoffFactor: 2, jitter: true, isRetryable: (err) => true, onRetry: null, }; async function retry(fn, options = {}) { const opts = { ...DEFAULT_OPTS, ...options }; let lastError; for (let attempt = 0; attempt <= opts.maxRetries; attempt++) { try { return await fn(attempt); } catch (err) { lastError = err; // Don't retry if we've exhausted attempts if (attempt >= opts.maxRetries) break; // Don't retry non-retryable errors if (!opts.isRetryable(err)) break; // Calculate delay with exponential backoff let delay = Math.min( opts.baseDelay * Math.pow(opts.backoffFactor, attempt), opts.maxDelay ); // Add jitter: random value between 0 and delay if (opts.jitter) { delay = Math.random() * delay; } // Optional retry callback for logging/metrics if (opts.onRetry) { opts.onRetry(err, attempt + 1, delay); } await new Promise(r => setTimeout(r, delay)); } } throw lastError; }
Let me walk through each decision.
Exponential Backoff
The delay between retries grows exponentially: 1s, 2s, 4s, 8s, and so on. This gives transient failures time to resolve. A service that is briefly overloaded will recover during those longer pauses. A service that is truly down will not waste your resources with rapid-fire retries.
Full Jitter
I use "full jitter" rather than "decorrelated jitter" or "equal jitter." The delay is a random value between zero and the calculated exponential delay. AWS published an excellent analysis of this in their architecture blog, and after testing all three strategies in production, full jitter consistently produced the smoothest load distribution.
"Jitter is not optional. Without it, your retry logic is a synchronized flashmob performing a denial-of-service attack against your own infrastructure."
A lesson I learned the hard way at a previous company
The isRetryable Callback
This is the most important piece, and the one most implementations skip entirely. Not all errors are retryable. Here is how I configure it for HTTP-based services:
const RETRYABLE_STATUS = new Set([408, 429, 500, 502, 503, 504]); const isRetryableHttp = (err) => { // Network errors are always retryable if (err.code === 'ECONNRESET') return true; if (err.code === 'ETIMEDOUT') return true; if (err.code === 'ECONNREFUSED') return true; // HTTP status-based retries if (err.status) { return RETRYABLE_STATUS.has(err.status); } return false; };
Watch out for 429 (Too Many Requests). If the upstream service includes a Retry-After header, honor it. Your calculated backoff should be the maximum of your exponential delay and the Retry-After value. I have seen teams get their API keys revoked for ignoring rate limit headers.
The onRetry Hook
Never retry silently. Every retry should produce a log entry or a metric. In production, I wire this into our structured logging pipeline:
const result = await retry( () => stripe.charges.create(chargeData), { maxRetries: 4, isRetryable: isRetryableHttp, onRetry: (err, attempt, delay) => { logger.warn('Payment retry', { attempt, delay: Math.round(delay), error: err.message, chargeId: chargeData.metadata.orderId, }); }, } );
Those log entries have saved me more than once. When a partner API starts degrading, you see retry counts climbing in your dashboards before the errors become outright failures. It is your early warning system.
The Ceiling That Saves You
The maxDelay option is a safety valve. Without it, exponential backoff grows unbounded: 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s... At some point you are not retrying, you are just waiting around hoping something changes. Thirty seconds is my default ceiling, but I adjust it based on context. For user-facing requests, I use 5 seconds. For background job processing, 60 seconds is often appropriate.
When Not to Retry
Retry logic is not appropriate for every situation. Here are the cases where I explicitly disable it:
- Non-idempotent operations without deduplication. If you cannot safely repeat an operation, retrying it can create duplicate charges, double-sent emails, or duplicate database entries.
- Operations within a database transaction. If the transaction is going to roll back anyway, retrying individual queries inside it is pointless.
- Synchronous user-facing requests with tight SLAs. If a user is watching a spinner and your timeout is 3 seconds, there is no room for retries. Fail fast and show an error.
- Authentication failures. If the upstream API says your token is invalid, sending the same token again will not make it more valid.
Idempotency keys are your friend. For payment APIs and other critical non-idempotent operations, use idempotency keys. Stripe, PayPal, and most modern APIs support them. Pass a unique key per operation, and the API guarantees it will only process the operation once, regardless of how many times you send it.
Putting It All Together
Here is a real-world example from a service that processes order fulfillment. It calls an external shipping API that is reliable 99.5% of the time but occasionally hiccups under load:
async function createShipment(order) { const shipment = await retry( async (attempt) => { if (attempt > 0) { metrics.increment('shipping.retry'); } return shippingClient.create({ from: warehouse.address, to: order.shippingAddress, parcels: buildParcels(order.items), options: { idempotencyKey: order.id }, }); }, { maxRetries: 3, baseDelay: 2000, maxDelay: 15000, isRetryable: isRetryableHttp, onRetry: (err, attempt, delay) => { logger.warn('Shipping API retry', { orderId: order.id, attempt, nextDelay: Math.round(delay), error: err.message, }); }, } ); return shipment; }
Notice the idempotency key. Even though the shipping API call is not naturally idempotent (creating a shipment is a side effect), the idempotency key ensures that if our retry sends the same request twice, the API only processes it once.
Five Years of Lessons
This pattern has been remarkably stable. I have not had to change the core retry function in years. But the configurations I pass into it have evolved significantly based on operational experience:
- Default to conservative. Three retries with a one-second base delay is right for most situations. If you need more aggressive retries, you should be asking why the upstream service is so unreliable.
- Always cap the delay. Unbounded exponential backoff is a foot gun. Set a ceiling and enforce it.
- Log every retry. You cannot fix what you cannot see. Retry metrics are the canary in your coal mine.
- Test the unhappy path. Your retry tests should verify that non-retryable errors fail immediately, that backoff timing is roughly correct, and that jitter produces non-uniform delays.
- Pair with circuit breakers. Retries handle transient failures. Circuit breakers handle sustained failures. Used together, they form a complete resilience strategy. But that is a topic for another post.
"The best error handling code is the code that never has to run. The second best is the code that runs, recovers, and leaves a paper trail."
Build your retry logic once, test it thoroughly, and carry it with you. The next 2:47 AM alert might never come.