A Retry Pattern That Actually Works in Production

Most retry implementations I see in the wild are either too naive or too clever by half. After five years of battle testing, here is the pattern I reach for every time, and the hard-won lessons behind each decision.

Let me paint you a picture. It is 2:47 AM on a Tuesday. Your phone lights up with a PagerDuty alert. The payment processing service is down. Not because Stripe is down, and not because your code has a bug. It is down because a single network hiccup caused one HTTP request to fail, and your application treated that as a fatal error and gave up.

I have seen this exact scenario play out at three different companies. The fix is always the same: a proper retry mechanism. But "proper" is doing a lot of heavy lifting in that sentence.

The Naive Approach (and Why It Will Hurt You)

The first thing most developers reach for is something like this:

JavaScript
// DON'T DO THIS
async function fetchWithRetry(url, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      return await fetch(url);
    } catch (err) {
      if (i === retries - 1) throw err;
    }
  }
}

This looks reasonable. It retries three times. If all three fail, it throws. What could go wrong?

Everything. This pattern has three critical flaws:

Diagram showing exponential backoff with jitter compared to naive retry timing
Exponential backoff with jitter spreads retry attempts over time, reducing the chance of synchronized load spikes.

The Pattern That Actually Works

Here is the retry utility I have carried from project to project for the past five years. It has survived millions of requests in production across payment processors, third-party APIs, database connections, and message queues.

JavaScript
const DEFAULT_OPTS = {
  maxRetries:    3,
  baseDelay:     1000,      // 1 second
  maxDelay:      30000,     // 30 seconds ceiling
  backoffFactor: 2,
  jitter:        true,
  isRetryable:   (err) => true,
  onRetry:       null,
};

async function retry(fn, options = {}) {
  const opts = { ...DEFAULT_OPTS, ...options };
  let lastError;

  for (let attempt = 0; attempt <= opts.maxRetries; attempt++) {
    try {
      return await fn(attempt);
    } catch (err) {
      lastError = err;

      // Don't retry if we've exhausted attempts
      if (attempt >= opts.maxRetries) break;

      // Don't retry non-retryable errors
      if (!opts.isRetryable(err)) break;

      // Calculate delay with exponential backoff
      let delay = Math.min(
        opts.baseDelay * Math.pow(opts.backoffFactor, attempt),
        opts.maxDelay
      );

      // Add jitter: random value between 0 and delay
      if (opts.jitter) {
        delay = Math.random() * delay;
      }

      // Optional retry callback for logging/metrics
      if (opts.onRetry) {
        opts.onRetry(err, attempt + 1, delay);
      }

      await new Promise(r => setTimeout(r, delay));
    }
  }

  throw lastError;
}

Let me walk through each decision.

Exponential Backoff

The delay between retries grows exponentially: 1s, 2s, 4s, 8s, and so on. This gives transient failures time to resolve. A service that is briefly overloaded will recover during those longer pauses. A service that is truly down will not waste your resources with rapid-fire retries.

Full Jitter

I use "full jitter" rather than "decorrelated jitter" or "equal jitter." The delay is a random value between zero and the calculated exponential delay. AWS published an excellent analysis of this in their architecture blog, and after testing all three strategies in production, full jitter consistently produced the smoothest load distribution.

"Jitter is not optional. Without it, your retry logic is a synchronized flashmob performing a denial-of-service attack against your own infrastructure."

A lesson I learned the hard way at a previous company

The isRetryable Callback

This is the most important piece, and the one most implementations skip entirely. Not all errors are retryable. Here is how I configure it for HTTP-based services:

JavaScript
const RETRYABLE_STATUS = new Set([408, 429, 500, 502, 503, 504]);

const isRetryableHttp = (err) => {
  // Network errors are always retryable
  if (err.code === 'ECONNRESET')  return true;
  if (err.code === 'ETIMEDOUT')   return true;
  if (err.code === 'ECONNREFUSED') return true;

  // HTTP status-based retries
  if (err.status) {
    return RETRYABLE_STATUS.has(err.status);
  }

  return false;
};

Watch out for 429 (Too Many Requests). If the upstream service includes a Retry-After header, honor it. Your calculated backoff should be the maximum of your exponential delay and the Retry-After value. I have seen teams get their API keys revoked for ignoring rate limit headers.

The onRetry Hook

Never retry silently. Every retry should produce a log entry or a metric. In production, I wire this into our structured logging pipeline:

JavaScript
const result = await retry(
  () => stripe.charges.create(chargeData),
  {
    maxRetries: 4,
    isRetryable: isRetryableHttp,
    onRetry: (err, attempt, delay) => {
      logger.warn('Payment retry', {
        attempt,
        delay: Math.round(delay),
        error: err.message,
        chargeId: chargeData.metadata.orderId,
      });
    },
  }
);

Those log entries have saved me more than once. When a partner API starts degrading, you see retry counts climbing in your dashboards before the errors become outright failures. It is your early warning system.

The Ceiling That Saves You

The maxDelay option is a safety valve. Without it, exponential backoff grows unbounded: 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s... At some point you are not retrying, you are just waiting around hoping something changes. Thirty seconds is my default ceiling, but I adjust it based on context. For user-facing requests, I use 5 seconds. For background job processing, 60 seconds is often appropriate.

Monitoring dashboard showing retry metrics over time
A Grafana dashboard tracking retry counts, success rates, and average delays across services. The spike at 14:30 correlates with a brief upstream outage that resolved without any customer impact.

When Not to Retry

Retry logic is not appropriate for every situation. Here are the cases where I explicitly disable it:

💡

Idempotency keys are your friend. For payment APIs and other critical non-idempotent operations, use idempotency keys. Stripe, PayPal, and most modern APIs support them. Pass a unique key per operation, and the API guarantees it will only process the operation once, regardless of how many times you send it.

Putting It All Together

Here is a real-world example from a service that processes order fulfillment. It calls an external shipping API that is reliable 99.5% of the time but occasionally hiccups under load:

JavaScript
async function createShipment(order) {
  const shipment = await retry(
    async (attempt) => {
      if (attempt > 0) {
        metrics.increment('shipping.retry');
      }

      return shippingClient.create({
        from:    warehouse.address,
        to:      order.shippingAddress,
        parcels: buildParcels(order.items),
        options: { idempotencyKey: order.id },
      });
    },
    {
      maxRetries:    3,
      baseDelay:     2000,
      maxDelay:      15000,
      isRetryable:   isRetryableHttp,
      onRetry: (err, attempt, delay) => {
        logger.warn('Shipping API retry', {
          orderId: order.id,
          attempt,
          nextDelay: Math.round(delay),
          error: err.message,
        });
      },
    }
  );

  return shipment;
}

Notice the idempotency key. Even though the shipping API call is not naturally idempotent (creating a shipment is a side effect), the idempotency key ensures that if our retry sends the same request twice, the API only processes it once.


Five Years of Lessons

This pattern has been remarkably stable. I have not had to change the core retry function in years. But the configurations I pass into it have evolved significantly based on operational experience:

  1. Default to conservative. Three retries with a one-second base delay is right for most situations. If you need more aggressive retries, you should be asking why the upstream service is so unreliable.
  2. Always cap the delay. Unbounded exponential backoff is a foot gun. Set a ceiling and enforce it.
  3. Log every retry. You cannot fix what you cannot see. Retry metrics are the canary in your coal mine.
  4. Test the unhappy path. Your retry tests should verify that non-retryable errors fail immediately, that backoff timing is roughly correct, and that jitter produces non-uniform delays.
  5. Pair with circuit breakers. Retries handle transient failures. Circuit breakers handle sustained failures. Used together, they form a complete resilience strategy. But that is a topic for another post.

"The best error handling code is the code that never has to run. The second best is the code that runs, recovers, and leaves a paper trail."

Build your retry logic once, test it thoroughly, and carry it with you. The next 2:47 AM alert might never come.