Zero-Downtime Deployments with Docker Swarm and Traefik

CHAPTERS

0:00Introduction and the problem statement
3:42Docker Swarm fundamentals
11:15Configuring Traefik as a reverse proxy
19:30Rolling update strategy
27:08Health checks and readiness probes
34:20Automatic rollback on failure
39:45Monitoring and wrap-up

There is a moment in every production deployment where you hold your breath. The old containers are draining, the new ones are spinning up, and somewhere in between, real users are clicking buttons that resolve to endpoints that may or may not exist for the next three seconds.

I have been chasing the zero-downtime dragon for years. Not theoretically; practically. On systems that handle real traffic, with real SLAs, where "a few dropped requests" means an angry Slack message from someone three levels above you in the org chart. This episode is the distillation of everything I have learned.

WHY DOCKER SWARM IN 2026

I know what you are thinking. Kubernetes won. The war is over. And you are right, mostly. But there is a class of deployment that Kubernetes is egregiously over-engineered for, and if you are running a handful of services on two to five nodes, Swarm remains the most elegant solution I have found.

The mental model is simpler. The networking is built in. The learning curve from single-host Docker to Swarm is a gentle slope rather than a cliff face. And critically, for the deployment pattern we are building today, Swarm's rolling update primitive is genuinely well-designed.

Architecture diagram — The target architecture: Traefik sits at the edge, routing to service replicas managed by Swarm's orchestrator.

THE TRAEFIK CONFIGURATION

Traefik is one of those tools that feels like it was designed specifically for this use case. It watches the Docker socket, discovers services automatically, and reconfigures its routing table when containers come and go. No config reloads. No NGINX templates. It just works.

Here is the core of our Traefik stack definition:

YAML docker-compose.traefik.yml

version: "3.8"

services:
  traefik:
    image: traefik:v3.0
    command:
      - "--providers.swarm.endpoint=unix:///var/run/docker.sock"
      - "--providers.swarm.exposedByDefault=false"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
      - "--api.dashboard=true"
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    deploy:
      placement:
        constraints:
          - node.role == manager

The key line is providers.swarm.exposedByDefault=false. This means Traefik will only route to services that explicitly opt in via labels. Defense in depth starts at the reverse proxy.

ROLLING UPDATE STRATEGY

The deployment strategy is where everything comes together. Docker Swarm supports two update orders: stop-first (default) and start-first. For zero-downtime, we always want start-first. The new container must be healthy before the old one is drained.

YAML docker-compose.app.yml

services:
  api:
    image: registry.internal/api:${TAG}
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 15s
        order: start-first
        failure_action: rollback
        monitor: 30s
      rollback_config:
        parallelism: 0
        order: start-first
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.api.rule=Host(`api.example.com`)"
        - "traefik.http.services.api.loadbalancer.server.port=3000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 20s

Let me break down the critical parameters:

parallelism: 1 means we update one replica at a time. Conservative, but safe.
delay: 15s introduces a cooling-off period between each replica update.
monitor: 30s watches the new container for 30 seconds after it starts. If it crashes or fails its health check within that window, we trigger an automatic rollback.
failure_action: rollback is the safety net. No manual intervention required.

The best deployment pipeline is one you can run on a Friday afternoon without checking your phone all weekend. If your system needs you to babysit it, the system is not done yet.
Something I remind myself constantly

THE HEALTH CHECK DEEP DIVE

Health checks are the unsung hero of this entire pattern. A poorly written health check will give you false confidence. A missing health check will give you no confidence at all. Here is what a production-grade health endpoint actually looks like:

JavaScript src/routes/health.js

const healthCheck = async (req, res) => {
  const checks = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    database: 'unknown',
    redis: 'unknown',
  };

  try {
    await db.query('SELECT 1');
    checks.database = 'connected';
  } catch (err) {
    checks.database = 'disconnected';
    return res.status(503).json(checks);
  }

  try {
    await redis.ping();
    checks.redis = 'connected';
  } catch (err) {
    checks.redis = 'disconnected';
    return res.status(503).json(checks);
  }

  res.status(200).json(checks);
};

Notice that we are not just returning 200. We are verifying that the application can actually reach its dependencies. A container that is running but cannot talk to the database is not healthy. It is a zombie, and zombies should not serve traffic.

Watch the live demo

8:42

MONITORING THE ROLLOUT

You can watch a deployment in real time with a single command. I keep a terminal split open during every deploy showing the service state:

Bash Terminal

# Watch the update roll through each replica
watch -n 2 docker service ps api \
  --format "table {{.ID}}\t{{.Image}}\t{{.CurrentState}}\t{{.Error}}"

# Or for a quick status check
docker service inspect api \
  --pretty | grep -A 5 "UpdateStatus"

If something goes wrong, you will see the update state flip to rolling_back and Swarm will revert to the previous image automatically. No pages. No panic. Just a clean rollback and a log entry you can investigate on Monday morning.

Monitoring dashboard — Grafana dashboard showing request latency during a rolling update. The brief dip corresponds to the drain period, but zero requests were dropped.

WRAPPING UP

This is not a complicated pattern. That is the point. The entire deployment configuration fits in a single compose file. The health check is a dozen lines of code. The rollback is automatic. There is no Helm chart, no custom operator, no cluster to manage.

If you are running a small to medium deployment, somewhere between "I have a Dockerfile" and "we need a platform team," this is the sweet spot. It has served me well across multiple production systems, and I suspect it will serve you well too.

As always, the full source code is linked in the description. Drop a comment if you have questions or if you have found a pattern that works better. I am always learning.