CHAPTERS
  • 0:00Introduction and the problem statement
  • 3:42Docker Swarm fundamentals
  • 11:15Configuring Traefik as a reverse proxy
  • 19:30Rolling update strategy
  • 27:08Health checks and readiness probes
  • 34:20Automatic rollback on failure
  • 39:45Monitoring and wrap-up

There is a moment in every production deployment where you hold your breath. The old containers are draining, the new ones are spinning up, and somewhere in between, real users are clicking buttons that resolve to endpoints that may or may not exist for the next three seconds.

I have been chasing the zero-downtime dragon for years. Not theoretically; practically. On systems that handle real traffic, with real SLAs, where "a few dropped requests" means an angry Slack message from someone three levels above you in the org chart. This episode is the distillation of everything I have learned.

WHY DOCKER SWARM IN 2026

I know what you are thinking. Kubernetes won. The war is over. And you are right, mostly. But there is a class of deployment that Kubernetes is egregiously over-engineered for, and if you are running a handful of services on two to five nodes, Swarm remains the most elegant solution I have found.

The mental model is simpler. The networking is built in. The learning curve from single-host Docker to Swarm is a gentle slope rather than a cliff face. And critically, for the deployment pattern we are building today, Swarm's rolling update primitive is genuinely well-designed.

Architecture diagram
The target architecture: Traefik sits at the edge, routing to service replicas managed by Swarm's orchestrator.

THE TRAEFIK CONFIGURATION

Traefik is one of those tools that feels like it was designed specifically for this use case. It watches the Docker socket, discovers services automatically, and reconfigures its routing table when containers come and go. No config reloads. No NGINX templates. It just works.

Here is the core of our Traefik stack definition:

YAML docker-compose.traefik.yml
version: "3.8"

services:
  traefik:
    image: traefik:v3.0
    command:
      - "--providers.swarm.endpoint=unix:///var/run/docker.sock"
      - "--providers.swarm.exposedByDefault=false"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
      - "--api.dashboard=true"
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    deploy:
      placement:
        constraints:
          - node.role == manager

The key line is providers.swarm.exposedByDefault=false. This means Traefik will only route to services that explicitly opt in via labels. Defense in depth starts at the reverse proxy.

ROLLING UPDATE STRATEGY

The deployment strategy is where everything comes together. Docker Swarm supports two update orders: stop-first (default) and start-first. For zero-downtime, we always want start-first. The new container must be healthy before the old one is drained.

YAML docker-compose.app.yml
services:
  api:
    image: registry.internal/api:${TAG}
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 15s
        order: start-first
        failure_action: rollback
        monitor: 30s
      rollback_config:
        parallelism: 0
        order: start-first
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.api.rule=Host(`api.example.com`)"
        - "traefik.http.services.api.loadbalancer.server.port=3000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 20s

Let me break down the critical parameters:

The best deployment pipeline is one you can run on a Friday afternoon without checking your phone all weekend. If your system needs you to babysit it, the system is not done yet.

Something I remind myself constantly

THE HEALTH CHECK DEEP DIVE

Health checks are the unsung hero of this entire pattern. A poorly written health check will give you false confidence. A missing health check will give you no confidence at all. Here is what a production-grade health endpoint actually looks like:

JavaScript src/routes/health.js
const healthCheck = async (req, res) => {
  const checks = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    database: 'unknown',
    redis: 'unknown',
  };

  try {
    await db.query('SELECT 1');
    checks.database = 'connected';
  } catch (err) {
    checks.database = 'disconnected';
    return res.status(503).json(checks);
  }

  try {
    await redis.ping();
    checks.redis = 'connected';
  } catch (err) {
    checks.redis = 'disconnected';
    return res.status(503).json(checks);
  }

  res.status(200).json(checks);
};

Notice that we are not just returning 200. We are verifying that the application can actually reach its dependencies. A container that is running but cannot talk to the database is not healthy. It is a zombie, and zombies should not serve traffic.

Demo video
Watch the live demo
8:42

MONITORING THE ROLLOUT

You can watch a deployment in real time with a single command. I keep a terminal split open during every deploy showing the service state:

Bash Terminal
# Watch the update roll through each replica
watch -n 2 docker service ps api \
  --format "table {{.ID}}\t{{.Image}}\t{{.CurrentState}}\t{{.Error}}"

# Or for a quick status check
docker service inspect api \
  --pretty | grep -A 5 "UpdateStatus"

If something goes wrong, you will see the update state flip to rolling_back and Swarm will revert to the previous image automatically. No pages. No panic. Just a clean rollback and a log entry you can investigate on Monday morning.

Monitoring dashboard
Grafana dashboard showing request latency during a rolling update. The brief dip corresponds to the drain period, but zero requests were dropped.

WRAPPING UP

This is not a complicated pattern. That is the point. The entire deployment configuration fits in a single compose file. The health check is a dozen lines of code. The rollback is automatic. There is no Helm chart, no custom operator, no cluster to manage.

If you are running a small to medium deployment, somewhere between "I have a Dockerfile" and "we need a platform team," this is the sweet spot. It has served me well across multiple production systems, and I suspect it will serve you well too.

As always, the full source code is linked in the description. Drop a comment if you have questions or if you have found a pattern that works better. I am always learning.