- 0:00Introduction and the problem statement
- 3:42Docker Swarm fundamentals
- 11:15Configuring Traefik as a reverse proxy
- 19:30Rolling update strategy
- 27:08Health checks and readiness probes
- 34:20Automatic rollback on failure
- 39:45Monitoring and wrap-up
There is a moment in every production deployment where you hold your breath. The old containers are draining, the new ones are spinning up, and somewhere in between, real users are clicking buttons that resolve to endpoints that may or may not exist for the next three seconds.
I have been chasing the zero-downtime dragon for years. Not theoretically; practically. On systems that handle real traffic, with real SLAs, where "a few dropped requests" means an angry Slack message from someone three levels above you in the org chart. This episode is the distillation of everything I have learned.
WHY DOCKER SWARM IN 2026
I know what you are thinking. Kubernetes won. The war is over. And you are right, mostly. But there is a class of deployment that Kubernetes is egregiously over-engineered for, and if you are running a handful of services on two to five nodes, Swarm remains the most elegant solution I have found.
The mental model is simpler. The networking is built in. The learning curve from single-host Docker to Swarm is a gentle slope rather than a cliff face. And critically, for the deployment pattern we are building today, Swarm's rolling update primitive is genuinely well-designed.
THE TRAEFIK CONFIGURATION
Traefik is one of those tools that feels like it was designed specifically for this use case. It watches the Docker socket, discovers services automatically, and reconfigures its routing table when containers come and go. No config reloads. No NGINX templates. It just works.
Here is the core of our Traefik stack definition:
version: "3.8" services: traefik: image: traefik:v3.0 command: - "--providers.swarm.endpoint=unix:///var/run/docker.sock" - "--providers.swarm.exposedByDefault=false" - "--entrypoints.web.address=:80" - "--entrypoints.websecure.address=:443" - "--api.dashboard=true" ports: - "80:80" - "443:443" volumes: - /var/run/docker.sock:/var/run/docker.sock:ro deploy: placement: constraints: - node.role == manager
The key line is providers.swarm.exposedByDefault=false. This means Traefik will only route to services that explicitly opt in via labels. Defense in depth starts at the reverse proxy.
ROLLING UPDATE STRATEGY
The deployment strategy is where everything comes together. Docker Swarm supports two update orders: stop-first (default) and start-first. For zero-downtime, we always want start-first. The new container must be healthy before the old one is drained.
services: api: image: registry.internal/api:${TAG} deploy: replicas: 3 update_config: parallelism: 1 delay: 15s order: start-first failure_action: rollback monitor: 30s rollback_config: parallelism: 0 order: start-first labels: - "traefik.enable=true" - "traefik.http.routers.api.rule=Host(`api.example.com`)" - "traefik.http.services.api.loadbalancer.server.port=3000" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3000/health"] interval: 10s timeout: 5s retries: 3 start_period: 20s
Let me break down the critical parameters:
- parallelism: 1 means we update one replica at a time. Conservative, but safe.
- delay: 15s introduces a cooling-off period between each replica update.
- monitor: 30s watches the new container for 30 seconds after it starts. If it crashes or fails its health check within that window, we trigger an automatic rollback.
- failure_action: rollback is the safety net. No manual intervention required.
The best deployment pipeline is one you can run on a Friday afternoon without checking your phone all weekend. If your system needs you to babysit it, the system is not done yet.
Something I remind myself constantly
THE HEALTH CHECK DEEP DIVE
Health checks are the unsung hero of this entire pattern. A poorly written health check will give you false confidence. A missing health check will give you no confidence at all. Here is what a production-grade health endpoint actually looks like:
const healthCheck = async (req, res) => { const checks = { uptime: process.uptime(), timestamp: Date.now(), database: 'unknown', redis: 'unknown', }; try { await db.query('SELECT 1'); checks.database = 'connected'; } catch (err) { checks.database = 'disconnected'; return res.status(503).json(checks); } try { await redis.ping(); checks.redis = 'connected'; } catch (err) { checks.redis = 'disconnected'; return res.status(503).json(checks); } res.status(200).json(checks); };
Notice that we are not just returning 200. We are verifying that the application can actually reach its dependencies. A container that is running but cannot talk to the database is not healthy. It is a zombie, and zombies should not serve traffic.
MONITORING THE ROLLOUT
You can watch a deployment in real time with a single command. I keep a terminal split open during every deploy showing the service state:
# Watch the update roll through each replica watch -n 2 docker service ps api \ --format "table {{.ID}}\t{{.Image}}\t{{.CurrentState}}\t{{.Error}}" # Or for a quick status check docker service inspect api \ --pretty | grep -A 5 "UpdateStatus"
If something goes wrong, you will see the update state flip to rolling_back and Swarm will revert to the previous image automatically. No pages. No panic. Just a clean rollback and a log entry you can investigate on Monday morning.
WRAPPING UP
This is not a complicated pattern. That is the point. The entire deployment configuration fits in a single compose file. The health check is a dozen lines of code. The rollback is automatic. There is no Helm chart, no custom operator, no cluster to manage.
If you are running a small to medium deployment, somewhere between "I have a Dockerfile" and "we need a platform team," this is the sweet spot. It has served me well across multiple production systems, and I suspect it will serve you well too.
As always, the full source code is linked in the description. Drop a comment if you have questions or if you have found a pattern that works better. I am always learning.