flackey@devbox:~/blog$ cat self-healing-docker-pipeline.log

flackey@devbox:~/blog$ cat self-healing-docker-pipeline.log | less

FILE self-healing-docker-pipeline.log
AUTHOR flackey (Fred Lackey)
CREATED 2026-02-18T09:14:22-05:00
MODIFIED 2026-02-18T22:47:08-05:00
SIZE 12,847 bytes
TAGS devops docker docker-swarm self-healing ci-cd

================================================================

Building a Self-Healing Deployment Pipeline
with Docker Swarm

  A field log on building deployments that fix themselves at 3AM so you don't have to.

// TABLE OF CONTENTS

01. The 3AM Problem 02. Architecture Overview 03. The Health Check Engine 04. Automated Rollback Logic 05. Alerting Without Alert Fatigue 06. Results and Lessons Learned

____ ____ ____ ____ ____ ____ ____ ____ || |||| |||| |||| |||| |||| |||| |||| || ||__||||__||||__||||__||||__||||__||||__||||__|| |/__\||/__\||/__\||/__\||/__\||/__\||/__\||/__\|

[2026-02-18T09:14:22-05:00] LOG ENTRY BEGIN

# 01. The 3AM Problem

Every ops engineer has a version of the same story. You're asleep. The phone buzzes. PagerDuty. A service is down. You fumble for your laptop, SSH into the box, and discover that a container OOM-killed itself because someone merged a memory leak into the image tag that got deployed six hours ago. You roll back, restart the service, confirm health, and go back to bed knowing your alarm is in two hours.

I got tired of this story. Not because the incidents were complicated -- most of them were trivially fixable -- but because the fix was always the same: detect the failure, roll back to the last known good image, and restart. If the fix is always the same, the machine should be doing it.

  THESIS: If 80% of your production incidents have the same remediation, automate the remediation. Save the pager for the other 20%.

This post documents the self-healing deployment pipeline I built for a fleet of 14 microservices running on Docker Swarm across three VPS nodes. The setup has been running for four months with zero manual rollbacks. Here's how it works.

fig01-deployment-overview.png [--] [##] [XX]

____ ____ ____ ____ ____ ____ ____ ____ || |||| |||| |||| |||| |||| |||| |||| || ||__||||__||||__||||__||||__||||__||||__||||__|| |/__\||/__\||/__\||/__\||/__\||/__\||/__\||/__\|

[2026-02-18T10:02:47-05:00] LOG ENTRY CONTINUED

# 02. Architecture Overview

The system has three layers. The deploy-agent runs on each Swarm node and handles image pulls, service updates, and health verification. The state-store is a lightweight SQLite database (yes, SQLite -- fight me) that records every deployment: image tag, timestamp, health check result, and whether the deploy was human-initiated or automated. The watchdog daemon runs every 30 seconds and compares current service health against expected baselines.

The key insight is that every deployment is recorded as a state transition. You always know what the "last known good" state was. Rollback isn't a special operation -- it's just a deployment to a previously-recorded state.

interface DeploymentRecord {
  id: string;
  service: string;
  imageTag: string;
  previousTag: string | null;
  timestamp: Date;
  healthStatus: 'pending' | 'healthy' | 'degraded' | 'failed';
  initiator: 'human' | 'ci' | 'watchdog';
  rollbackOf: string | null;
}

async function deploy(
  service: string,
  imageTag: string,
  initiator: DeploymentRecord['initiator']
): Promise<DeploymentRecord> {
  const current = await getCurrentDeployment(service);

  const record: DeploymentRecord = {
    id: generateId(),
    service,
    imageTag,
    previousTag: current?.imageTag ?? null,
    timestamp: new Date(),
    healthStatus: 'pending',
    initiator,
    rollbackOf: null,
  };

  await stateStore.insert(record);
  await swarmServiceUpdate(service, imageTag);
  await waitForConvergence(service, 120_000);

  record.healthStatus = await runHealthChecks(service);
  await stateStore.update(record);

  return record;
}

The waitForConvergence function is crucial. Docker Swarm's rolling updates don't complete instantly. We poll docker service ps until all replicas report Running state, with a configurable timeout (default: 2 minutes). If convergence fails, we already know we need to roll back before any health check runs.

____ ____ ____ ____ ____ ____ ____ ____ || |||| |||| |||| |||| |||| |||| |||| || ||__||||__||||__||||__||||__||||__||||__||||__|| |/__\||/__\||/__\||/__\||/__\||/__\||/__\||/__\|

[2026-02-18T11:38:05-05:00] LOG ENTRY CONTINUED

# 03. The Health Check Engine

Docker has built-in health checks, but they're limited. A container can report "healthy" while serving 500 errors to every request. We needed something smarter. The health check engine runs a configurable battery of checks per service:

1. TCP connectivity -- can we reach the service port?
2. HTTP probe -- does GET /health return 200?
3. Dependency verification -- can the service reach its database, cache, and message queue?
4. Smoke test -- does a sample request through the actual API return expected data?
5. Resource baseline -- is memory and CPU usage within 2 standard deviations of the rolling average?

type HealthCheckResult = {
  check: string;
  passed: boolean;
  latencyMs: number;
  details?: string;
};

async function runHealthChecks(
  service: string
): Promise<'healthy' | 'degraded' | 'failed'> {
  const config = await getServiceConfig(service);
  const results: HealthCheckResult[] = [];

  // Run all checks in parallel with individual timeouts
  const checks = [
    checkTCP(config.host, config.port),
    checkHTTP(config.healthEndpoint),
    ...config.dependencies.map(d => checkDependency(d)),
    checkSmokeTest(config.smokeTest),
    checkResourceBaseline(service),
  ];

  const settled = await Promise.allSettled(checks);

  const failures = settled.filter(
    r => r.status === 'rejected' || !r.value.passed
  );

  if (failures.length === 0) return 'healthy';
  if (failures.length <= 1) return 'degraded';
  return 'failed';
}

The distinction between degraded and failed matters. A degraded service gets logged and monitored but not rolled back -- maybe the dependency check failed because Redis had a brief hiccup. A failed service triggers immediate rollback. The threshold is configurable per service.

fig02-health-dashboard.png [--] [##] [XX]

Health check dashboard showing service status

____ ____ ____ ____ ____ ____ ____ ____ || |||| |||| |||| |||| |||| |||| |||| || ||__||||__||||__||||__||||__||||__||||__||||__|| |/__\||/__\||/__\||/__\||/__\||/__\||/__\||/__\|

[2026-02-18T14:21:33-05:00] LOG ENTRY CONTINUED

# 04. Automated Rollback Logic

Here's where it gets interesting. The watchdog daemon runs every 30 seconds. On each tick, it checks every service's health. If a service reports "failed", the watchdog queries the state store for the most recent deployment with healthStatus === 'healthy' and triggers a rollback to that image tag.

  WARNING: The rollback circuit breaker is critical. Without it, you can end up in an infinite rollback loop if the "last known good" image also fails (maybe the database schema changed). We cap automated rollbacks at 3 per service per hour.

async function watchdogTick(): Promise<void> {
  const services = await listManagedServices();

  for (const svc of services) {
    const health = await runHealthChecks(svc.name);

    if (health !== 'failed') continue;

    // Circuit breaker: max 3 rollbacks per service per hour
    const recentRollbacks = await stateStore.query({
      service: svc.name,
      initiator: 'watchdog',
      since: hoursAgo(1),
    });

    if (recentRollbacks.length >= 3) {
      alertEscalate(svc.name, 'circuit_breaker_tripped');
      continue;
    }

    const lastGood = await stateStore.findLastHealthy(svc.name);

    if (!lastGood) {
      alertEscalate(svc.name, 'no_healthy_state_found');
      continue;
    }

    log.warn(`Rolling back ${svc.name} to ${lastGood.imageTag}`);
    await deploy(svc.name, lastGood.imageTag, 'watchdog');
  }
}

Let me show you what a real automated rollback looks like in the logs. This happened on January 28th at 2:14 AM. Nobody was awake. The pipeline handled it:

[02:14:02] watchdog: health check tick starting [02:14:03] watchdog: api-gateway ...... HEALTHY [02:14:03] watchdog: auth-service ..... HEALTHY [02:14:04] watchdog: order-service .... FAILED [02:14:04] -> TCP check: OK [02:14:04] -> HTTP /health: 503 Service Unavailable [02:14:04] -> Smoke test: TIMEOUT (5000ms) [02:14:04] -> Memory: 847MB (baseline: 220MB +/- 40MB) EXCEEDED [02:14:05] watchdog: initiating rollback for order-service [02:14:05] current: registry.local/order-service:v2.4.1 [02:14:05] rollback: registry.local/order-service:v2.4.0 [02:14:06] deploy-agent: pulling image... [02:14:08] deploy-agent: updating swarm service... [02:14:12] deploy-agent: waiting for convergence... [02:14:34] deploy-agent: all replicas running [02:14:35] health-check: order-service ... HEALTHY [02:14:35] watchdog: rollback complete. alerting on-call (informational). [02:14:35] alert: INFO order-service auto-rolled-back v2.4.1 -> v2.4.0 (memory_exceeded)

Total time from detection to recovery: 33 seconds. No human intervention. The on-call engineer got an informational Slack message, reviewed the logs in the morning, and identified the memory leak in a new database query that wasn't using cursor pagination.

____ ____ ____ ____ ____ ____ ____ ____ || |||| |||| |||| |||| |||| |||| |||| || ||__||||__||||__||||__||||__||||__||||__||||__|| |/__\||/__\||/__\||/__\||/__\||/__\||/__\||/__\|

[2026-02-18T16:45:19-05:00] LOG ENTRY CONTINUED

# 05. Alerting Without Alert Fatigue

The alerting layer is tiered. Not every event deserves to wake someone up. Here's the escalation matrix:

alerts:
  auto_rollback_success:
    channel: slack#deployments
    severity: info
    page: false
    # Self-healed. FYI only.

  auto_rollback_failed:
    channel: slack#incidents
    severity: critical
    page: true
    # Rollback didn't work. Human needed.

  circuit_breaker_tripped:
    channel: slack#incidents
    severity: critical
    page: true
    # Something is fundamentally wrong.

  degraded_service:
    channel: slack#monitoring
    severity: warning
    page: false
    auto_resolve_after: 5m
    # Might self-resolve. Wait and see.

This approach cut our PagerDuty alerts from an average of 11 per week to 2 per week. And those 2 are genuinely important -- things the system can't fix on its own, like a bad database migration or a misconfigured environment variable.

fig03-alert-volume.png [--] [##] [XX]

Graph showing dramatic reduction in alert volume over time

____ ____ ____ ____ ____ ____ ____ ____ || |||| |||| |||| |||| |||| |||| |||| || ||__||||__||||__||||__||||__||||__||||__||||__|| |/__\||/__\||/__\||/__\||/__\||/__\||/__\||/__\|

[2026-02-18T22:47:08-05:00] LOG ENTRY CONTINUED

# 06. Results and Lessons Learned

After four months in production, the numbers tell the story:

PIPELINE STATS (Oct 2025 - Feb 2026) --------------------------------------------- Total deployments: 847 - CI-triggered: 712 - Human-triggered: 98 - Watchdog rollbacks: 37 Automated rollback success: 37/37 (100%) Mean time to recovery: 41 seconds PagerDuty pages: 9 total (from 44 previous period) 3AM wake-ups: 0

The biggest lesson: keep the system simple and observable. I deliberately chose SQLite over Postgres for the state store because it's one fewer service to monitor. I chose a 30-second polling interval instead of event-driven because it's easier to reason about and debug. I wrote extensive structured logging so that when something does go wrong, the post-mortem writes itself.

  KEY TAKEAWAY: The goal of self-healing infrastructure isn't to eliminate human judgment. It's to eliminate the gap between detection and remediation for known failure modes. Your team should still review every automated action -- just on their own schedule, not at 3AM.

A few specific gotchas I encountered along the way:

01. Image garbage collection. If you're rolling back to old image tags, those images need to still exist in your registry. We set a 90-day retention policy and a minimum of 10 tags per service. Learned this one the hard way when a rollback target had been garbage collected.

02. Database schema compatibility. The rollback only works if the previous code version is compatible with the current database schema. We enforce backward-compatible migrations as a CI check. This is non-negotiable.

03. The "degraded" state is your friend. Most issues self-resolve. A brief network hiccup, a slow garbage collection cycle, a dependency that's temporarily overloaded. If you roll back on every blip, you'll create more instability than you prevent.

04. Test the rollback path. We run monthly "chaos mornings" where we deliberately deploy a bad image and verify the pipeline catches it. It's like a fire drill for your infrastructure.

================================================================

  The full source for the deploy-agent and watchdog is available on GitHub. It's about 1200 lines of

  TypeScript with zero external dependencies beyond the Docker SDK. MIT licensed. PRs welcome.

  github.com/FredLackey/swarm-sentinel

[2026-02-18T22:47:08-05:00] LOG ENTRY END

================================================================

flackey@devbox:~/blog$