I Mass-Deleted My Docker Images and Lived to Tell the Tale // FRED LACKEY

Let me set the scene. It's a Tuesday night. I've been staring at our Docker registry for about thirty minutes, watching the storage counter tick upward like a taxi meter in Manhattan traffic. 847 images. 340 GB. Most of them are ghosts -- artifacts of feature branches that died three months ago, CI runs that failed halfway through, and that one time someone pushed a 2 GB node_modules layer because they forgot the .dockerignore.

"I should clean this up," I think, with the same energy as someone saying "I should floss more." The difference is that flossing doesn't take down production.

The best time to clean your Docker registry was six months ago. The second best time is not 11 PM on a Tuesday.
-- Me, writing this in hindsight

The Setup

Our infrastructure at the time was straightforward: a handful of Node.js microservices, all containerized, deployed through a CI/CD pipeline to a Kubernetes cluster. Each service had its own Dockerfile, its own image repository, and its own tagging scheme -- which is to say, no consistent tagging scheme at all.

Some services tagged with git SHAs. Some used latest. One rogue service used date strings like 2026-01-15-hotfix-v2-final-FINAL. You know the type. You might be the type. I'm not judging. I was that type for two of the services.

Docker registry showing hundreds of images — Fig 1. Our registry. Yes, that's 847 images. No, we didn't need 847 images.

The problem wasn't just storage cost (though that was real -- we were paying about $120/month just for image storage). The problem was cognitive overhead. Every time someone needed to figure out which image was actually deployed, they had to play archaeological detective, cross-referencing CI logs with Kubernetes manifests with git blame output.

The Script That Started It All

Being a responsible engineer, I didn't just start deleting things manually. No. I wrote a script. A carefully considered, thoroughly thought-through, definitely-not-written-in- fifteen-minutes Bash script that would identify and remove stale images.

Here's what the script was supposed to do:

List all images in each repository
Identify images older than 30 days
Cross-reference with currently deployed image tags
Delete everything that was old AND not currently deployed

Here's what I actually wrote:

Bash

#!/bin/bash
# cleanup-images.sh
# "This will be fine" -- narrator: it was not fine

REGISTRY="our-registry.example.com"
CUTOFF_DATE=$(date -d "30 days ago" +%s)

for repo in $(curl -s "$REGISTRY/v2/_catalog" | jq -r '.repositories[]'); do
    echo "Processing: $repo"

    tags=$(curl -s "$REGISTRY/v2/$repo/tags/list" | jq -r '.tags[]')

    for tag in $tags; do
        created=$(curl -s "$REGISTRY/v2/$repo/manifests/$tag" \
            | jq -r '.history[0].v1Compatibility' \
            | jq -r '.created' \
            | xargs -I{} date -d {} +%s)

        if [ "$created" -lt "$CUTOFF_DATE" ]; then
            echo "  Deleting: $repo:$tag"
            # THE BUG IS HERE. CAN YOU SEE IT?
            curl -X DELETE "$REGISTRY/v2/$repo/manifests/$tag"
        fi
    done
done

echo "Done! Registry cleaned."

Warning: Spot the Bug

If you noticed that this script doesn't cross-reference currently deployed tags before deleting, congratulations -- you're smarter than Tuesday-night me. The step 3 from my plan? Never made it into the code. I "planned to add it later."

The "Oh No" Moment

I ran the script at 11:14 PM. It ran beautifully. It logged every deletion with satisfying output. Within six minutes, it had removed 612 images. I watched the storage counter drop from 340 GB to 89 GB. I felt like a genius.

I felt like a genius for approximately fourteen minutes.

At 11:28 PM, our Kubernetes cluster started throwing ImagePullBackOff errors. Three services couldn't pull their images during routine pod rescheduling. Because the images they were pointing to -- images tagged with git SHAs from five weeks ago -- no longer existed.

Kubernetes Error Log

Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  2m    default-scheduler  Successfully assigned
                                               api/user-svc-7d8f9c to node-03
  Normal   Pulling    2m    kubelet            Pulling image
                                               "registry/user-svc:a3f8b21"
  Warning  Failed     1m    kubelet            Failed to pull image
                                               "registry/user-svc:a3f8b21":
                                               manifest unknown
  Warning  Failed     1m    kubelet            Error: ImagePullBackOff

The good news: the currently running pods were fine. Kubernetes doesn't re-pull images that are already cached on the node. The bad news: any pod that needed to restart, scale up, or get rescheduled to a different node was now bricked.

Damage Report

Here's the full damage assessment from that night:

3 services affected (user-service, notification-service, analytics-worker)
612 images deleted (we needed about 8 of them)
2 hours of degraded service while we rebuilt and redeployed
1 very apologetic Slack message in #engineering at midnight
0 customers who noticed (we got incredibly lucky -- traffic was minimal)

Slack message at midnight — Fig 2. The Slack message. Cropped to protect the guilty (me).

The Immediate Fix

The fix was straightforward but tedious. For each affected service:

Check out the exact commit that was deployed (thank god for Kubernetes annotations)
Rebuild the Docker image from that commit
Push it with the same tag
Trigger a rolling restart

Bash

# For each affected service:
git checkout a3f8b21  # The exact commit from the deployment
docker build -t registry/user-svc:a3f8b21 .
docker push registry/user-svc:a3f8b21

# Then restart the pods
kubectl rollout restart deployment/user-svc -n api

By 1:30 AM, everything was back. I closed my laptop, stared at the ceiling for about twenty minutes contemplating my life choices, and then went to sleep.

What I Actually Learned

This is the part of the blog post where I'm supposed to act wise and share profound insights. Here's my honest list:

1. Never delete without a dry run

Any destructive script should have a mandatory --dry-run flag that's on by default. Print what you would delete. Stare at it. Sleep on it. Then delete it tomorrow when you're not running on caffeine and hubris.

2. Tag immutably, reference by digest

After this incident, we switched to a consistent tagging scheme:

Text

# Format: {service}-{git-sha}-{build-number}
registry/user-svc:user-svc-a3f8b21-142
registry/user-svc:user-svc-b7e2c4d-143

# Deployments reference the full digest:
image: registry/user-svc@sha256:abc123def456...

3. Protect what's deployed

Here's the script I wrote the next day -- the one I should have written first:

Bash

#!/bin/bash
# cleanup-images-v2.sh -- now with 100% more sanity

set -euo pipefail

DRY_RUN=${DRY_RUN:-true}  # Safe by default!
CUTOFF_DAYS=${CUTOFF_DAYS:-30}
REGISTRY="our-registry.example.com"

# Step 1: Get ALL currently deployed image refs
echo "Fetching deployed images from all namespaces..."
DEPLOYED_IMAGES=$(kubectl get pods --all-namespaces \
    -o jsonpath='{.items[*].spec.containers[*].image}' \
    | tr ' ' '\n' | sort -u)

echo "Found $(echo "$DEPLOYED_IMAGES" | wc -l) deployed image refs"
echo "---"

# Step 2: For each repo, check tags
DELETED=0
PROTECTED=0

for repo in $(curl -s "$REGISTRY/v2/_catalog" \
    | jq -r '.repositories[]'); do
    for tag in $(curl -s "$REGISTRY/v2/$repo/tags/list" \
        | jq -r '.tags[]' 2>/dev/null); do

        full_ref="$REGISTRY/$repo:$tag"

        # NEVER delete deployed images
        if echo "$DEPLOYED_IMAGES" | grep -q "$full_ref"; then
            echo "PROTECTED: $full_ref (currently deployed)"
            ((PROTECTED++))
            continue
        fi

        # Check age...
        # [age check logic here]

        if [ "$DRY_RUN" = "true" ]; then
            echo "WOULD DELETE: $full_ref"
        else
            echo "DELETING: $full_ref"
            # actual delete call
        fi
        ((DELETED++))
    done
done

echo "---"
echo "Protected: $PROTECTED | Deleted: $DELETED"
[ "$DRY_RUN" = "true" ] && echo "(DRY RUN -- nothing was actually deleted)"

The difference between a junior and senior engineer isn't that seniors don't make mistakes. It's that seniors have made enough mistakes to build guardrails into everything they write.

4. Retention policies should be automated, not heroic

We now use a registry-level retention policy that automatically prunes images older than 90 days, but only if they don't match any tag pattern that could be a release. Combine this with a CI job that runs weekly to verify deployed images exist in the registry, and you've got a system that's self-cleaning and self-healing.

Pro Tip

Most container registries (ECR, GCR, ACR, Harbor) have built-in lifecycle policies. Use them. They're battle-tested. They're smarter than your Bash script. They're definitely smarter than your 11 PM Bash script.

Prevention Checklist

If you're sitting there right now with a bloated registry, here's my opinionated checklist before you start deleting anything:

Audit what's actually deployed right now. Export the full list.
Set up a retention policy at the registry level, not in a script.
Standardize your tagging scheme across all services today.
Add a CI check that verifies deployed images exist in the registry.
If you must write a cleanup script, dry-run it for a full week.
Never run destructive operations after 9 PM. Just don't.

The storage cost will still be there tomorrow morning. Your sanity might not be if you delete the wrong thing at midnight.

If this post saved you from making the same mistake, or if you've got an even worse Docker war story, drop me a line. Misery loves company, and I love a good incident report.

Until next time -- tag your images, sleep before you delete, and never trust a script you wrote after 10 PM.

Fred Lackey

Full-stack developer, infrastructure tinkerer, and serial note-taker. Writes about building things, breaking things, and occasionally getting it right.

I Mass-Deleted My Docker Images and Lived to Tell the Tale

The Setup

The Script That Started It All

The "Oh No" Moment

Damage Report

The Immediate Fix

What I Actually Learned

1. Never delete without a dry run

2. Tag immutably, reference by digest

3. Protect what's deployed

4. Retention policies should be automated, not heroic

Prevention Checklist

Fred Lackey

More Posts Like This

My Actual, Honest Dotfiles Setup

Managing Multiple SSH Keys Without Losing Your Mind

Node.js Monorepo That Doesn't Make You Want to Quit