Let me set the scene. It's a Tuesday night. I've been staring at our Docker registry for about thirty minutes, watching the storage counter tick upward like a taxi meter in Manhattan traffic. 847 images. 340 GB. Most of them are ghosts -- artifacts of feature branches that died three months ago, CI runs that failed halfway through, and that one time someone pushed a 2 GB node_modules layer because they forgot the .dockerignore.
"I should clean this up," I think, with the same energy as someone saying "I should floss more." The difference is that flossing doesn't take down production.
The best time to clean your Docker registry was six months ago. The second best time is not 11 PM on a Tuesday.
-- Me, writing this in hindsight
The Setup
Our infrastructure at the time was straightforward: a handful of Node.js microservices, all containerized, deployed through a CI/CD pipeline to a Kubernetes cluster. Each service had its own Dockerfile, its own image repository, and its own tagging scheme -- which is to say, no consistent tagging scheme at all.
Some services tagged with git SHAs. Some used latest.
One rogue service used date strings like 2026-01-15-hotfix-v2-final-FINAL.
You know the type. You might be the type. I'm not judging. I was that type for
two of the services.
The problem wasn't just storage cost (though that was real -- we were paying about $120/month just for image storage). The problem was cognitive overhead. Every time someone needed to figure out which image was actually deployed, they had to play archaeological detective, cross-referencing CI logs with Kubernetes manifests with git blame output.
The Script That Started It All
Being a responsible engineer, I didn't just start deleting things manually. No. I wrote a script. A carefully considered, thoroughly thought-through, definitely-not-written-in- fifteen-minutes Bash script that would identify and remove stale images.
Here's what the script was supposed to do:
- List all images in each repository
- Identify images older than 30 days
- Cross-reference with currently deployed image tags
- Delete everything that was old AND not currently deployed
Here's what I actually wrote:
#!/bin/bash
# cleanup-images.sh
# "This will be fine" -- narrator: it was not fine
REGISTRY="our-registry.example.com"
CUTOFF_DATE=$(date -d "30 days ago" +%s)
for repo in $(curl -s "$REGISTRY/v2/_catalog" | jq -r '.repositories[]'); do
echo "Processing: $repo"
tags=$(curl -s "$REGISTRY/v2/$repo/tags/list" | jq -r '.tags[]')
for tag in $tags; do
created=$(curl -s "$REGISTRY/v2/$repo/manifests/$tag" \
| jq -r '.history[0].v1Compatibility' \
| jq -r '.created' \
| xargs -I{} date -d {} +%s)
if [ "$created" -lt "$CUTOFF_DATE" ]; then
echo " Deleting: $repo:$tag"
# THE BUG IS HERE. CAN YOU SEE IT?
curl -X DELETE "$REGISTRY/v2/$repo/manifests/$tag"
fi
done
done
echo "Done! Registry cleaned."
If you noticed that this script doesn't cross-reference currently deployed tags before deleting, congratulations -- you're smarter than Tuesday-night me. The step 3 from my plan? Never made it into the code. I "planned to add it later."
The "Oh No" Moment
I ran the script at 11:14 PM. It ran beautifully. It logged every deletion with satisfying output. Within six minutes, it had removed 612 images. I watched the storage counter drop from 340 GB to 89 GB. I felt like a genius.
I felt like a genius for approximately fourteen minutes.
At 11:28 PM, our Kubernetes cluster started throwing ImagePullBackOff
errors. Three services couldn't pull their images during routine pod rescheduling. Because
the images they were pointing to -- images tagged with git SHAs from five weeks ago --
no longer existed.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m default-scheduler Successfully assigned
api/user-svc-7d8f9c to node-03
Normal Pulling 2m kubelet Pulling image
"registry/user-svc:a3f8b21"
Warning Failed 1m kubelet Failed to pull image
"registry/user-svc:a3f8b21":
manifest unknown
Warning Failed 1m kubelet Error: ImagePullBackOff
The good news: the currently running pods were fine. Kubernetes doesn't re-pull images that are already cached on the node. The bad news: any pod that needed to restart, scale up, or get rescheduled to a different node was now bricked.
Damage Report
Here's the full damage assessment from that night:
- 3 services affected (user-service, notification-service, analytics-worker)
- 612 images deleted (we needed about 8 of them)
- 2 hours of degraded service while we rebuilt and redeployed
- 1 very apologetic Slack message in #engineering at midnight
- 0 customers who noticed (we got incredibly lucky -- traffic was minimal)
The Immediate Fix
The fix was straightforward but tedious. For each affected service:
- Check out the exact commit that was deployed (thank god for Kubernetes annotations)
- Rebuild the Docker image from that commit
- Push it with the same tag
- Trigger a rolling restart
# For each affected service:
git checkout a3f8b21 # The exact commit from the deployment
docker build -t registry/user-svc:a3f8b21 .
docker push registry/user-svc:a3f8b21
# Then restart the pods
kubectl rollout restart deployment/user-svc -n api
By 1:30 AM, everything was back. I closed my laptop, stared at the ceiling for about twenty minutes contemplating my life choices, and then went to sleep.
What I Actually Learned
This is the part of the blog post where I'm supposed to act wise and share profound insights. Here's my honest list:
1. Never delete without a dry run
Any destructive script should have a mandatory --dry-run flag that's
on by default. Print what you would delete. Stare at it. Sleep on it. Then
delete it tomorrow when you're not running on caffeine and hubris.
2. Tag immutably, reference by digest
After this incident, we switched to a consistent tagging scheme:
# Format: {service}-{git-sha}-{build-number}
registry/user-svc:user-svc-a3f8b21-142
registry/user-svc:user-svc-b7e2c4d-143
# Deployments reference the full digest:
image: registry/user-svc@sha256:abc123def456...
3. Protect what's deployed
Here's the script I wrote the next day -- the one I should have written first:
#!/bin/bash
# cleanup-images-v2.sh -- now with 100% more sanity
set -euo pipefail
DRY_RUN=${DRY_RUN:-true} # Safe by default!
CUTOFF_DAYS=${CUTOFF_DAYS:-30}
REGISTRY="our-registry.example.com"
# Step 1: Get ALL currently deployed image refs
echo "Fetching deployed images from all namespaces..."
DEPLOYED_IMAGES=$(kubectl get pods --all-namespaces \
-o jsonpath='{.items[*].spec.containers[*].image}' \
| tr ' ' '\n' | sort -u)
echo "Found $(echo "$DEPLOYED_IMAGES" | wc -l) deployed image refs"
echo "---"
# Step 2: For each repo, check tags
DELETED=0
PROTECTED=0
for repo in $(curl -s "$REGISTRY/v2/_catalog" \
| jq -r '.repositories[]'); do
for tag in $(curl -s "$REGISTRY/v2/$repo/tags/list" \
| jq -r '.tags[]' 2>/dev/null); do
full_ref="$REGISTRY/$repo:$tag"
# NEVER delete deployed images
if echo "$DEPLOYED_IMAGES" | grep -q "$full_ref"; then
echo "PROTECTED: $full_ref (currently deployed)"
((PROTECTED++))
continue
fi
# Check age...
# [age check logic here]
if [ "$DRY_RUN" = "true" ]; then
echo "WOULD DELETE: $full_ref"
else
echo "DELETING: $full_ref"
# actual delete call
fi
((DELETED++))
done
done
echo "---"
echo "Protected: $PROTECTED | Deleted: $DELETED"
[ "$DRY_RUN" = "true" ] && echo "(DRY RUN -- nothing was actually deleted)"
The difference between a junior and senior engineer isn't that seniors don't make mistakes. It's that seniors have made enough mistakes to build guardrails into everything they write.
4. Retention policies should be automated, not heroic
We now use a registry-level retention policy that automatically prunes images older than 90 days, but only if they don't match any tag pattern that could be a release. Combine this with a CI job that runs weekly to verify deployed images exist in the registry, and you've got a system that's self-cleaning and self-healing.
Most container registries (ECR, GCR, ACR, Harbor) have built-in lifecycle policies. Use them. They're battle-tested. They're smarter than your Bash script. They're definitely smarter than your 11 PM Bash script.
Prevention Checklist
If you're sitting there right now with a bloated registry, here's my opinionated checklist before you start deleting anything:
- Audit what's actually deployed right now. Export the full list.
- Set up a retention policy at the registry level, not in a script.
- Standardize your tagging scheme across all services today.
- Add a CI check that verifies deployed images exist in the registry.
- If you must write a cleanup script, dry-run it for a full week.
- Never run destructive operations after 9 PM. Just don't.
The storage cost will still be there tomorrow morning. Your sanity might not be if you delete the wrong thing at midnight.
If this post saved you from making the same mistake, or if you've got an even worse Docker war story, drop me a line. Misery loves company, and I love a good incident report.
Until next time -- tag your images, sleep before you delete, and never trust a script you wrote after 10 PM.