Cold starts were killing us. Not in the "slightly annoying" way that makes you file a low-priority ticket. In the "our enterprise clients are threatening to leave" way that makes you reconsider every architectural decision you've made in the last three years.
Our platform served API requests from 14 AWS Lambda functions behind an API Gateway. On paper, it was textbook serverless. In practice, our p99 latency had crept up to 2.3 seconds. For a developer tools company. The irony was not lost on anyone.
The Problem Nobody Warned Us About
When we first adopted Lambda in 2023, the cold start problem was well-documented. We did everything the blog posts told us: provisioned concurrency, kept functions warm with scheduled pings, minimized bundle sizes. And it worked -- for a while.
What nobody told us was that as your function count grows and your traffic becomes more spiky (as developer tools traffic tends to be), the warm pool becomes increasingly expensive to maintain. We were spending $4,200/month just on provisioned concurrency for functions that were idle 60% of the time.
The numbers in this post are real but rounded. I've anonymized client-specific data and normalized costs to a single-region deployment for clarity.
Measuring What Actually Matters
Before ripping anything out, we spent two weeks instrumenting everything. I mean everything. Here's the script that got us started:
1import { Histogram, Counter } from 'prom-client'; 2 3const requestLatency = new Histogram({ 4 name: 'api_request_duration_ms', 5 help: 'API request duration in milliseconds', 6 labelNames: ['route', 'method', 'cold_start'], 7 buckets: [10, 50, 100, 250, 500, 1000, 2500, 5000], 8}); 9 10const coldStartCount = new Counter({ 11 name: 'api_cold_starts_total', 12 help: 'Total number of cold starts', 13 labelNames: ['function_name'], 14}); 15 16let isColdStart = true; 17 18export function trackRequest( 19 route: string, 20 method: string, 21 durationMs: number 22) { 23 requestLatency 24 .labels(route, method, String(isColdStart)) 25 .observe(durationMs); 26 27 if (isColdStart) { 28 coldStartCount.labels(route).inc(); 29 isColdStart = false; 30 } 31}
After two weeks of data collection, the picture was grim. Cold starts accounted for 23% of all requests during peak hours (9-11am EST, when developer teams start their day). The median cold start added 1,800ms. That's not a latency spike -- that's a fundamentally broken user experience.
The Migration Path
We evaluated three options:
- Stay on Lambda, optimize harder -- Provisioned concurrency across all functions, SnapStart for Java-based services. Estimated cost: $8,400/month.
- Move to containers (ECS/Fargate) -- Predictable latency, but we'd lose the auto-scaling simplicity. Estimated cost: $3,200/month baseline.
- Edge functions (Cloudflare Workers) -- Near-zero cold starts, global deployment, V8 isolate model. Estimated cost: $420/month.
The cost difference alone was compelling, but the latency numbers sealed the deal. Workers spin up in under 5ms. Not 5 seconds. Five milliseconds.
The Worker Architecture
Here's the core of our routing layer after the migration:
1import { Router } from 'itty-router'; 2import { withAuth } from './middleware/auth'; 3import { withCache } from './middleware/cache'; 4import { handleAnalytics } from './handlers/analytics'; 5import { handleIngest } from './handlers/ingest'; 6 7const router = Router(); 8 9// Auth middleware runs at the edge -- 10// JWT verification in <1ms using Web Crypto API 11router.all('/api/*', withAuth); 12 13// Cache layer using Cloudflare KV for 14// frequently accessed read paths 15router.get('/api/analytics/*', withCache, handleAnalytics); 16router.post('/api/ingest', handleIngest); 17 18export default { 19 async fetch(request, env, ctx): Promise<Response> { 20 return router.handle(request, env, ctx); 21 }, 22};
Simple. Almost suspiciously simple. But that's the point -- the complexity shifted from "managing infrastructure" to "writing good code." Which is exactly where I want it.
The best infrastructure is the kind you forget exists. It should be invisible -- a platform your code runs on, not a problem your team debugates.
Kelsey Hightower, at KubeCon 2024
What Broke Along the Way
It wasn't all smooth sailing. Here are the three biggest problems we hit:
1. The 128MB Memory Ceiling
Workers have a hard memory limit. Our analytics aggregation function was buffering entire result sets in memory before streaming them to the client. We had to rewrite it to use TransformStream for chunked processing. Took two days, but the result was actually better than the original.
2. No Native Database Drivers
Workers run on V8, not Node.js. That means no pg, no mysql2, no native TCP sockets. We moved to Cloudflare's Hyperdrive for Postgres connections, which pools connections at the edge and proxies them to our origin database. Latency impact: negligible.
If you're using connection-heavy ORMs like Prisma, test thoroughly before migrating. The connection model at the edge is fundamentally different, and you may hit connection limits faster than expected.
3. Debugging Is Different
CloudWatch logs don't exist in this world. We moved to wrangler tail for real-time log streaming during development and Baselime for production observability. The tooling gap is real but narrowing fast.
The Numbers, Four Months Later
Here's where we landed:
- p50 latency: 12ms (down from 180ms)
- p99 latency: 78ms (down from 2,300ms)
- Cold starts: Effectively zero
- Monthly cost: $380 (down from $4,200)
- Deployment time: 8 seconds globally (down from 4 minutes per region)
The 91% cost reduction was the headline number our CFO cared about. But the one that matters to me is the 97% reduction in p99 latency. Our users feel that every single time they interact with the platform.
The Human Cost of Technical Debt
I want to be honest about something the metrics don't show. This migration took three months. During that time, I missed my daughter's school play. I worked through two weekends that I'd promised to my family. The Slack messages at 11pm became routine.
The platform is faster now. The numbers are beautiful. But I've been thinking a lot about what we sacrifice when we treat every technical problem as urgent. This was the right call architecturally. I'm less sure it was the right call for the humans involved -- including me.
"We are what we repeatedly do. Excellence, then, is not an act, but a habit."
Will Durant, paraphrasing Aristotle
Should You Do This?
Maybe. Edge functions are not a universal solution. They're excellent for:
- API gateways and routing layers
- Authentication and authorization
- Content transformation and personalization
- Light data processing and aggregation
They're not great for:
- Long-running computations (30-second CPU time limit)
- Heavy database writes (connection pooling complexity)
- Workloads that need more than 128MB of memory
- Anything requiring Node.js-specific APIs
For us, it was the right move. Our workload was almost entirely I/O-bound API requests -- exactly what edge functions are designed for. Your mileage will vary. Measure first, migrate second.
1name = "platform-api" 2main = "src/router.ts" 3compatibility_date = "2026-01-15" 4 5# This single file deploys to 300+ locations. 6# Try doing that with CloudFormation. 7 8[vars] 9ENVIRONMENT = "production"
If you're exploring this path, start with one function. Pick your simplest, most stateless endpoint and migrate it. Watch the numbers. If they tell a similar story to ours, you'll know what to do next.
Ship fast. Measure everything. And remember: the best architecture is the one that lets you go home on time.