It started, as most ambitious projects do, with a whiteboard and a lie. "This should take about three months," I told my business partner, sketching boxes and arrows with the confidence of someone who had never built a multi-tenant system before. Eighteen months later, after scrapping the database schema twice, rewriting the authentication layer from scratch, and spending one particularly dark Thursday questioning whether I was even a real engineer, we shipped.
The platform serves 340 organizations today. It handles roughly 2.8 million API requests per day on infrastructure that costs less than a mid-range car payment. And while I'm proud of where it landed, the journey there was anything but smooth. This is the story of every significant decision, every painful mistake, and the handful of insights that actually mattered.
The Tenant Isolation Problem
Before writing a single line of code, we had to answer the foundational question of multi-tenancy: how do you keep one customer's data completely separate from another's? There are three common approaches, and each involves a different set of tradeoffs.
The first option is database-per-tenant: every customer gets their own Postgres database. Maximum isolation, but operational complexity scales linearly. The second is schema-per-tenant: one database, but each customer gets their own schema. A middle ground. The third is shared tables with a tenant_id column: everyone shares everything, separated only by a discriminator in every query.
We chose the third option. Here's why.
"The best architecture is the one your team can actually operate at 2 AM when something breaks. Everything else is theory." -- A senior engineer at a previous job, after the third incident that quarter
With 340 tenants, managing 340 separate databases or schemas would have required automation we didn't have time to build. Shared tables with row-level security gave us strong isolation guarantees without the operational overhead. The key was making sure every single query, without exception, included the tenant context.
Row-Level Security: The Foundation
Postgres has a built-in feature called Row-Level Security (RLS) that lets you define policies at the database level. This means even if your application code has a bug and forgets to filter by tenant_id, the database itself will enforce the boundary. It's defense in depth, and it saved us at least twice during development.
Here's the core pattern we used across every table:
-- Enable RLS on the projects table
ALTER TABLE projects ENABLE ROW LEVEL SECURITY;
-- Create a policy that restricts access to the current tenant
CREATE POLICY tenant_isolation ON projects
USING (tenant_id = current_setting('app.current_tenant')::uuid);
-- Apply the same pattern to every tenant-scoped table
ALTER TABLE tasks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON tasks
USING (tenant_id = current_setting('app.current_tenant')::uuid);
The trick is the current_setting('app.current_tenant') call. At the start of every request, our middleware sets this session variable based on the authenticated user's organization. Every subsequent query in that transaction automatically scopes to the correct tenant.
The Middleware
On the Node.js side, setting the tenant context looks like this:
import { Request, Response, NextFunction } from 'express';
import { pool } from '../db';
export async function tenantMiddleware(
req: Request,
res: Response,
next: NextFunction
) {
const tenantId = req.user?.tenantId;
if (!tenantId) {
return res.status(401).json({
error: 'Tenant context required'
});
}
// Acquire a connection and set the tenant context
const client = await pool.connect();
try {
await client.query(
`SET LOCAL app.current_tenant = $1`,
[tenantId]
);
// Attach the client to the request
req.db = client;
next();
} catch (err) {
client.release();
next(err);
}
}
There's a subtlety here that tripped us up for weeks. The SET LOCAL command scopes the setting to the current transaction. If you're using connection pooling (and you should be), you need to make sure each request gets its own transaction. Otherwise, tenant context bleeds between requests. We learned this the hard way when a staging test showed one customer's data leaking into another's dashboard. That was not a fun afternoon.
Always use SET LOCAL instead of SET for tenant context in pooled connections. Without the LOCAL qualifier, the setting persists on the connection and can bleed into the next request that reuses it. We caught this in staging. In production, it would have been a data breach.
Authentication and the JWT Dilemma
Every multi-tenant system needs to answer: how does the application know which tenant a request belongs to? We went through three iterations before landing on something solid.
Version 1 extracted the tenant from the subdomain. Clean, intuitive, but broke when customers wanted custom domains. Version 2 embedded the tenant ID in the JWT. Worked, but made token rotation painful and created a subtle bug where cached tokens could reference a tenant the user had been removed from. Version 3 -- what we shipped -- resolves the tenant server-side from the user's active membership, verified on every request.
interface TenantContext {
tenantId: string;
role: 'owner' | 'admin' | 'member' | 'viewer';
permissions: string[];
}
export async function resolveTenant(
userId: string,
requestedTenantId?: string
): Promise<TenantContext> {
// Fetch all active memberships for this user
const memberships = await db.query(`
SELECT tenant_id, role, permissions
FROM memberships
WHERE user_id = $1
AND status = 'active'
AND revoked_at IS NULL
`, [userId]);
if (memberships.rows.length === 0) {
throw new ForbiddenError('No active memberships');
}
// If a specific tenant was requested, verify access
if (requestedTenantId) {
const match = memberships.rows.find(
m => m.tenant_id === requestedTenantId
);
if (!match) throw new ForbiddenError('Access denied');
return match;
}
// Default to the user's primary tenant
return memberships.rows[0];
}
This pattern is slightly more expensive -- it requires a database lookup on every request instead of just validating a JWT. But the tradeoff is worth it. Membership changes take effect immediately. There's no stale token problem. And the security model is dead simple to reason about.
The Connection Pool That Wasn't
About four months into development, we hit a wall. Response times were climbing. Database connections were maxing out. The Node.js process was running out of file descriptors. Everything pointed to a connection pool problem, but our pool was configured correctly -- or so we thought.
The issue was subtle. Our middleware was acquiring connections at the start of each request but only releasing them when the response finished. For most requests, that was fine. But for Server-Sent Events and long-polling endpoints, connections were held open for minutes. With 200 concurrent SSE connections, our pool of 20 connections was permanently exhausted.
"The pool isn't too small. Your connections are living too long." -- Me, at 1:47 AM, staring at pg_stat_activity output
The fix was architectural: SSE endpoints got their own lightweight connection strategy that used LISTEN/NOTIFY with a shared connection instead of holding individual pool connections open.
import { Client } from 'pg';
import { EventEmitter } from 'events';
class RealtimeHub extends EventEmitter {
private client: Client;
async initialize() {
// Single dedicated connection for all notifications
this.client = new Client(process.env.DATABASE_URL);
await this.client.connect();
await this.client.query('LISTEN tenant_events');
this.client.on('notification', (msg) => {
const payload = JSON.parse(msg.payload);
// Emit to the specific tenant's channel
this.emit(`tenant:${payload.tenant_id}`, payload);
});
}
}
export const realtimeHub = new RealtimeHub();
One connection now serves all real-time subscribers. Connection pool utilization dropped from 95% to 30% overnight.
What I'd Do Differently
Looking back, there are three things I'd change if I started this project today:
- Start with RLS from day one. We added it retroactively, which meant auditing every query in the codebase. If it had been in place from the first migration, we'd have saved three weeks of work.
- Build the tenant-switching UI early. Users who belong to multiple organizations need to switch between them. We treated this as a "nice to have" and shipped it late. It should have been core from the start.
- Invest in integration tests over unit tests. The bugs that hurt us were never in isolated functions. They were in the interactions between middleware, database, and authentication. Integration tests caught what unit tests missed.
The Human Side
I want to end with something that doesn't appear in technical blog posts often enough. Building this platform was one of the hardest things I've done professionally. There were weeks where I genuinely doubted whether I was capable of finishing it. The second database schema rewrite, in particular, felt like admitting defeat.
But it wasn't defeat. It was learning. The first schema failed because I didn't understand the access patterns. The second failed because I optimized for write performance before I had real traffic data. The third worked because I finally had enough scar tissue to make informed decisions.
If you're in the middle of something hard right now -- a rewrite, a migration, a project that feels like it's never going to ship -- keep going. The engineers who build great systems aren't the ones who get it right the first time. They're the ones who stay long enough to get it right the third time.
If you have questions about multi-tenant architecture, Postgres RLS, or anything else in this post, feel free to reach out. I read every message and do my best to respond. You can also find me on GitHub where most of my open-source work lives.