maxcrofts.com

Durable Outboxes

I have been fortunate enough to be extensively working with Cloudflare Workers as part of my day job as of late. Durable Objects in particular have become an indespensible part of my toolbox. For the uninitiated, Durable Objects extend the serverless compute primitive (think Lambda) to include an SQLite database to enable state persistence. Each Durable Object is closer to a live object instance than a memory-mapped class; state changes have to be persisted explicitly. But they do allow you to build complex, stateful, serverless applications while mostly forgetting about devops.

The Canonical Example

Kenton Varda, introducing the SQLite storage backend for Durable Objects in late 2024, detailed how an airline could manage seat assignment for a flight:

import {DurableObject} from "cloudflare:workers";

// All Workers that specify the same object ID (probably based on the flight
// number and date) will reach the same instance of FlightSeating.
export class FlightSeating extends DurableObject {
	sql = this.ctx.storage.sql;
	// ...
}

Each flight has its own object, and therefore database, instantiated on-demand. Once the hypthothetical plane takes to the skies, the object can safely be deleted, ideally serialized to some long term persistant store. Inside that object, the rest reads like ordinary SQL:

// Assign passenger to a seat.
assignSeat(seatId, occupant) {
	// Check that seat isn't occupied.
	let cursor = this.sql.exec(`SELECT occupant FROM seats WHERE seatId = ?`, seatId);
	let result = [...cursor][0];  // Get the first result from the cursor.
	if (!result) {
		throw new Error("No such seat: " + seatId);
	}
	if (result.occupant !== null) {
		throw new Error("Seat is occupied: " + seatId);
	}

	// If the occupant is already in a different seat, remove them.
	this.sql.exec(`UPDATE seats SET occupant = null WHERE occupant = ?`, occupant);

	// Assign the seat. Note: We don't have to worry that a concurrent request may
	// have grabbed the seat between the two queries, because the code is synchronous
	// (no `await`s) and the database is private to this Durable Object. Nothing else
	// could have changed since we checked that the seat was available earlier!
	this.sql.exec(`UPDATE seats SET occupant = ? WHERE seatId = ?`, occupant, seatId);
}

Look at the comment in assignSeat. There are no awaits between the read and the write, and the database is private to this Durable Object, so nothing else can grab the seat between checks. The race that elsewhere demands SELECT FOR UPDATE or optimistic locking isn't there.

A Quick Performance Detour

SQLite inside a Durable Object runs in-process. this.sql.exec() is a function call, not a network round trip; per-query overhead is in the microseconds.

This is what makes the N+1 pattern affordable. With a database 5ms away over the wire, fetching 100 parent rows and then their children one at a time costs 505ms. In a Durable Object, the same loop is bounded by the queries themselves, performing on par with an equivalent JOIN.

Playing Nice With Others

It is an inevitability that when developing any web app you will have to incorporate another service's API. Oftentimes you will find yourself at the mercy of an outage occurring at said service provider. Naive fetches go unanswered, throwing errors that look like your fault, or worse, allowing your database and the external API to quietly fall out of sync. This is especially important for state-changing APIs like payments, SMS, and transactional email.

Take Azure Communication Services. You hand it your email and, rather than a clean send-and-confirm, you get back an operation handle:

const response = await fetch(`${endpoint}/emails:send?api-version=2023-03-31`, {
	method: "POST",
	body: JSON.stringify({ /* recipients, subject, content */ }),
});

// 202 Accepted, with the verdict to be delivered separately.
const operationLocation = response.headers.get("Operation-Location");

The email hasn't actually gone out yet. To find out how it ended, you come back later and poll:

const operation = await fetch(operationLocation);
const { status } = await operation.json();
// "Running"... try again in a moment.
// "Succeeded"... finally.
// "Failed"... now what?

Each of these calls is its own opportunity to fail. The submit might 429. The poll might 5xx. The operation itself can sit in Running for an embarrassingly long time. By the time you know how it ended, the user has long since closed the tab; your application has to keep track for them.

Sound The Alarm

The outbox is a Durable Object with a SQLite queue and an alarm. Each row is a pending or in-flight item:

CREATE TABLE Queue (
	id            INTEGER PRIMARY KEY AUTOINCREMENT,
	payload       TEXT NOT NULL,
	status        TEXT NOT NULL DEFAULT 'queued',  -- 'queued' | 'sending' | 'sent' | 'failed'
	retries       INTEGER NOT NULL DEFAULT 0,
	operationUrl  TEXT,
	error         TEXT
);

Two entry points. enqueue adds rows and arms the alarm to fire immediately. alarm drains the queue and schedules itself again if there's still work.

async enqueue(payload: object): Promise<void> {
	this.ctx.storage.sql.exec(
		`INSERT INTO Queue (payload) VALUES (?)`,
		JSON.stringify(payload),
	);
	await this.ctx.storage.setAlarm(Date.now());
}

async alarm(): Promise<void> {
	await this.sendPass();
	await this.pollPass();

	const { count } = this.ctx.storage.sql
		.exec(`SELECT COUNT(*) as count FROM Queue WHERE status IN ('queued', 'sending')`)
		.one();

	if (count > 0) {
		await this.ctx.storage.setAlarm(Date.now() + 10_000);
	}
}

The queue row is a small state machine, moving from 'queued' to 'sending' to 'sent' or 'failed' as the passes advance it. Errors don't drop a row from the queue; they bump the retry counter and leave it for the next tick.

private async sendPass(): Promise<void> {
	const queued = this.ctx.storage.sql
		.exec(`SELECT * FROM Queue WHERE status = 'queued' LIMIT 15`)
		.toArray();

	for (const row of queued) {
		try {
			const { operationUrl } = await this.submit(JSON.parse(row.payload));
			this.ctx.storage.sql.exec(
				`UPDATE Queue SET status = 'sending', operationUrl = ? WHERE id = ?`,
				operationUrl, row.id,
			);
		} catch {
			this.ctx.storage.sql.exec(
				`UPDATE Queue SET retries = retries + 1 WHERE id = ?`,
				row.id,
			);
		}
	}
}

private async pollPass(): Promise<void> {
	const sending = this.ctx.storage.sql
		.exec(`SELECT * FROM Queue WHERE status = 'sending' LIMIT 20`)
		.toArray();

	for (const row of sending) {
		const result = await fetch(row.operationUrl).then(r => r.json());
		if (result.status === "Succeeded") {
			this.ctx.storage.sql.exec(`UPDATE Queue SET status = 'sent' WHERE id = ?`, row.id);
		} else if (result.status === "Failed") {
			this.ctx.storage.sql.exec(
				`UPDATE Queue SET status = 'failed', error = ? WHERE id = ?`,
				result.error?.message ?? "delivery failed", row.id,
			);
		}
	}
}

The alarm doubles as the rate limit. The same setAlarm(now + 10_000) that schedules the next tick also prevents the outbox from hitting the external API again for ten seconds. There is no separate token bucket or sleep loop.

At fifteen sends and twenty polls per ten-second tick, the outbox processes ninety items per minute. Tune for your external API's rate limit.

My first instinct was Queues. It does not fit here. Queues processes each event once. The outbox processes the same row across many ticks.

Enter: WebSockets

The user does not need to poll your application to see progress. A WebSocket can push updates as they happen.

Cloudflare provides hibernatable WebSockets for this. Calling ctx.acceptWebSocket() rather than socket.accept() hands the connection back to the platform; the Durable Object can be evicted while the socket stays connected, and only wakes when there is something to send. The connection survives idle time at no compute cost.

async fetch(request: Request): Promise<Response> {
	if (request.headers.get("Upgrade") === "websocket") {
		const pair = new WebSocketPair();
		this.ctx.acceptWebSocket(pair[1]); // hibernatable
		return new Response(null, { status: 101, webSocket: pair[0] });
	}
	return new Response("Not found", { status: 404 });
}

I chose Turbo Stream fragments rather than JSON. Turbo will not be everyone's pick. Sending rendered HTML and letting the browser apply it means there is no client-side state to manage. The server broadcasts what should change, and it changes.

private broadcast(html: string): void {
	for (const ws of this.ctx.getWebSockets()) {
		try { ws.send(html); } catch {}
	}
}

Every mutation that touches queue state can also call broadcast with the appropriate fragment. The Durable Object wakes when the alarm fires, runs, broadcasts, and is evicted again. Connected browsers update without ever sending a request.