Gotchas & Lessons
This is a catalogue of traps caught while designing and reviewing the multi-tenant build, grouped by how they would have hurt:
- Things that would have failed at runtime — crashes, silent drops, and 404s from platform or library behavior that contradicts a reasonable first guess.
- Security holes that would have been exploitable — attacks the naive design enabled, and the mitigation that closed each.
- Subtle correctness issues — the quiet bugs: wrong behavior, sequential scans, build-order failures.
The tables are for scanning — skim them as a checklist when you touch the multi-tenancy surface. Below them, Lessons learned distils the durable takeaways, and the v2 backlog lists what we deferred.
Things that would have failed at runtime
Section titled “Things that would have failed at runtime”These are platform and library behaviors that contradict a reasonable first guess. Each one would have surfaced as a crash, a silent drop, or a 404 rather than a clean error at design time.
| Gotcha | What it actually does | Where it bites |
|---|---|---|
| Cloudflare Queues are not pub/sub | Only one active consumer per queue. A “queue fan-out” design for cache invalidation silently loses messages or throws when a second consumer registers. | Cache invalidation |
certificate_authority: "google" is Enterprise-only | API error 1459 on Pro/Business. | Custom hostname creation |
custom_metadata is Enterprise-only | Silently dropped on Pro/Business, so a reverse lookup that relies on it fails. | Custom hostname creation |
| Wrong CNAME target | app.example.com.cdn.cloudflare.net is undocumented; the correct target is customers.example.com, a proxied CNAME on your zone. | Custom hostname onboarding |
| Better Auth’s hooks API takes a single middleware function | In 2026 it is one createAuthMiddleware(...) function, not the older { matcher, handler } array shape. | All Better Auth hook plugins |
Request headers are immutable | AUTH.fetch(c.req.raw) after mutating headers throws. You must build new Request(c.req.raw, { headers }). | Auth worker proxy |
generateIdForModel("tenantHostname") falls through to the ent_* prefix | The switch in ids.ts is closed, so a new model silently gets the wrong prefix. Call generatePrefixedCuid(ID_PREFIXES.tenantHostname) directly. | New-table ID generation |
organization.metadata is text, not jsonb | It is a Better Auth-managed column, so JSON queries against it fail. | Storing the enforce_sso flag |
audit_logs.actor_id had an FK to users | Operator gad_* IDs violate it. Drop the FK before writing operator actor IDs. | Admin worker audit |
Better Auth accept-invitation requires an already-authenticated user | The endpoint takes an { invitationId } body for a signed-in user — it is not a user-creation endpoint. Bootstrapping a user from an invite needs custom orchestration. | Tenant admin onboarding |
Better Auth createUser is not idempotent | It returns USER_ALREADY_EXISTS on a duplicate email, so the recovery path must catch the error and look up the existing user. | Accept-invite retry |
sendInvitationEmail is not wired in createOrganizationPlugin | Out of the box no invitation email is sent at all; it must be added. | Operator-led onboarding |
An apps/app worker without a fetch handler returns 404 even with an ASSETS binding | The minimal (req, env) => env.ASSETS.fetch(req) handler is required. | Tenant SPA serving |
Missing not_found_handling: "single-page-application" in the apps/app wrangler | TanStack Router client-side routes (e.g. /dashboard) 404 on reload. | Tenant SPA |
Custom Domain syntax is pattern: "admin.example.com" | No /* and no zone_name; the pattern: "admin.example.com/*" with zone_name form is wrong. | apps/admin wrangler |
secrets.required is now a real Wrangler config key | It is used for validation and type generation, but secret values are still managed with wrangler secret put. Assuming it is ignored is stale. | All worker wrangler files |
Better Auth’s dynamic baseURL reads forwarded host/proto first | Raw proxying into the auth worker lets forwarded headers distort callback and trusted-origin behavior. | Auth worker proxy |
The Turbo task is generate-openapi, not openapi:cache | The fabricated task name would fail the build pipeline. | Web app build |
workers_dev: false was missing | The default name.account.workers.dev URL bypasses Cloudflare Access entirely. | Critical security exposure |
There is no whereRole policy DSL builder | The codebase has whereOwner, whereTargetIsSelf, where(predicate), withRelation, and withOrgRole — no whereRole. The per-role matrix is unimplementable until you add whereGlobalAdminRole. | Operator authorization |
systemAdminRoles short-circuits condition evaluation | Adding global_admin to systemAdminRoles while also using whereGlobalAdminRole policies makes the bypass kill the per-role check — every operator becomes super_admin. Do not add global_admin to systemAdminRoles. | Operator authorization |
Security holes that would have been exploitable
Section titled “Security holes that would have been exploitable”Each row is an attack the naive design enabled and the mitigation that closed it. The parenthetical D-numbers point at the matching entry in the Decision Log; the full threat model lives in Security.
| Gotcha | Attack | Mitigation |
|---|---|---|
Queue fan-out plus per-colo cache.delete | Cache invalidation does not reliably propagate, so a suspended tenant keeps serving traffic. | RPC fan-out plus KV cache versioning (D28) |
Email-fallback for first-login cfAccessSub | An attacker registers the same email at an IdP, races the first login, and silently takes over an operator account. | Enrollment-token model (D31) |
admin.support.query as a bufferable audit event | An operator scrapes 1000 tenants, queuing 1000 events that may be lost on worker eviction. | Classified CRITICAL plus row cap plus rate limit (D33) |
| Tenant suspension did not revoke active sessions | A 1h–7d window where the suspended tenant’s users keep operating on existing JWTs. | session_version bump plus session DELETE in the same transaction (D34) |
INTERNAL_ADMIN_TOKEN shared secret with no clear injection point | A leak via logs or error stacks bypasses the organization.create gate, with no clear rotation mechanism. | Removed entirely; the service binding is the perimeter and the admin inserts orgs via Drizzle directly (D35) |
audit_logs had no append-only invariant at the DB level | A super_admin who is also DB-credentialed could mutate audit history. | A Postgres trigger raises on UPDATE/DELETE (D30) |
accountLinking.allowDifferentEmails: true (a default in some Better Auth versions) | A tenant-controlled SSO IdP attaches a different email to an existing user, enabling cross-tenant takeover. | Set explicitly to false |
provisionUser runs after token exchange | A confused-deputy attack: an Acme IdP response replayed at globex’s callback creates a session for the wrong tenant. | ssoCallbackGuardPlugin runs before token exchange |
trustedOrigins echo-back of the inbound host | If Host is ever spoofable, an attacker marks https://attacker.com as trusted. | A function validates the host against the tenant set |
| OIDC client secrets stored as plaintext in the DB | A backup leak compromises every tenant’s IdP integration. | pgcrypto plus a Postgres view plus log redaction (D13, D73) |
Better Auth organization.create mounted publicly by default | An authenticated tenant user can create rogue orgs. | An unconditional before hook (D22, D35) |
SameSite=strict does not isolate sibling subdomains | Tenant subdomains under the same registrable domain are still same-site, and strict cookies can interfere with OAuth/OIDC callback state. | Host-only cookies plus explicit origin/CSRF checks (D15) |
JWT aud/iss global, no org claim | A JWT minted on tenant A validates against tenant B’s downstream services. | Per-tenant aud/iss plus org.host/org.id/sessionVersion claims (D12, D34) |
disableSignUp: false (the current template default) | With operator-led onboarding, anyone could still sign up via Better Auth’s standard flow. | disableSignUp: true (D32) |
Subtle correctness issues
Section titled “Subtle correctness issues”These would not crash and are not exploitable — they are the quiet bugs that produce wrong behavior, sequential scans, or build-order failures.
| Gotcha | Issue |
|---|---|
| The cache API key shape leaks across workers | All three workers must agree on a string format that lived nowhere as a single function. Centralized in @repo/tenancy (D51). |
| OpenAPI build chicken-and-egg | The web app’s code-gen depends on the worker’s openapi.cache.json, but worker builds depend on nothing. Fixed with Turbo dependsOn: ["^generate-openapi"]. |
| Cross-package wrangler asset-directory reference | apps/admin’s wrangler points at ../admin-ui/dist, so build order matters: apps/admin-ui#build must precede apps/admin#deploy. |
| The auth worker has no service binding back to the admin worker | Cache invalidation must be asymmetric: the admin fans out, while auth uses apps/auth → apps/server.invalidateTenant(...) instead. |
Self-FKs on global_admins.created_by and deactivated_by | Drizzle’s circular self-reference pattern needs a (): AnyPgColumn type cast. |
Better Auth’s SSO plugin reads the provider table directly from node_modules | Those reads can’t be intercepted, so encryption coexists with them via the sso_providers_decrypted view (D73). |
pgcrypto SET LOCAL app.sso_key per session | The decryption key must not persist in connection state across requests. It is closure-scoped via withDecryptedSecret. |
| The apex host case is real | app.example.com is a legitimate request with no tenant. Routes that require a tenant must default-deny when c.var.tenant === null, with an allowlist of valid apex routes. |
| Reserved-slug enforcement missing at the DB level | A slug UNIQUE constraint catches collisions, but format and length are not enforced — add a CHECK constraint or rely on app-layer validation plus the UNIQUE constraint. |
parseHostname must explicitly reject admin.example.com | Otherwise it is classified as a custom tenant lookup and leaks a 404 timing oracle. |
audit_logs needs (actor_type, created_at DESC) and (organization_id, created_at DESC) indexes | Cross-tenant operator queries and tenant-scoped audit views are common; without indexes they are sequential scans. |
Lessons learned
Section titled “Lessons learned”If you skip the tables, read this section. Each lesson is one trap above, generalised into something you can carry to your own multi-tenant build — the kind of thing you wish someone had told you before you wrote the code, not after the incident.
About multi-tenancy on Workers
Section titled “About multi-tenancy on Workers”Cache invalidation will be the hardest thing you build — design it first, not last. The intuitive answer (a queue everyone subscribes to) is the wrong primitive: Cloudflare Queues allow only one consumer, so a fan-out design silently drops messages. What worked was RPC fan-out plus KV cache versioning. And the workers aren’t symmetric — the admin fans out to everyone, but the auth worker has no binding back to admin, so it invalidates a different way. If a “cache invalidation” line item looks small on your plan, move it to the top.
Pass tenant context as a typed RPC parameter, never as a header. The moment tenancy rides in an HTTP header you’ve signed up for algorithm choice, replay protection, and downgrade attacks — a whole security surface, for free, that you didn’t want. A service-binding RPC call with a typed argument makes that entire class of bug unrepresentable. Prefer the boring typed call.
Per-host cookies are necessary but not sufficient. It is tempting to think
subdomains isolate tenants. They don’t: a.example.com and b.example.com are
same-site, so a strict-SameSite cookie does nothing to stop sibling-tenant
confusion. The real boundary is host-only cookies plus an explicit origin/CSRF check
on every mutation.
One JWT check is never enough — scope tokens on five axes. aud alone, iss alone,
even both together let a token minted for tenant A validate against tenant B. Per-tenant
aud/iss narrows it; the org claim pins the tenant; and sessionVersion is the part
people forget — without it you have no way to revoke, which is exactly what you need
the day you suspend a tenant.
About Better Auth
Section titled “About Better Auth”For operator-led SaaS, turn self-signup off and mean it. Self-serve and operator-led
onboarding don’t mix gracefully — leave the default disableSignUp: false in place and
“anyone can sign up” quietly co-exists with your invite-only flow. Set
disableSignUp: true and build the one onboarding path you actually want.
accept-invitation assumes the user already exists — it won’t create one. It’s
designed for a signed-in user accepting an org invite, not for bootstrapping a brand-new
account from an email link. If your invite is the account-creation moment, you write
that orchestration yourself.
Run tenant-binding checks before the IdP code is exchanged, not after. provisionUser
fires after token exchange, which is too late: a response meant for Acme, replayed at
Globex’s callback, has already minted a session. The guard has to sit in front of the
exchange.
Treat account-linking defaults as hostile until you’ve pinned every one. A
permissive default (allowDifferentEmails: true) lets a tenant-controlled IdP attach a
different email to an existing user — a cross-tenant takeover. Set
accountLinking.enabled and allowDifferentEmails explicitly, keep
trustedProviders: [] empty, and approve linking by hand inside provisionUser.
Re-check the hooks API against the version you’re on. In 2026 a hook is a single
createAuthMiddleware(...) function, not the older { matcher, handler } array. Library
shapes drift between majors; a tutorial from last year will compile against your types
and then misbehave.
About Cloudflare for SaaS
Section titled “About Cloudflare for SaaS”You don’t need Enterprise for v1 — but you do need to know which line items are gated.
Pro/Business covers the v1 build. custom_metadata, certificate_authority: "google",
and wildcard custom certs are Enterprise-only, and the cruel part is how they fail:
custom_metadata is silently dropped, certificate_authority throws API error 1459.
Confirm the plan tier of every feature you lean on before you depend on it.
Verify domain ownership on your side before you ask the CA for a cert. HTTP DCV alone won’t let a competitor steal a cert for your customer’s domain — they can’t serve the challenge — but they can burn your cert quota by requesting hostnames you’ll never activate. A TXT pre-verification step on your side closes the squatting window.
Point the CNAME at a proxied CNAME on your own zone. Not a worker URL, and not the
...cdn.cloudflare.net form (which is undocumented and wrong). The correct target is a
proxied record on your zone, e.g. customers.example.com.
Webhooks exist for SSL and hostname state — know they’re there before you reach for polling. v1 polls every 60 seconds, which is simple and fine; just don’t mistake polling for the only option when activation latency starts to matter.
About Cloudflare Access
Section titled “About Cloudflare Access”workers_dev: false is one line, and forgetting it un-protects everything. Access
binds protection at the hostname level, so the default name.account.workers.dev URL
sails right past your Access policy. The auth panel you carefully gated is reachable,
unauthenticated, on a URL you didn’t think about.
You can’t verify MFA inside the worker — there’s no amr claim. Don’t try to assert
“this user did MFA” from the Access JWT; rely on the Access policy and IdP-side
enforcement instead.
Reject service tokens on human-only panels. A service-token JWT has a different shape
(type: "app", common_name set, no email); if your admin panel assumes a human, an
automated token can walk in. Check the shape and refuse it.
payload.sub isn’t guaranteed stable — and the obvious workaround is a takeover hole.
Falling back to email for first-login identity is the natural fix and also the exact
vector an attacker uses to race-register an operator’s email at an IdP. Bind first login
with an enrollment token instead.
About architecture and module boundaries
Section titled “About architecture and module boundaries”Land your core runtime boundary early; deferring it is technical debt that compounds.
@repo/tenancy isn’t a “nice refactor for later” — it’s a first-order runtime boundary
every worker depends on, so it has to ship in Phase A. The genuinely deep modules that
can wait should wait (Phase C); the trick is telling the two apart.
Decide your dependency direction once and enforce it. @repo/tenancy imports schemas
from @repo/db, which makes the reverse import dangerously easy to add by reflex — and
that creates a cycle. Pick the arrow (db never imports tenancy) and hold the line.
Test deep modules at the boundary, not at every internal helper. The whole point of a deep module is a small interface over a lot of behavior; test that interface and resist the urge to pin down every private function — those tests just make refactoring painful.
Honor the project conventions that are easy to forget. Each new package needs an
AGENTS.md. Small, mechanical, skipped exactly when you’re moving fast.
Future work / v2 backlog
Section titled “Future work / v2 backlog”These never made it into a locked decision — they are deferred to a later version or to implementation-time follow-up. v2
- Custom-hostname allowlist refresh strategy — per-request DB lookup versus in-memory with a 5-minute refresh. Leaning toward in-memory.
- Webhook integration for hostname state changes, replacing or supplementing the 60-second polling.
- Queue-driven hostname reconciler, once sustained pending hostnames exceed ~6,000.
- Tenant-scoped user identity, to close passkey/two-factor cross-tenant data bleed.
enforce_ssodefault — auto-flip to true on the first verified SSO provider, or always require explicit opt-in. Leaning opt-in.- In-app TOTP for operators, as defense in depth above Cloudflare Access MFA.
- Customer notification on operator access — some SaaS notify the customer when a support engineer reads their data. v1 ships the audit log only; v2 considers email.
- Approval workflow for destructive operator actions (delete tenant, deactivate a
super_admin). v1 ships single-actor; v2 may add a two-person rule. - Operator activity dashboard — surface
lastActiveAtand daily action counts. - Per-tenant feature flags — the schema is not designed yet; v1 ships scaffolding.
- MFA verification in the worker via the Cloudflare Access Identity API.
- Logout UX in the admin panel — proxy
/cdn-cgi/access/logoutfor an explicit sign-out, versus relying on a browser bookmark. - Marketing / find-your-team page on the apex (D76) — v1 ships a static page; v2 adds the backend lookup.