Gotchas & Lessons

← Multi-Tenant overview

This is a catalogue of traps caught while designing and reviewing the multi-tenant build, grouped by how they would have hurt:

Things that would have failed at runtime — crashes, silent drops, and 404s from platform or library behavior that contradicts a reasonable first guess.
Security holes that would have been exploitable — attacks the naive design enabled, and the mitigation that closed each.
Subtle correctness issues — the quiet bugs: wrong behavior, sequential scans, build-order failures.

The tables are for scanning — skim them as a checklist when you touch the multi-tenancy surface. Below them, Lessons learned distils the durable takeaways, and the v2 backlog lists what we deferred.

Things that would have failed at runtime

These are platform and library behaviors that contradict a reasonable first guess. Each one would have surfaced as a crash, a silent drop, or a 404 rather than a clean error at design time.

Gotcha	What it actually does	Where it bites
Cloudflare Queues are not pub/sub	Only one active consumer per queue. A “queue fan-out” design for cache invalidation silently loses messages or throws when a second consumer registers.	Cache invalidation
`certificate_authority: "google"` is Enterprise-only	API error 1459 on Pro/Business.	Custom hostname creation
`custom_metadata` is Enterprise-only	Silently dropped on Pro/Business, so a reverse lookup that relies on it fails.	Custom hostname creation
Wrong CNAME target	`app.example.com.cdn.cloudflare.net` is undocumented; the correct target is `customers.example.com`, a proxied CNAME on your zone.	Custom hostname onboarding
Better Auth’s hooks API takes a single middleware function	In 2026 it is one `createAuthMiddleware(...)` function, not the older `{ matcher, handler }` array shape.	All Better Auth hook plugins
`Request` headers are immutable	`AUTH.fetch(c.req.raw)` after mutating headers throws. You must build `new Request(c.req.raw, { headers })`.	Auth worker proxy
`generateIdForModel("tenantHostname")` falls through to the `ent_*` prefix	The switch in `ids.ts` is closed, so a new model silently gets the wrong prefix. Call `generatePrefixedCuid(ID_PREFIXES.tenantHostname)` directly.	New-table ID generation
`organization.metadata` is `text`, not `jsonb`	It is a Better Auth-managed column, so JSON queries against it fail.	Storing the `enforce_sso` flag
`audit_logs.actor_id` had an FK to `users`	Operator `gad_*` IDs violate it. Drop the FK before writing operator actor IDs.	Admin worker audit
Better Auth `accept-invitation` requires an already-authenticated user	The endpoint takes an `{ invitationId }` body for a signed-in user — it is not a user-creation endpoint. Bootstrapping a user from an invite needs custom orchestration.	Tenant admin onboarding
Better Auth `createUser` is not idempotent	It returns `USER_ALREADY_EXISTS` on a duplicate email, so the recovery path must catch the error and look up the existing user.	Accept-invite retry
`sendInvitationEmail` is not wired in `createOrganizationPlugin`	Out of the box no invitation email is sent at all; it must be added.	Operator-led onboarding
An `apps/app` worker without a fetch handler returns 404 even with an `ASSETS` binding	The minimal `(req, env) => env.ASSETS.fetch(req)` handler is required.	Tenant SPA serving
Missing `not_found_handling: "single-page-application"` in the `apps/app` wrangler	TanStack Router client-side routes (e.g. `/dashboard`) 404 on reload.	Tenant SPA
Custom Domain syntax is `pattern: "admin.example.com"`	No `/` and no `zone_name`; the `pattern: "admin.example.com/"` with `zone_name` form is wrong.	`apps/admin` wrangler
`secrets.required` is now a real Wrangler config key	It is used for validation and type generation, but secret values are still managed with `wrangler secret put`. Assuming it is ignored is stale.	All worker wrangler files
Better Auth’s dynamic `baseURL` reads forwarded host/proto first	Raw proxying into the auth worker lets forwarded headers distort callback and trusted-origin behavior.	Auth worker proxy
The Turbo task is `generate-openapi`, not `openapi:cache`	The fabricated task name would fail the build pipeline.	Web app build
`workers_dev: false` was missing	The default `name.account.workers.dev` URL bypasses Cloudflare Access entirely.	Critical security exposure
There is no `whereRole` policy DSL builder	The codebase has `whereOwner`, `whereTargetIsSelf`, `where(predicate)`, `withRelation`, and `withOrgRole` — no `whereRole`. The per-role matrix is unimplementable until you add `whereGlobalAdminRole`.	Operator authorization
`systemAdminRoles` short-circuits condition evaluation	Adding `global_admin` to `systemAdminRoles` while also using `whereGlobalAdminRole` policies makes the bypass kill the per-role check — every operator becomes `super_admin`. Do not add `global_admin` to `systemAdminRoles`.	Operator authorization

Security holes that would have been exploitable

Each row is an attack the naive design enabled and the mitigation that closed it. The parenthetical D-numbers point at the matching entry in the Decision Log; the full threat model lives in Security.

Gotcha	Attack	Mitigation
Queue fan-out plus per-colo `cache.delete`	Cache invalidation does not reliably propagate, so a suspended tenant keeps serving traffic.	RPC fan-out plus KV cache versioning (D28)
Email-fallback for first-login `cfAccessSub`	An attacker registers the same email at an IdP, races the first login, and silently takes over an operator account.	Enrollment-token model (D31)
`admin.support.query` as a bufferable audit event	An operator scrapes 1000 tenants, queuing 1000 events that may be lost on worker eviction.	Classified CRITICAL plus row cap plus rate limit (D33)
Tenant suspension did not revoke active sessions	A 1h–7d window where the suspended tenant’s users keep operating on existing JWTs.	`session_version` bump plus session `DELETE` in the same transaction (D34)
`INTERNAL_ADMIN_TOKEN` shared secret with no clear injection point	A leak via logs or error stacks bypasses the `organization.create` gate, with no clear rotation mechanism.	Removed entirely; the service binding is the perimeter and the admin inserts orgs via Drizzle directly (D35)
`audit_logs` had no append-only invariant at the DB level	A `super_admin` who is also DB-credentialed could mutate audit history.	A Postgres trigger raises on `UPDATE`/`DELETE` (D30)
`accountLinking.allowDifferentEmails: true` (a default in some Better Auth versions)	A tenant-controlled SSO IdP attaches a different email to an existing user, enabling cross-tenant takeover.	Set explicitly to `false`
`provisionUser` runs after token exchange	A confused-deputy attack: an Acme IdP response replayed at globex’s callback creates a session for the wrong tenant.	`ssoCallbackGuardPlugin` runs before token exchange
`trustedOrigins` echo-back of the inbound host	If `Host` is ever spoofable, an attacker marks `https://attacker.com` as trusted.	A function validates the host against the tenant set
OIDC client secrets stored as plaintext in the DB	A backup leak compromises every tenant’s IdP integration.	`pgcrypto` plus a Postgres view plus log redaction (D13, D73)
Better Auth `organization.create` mounted publicly by default	An authenticated tenant user can create rogue orgs.	An unconditional `before` hook (D22, D35)
`SameSite=strict` does not isolate sibling subdomains	Tenant subdomains under the same registrable domain are still same-site, and strict cookies can interfere with OAuth/OIDC callback state.	Host-only cookies plus explicit origin/CSRF checks (D15)
JWT `aud`/`iss` global, no `org` claim	A JWT minted on tenant A validates against tenant B’s downstream services.	Per-tenant `aud`/`iss` plus `org.host`/`org.id`/`sessionVersion` claims (D12, D34)
`disableSignUp: false` (the current template default)	With operator-led onboarding, anyone could still sign up via Better Auth’s standard flow.	`disableSignUp: true` (D32)

Subtle correctness issues

These would not crash and are not exploitable — they are the quiet bugs that produce wrong behavior, sequential scans, or build-order failures.

Gotcha	Issue
The cache API key shape leaks across workers	All three workers must agree on a string format that lived nowhere as a single function. Centralized in `@repo/tenancy` (D51).
OpenAPI build chicken-and-egg	The web app’s code-gen depends on the worker’s `openapi.cache.json`, but worker builds depend on nothing. Fixed with Turbo `dependsOn: ["^generate-openapi"]`.
Cross-package wrangler asset-directory reference	`apps/admin`’s wrangler points at `../admin-ui/dist`, so build order matters: `apps/admin-ui#build` must precede `apps/admin#deploy`.
The auth worker has no service binding back to the admin worker	Cache invalidation must be asymmetric: the admin fans out, while auth uses `apps/auth → apps/server.invalidateTenant(...)` instead.
Self-FKs on `global_admins.created_by` and `deactivated_by`	Drizzle’s circular self-reference pattern needs a `(): AnyPgColumn` type cast.
Better Auth’s SSO plugin reads the provider table directly from `node_modules`	Those reads can’t be intercepted, so encryption coexists with them via the `sso_providers_decrypted` view (D73).
`pgcrypto` `SET LOCAL app.sso_key` per session	The decryption key must not persist in connection state across requests. It is closure-scoped via `withDecryptedSecret`.
The apex host case is real	`app.example.com` is a legitimate request with no tenant. Routes that require a tenant must default-deny when `c.var.tenant === null`, with an allowlist of valid apex routes.
Reserved-slug enforcement missing at the DB level	A slug `UNIQUE` constraint catches collisions, but format and length are not enforced — add a `CHECK` constraint or rely on app-layer validation plus the `UNIQUE` constraint.
`parseHostname` must explicitly reject `admin.example.com`	Otherwise it is classified as a custom tenant lookup and leaks a 404 timing oracle.
`audit_logs` needs `(actor_type, created_at DESC)` and `(organization_id, created_at DESC)` indexes	Cross-tenant operator queries and tenant-scoped audit views are common; without indexes they are sequential scans.

Lessons learned

If you skip the tables, read this section. Each lesson is one trap above, generalised into something you can carry to your own multi-tenant build — the kind of thing you wish someone had told you before you wrote the code, not after the incident.

About multi-tenancy on Workers

Cache invalidation will be the hardest thing you build — design it first, not last. The intuitive answer (a queue everyone subscribes to) is the wrong primitive: Cloudflare Queues allow only one consumer, so a fan-out design silently drops messages. What worked was RPC fan-out plus KV cache versioning. And the workers aren’t symmetric — the admin fans out to everyone, but the auth worker has no binding back to admin, so it invalidates a different way. If a “cache invalidation” line item looks small on your plan, move it to the top.

Pass tenant context as a typed RPC parameter, never as a header. The moment tenancy rides in an HTTP header you’ve signed up for algorithm choice, replay protection, and downgrade attacks — a whole security surface, for free, that you didn’t want. A service-binding RPC call with a typed argument makes that entire class of bug unrepresentable. Prefer the boring typed call.

Per-host cookies are necessary but not sufficient. It is tempting to think subdomains isolate tenants. They don’t: a.example.com and b.example.com are same-site, so a strict-SameSite cookie does nothing to stop sibling-tenant confusion. The real boundary is host-only cookies plus an explicit origin/CSRF check on every mutation.

One JWT check is never enough — scope tokens on five axes. aud alone, iss alone, even both together let a token minted for tenant A validate against tenant B. Per-tenant aud/iss narrows it; the org claim pins the tenant; and sessionVersion is the part people forget — without it you have no way to revoke, which is exactly what you need the day you suspend a tenant.

About Better Auth

For operator-led SaaS, turn self-signup off and mean it. Self-serve and operator-led onboarding don’t mix gracefully — leave the default disableSignUp: false in place and “anyone can sign up” quietly co-exists with your invite-only flow. Set disableSignUp: true and build the one onboarding path you actually want.

accept-invitation assumes the user already exists — it won’t create one. It’s designed for a signed-in user accepting an org invite, not for bootstrapping a brand-new account from an email link. If your invite is the account-creation moment, you write that orchestration yourself.

Run tenant-binding checks before the IdP code is exchanged, not after. provisionUser fires after token exchange, which is too late: a response meant for Acme, replayed at Globex’s callback, has already minted a session. The guard has to sit in front of the exchange.

Treat account-linking defaults as hostile until you’ve pinned every one. A permissive default (allowDifferentEmails: true) lets a tenant-controlled IdP attach a different email to an existing user — a cross-tenant takeover. Set accountLinking.enabled and allowDifferentEmails explicitly, keep trustedProviders: [] empty, and approve linking by hand inside provisionUser.

Re-check the hooks API against the version you’re on. In 2026 a hook is a single createAuthMiddleware(...) function, not the older { matcher, handler } array. Library shapes drift between majors; a tutorial from last year will compile against your types and then misbehave.

About Cloudflare for SaaS

You don’t need Enterprise for v1 — but you do need to know which line items are gated. Pro/Business covers the v1 build. custom_metadata, certificate_authority: "google", and wildcard custom certs are Enterprise-only, and the cruel part is how they fail: custom_metadata is silently dropped, certificate_authority throws API error 1459. Confirm the plan tier of every feature you lean on before you depend on it.

Verify domain ownership on your side before you ask the CA for a cert. HTTP DCV alone won’t let a competitor steal a cert for your customer’s domain — they can’t serve the challenge — but they can burn your cert quota by requesting hostnames you’ll never activate. A TXT pre-verification step on your side closes the squatting window.

Point the CNAME at a proxied CNAME on your own zone. Not a worker URL, and not the ...cdn.cloudflare.net form (which is undocumented and wrong). The correct target is a proxied record on your zone, e.g. customers.example.com.

Webhooks exist for SSL and hostname state — know they’re there before you reach for polling. v1 polls every 60 seconds, which is simple and fine; just don’t mistake polling for the only option when activation latency starts to matter.

About Cloudflare Access

workers_dev: false is one line, and forgetting it un-protects everything. Access binds protection at the hostname level, so the default name.account.workers.dev URL sails right past your Access policy. The auth panel you carefully gated is reachable, unauthenticated, on a URL you didn’t think about.

You can’t verify MFA inside the worker — there’s no amr claim. Don’t try to assert “this user did MFA” from the Access JWT; rely on the Access policy and IdP-side enforcement instead.

Reject service tokens on human-only panels. A service-token JWT has a different shape (type: "app", common_name set, no email); if your admin panel assumes a human, an automated token can walk in. Check the shape and refuse it.

payload.sub isn’t guaranteed stable — and the obvious workaround is a takeover hole. Falling back to email for first-login identity is the natural fix and also the exact vector an attacker uses to race-register an operator’s email at an IdP. Bind first login with an enrollment token instead.

About architecture and module boundaries

Land your core runtime boundary early; deferring it is technical debt that compounds. @repo/tenancy isn’t a “nice refactor for later” — it’s a first-order runtime boundary every worker depends on, so it has to ship in Phase A. The genuinely deep modules that can wait should wait (Phase C); the trick is telling the two apart.

Decide your dependency direction once and enforce it. @repo/tenancy imports schemas from @repo/db, which makes the reverse import dangerously easy to add by reflex — and that creates a cycle. Pick the arrow (db never imports tenancy) and hold the line.

Test deep modules at the boundary, not at every internal helper. The whole point of a deep module is a small interface over a lot of behavior; test that interface and resist the urge to pin down every private function — those tests just make refactoring painful.

Honor the project conventions that are easy to forget. Each new package needs an AGENTS.md. Small, mechanical, skipped exactly when you’re moving fast.

Future work / v2 backlog

These never made it into a locked decision — they are deferred to a later version or to implementation-time follow-up. v2

Custom-hostname allowlist refresh strategy — per-request DB lookup versus in-memory with a 5-minute refresh. Leaning toward in-memory.
Webhook integration for hostname state changes, replacing or supplementing the 60-second polling.
Queue-driven hostname reconciler, once sustained pending hostnames exceed ~6,000.
Tenant-scoped user identity, to close passkey/two-factor cross-tenant data bleed.
enforce_sso default — auto-flip to true on the first verified SSO provider, or always require explicit opt-in. Leaning opt-in.
In-app TOTP for operators, as defense in depth above Cloudflare Access MFA.
Customer notification on operator access — some SaaS notify the customer when a support engineer reads their data. v1 ships the audit log only; v2 considers email.
Approval workflow for destructive operator actions (delete tenant, deactivate a super_admin). v1 ships single-actor; v2 may add a two-person rule.
Operator activity dashboard — surface lastActiveAt and daily action counts.
Per-tenant feature flags — the schema is not designed yet; v1 ships scaffolding.
MFA verification in the worker via the Cloudflare Access Identity API.
Logout UX in the admin panel — proxy /cdn-cgi/access/logout for an explicit sign-out, versus relying on a browser bookmark.
Marketing / find-your-team page on the apex (D76) — v1 ships a static page; v2 adds the backend lookup.