Context: Cancellation & Deadlines

context.Context is one of the most pervasive types in real Go code, and the multigres codebase (“Vitess for Postgres”) is no exception — a grep across its source touches several hundred files, with thousands of individual call sites. It is the spine of the latency-sensitive query path and of every RPC the system makes. If you internalize one cross-cutting Go concept, make it this one.

We’ll learn it the same way: by reading how a real distributed system threads a context through every hop, from the gateway down to a goroutine waiting on a ticker.

A context threaded through the request path

Rendering diagram…

flowchart LR
Client["Client"] -->|"PG wire"| GW["multigateway"]
GW -->|"gRPC (deadline travels)"| PL["multipooler"]
PL -->|"pooled SQL"| PG["PostgreSQL"]
GW -. "ctx (deadline + cancel)" .-> PL
PL -. "same ctx" .-> PG

A statement_timeout set at the gateway becomes a deadline on the context, and that one context flows all the way down — so a cancellation at the top stops work running at the bottom, without any layer in between knowing about the others.

Why context exists

Go has no thread-local storage and no exceptions, so there’s no ambient place to stash “this request must finish by 10:42:03” or “the caller hung up, stop working.” Go’s answer is to make that state an explicit, ordinary value — a context.Context — that you pass as the first argument to every function that might block, do I/O, or spawn work.

A context is an immutable node in a tree: you derive children from a parent, and cancelling (or timing out) a parent cancels every descendant. That tree is how a cancellation at the gateway propagates down through the pooler and out to a goroutine waiting on a ticker, all without those layers knowing about each other.

Context is a tiny interface — four methods:

type Context interface {
    Deadline() (deadline time.Time, ok bool) // when (if ever) this ctx expires
    Done() <-chan struct{}                    // closed when cancelled/expired
    Err() error                               // why Done() closed (nil if not yet)
    Value(key any) any                        // request-scoped lookup
}

You almost never implement this interface yourself. You get a root from the standard library and derive children with context.WithCancel, WithTimeout, WithDeadline, and WithValue.

Roots: `Background()` vs `TODO()`

context.Background() is the empty, never-cancelled root. Use it at the true top of a goroutine tree — process startup, a server’s main loop, a CLI command, a test.

func main() {
    ctx := context.Background() // the root; nothing above it
    run(ctx)
}

A service entry point that owns a long-lived lifecycle creates its root from Background() and immediately wraps it in WithCancel so it can shut the whole tree down later:

ctx, cancel := context.WithCancel(context.Background())

context.TODO() is byte-for-byte identical to Background() at runtime, but it means something different to a reader: “a real context should be threaded in here, but it isn’t yet.” It is a marker for incomplete plumbing, not a polite alias.

func (r *Reader) GetLeadershipView() (*LeadershipView, error) {
    ctx, cancel := context.WithTimeout(context.TODO(), r.interval)
    defer cancel()
    // ...
}

Here GetLeadershipView takes no ctx parameter, so it manufactures one from TODO(). The cost is real: because the parent is TODO()/Background(), the caller’s cancellation never reaches this query — if the heartbeat reader’s owner shuts down, this in-flight read only stops when r.interval elapses.

Deriving cancellable children, and the cancel func

WithCancel, WithTimeout, and WithDeadline each return two things: a derived context and a context.CancelFunc. Calling the cancel func releases the resources the context holds — most importantly, for WithTimeout/WithDeadline, the internal timer goroutine. You must always call it, even on the happy path, even after the timeout already fired. Calling it more than once is safe (subsequent calls are no-ops). The idiom is defer cancel() on the very next line.

ctx, cancel := context.WithTimeout(parent, 5*time.Second)
defer cancel() // releases the timer when this function returns, regardless of why
result, err := doRPC(ctx)

The codebase does exactly this before every consensus RPC, using a named timeout constant (more on those below):

rpcCtx, cancel := context.WithTimeout(ctx, timeouts.RuleWriteTimeout)
defer cancel()
resp, err := r.coordinator.rpcClient.Recruit(rpcCtx, p.MultiPooler, &consensusdatapb.RecruitRequest{
    TermRevocation: revocation,
})

Note the deliberate naming: the derived context is rpcCtx, distinct from the incoming ctx. The function logs against ctx (the broader request) but passes rpcCtx (the narrower deadline) to the RPC, so a slow Recruit cannot exceed RuleWriteTimeout even if the surrounding ctx would allow more.

The no-op cancel trick

Sometimes you want to conditionally apply a deadline but still let the caller defer cancel() unconditionally. Return a do-nothing cancel func in the “no deadline” branch:

func (h *MultiGatewayHandler) statementTimeoutCtx(ctx context.Context, state *MultiGatewayConnectionState, query ast.Stmt) (context.Context, context.CancelFunc) {
    timeout := ResolveStatementTimeout(
        ParseStatementTimeoutDirective(query),
        state.GetStatementTimeout(),
    )
    if timeout > 0 {
        return context.WithTimeout(ctx, timeout)
    }
    return ctx, func() {} // no deadline: return ctx unchanged + a harmless cancel
}

This is the query path’s statement-timeout enforcement. When a statement_timeout is in effect, callers get a context that will expire; when none is set, they get the original context back plus a func(){}. Either way the caller writes the same two lines — ctx, cancel := h.statementTimeoutCtx(...); defer cancel() — with no special-casing.

Reacting to cancellation: `Done()`, `Err()`, and `select`

ctx.Done() returns a channel that is closed when the context is cancelled or its deadline passes. A receive from a closed channel returns immediately, so <-ctx.Done() is the “wake me when it’s time to stop” primitive. After Done() closes, ctx.Err() tells you why: context.Canceled (someone called the cancel func) or context.DeadlineExceeded (the timeout fired).

The canonical pattern is a select that races your real work against ctx.Done(). This is how you make any blocking wait cancellable:

ticker := time.NewTicker(100 * time.Millisecond)
defer ticker.Stop()
for {
    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-ticker.C:
        if done() {
            return nil
        }
    }
}

The WaitForLSN routine is exactly this — poll PostgreSQL every 100ms until a standby has replayed WAL up to a target position, but bail the instant the caller’s context expires:

for {
    select {
    case <-ctx.Done():
        pm.logger.ErrorContext(ctx, "WaitForLSN context cancelled or timed out",
            "target_lsn", targetLsn,
            "error", ctx.Err())
        return mterrors.Wrap(ctx.Err(), "context cancelled or timed out while waiting for LSN")
    case <-ticker.C:
        reachedTarget, err := pm.checkLSNReached(ctx, targetLsn)
        if err != nil {
            return err
        }
        if reachedTarget {
            return nil
        }
    }
}

Two things to notice. First, ctx.Err() is wrapped with mterrors.Wrap so the error chain still satisfies errors.Is(err, context.DeadlineExceeded) while carrying extra context — see Errors for why wrapping preserves the sentinel. Second, ctx is also passed into checkLSNReached, so the inner query inherits the same deadline; cancellation propagates all the way down, not just to the select you can see here.

A subtler variant: select over two done-like channels when there are two independent reasons to stop. A connection-pool waitlist races the pool being closed against the caller’s deadline:

select {
case <-closeChan:
    // Pool was closed while we were waiting. Remove ourselves from the waitlist...
case <-ctx.Done():
    // Context expired. Remove ourselves from the waitlist to prevent another
    // goroutine from trying to hand us a connection later on.
    if removed {
        return nil, context.Cause(ctx)
    }
}

The “ctx is the first parameter” convention

Any method that does I/O, blocks, or makes an RPC takes ctx context.Context as its first parameter, conventionally named ctx. This is not enforced by the compiler; it is a universal Go convention that tooling and reviewers rely on. An entire client interface follows it:

type MultiPoolerClient interface {
    Recruit(ctx context.Context, pooler *clustermetadatapb.MultiPooler, request *consensusdatapb.RecruitRequest) (*consensusdatapb.RecruitResponse, error)
    Promote(ctx context.Context, pooler *clustermetadatapb.MultiPooler, request *consensusdatapb.PromoteRequest) (*consensusdatapb.PromoteResponse, error)
    SetPrimary(ctx context.Context, pooler *clustermetadatapb.MultiPooler, request *consensusdatapb.SetPrimaryRequest) (*consensusdatapb.SetPrimaryResponse, error)
    // ...
}

The corollary: do not store a Context in a struct field as a general habit. The Go docs say so explicitly. A context models a single call tree’s lifetime; stashing one in a struct means later method calls silently reuse a possibly-stale deadline instead of receiving a fresh ctx per call. Pass it, don’t park it.

The deliberate exception: self-owned background watchers

There is one legitimate reason to store a context in a struct: an object that owns a long-lived background goroutine and exposes a Start()/Stop() lifecycle. It derives a cancellable context once at construction, keeps it and its cancel func, loops on it in the goroutine, and calls the cancel func from Stop().

type CellPoolerDiscovery struct {
    // ...
    ctx        context.Context
    cancelFunc context.CancelFunc
    wg         sync.WaitGroup
}

func NewCellPoolerDiscovery(ctx context.Context, topoStore topoclient.Store, cell string, logger *slog.Logger) *CellPoolerDiscovery {
    discoveryCtx, cancel := context.WithCancel(ctx)
    return &CellPoolerDiscovery{
        ctx:        discoveryCtx,
        cancelFunc: cancel,
    }
}

// inside the goroutine launched by Start()
for {
    select {
    case <-pd.ctx.Done():
        return
    case watchData, ok := <-changes:
        // ...
    }
}

func (pd *CellPoolerDiscovery) Stop() {
    pd.cancelFunc()
    pd.wg.Wait()
}

The stored context belongs to the watcher itself, not to any inbound request. Stop() calls cancelFunc() (closing pd.ctx.Done(), which unblocks the loop) and then wg.Wait()s for the goroutine to actually exit. This is the idiomatic shape for a service-owned background worker. See Concurrency for the WaitGroup + cancel-func goroutine-shutdown pattern in depth.

Cancellation across gRPC boundaries

A context’s cancellation and deadline travel over the wire automatically. When you pass a ctx to a generated gRPC client stub, gRPC encodes its deadline as request metadata; the server materializes a context with that same deadline, and if the client cancels, the server’s ctx.Done() fires. This is what lets a statement_timeout set at the gateway actually stop work running inside the pooler.

The rpcclient here is a deliberately thin pass-through. It adds no timeout of its own — it forwards the caller’s ctx straight to the stub. Bounding the call is the caller’s job (which is exactly why rule_change.go wraps with WithTimeout(ctx, timeouts.RuleWriteTimeout) before calling).

func (c *Client) Recruit(ctx context.Context, pooler *clustermetadatapb.MultiPooler, request *consensusdatapb.RecruitRequest) (*consensusdatapb.RecruitResponse, error) {
    conn, closer, err := c.dialPersistent(ctx, pooler) // ctx also bounds connection acquisition
    if err != nil {
        return nil, err
    }
    defer func() { _ = closer() }()
    return conn.consensusClient.Recruit(ctx, request) // ctx flows over the wire as the gRPC deadline
}

Notice ctx does double duty: it bounds acquiring the connection and then propagates as the call’s deadline. Connection acquisition itself respects the context — when the pool is at capacity and must wait for an evictable connection, it polls in a loop that bails on ctx.Done():

for {
    select {
    case <-ctx.Done():
        cc.metrics.AddDialTimeout(ctx)
        return nil, nil, ctx.Err()
    default:
        if client, closer, found, err := cc.pollOnce(ctx, addr, poolerID); found {
            return client, closer, err
        }
    }
}

Long-lived streams use the context as the stream’s lifetime: such a stream stays open until the context passed to it is cancelled — cancelling the context is how you close the stream. See gRPC & Protobuf for the full picture of how generated stubs and deadlines interact.

Reading and shaving the inbound deadline at the server

On the server side, a handler can ask whether the caller imposed a deadline via ctx.Deadline(). The TriggerRecoveryNow handler does this and, when a deadline exists, derives a slightly shorter child deadline so it leaves headroom to actually serialize and send the response:

deadline, hasDeadline := ctx.Deadline()
if !hasDeadline {
    var cancel context.CancelFunc
    ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
    defer cancel()
} else {
    // Subtract 200ms from deadline to allow time for response overhead.
    timeout := time.Until(deadline) - 200*time.Millisecond
    if timeout > 0 {
        var cancel context.CancelFunc
        ctx, cancel = context.WithTimeout(ctx, timeout)
        defer cancel()
    }
}

Distinguishing “deadline hit” from “real failure”

When an RPC’s deadline fires, gRPC surfaces it on the client as a status error (code DeadlineExceeded), not as a bare context.DeadlineExceeded. But within a server, an operation that watched its own ctx will return the Go sentinel. A recovery handler treats a deadline or cancellation as an expected outcome (recovery ran out of time — that’s information, not a crash) and only maps genuine errors to Internal:

remainingProblems, err := s.engine.TriggerRecoveryNow(ctx, req.MaxCycles)
if err != nil && !errors.Is(err, context.DeadlineExceeded) && !errors.Is(err, context.Canceled) {
    return nil, status.Error(codes.Internal, fmt.Sprintf("recovery trigger failed: %v", err))
}

Use errors.Is against the sentinels — never string-match "context deadline exceeded". Wrapping with mterrors.Wrap(ctx.Err(), ...) preserves errors.Is reachability, which is precisely why the check above keeps working even through several wrap layers. Cross-reference Errors.

`context.WithValue` — request-scoped data, used sparingly

WithValue attaches a key/value pair to a derived context. It is the right tool for a narrow purpose: data that is genuinely request-scoped and must cross API boundaries where adding a parameter is impractical — a trace span, an auth principal, a proof-of-something token. It is the wrong tool for passing optional arguments or smuggling dependencies into a function.

The keys must be an unexported, custom type — typically a zero-size struct{} — never a bare string or int. String keys from different packages can silently collide; an unexported struct type is unforgeable from outside the package and costs zero bytes.

type ctxKey struct{}                       // unexported, zero-size, collision-proof
ctx = context.WithValue(ctx, ctxKey{}, v)  // store
v, ok := ctx.Value(ctxKey{}).(*MyType)     // retrieve + type-assert

An “action lock” is the textbook good use. It threads proof that the caller holds the lock through a call chain, so deep functions can assert the lock is held without the lock object being passed down explicitly:

type actionLockKey struct{}

type actionLockValue struct {
    lockID    uint64
    operation string
    released  *atomic.Bool
}

// Acquire returns a NEW context carrying the proof:
return context.WithValue(ctx, actionLockKey{}, val), nil

// AssertActionLockHeld reads it back via type assertion:
func AssertActionLockHeld(ctx context.Context) error {
    val, ok := ctx.Value(actionLockKey{}).(*actionLockValue)
    if !ok {
        return errors.New("context does not hold an action lock")
    }
    if val.released.Load() {
        return errors.New("context's action lock has been released")
    }
    return nil
}

Acquire returns a derived context; callers thread that context downstream, and any function that needs to be sure the lock is held calls AssertActionLockHeld(ctx). The stored value is a pointer to a struct with an atomic.Bool released-flag, so Release can flip the flag and every context that captured that pointer immediately sees the lock as released — a neat interaction between WithValue and the sync/atomic primitives covered in Sync & the Memory Model.

Detached contexts: work that must outlive the request

Sometimes a background task is triggered by a request but must outlive it — graceful shutdown, a post-response health check. If you reuse the request’s context, its cancellation (or deadline) will abort the very cleanup you need to run. The fix is to detach: build a fresh context that keeps the telemetry baggage but drops the cancellation linkage.

A ctxutil.Detach helper starts from Background() and copies over OpenTelemetry baggage and the parent span (stored separately so background work can link to rather than nest under the originating trace):

func Detach(parent context.Context) context.Context {
    ctx := context.Background() // start fresh - no cancellation inheritance
    if bag := baggage.FromContext(parent); bag.Len() > 0 {
        ctx = baggage.ContextWithBaggage(ctx, bag)
    }
    if span := trace.SpanFromContext(parent); span.SpanContext().IsValid() {
        ctx = context.WithValue(ctx, parentSpanContextKey{}, span.SpanContext())
    }
    return ctx
}

This contrasts with the standard library’s context.WithoutCancel (Go 1.21+): WithoutCancel keeps the parent span so the background work appears as a child of the request; Detach deliberately demotes it to a linked span so it gets its own trace. The graceful-shutdown hook uses it, then re-imposes a bound:

mp.senv.OnClose(func() {
    // Detach from startCtx so a cancelled startup ctx doesn't block
    // the shutdown write, while preserving any trace/telemetry values.
    ctx, cancel := context.WithTimeout(ctxutil.Detach(startCtx), 10*time.Second)
    defer cancel()
    mp.Shutdown(ctx)
})

Centralized timeout constants

Notice that the consensus RPC sites do not hardcode a magic 30 * time.Second. Deadline durations are collected into a timeouts package, each constant carrying a comment that justifies its specific value from operational experience:

// RemoteOperationTimeout is the default timeout for remote operations such as
// RPC calls, etcd data fetches, and synchronous replication health checks.
const RemoteOperationTimeout = 15 * time.Second

// RuleWriteTimeout is the timeout for rule writes and the election-flow RPCs
// (Recruit, Promote, SetPrimary).
const RuleWriteTimeout = 30 * time.Second

// ReadyTopoCheckTimeout bounds the etcd connectivity probe in /ready handlers.
const ReadyTopoCheckTimeout = 4 * time.Second

Callers reference the named constant: context.WithTimeout(ctx, timeouts.RuleWriteTimeout). This keeps every election RPC on the same budget and turns “why is this 30s?” into a documented decision rather than a guess scattered across the codebase. When you need a deadline on the latency path, look here first before inventing a number.

ctx-aware logging

A small but pervasive idiom: the codebase logs with slog’s *Context methods — logger.ErrorContext(ctx, ...), InfoContext, DebugContext — rather than the context-free Error/Info. Passing ctx lets the logging backend extract trace/baggage data carried in the context, so a log line can be correlated with the request that produced it. You saw it in WaitForLSN above (ErrorContext(ctx, "WaitForLSN context cancelled or timed out", ...)). When you add logging on a code path that has a ctx, prefer the *Context variant. See mterrors & Observability.

Checkpoints

Why must you call the CancelFunc returned by context.WithTimeout even if the timeout has already fired or the operation succeeded?

The cancel func releases the context’s internal resources — for WithTimeout/WithDeadline that includes a timer goroutine. If you never call it, those resources live until the parent context is cancelled, which for a long-lived parent can be the entire process lifetime: a slow leak. Calling it is idempotent (extra calls are no-ops), so the safe, universal habit is defer cancel() on the line right after creation. go vet’s lostcancel catches some omissions but not all.

The rule says “never store a Context in a struct,” yet CellPoolerDiscovery has a ctx field. Why is that not a contradiction?

The rule targets objects that act on behalf of a caller per method call — a request handler must receive a fresh ctx each call because the caller owns the request’s lifetime. CellPoolerDiscovery is the opposite: it owns a long-lived background goroutine with a Start()/Stop() lifecycle. It derives one cancellable context at construction (WithCancel(ctx)), the goroutine loops on it, and Stop() calls the stored cancelFunc. The stored context belongs to the watcher’s own subtree, so caching it is correct. The distinction is ownership: store a context only when the struct owns the goroutine that consumes it.

Why must a WithValue key be an unexported custom type like type actionLockKey struct{} rather than a string constant?

Context keys are compared by equality across the whole program. Two packages that both use the string "lock" as a key would silently collide and overwrite each other’s values. An unexported, zero-size struct type is unique to its package (nothing outside can name it), costs zero bytes, and is collision-proof. Retrieval uses the comma-ok type assertion (val, ok := ctx.Value(actionLockKey{}).(*actionLockValue)), so a missing or wrong-typed value returns cleanly as ok == false rather than panicking — which is exactly how AssertActionLockHeld distinguishes “no lock” from “lock present.”

TriggerRecoveryNow subtracts 200ms from the inbound deadline before forwarding work. What goes wrong without that shave?

If the server runs the inner operation up to the exact inbound deadline, it finishes at the same instant the client gives up. The client’s RPC then times out before it can receive the server’s response, so the (possibly successful) answer is lost and the caller sees a deadline/network error instead. Reserving a small budget (200ms here) ensures the server completes early enough to serialize and send the reply before the client’s clock expires. Budget headroom matters whenever a deadline is forwarded down a call chain.

Exercises

Trace the timeout constants. Open go/common/timeouts/rpc.go and pick three constants (e.g. RuleWriteTimeout, ReadyTopoCheckTimeout, RemoteOperationTimeout). Grep the repo for each constant name and find a call site that passes it to context.WithTimeout. For each, identify the blocking operation the deadline bounds (an RPC? an etcd probe?) and read the constant’s comment to explain why that specific duration was chosen.
Follow a WithValue end to end. In go/services/multipooler/internal/manager/actionlock/action_lock.go, trace the *actionLockValue pointer from Acquire (where it’s stored) to Release and to the package-level AssertActionLockHeld. Then find a caller in rpc_manager.go that asserts the lock is held via the context. Explain why the released *atomic.Bool is a pointer shared across the contexts rather than a plain bool copied into each.
Compare two select-on-Done loops. Read WaitForLSN (rpc_manager.go) and getOrDial (conn_cache.go). For each, name the other select case (a ticker channel vs. a default poll) and explain the difference between a blocking select and a busy default loop. Then explain why WaitForLSN wraps ctx.Err() with mterrors.Wrap while getOrDial returns ctx.Err() directly.
Detach vs. direct context. Read ctxutil.Detach and the OnClose hook in go/services/multipooler/init.go. Explain precisely what would break if the hook used startCtx directly instead of Detach(startCtx), and what would break if it used Detach(startCtx) without wrapping it in WithTimeout.

Sync & the Memory Model Mutexes, atomics, and the happens-before rules that make the atomic.Bool released-flag in the action lock safe.

Context: Cancellation & Deadlines

Why context exists

Roots: `Background()` vs `TODO()`

Deriving cancellable children, and the cancel func

The no-op cancel trick

Reacting to cancellation: `Done()`, `Err()`, and `select`

The “ctx is the first parameter” convention

The deliberate exception: self-owned background watchers

Cancellation across gRPC boundaries

Reading and shaving the inbound deadline at the server

Distinguishing “deadline hit” from “real failure”

`context.WithValue` — request-scoped data, used sparingly

Detached contexts: work that must outlive the request

Centralized timeout constants

ctx-aware logging

Checkpoints

Exercises

Next

See also

Context: Cancellation & Deadlines

Why context exists

Roots: Background() vs TODO()

Deriving cancellable children, and the cancel func

The no-op cancel trick

Reacting to cancellation: Done(), Err(), and select

The “ctx is the first parameter” convention

The deliberate exception: self-owned background watchers

Cancellation across gRPC boundaries

Reading and shaving the inbound deadline at the server

Distinguishing “deadline hit” from “real failure”

context.WithValue — request-scoped data, used sparingly

Detached contexts: work that must outlive the request

Centralized timeout constants

ctx-aware logging

Checkpoints

Exercises

Next

See also

Roots: `Background()` vs `TODO()`

Reacting to cancellation: `Done()`, `Err()`, and `select`

`context.WithValue` — request-scoped data, used sparingly