Anatomy of a Service

What you will learn: how a long-running Go service is structured, brought to a serving state, kept healthy, and shut down gracefully — by dissecting one small lifecycle engine and the service that drives it end to end.

Prerequisites: interfaces & composition, concurrency, context, sync & the memory model, and errors. It also helps to have read cmd & cobra (process entry) and config & viperutil (where the flag values come from).

Every long-running binary in this system — gateway, pooler, orchestrator, admin, the Postgres controller — is a server, and they all share one bootstrapping engine called servenv (short for serving environment). Understand servenv plus one concrete service and you understand all of them. We’ll use multigateway as the worked example because its startup is the most complete and self-contained, and bring in multipooler for contrast.

A *servenv.ServEnv is the per-process object that owns the HTTP mux, the lifecycle hooks, readiness checks, telemetry, the PID file, and the signal channel. A service struct like MultiGateway holds a *ServEnv and a *GrpcServer rather than embedding the machinery itself — composition over inheritance, the Go way.

The big picture: a hook-driven state machine

servenv is built around six families of lifecycle hooks. A hook is just a registered function (func() or func() error). The service registers callbacks; servenv fires them at the right phase. This is the functional-callback idiom — see concurrency for why functions-as-values matter here, since hooks fire on goroutines.

The whole lifecycle is a small state machine. A process starts, the environment comes up (OnInit), the service wires its object graph and registers deferred work, then Run listens, fires its run hooks, and blocks while serving. A signal kicks it into a bounded, ordered shutdown:

Service lifecycle

Rendering diagram…

stateDiagram-v2
[*] --> Start: process start (cobra RunE)
Start --> Init: ServEnv.Init(id)
Init --> Init: fire OnInit hooks
note right of Init
  telemetry, hostname,
  /live + /ready, PID file
end note
Init --> Wired: service.Init builds object graph
note right of Wired
  register OnRun / OnRunE (gRPC handlers),
  RegisterReadyCheck, OnClose
end note
Wired --> Serving: ServEnv.Run
note right of Serving
  HTTP listen FIRST, then grpcServer.Create(),
  then FireRunHooks() (OnRun/OnRunE), then Serve()
end note
Serving --> Serving: block on signal channel
Serving --> Lameduck: SIGTERM / SIGINT
Lameduck --> TermSync: go OnTerm (async, not waited)
TermSync --> CloseHTTP: OnTermSync (waited, gRPC GracefulStop)
CloseHTTP --> Close: finish lameduck, close HTTP listener
Close --> Exit: OnClose (waited, Shutdown / topo unregister)
Exit --> [*]: telemetry shutdown, exit

The six registration methods give you a hook for each interesting moment:

func (se *ServEnv) OnInit(f func()) { se.onInitHooks.Add(f) }      // start of lifecycle
func (se *ServEnv) OnRun(f func()) { se.onRunHooks.Add(f) }        // start of Run
func (se *ServEnv) OnRunE(f func() error) { se.onRunEHooks.Add(f) } // start of Run, can error
func (se *ServEnv) OnTerm(f func()) { se.onTermHooks.Add(f) }      // on SIGTERM (async)
func (se *ServEnv) OnTermSync(f func()) { se.onTermSyncHooks.Add(f) } // on SIGTERM (waited)
func (sv *ServEnv) OnClose(f func()) { sv.onCloseHooks.Add(f) }    // end of lifecycle

Hook	When it fires	Waited on?	Timeout bound	Typical use
`OnInit`	inside `ServEnv.Init`	yes (sequential w.r.t. Init)	none	early one-time setup
`OnRun`	inside `Run`, after `Create`, before `Serve`	yes	none	register gRPC handlers
`OnRunE`	same as `OnRun`	yes, fail-fast	none	gRPC registration that can error
`OnTerm`	on SIGTERM	no (`go ...Fire()`)	none	best-effort async notices
`OnTermSync`	on SIGTERM, after async `OnTerm` launched	yes	`--onterm-timeout` (30s)	gRPC `GracefulStop`
`OnClose`	after lameduck period	yes	`--onclose-timeout` (10s)	service `Shutdown`, topo unregister

How hooks fire: parallel goroutines

The hook lists launch one goroutine per registered function and wait for all of them:

// Fire launches a goroutine for each function and waits for all to finish.
func (h *Hooks) Fire() {
  h.mu.Lock()
  defer h.mu.Unlock()
  wg := sync.WaitGroup{}
  for _, f := range h.funcs {
    wg.Go(f)
  }
  wg.Wait()
}

wg.Go(f) is the Go 1.25 convenience form of wg.Add(1); go func(){ defer wg.Done(); f() }() — see sync & the memory model. The error-returning variant is also parallel but fail-fast: it returns the first error received on a channel while the rest may still be running.

Phase 1 — process start: cobra to `Init`

The entry point is tiny. The main.go builds a cobra command whose RunE calls a run function that is just Init then RunDefault:

func run(ctx context.Context, mg *multigateway.MultiGateway) error {
  if err := mg.Init(ctx); err != nil {
    return err
  }
  return mg.RunDefault()
}

PreRunE is wired to load the config before RunE runs (see config & viperutil). main() calls cmd.Execute() and os.Exit(1) on error — and that’s the only place os.Exit is allowed.

Phase 2 — `ServEnv.Init`: the serving environment comes up

MultiGateway.Init calls senv.Init(...) first, handing it a service identity (name, instance ID, cell):

if err := mg.senv.Init(servenv.ServiceIdentity{
  ServiceName:       constants.ServiceMultigateway,
  ServiceInstanceID: serviceID,
  Cell:              cell,
}); err != nil {
  return fmt.Errorf("servenv init: %w", err)
}

ServEnv.Init is the one-time environment bring-up, in order:

Set up logging and OpenTelemetry from the identity (resource attributes like service.instance.id, cell, build revision).
Optionally ignore SIGPIPE.
Double-init guard: under a mutex, if the env is already inited it calls log.Fatal. Otherwise it marks itself inited.
Refuse to run as root: if os.Getuid() == 0 it returns an error.
Set the max stack and resolve the hostname (failing early on a bad one).
Fire the OnInit hooks — your early setup callbacks run here.
Register the PID file and the common HTTP endpoints (/live, /ready, /version, /config), pprof, and orphan detection.

After senv.Init returns, Init builds the service object graph — where the request-flow components are wired (see architecture & request flow). The first thing it does is create a service-lifetime context cancelled on shutdown:

// A service-lifetime context, cancelled once at shutdown.
mg.shutdownCtx, mg.shutdownCancel = context.WithCancel(ctx)

mg.shutdownCtx is then threaded into every long-running goroutine — pooler discovery, the load balancer, the failover buffer:

mg.poolerDiscovery = NewGlobalPoolerDiscovery(mg.shutdownCtx, mg.ts, mg.cell.Get(), logger)
loadBalancer := poolergateway.NewLoadBalancer(mg.shutdownCtx, mg.cell.Get(), logger, poolerTransportCreds)
mg.buffer = buffer.New(mg.shutdownCtx, mg.bufferConfig, logger)

This is the canonical Go pattern for fanning a cancel signal out to many goroutines: derive one cancellable context, store its CancelFunc, pass the context down, and cancel once at shutdown. See context.

Phase 3 — deferred gRPC registration via `OnRun`

Here is the single most important structural rule in one of these services. The service holds a *servenv.GrpcServer, but its embedded *grpc.Server field does not exist yet during Init — it’s created later, inside ServEnv.Run, by grpcServer.Create(). So you can’t register gRPC handlers in Init (you’d dereference nil), and you can’t register them after Serve() (gRPC panics). The handlers go in an OnRun hook, which fires in the narrow window in between:

// Register gRPC services via OnRun because grpcServer.Server is only
// created in servenv.Run() (after Create()), which runs after Init().
managerServer := NewManagerServer(mg.queryRegistry, mg.pgHandler)
mg.senv.OnRun(func() {
  mg.cancelManager.RegisterWithGRPCServer(mg.grpcServer.Server)
  managerServer.RegisterWithGRPCServer(mg.grpcServer.Server)
})

The constraint is documented right at the gRPC Serve site too: all services must register before Serve(), which is exactly why the run hooks fire after Create() and before Serve(). Otherwise the binary crashes with grpc: Server.RegisterService after Server.Serve.

service-map gating

Whether a gRPC sub-service is even registered is gated by a --service-map flag through GrpcServer.CheckServiceMap. The pooler shows this concretely:

func RegisterPoolerServices(senv *servenv.ServEnv, grpc *servenv.GrpcServer) {
  poolerserver.RegisterPoolerServices = append(poolerserver.RegisterPoolerServices, func(p *poolerserver.QueryPoolerServer) {
    if grpc.CheckServiceMap("pooler", senv) {
      srv := &poolerService{
        pooler: p,
        pubsub: p.PubSubListener(),
      }
      multipoolerpb.RegisterMultiPoolerServiceServer(grpc.Server, srv)
    }
  })
}

CheckServiceMap returns false outright if gRPC is disabled (no grpc-port/socket), and otherwise consults --service-map (grpc-pooler, grpc-consensus, and so on; a - prefix disables a name). One binary can selectively expose subsets of its services this way.

Note that poolerService embeds multipoolerpb.UnimplementedMultiPoolerServiceServer. That embedded “Unimplemented” struct is the standard gRPC forward-compatibility idiom: it satisfies the generated server interface so new RPCs in the proto don’t break the build. See interfaces & composition and gRPC & protobuf.

Phase 4 — `Run`: HTTP first, then serve, then block

RunDefault simply delegates to servenv, which runs the blocking serve-and-shutdown loop. Here is the startup half:

func (sv *ServEnv) Run(bindAddress string, port int, grpcServer *GrpcServer) error {
  sv.PopulateListeningURL(int32(port))

  // Start the HTTP server early so liveness/startup probes respond
  // before potentially-blocking run hooks.
  l, err := net.Listen("tcp", net.JoinHostPort(bindAddress, strconv.Itoa(port)))
  // ...
  go func() { sv.HTTPServe(l) /* ... */ }()

  if err := grpcServer.Create(); err != nil { /* ... */ }   // grpcServer.Server now exists
  if err := sv.FireRunHooks(); err != nil { /* ... */ }     // OnRun + OnRunE fire here
  if err := grpcServer.Serve(sv); err != nil { /* ... */ }

  signal.Notify(sv.exitChan, syscall.SIGTERM, syscall.SIGINT)
  slog.Info("service successfully started", "port", port)
  <-sv.exitChan   // block here while serving
  // ... shutdown ...
}

The ordering is deliberate:

The HTTP listener starts first, in its own goroutine. Kubernetes liveness/startup probes must be able to respond before the potentially-blocking OnRun hooks (which may wait on topology or manager readiness). On K8s 1.33+ native sidecars, a probe deadlock here would stall the whole pod.
grpcServer.Create() — now grpcServer.Server exists.
FireRunHooks() runs OnRun then OnRunE. The error result is joined and returned, so a registration error aborts startup with a wrapped error (%w — see errors).
grpcServer.Serve(sv) registers reflection, a build-identity service, and the standard gRPC health server, then serves on a goroutine.
signal.Notify(...) then <-sv.exitChan — the exact line where the process blocks while serving.

Health and readiness

A service exposes two independent health surfaces, and they mean different things.

HTTP endpoints are built into every service:

/live always returns ok — pure liveness.
/ready iterates the registered ready checks and returns 503 if any of them errors:

sv.HTTPHandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
  sv.readyMu.RLock()
  checks := sv.readyChecks
  sv.readyMu.RUnlock()
  for _, check := range checks {
    if err := check(); err != nil {
      w.WriteHeader(http.StatusServiceUnavailable)
      _ = web.Templates.ExecuteTemplate(w, "isok.html", false)
      return
    }
  }
  _ = web.Templates.ExecuteTemplate(w, "isok.html", true)
})

A service adds its own readiness logic via RegisterReadyCheck. The gateway reports not-ready until topology registration has succeeded and at least one pooler has been discovered:

mg.senv.RegisterReadyCheck(func() error {
  mg.serverStatus.mu.Lock()
  defer mg.serverStatus.mu.Unlock()
  if len(mg.serverStatus.InitError) > 0 {
    return errors.New(mg.serverStatus.InitError)
  }
  if mg.poolerDiscovery.PoolerCount() == 0 {
    return errors.New("no poolers discovered")
  }
  return nil
})

serverStatus is a small struct with a mutex guarding its fields — the state shared between this check and the / index page. Guarding shared mutable state with a mutex is the textbook concurrency rule (see sync & the memory model).

gRPC health is auto-registered when the gRPC server starts. Every registered gRPC service is marked SERVING, alongside reflection and a build-identity service:

healthServer := health.NewServer()
healthpb.RegisterHealthServer(g.Server, healthServer)
for service := range g.Server.GetServiceInfo() {
  healthServer.SetServingStatus(service, healthpb.HealthCheckResponse_SERVING)
}

Phase 5 — shutdown: every step is timeout-bounded

Once <-sv.exitChan unblocks on SIGTERM/SIGINT, the shutdown sequence runs in a fixed order, and each waited step is bounded by a timeout:

startTime := time.Now()
slog.Info("entering lameduck mode", "period", sv.lameduckPeriod.Get())
go sv.onTermHooks.Fire()                        // async OnTerm, NOT waited on

sv.fireOnTermSyncHooks(sv.onTermTimeout.Get())  // OnTermSync, waited, ≤ onterm-timeout
if remain := sv.lameduckPeriod.Get() - time.Since(startTime); remain > 0 {
  time.Sleep(remain)                          // finish the lameduck window
}
_ = l.Close()                                   // stop HTTP listener

sv.fireOnCloseHooks(sv.onCloseTimeout.Get())    // OnClose, waited, ≤ onclose-timeout
// ... telemetry shutdown last, with a fresh 5s ctx ...

The exact ordered sequence on a signal:

Enter lameduck (default 50ms — short by design; K8s drains via endpoint removal).
Fire OnTerm asynchronously, fire-and-forget.
Fire OnTermSync, racing the hooks against a timer. This is where gRPC stops gracefully — the wired hook calls g.Server.GracefulStop().
Sleep the remainder of the lameduck window, if any.
Close the HTTP listener.
Fire OnClose — the service’s own teardown.
Shut telemetry down last, with a fresh 5s context, so spans from cleanup are still captured.

The timeout race

Both OnTermSync and OnClose funnel through one helper that selects the hooks-done channel against a timer:

func (se *ServEnv) fireHooksWithTimeout(timeout time.Duration, name string, hookFn func()) bool {
  timer := time.NewTimer(timeout)
  defer timer.Stop()
  done := make(chan struct{})
  go func() {
    hookFn()
    close(done)
  }()
  select {
  case <-done:
    return true
  case <-timer.C:
    slog.Info(name + " hooks timed out")
    return false
  }
}

This is the classic select-with-timer pattern from concurrency: a stuck cleanup hook cannot block shutdown past --onterm-timeout (30s, matched to the K8s grace period) or --onclose-timeout (10s). The hook goroutine may leak, but the process exits on schedule.

The service’s `Shutdown`: cancel first, then LIFO teardown

The gateway wires its Shutdown as the OnClose hook. The very first move is to cancel the service-lifetime context:

func (mg *MultiGateway) Shutdown() {
  // Cancel the service-lifetime context first so health-stream goroutines
  // stop promptly, before we close the gRPC connections they use.
  if mg.shutdownCancel != nil {
    mg.shutdownCancel()
  }
  // ... close listeners, cancelManager, executor, queryRegistry,
  //     buffer, poolerGateway, poolerDiscovery ...
  mg.tr.Unregister()
  mg.ts.Close()
}

Everything after the cancel tears down in reverse of construction order (LIFO), ending by unregistering from topology and closing the topology store.

The cross-network contract: `queryservice.QueryService`

The lifecycle exists to serve queries, and the request path is glued together by one interface: queryservice.QueryService. Its package doc says it’s implemented on the pooler side (the server of the gRPC connection), by PoolerGateway (which abstracts away managing pooler connections), and by the gateway-side gRPC client:

// All methods must be safe to be called concurrently.
type QueryService interface {
  ExecuteQuery(ctx context.Context, target *query.Target, sql string, options *query.ExecuteOptions) (*sqltypes.Result, *query.ReservedState, error)
  StreamExecute(ctx context.Context, target *query.Target, sql string, options *query.ExecuteOptions, /* ... */)
  // ... Describe, Close, COPY family, ConcludeTransaction ...
}

That same interface is satisfied by three implementations:

the pooler-side executor (real execution against PostgreSQL),
the gateway-side gRPC client grpcQueryService (whose constructor returns a queryservice.QueryService),
and PoolerGateway.

This is interface-based dependency inversion across a network boundary: the gateway calls QueryService methods; one implementation turns them into gRPC calls; another, on the pooler side, executes them. Every method takes context.Context first (so cancellation and deadlines propagate over the wire) and is documented as concurrency-safe. See interfaces & composition and context.

Contrast: the manager pattern

multipooler has the same skeleton — Init / RunDefault / Shutdown, a senv and a grpcServer — but organizes differently. It delegates its state and topology to a dedicated manager.MultiPoolerManager and registers gRPC through helper functions:

poolerManager.Start(mp.senv)
grpcmanagerservice.RegisterPoolerManagerServices(mp.senv, mp.grpcServer)
grpcconsensusservice.RegisterConsensusServices(mp.senv, mp.grpcServer)
grpcpoolerservice.RegisterPoolerServices(mp.senv, mp.grpcServer)

Topology registration and shutdown are delegated to the manager via OnRun / OnClose:

mp.senv.OnRun(func() {
  poolerManager.StartTopoRegistration(func(s string) {
    mp.serverStatus.mu.Lock()
    defer mp.serverStatus.mu.Unlock()
    mp.serverStatus.InitError = s
  })
})

mp.senv.OnClose(func() {
  ctx, cancel := context.WithTimeout(ctxutil.Detach(startCtx), 10*time.Second)
  defer cancel()
  mp.Shutdown(ctx)
})

Where the pooler hides topology behind a manager, the gateway wires it inline through a toporeg helper: Register retries in a background goroutine with backoff; RegisterSynchronous blocks until success (the gateway uses it to claim a unique PID prefix); Unregister cancels the retry goroutine, waits, then deregisters with a fresh short-lived context. The two styles do the same job — one factored into a manager, one inline — which is a useful lesson in itself about how a large codebase carries more than one idiom for the same task.

Putting it together

A single-paragraph trace of the whole machine: main() builds the cobra command and calls Execute; cobra loads config then runs run → Init. Init calls senv.Init (telemetry, hostname, OnInit fire, /live+/ready endpoints, double-init/root guards), creates shutdownCtx, builds the object graph passing that context into every goroutine, registers gRPC handlers in an OnRun hook (because the *grpc.Server is still nil), registers a ready check, and wires Shutdown as OnClose. Then RunDefault → senv.Run: HTTP listens first, grpcServer.Create(), FireRunHooks(), grpcServer.Serve(), then signal.Notify and block on the exit channel. On SIGTERM: lameduck → async OnTerm → waited OnTermSync (gRPC GracefulStop, ≤30s) → finish lameduck → close HTTP listener → waited OnClose (Shutdown: cancel ctx, LIFO teardown, topo unregister, ≤10s) → telemetry shutdown → exit.

Checkpoints

Why can’t a service register its gRPC handlers inside Init, and where must it do it instead?

Because grpcServer.Server (the *grpc.Server) is nil until grpcServer.Create() runs inside ServEnv.Run, which executes after Init returns. Registering in Init dereferences nil; registering after Serve panics with “grpc: Server.RegisterService after Server.Serve”. The correct place is an OnRun (or OnRunE) hook, which fires between Create and Serve.

A cleanup hook hangs forever. Does the process still exit? Why?

Yes. OnTermSync and OnClose fire through fireHooksWithTimeout, which selects the hooks-done channel against a time.Timer. After —onterm-timeout (30s) or —onclose-timeout (10s) the timer fires, the function returns false, and Run proceeds to exit. The stuck goroutine leaks but does not block shutdown.

Why does ServEnv.Run start the HTTP listener before firing the run hooks and before gRPC Serve?

So Kubernetes liveness/startup probes can respond while OnRun hooks (which may block on topology or manager readiness) are still running. On K8s 1.33+ native sidecars, a probe deadlock there would stall pod startup.

What is the very first thing MultiGateway.Shutdown does, and why does order matter?

It calls mg.shutdownCancel() to cancel the service-lifetime context first, so health-stream goroutines stop before the gRPC connections they use are closed. The rest of teardown then runs in reverse of construction order (LIFO), ending with topology unregister and closing the topo store.

Exercises

Trace the full lifecycle through the real files. Start at go/cmd/multigateway/main.go (main → cmd.Execute → RunE → run). Follow run into go/services/multigateway/init.go (Init); note where senv.Init runs, and where OnRun, RegisterReadyCheck, and OnClose are registered. Then jump to RunDefault → go/common/servenv/run.go (Run). Write down the exact line where the process blocks, and the ordered list of every step that runs after a SIGTERM.
Classify every hook in one service. Run grep -rn 'OnInit\|OnRun\|OnTermSync\|OnTerm(\|OnClose' go/services/multigateway/ go/common/servenv/grpc_server.go. For each match, classify it as init / run / term-async / term-sync / close, and predict the order and timeout under which it fires during shutdown. Verify against run.go.
Map QueryService to its implementations. Open go/common/queryservice/queryservice.go, then grep -rn 'queryservice.QueryService' go/ and confirm the pooler-side server, the gateway-side gRPC client, and PoolerGateway all satisfy it. For StreamExecute, trace one call from the gateway client across the wire to the pooler server.
Compare the two lifecycle styles. List the structural differences between the gateway (inline toporeg.RegisterSynchronous, inline Shutdown) and the pooler (a manager with StartTopoRegistration / StopTopoRegistration). Argue which is easier to test and why.

Continue to parser, lexer, AST & codegen — how the system turns SQL text into an AST and back.

Anatomy of a Service

The big picture: a hook-driven state machine

How hooks fire: parallel goroutines

Phase 1 — process start: cobra to `Init`

Phase 2 — `ServEnv.Init`: the serving environment comes up

Phase 3 — deferred gRPC registration via `OnRun`

service-map gating

Phase 4 — `Run`: HTTP first, then serve, then block

Health and readiness

Phase 5 — shutdown: every step is timeout-bounded

The timeout race

The service’s `Shutdown`: cancel first, then LIFO teardown

The cross-network contract: `queryservice.QueryService`

Contrast: the manager pattern

Putting it together

Checkpoints

Exercises

Next

See also

Anatomy of a Service

The big picture: a hook-driven state machine

How hooks fire: parallel goroutines

Phase 1 — process start: cobra to Init

Phase 2 — ServEnv.Init: the serving environment comes up

Phase 3 — deferred gRPC registration via OnRun

service-map gating

Phase 4 — Run: HTTP first, then serve, then block

Health and readiness

Phase 5 — shutdown: every step is timeout-bounded

The timeout race

The service’s Shutdown: cancel first, then LIFO teardown

The cross-network contract: queryservice.QueryService

Contrast: the manager pattern

Putting it together

Checkpoints

Exercises

Next

See also

Phase 1 — process start: cobra to `Init`

Phase 2 — `ServEnv.Init`: the serving environment comes up

Phase 3 — deferred gRPC registration via `OnRun`

Phase 4 — `Run`: HTTP first, then serve, then block

The service’s `Shutdown`: cancel first, then LIFO teardown

The cross-network contract: `queryservice.QueryService`