Anatomy of a Service
What you will learn: how a long-running Go service is structured, brought to a serving state, kept healthy, and shut down gracefully — by dissecting one small lifecycle engine and the service that drives it end to end.
Prerequisites: interfaces & composition, concurrency, context, sync & the memory model, and errors. It also helps to have read cmd & cobra (process entry) and config & viperutil (where the flag values come from).
Every long-running binary in this system — gateway, pooler, orchestrator, admin, the Postgres controller — is a server, and they all share one bootstrapping engine called servenv (short for serving environment). Understand servenv plus one concrete service and you understand all of them. We’ll use multigateway as the worked example because its startup is the most complete and self-contained, and bring in multipooler for contrast.
A *servenv.ServEnv is the per-process object that owns the HTTP mux, the lifecycle hooks, readiness checks, telemetry, the PID file, and the signal channel. A service struct like MultiGateway holds a *ServEnv and a *GrpcServer rather than embedding the machinery itself — composition over inheritance, the Go way.
The big picture: a hook-driven state machine
Section titled “The big picture: a hook-driven state machine”servenv is built around six families of lifecycle hooks. A hook is just a registered function (func() or func() error). The service registers callbacks; servenv fires them at the right phase. This is the functional-callback idiom — see concurrency for why functions-as-values matter here, since hooks fire on goroutines.
The whole lifecycle is a small state machine. A process starts, the environment comes up (OnInit), the service wires its object graph and registers deferred work, then Run listens, fires its run hooks, and blocks while serving. A signal kicks it into a bounded, ordered shutdown:
stateDiagram-v2 [*] --> Start: process start (cobra RunE) Start --> Init: ServEnv.Init(id) Init --> Init: fire OnInit hooks note right of Init telemetry, hostname, /live + /ready, PID file end note Init --> Wired: service.Init builds object graph note right of Wired register OnRun / OnRunE (gRPC handlers), RegisterReadyCheck, OnClose end note Wired --> Serving: ServEnv.Run note right of Serving HTTP listen FIRST, then grpcServer.Create(), then FireRunHooks() (OnRun/OnRunE), then Serve() end note Serving --> Serving: block on signal channel Serving --> Lameduck: SIGTERM / SIGINT Lameduck --> TermSync: go OnTerm (async, not waited) TermSync --> CloseHTTP: OnTermSync (waited, gRPC GracefulStop) CloseHTTP --> Close: finish lameduck, close HTTP listener Close --> Exit: OnClose (waited, Shutdown / topo unregister) Exit --> [*]: telemetry shutdown, exit
The six registration methods give you a hook for each interesting moment:
func (se *ServEnv) OnInit(f func()) { se.onInitHooks.Add(f) } // start of lifecyclefunc (se *ServEnv) OnRun(f func()) { se.onRunHooks.Add(f) } // start of Runfunc (se *ServEnv) OnRunE(f func() error) { se.onRunEHooks.Add(f) } // start of Run, can errorfunc (se *ServEnv) OnTerm(f func()) { se.onTermHooks.Add(f) } // on SIGTERM (async)func (se *ServEnv) OnTermSync(f func()) { se.onTermSyncHooks.Add(f) } // on SIGTERM (waited)func (sv *ServEnv) OnClose(f func()) { sv.onCloseHooks.Add(f) } // end of lifecycle| Hook | When it fires | Waited on? | Timeout bound | Typical use |
|---|---|---|---|---|
OnInit | inside ServEnv.Init | yes (sequential w.r.t. Init) | none | early one-time setup |
OnRun | inside Run, after Create, before Serve | yes | none | register gRPC handlers |
OnRunE | same as OnRun | yes, fail-fast | none | gRPC registration that can error |
OnTerm | on SIGTERM | no (go ...Fire()) | none | best-effort async notices |
OnTermSync | on SIGTERM, after async OnTerm launched | yes | --onterm-timeout (30s) | gRPC GracefulStop |
OnClose | after lameduck period | yes | --onclose-timeout (10s) | service Shutdown, topo unregister |
How hooks fire: parallel goroutines
Section titled “How hooks fire: parallel goroutines”The hook lists launch one goroutine per registered function and wait for all of them:
// Fire launches a goroutine for each function and waits for all to finish.func (h *Hooks) Fire() { h.mu.Lock() defer h.mu.Unlock() wg := sync.WaitGroup{} for _, f := range h.funcs { wg.Go(f) } wg.Wait()}wg.Go(f) is the Go 1.25 convenience form of wg.Add(1); go func(){ defer wg.Done(); f() }() — see sync & the memory model. The error-returning variant is also parallel but fail-fast: it returns the first error received on a channel while the rest may still be running.
Phase 1 — process start: cobra to Init
Section titled “Phase 1 — process start: cobra to Init”The entry point is tiny. The main.go builds a cobra command whose RunE calls a run function that is just Init then RunDefault:
func run(ctx context.Context, mg *multigateway.MultiGateway) error { if err := mg.Init(ctx); err != nil { return err } return mg.RunDefault()}PreRunE is wired to load the config before RunE runs (see config & viperutil). main() calls cmd.Execute() and os.Exit(1) on error — and that’s the only place os.Exit is allowed.
Phase 2 — ServEnv.Init: the serving environment comes up
Section titled “Phase 2 — ServEnv.Init: the serving environment comes up”MultiGateway.Init calls senv.Init(...) first, handing it a service identity (name, instance ID, cell):
if err := mg.senv.Init(servenv.ServiceIdentity{ ServiceName: constants.ServiceMultigateway, ServiceInstanceID: serviceID, Cell: cell,}); err != nil { return fmt.Errorf("servenv init: %w", err)}ServEnv.Init is the one-time environment bring-up, in order:
- Set up logging and OpenTelemetry from the identity (resource attributes like
service.instance.id, cell, build revision). - Optionally ignore
SIGPIPE. - Double-init guard: under a mutex, if the env is already inited it calls
log.Fatal. Otherwise it marks itself inited. - Refuse to run as root:
if os.Getuid() == 0it returns an error. - Set the max stack and resolve the hostname (failing early on a bad one).
- Fire the
OnInithooks — your early setup callbacks run here. - Register the PID file and the common HTTP endpoints (
/live,/ready,/version,/config), pprof, and orphan detection.
After senv.Init returns, Init builds the service object graph — where the request-flow components are wired (see architecture & request flow). The first thing it does is create a service-lifetime context cancelled on shutdown:
// A service-lifetime context, cancelled once at shutdown.mg.shutdownCtx, mg.shutdownCancel = context.WithCancel(ctx)mg.shutdownCtx is then threaded into every long-running goroutine — pooler discovery, the load balancer, the failover buffer:
mg.poolerDiscovery = NewGlobalPoolerDiscovery(mg.shutdownCtx, mg.ts, mg.cell.Get(), logger)loadBalancer := poolergateway.NewLoadBalancer(mg.shutdownCtx, mg.cell.Get(), logger, poolerTransportCreds)mg.buffer = buffer.New(mg.shutdownCtx, mg.bufferConfig, logger)This is the canonical Go pattern for fanning a cancel signal out to many goroutines: derive one cancellable context, store its CancelFunc, pass the context down, and cancel once at shutdown. See context.
Phase 3 — deferred gRPC registration via OnRun
Section titled “Phase 3 — deferred gRPC registration via OnRun”Here is the single most important structural rule in one of these services. The service holds a *servenv.GrpcServer, but its embedded *grpc.Server field does not exist yet during Init — it’s created later, inside ServEnv.Run, by grpcServer.Create(). So you can’t register gRPC handlers in Init (you’d dereference nil), and you can’t register them after Serve() (gRPC panics). The handlers go in an OnRun hook, which fires in the narrow window in between:
// Register gRPC services via OnRun because grpcServer.Server is only// created in servenv.Run() (after Create()), which runs after Init().managerServer := NewManagerServer(mg.queryRegistry, mg.pgHandler)mg.senv.OnRun(func() { mg.cancelManager.RegisterWithGRPCServer(mg.grpcServer.Server) managerServer.RegisterWithGRPCServer(mg.grpcServer.Server)})The constraint is documented right at the gRPC Serve site too: all services must register before Serve(), which is exactly why the run hooks fire after Create() and before Serve(). Otherwise the binary crashes with grpc: Server.RegisterService after Server.Serve.
service-map gating
Section titled “service-map gating”Whether a gRPC sub-service is even registered is gated by a --service-map flag through GrpcServer.CheckServiceMap. The pooler shows this concretely:
func RegisterPoolerServices(senv *servenv.ServEnv, grpc *servenv.GrpcServer) { poolerserver.RegisterPoolerServices = append(poolerserver.RegisterPoolerServices, func(p *poolerserver.QueryPoolerServer) { if grpc.CheckServiceMap("pooler", senv) { srv := &poolerService{ pooler: p, pubsub: p.PubSubListener(), } multipoolerpb.RegisterMultiPoolerServiceServer(grpc.Server, srv) } })}CheckServiceMap returns false outright if gRPC is disabled (no grpc-port/socket), and otherwise consults --service-map (grpc-pooler, grpc-consensus, and so on; a - prefix disables a name). One binary can selectively expose subsets of its services this way.
Note that poolerService embeds multipoolerpb.UnimplementedMultiPoolerServiceServer. That embedded “Unimplemented” struct is the standard gRPC forward-compatibility idiom: it satisfies the generated server interface so new RPCs in the proto don’t break the build. See interfaces & composition and gRPC & protobuf.
Phase 4 — Run: HTTP first, then serve, then block
Section titled “Phase 4 — Run: HTTP first, then serve, then block”RunDefault simply delegates to servenv, which runs the blocking serve-and-shutdown loop. Here is the startup half:
func (sv *ServEnv) Run(bindAddress string, port int, grpcServer *GrpcServer) error { sv.PopulateListeningURL(int32(port))
// Start the HTTP server early so liveness/startup probes respond // before potentially-blocking run hooks. l, err := net.Listen("tcp", net.JoinHostPort(bindAddress, strconv.Itoa(port))) // ... go func() { sv.HTTPServe(l) /* ... */ }()
if err := grpcServer.Create(); err != nil { /* ... */ } // grpcServer.Server now exists if err := sv.FireRunHooks(); err != nil { /* ... */ } // OnRun + OnRunE fire here if err := grpcServer.Serve(sv); err != nil { /* ... */ }
signal.Notify(sv.exitChan, syscall.SIGTERM, syscall.SIGINT) slog.Info("service successfully started", "port", port) <-sv.exitChan // block here while serving // ... shutdown ...}The ordering is deliberate:
- The HTTP listener starts first, in its own goroutine. Kubernetes liveness/startup probes must be able to respond before the potentially-blocking
OnRunhooks (which may wait on topology or manager readiness). On K8s 1.33+ native sidecars, a probe deadlock here would stall the whole pod. grpcServer.Create()— nowgrpcServer.Serverexists.FireRunHooks()runsOnRunthenOnRunE. The error result is joined and returned, so a registration error aborts startup with a wrapped error (%w— see errors).grpcServer.Serve(sv)registers reflection, a build-identity service, and the standard gRPC health server, then serves on a goroutine.signal.Notify(...)then<-sv.exitChan— the exact line where the process blocks while serving.
Health and readiness
Section titled “Health and readiness”A service exposes two independent health surfaces, and they mean different things.
HTTP endpoints are built into every service:
/livealways returns ok — pure liveness./readyiterates the registered ready checks and returns 503 if any of them errors:
sv.HTTPHandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) { sv.readyMu.RLock() checks := sv.readyChecks sv.readyMu.RUnlock() for _, check := range checks { if err := check(); err != nil { w.WriteHeader(http.StatusServiceUnavailable) _ = web.Templates.ExecuteTemplate(w, "isok.html", false) return } } _ = web.Templates.ExecuteTemplate(w, "isok.html", true)})A service adds its own readiness logic via RegisterReadyCheck. The gateway reports not-ready until topology registration has succeeded and at least one pooler has been discovered:
mg.senv.RegisterReadyCheck(func() error { mg.serverStatus.mu.Lock() defer mg.serverStatus.mu.Unlock() if len(mg.serverStatus.InitError) > 0 { return errors.New(mg.serverStatus.InitError) } if mg.poolerDiscovery.PoolerCount() == 0 { return errors.New("no poolers discovered") } return nil})serverStatus is a small struct with a mutex guarding its fields — the state shared between this check and the / index page. Guarding shared mutable state with a mutex is the textbook concurrency rule (see sync & the memory model).
gRPC health is auto-registered when the gRPC server starts. Every registered gRPC service is marked SERVING, alongside reflection and a build-identity service:
healthServer := health.NewServer()healthpb.RegisterHealthServer(g.Server, healthServer)for service := range g.Server.GetServiceInfo() { healthServer.SetServingStatus(service, healthpb.HealthCheckResponse_SERVING)}Phase 5 — shutdown: every step is timeout-bounded
Section titled “Phase 5 — shutdown: every step is timeout-bounded”Once <-sv.exitChan unblocks on SIGTERM/SIGINT, the shutdown sequence runs in a fixed order, and each waited step is bounded by a timeout:
startTime := time.Now()slog.Info("entering lameduck mode", "period", sv.lameduckPeriod.Get())go sv.onTermHooks.Fire() // async OnTerm, NOT waited on
sv.fireOnTermSyncHooks(sv.onTermTimeout.Get()) // OnTermSync, waited, ≤ onterm-timeoutif remain := sv.lameduckPeriod.Get() - time.Since(startTime); remain > 0 { time.Sleep(remain) // finish the lameduck window}_ = l.Close() // stop HTTP listener
sv.fireOnCloseHooks(sv.onCloseTimeout.Get()) // OnClose, waited, ≤ onclose-timeout// ... telemetry shutdown last, with a fresh 5s ctx ...The exact ordered sequence on a signal:
- Enter lameduck (default 50ms — short by design; K8s drains via endpoint removal).
- Fire
OnTermasynchronously, fire-and-forget. - Fire
OnTermSync, racing the hooks against a timer. This is where gRPC stops gracefully — the wired hook callsg.Server.GracefulStop(). - Sleep the remainder of the lameduck window, if any.
- Close the HTTP listener.
- Fire
OnClose— the service’s own teardown. - Shut telemetry down last, with a fresh 5s context, so spans from cleanup are still captured.
The timeout race
Section titled “The timeout race”Both OnTermSync and OnClose funnel through one helper that selects the hooks-done channel against a timer:
func (se *ServEnv) fireHooksWithTimeout(timeout time.Duration, name string, hookFn func()) bool { timer := time.NewTimer(timeout) defer timer.Stop() done := make(chan struct{}) go func() { hookFn() close(done) }() select { case <-done: return true case <-timer.C: slog.Info(name + " hooks timed out") return false }}This is the classic select-with-timer pattern from concurrency: a stuck cleanup hook cannot block shutdown past --onterm-timeout (30s, matched to the K8s grace period) or --onclose-timeout (10s). The hook goroutine may leak, but the process exits on schedule.
The service’s Shutdown: cancel first, then LIFO teardown
Section titled “The service’s Shutdown: cancel first, then LIFO teardown”The gateway wires its Shutdown as the OnClose hook. The very first move is to cancel the service-lifetime context:
func (mg *MultiGateway) Shutdown() { // Cancel the service-lifetime context first so health-stream goroutines // stop promptly, before we close the gRPC connections they use. if mg.shutdownCancel != nil { mg.shutdownCancel() } // ... close listeners, cancelManager, executor, queryRegistry, // buffer, poolerGateway, poolerDiscovery ... mg.tr.Unregister() mg.ts.Close()}Everything after the cancel tears down in reverse of construction order (LIFO), ending by unregistering from topology and closing the topology store.
The cross-network contract: queryservice.QueryService
Section titled “The cross-network contract: queryservice.QueryService”The lifecycle exists to serve queries, and the request path is glued together by one interface: queryservice.QueryService. Its package doc says it’s implemented on the pooler side (the server of the gRPC connection), by PoolerGateway (which abstracts away managing pooler connections), and by the gateway-side gRPC client:
// All methods must be safe to be called concurrently.type QueryService interface { ExecuteQuery(ctx context.Context, target *query.Target, sql string, options *query.ExecuteOptions) (*sqltypes.Result, *query.ReservedState, error) StreamExecute(ctx context.Context, target *query.Target, sql string, options *query.ExecuteOptions, /* ... */) // ... Describe, Close, COPY family, ConcludeTransaction ...}That same interface is satisfied by three implementations:
- the pooler-side executor (real execution against PostgreSQL),
- the gateway-side gRPC client
grpcQueryService(whose constructor returns aqueryservice.QueryService), - and
PoolerGateway.
This is interface-based dependency inversion across a network boundary: the gateway calls QueryService methods; one implementation turns them into gRPC calls; another, on the pooler side, executes them. Every method takes context.Context first (so cancellation and deadlines propagate over the wire) and is documented as concurrency-safe. See interfaces & composition and context.
Contrast: the manager pattern
Section titled “Contrast: the manager pattern”multipooler has the same skeleton — Init / RunDefault / Shutdown, a senv and a grpcServer — but organizes differently. It delegates its state and topology to a dedicated manager.MultiPoolerManager and registers gRPC through helper functions:
poolerManager.Start(mp.senv)grpcmanagerservice.RegisterPoolerManagerServices(mp.senv, mp.grpcServer)grpcconsensusservice.RegisterConsensusServices(mp.senv, mp.grpcServer)grpcpoolerservice.RegisterPoolerServices(mp.senv, mp.grpcServer)Topology registration and shutdown are delegated to the manager via OnRun / OnClose:
mp.senv.OnRun(func() { poolerManager.StartTopoRegistration(func(s string) { mp.serverStatus.mu.Lock() defer mp.serverStatus.mu.Unlock() mp.serverStatus.InitError = s })})
mp.senv.OnClose(func() { ctx, cancel := context.WithTimeout(ctxutil.Detach(startCtx), 10*time.Second) defer cancel() mp.Shutdown(ctx)})Where the pooler hides topology behind a manager, the gateway wires it inline through a toporeg helper: Register retries in a background goroutine with backoff; RegisterSynchronous blocks until success (the gateway uses it to claim a unique PID prefix); Unregister cancels the retry goroutine, waits, then deregisters with a fresh short-lived context. The two styles do the same job — one factored into a manager, one inline — which is a useful lesson in itself about how a large codebase carries more than one idiom for the same task.
Putting it together
Section titled “Putting it together”A single-paragraph trace of the whole machine: main() builds the cobra command and calls Execute; cobra loads config then runs run → Init. Init calls senv.Init (telemetry, hostname, OnInit fire, /live+/ready endpoints, double-init/root guards), creates shutdownCtx, builds the object graph passing that context into every goroutine, registers gRPC handlers in an OnRun hook (because the *grpc.Server is still nil), registers a ready check, and wires Shutdown as OnClose. Then RunDefault → senv.Run: HTTP listens first, grpcServer.Create(), FireRunHooks(), grpcServer.Serve(), then signal.Notify and block on the exit channel. On SIGTERM: lameduck → async OnTerm → waited OnTermSync (gRPC GracefulStop, ≤30s) → finish lameduck → close HTTP listener → waited OnClose (Shutdown: cancel ctx, LIFO teardown, topo unregister, ≤10s) → telemetry shutdown → exit.
Checkpoints
Section titled “Checkpoints”Why can’t a service register its gRPC handlers inside Init, and where must it do it instead?
Because grpcServer.Server (the *grpc.Server) is nil until grpcServer.Create() runs inside ServEnv.Run, which executes after Init returns. Registering in Init dereferences nil; registering after Serve panics with “grpc: Server.RegisterService after Server.Serve”. The correct place is an OnRun (or OnRunE) hook, which fires between Create and Serve.A cleanup hook hangs forever. Does the process still exit? Why?
Yes.OnTermSync and OnClose fire through fireHooksWithTimeout, which selects the hooks-done channel against a time.Timer. After —onterm-timeout (30s) or —onclose-timeout (10s) the timer fires, the function returns false, and Run proceeds to exit. The stuck goroutine leaks but does not block shutdown.Why does ServEnv.Run start the HTTP listener before firing the run hooks and before gRPC Serve?
So Kubernetes liveness/startup probes can respond while OnRun hooks (which may block on topology or manager readiness) are still running. On K8s 1.33+ native sidecars, a probe deadlock there would stall pod startup.What is the very first thing MultiGateway.Shutdown does, and why does order matter?
It calls mg.shutdownCancel() to cancel the service-lifetime context first, so health-stream goroutines stop before the gRPC connections they use are closed. The rest of teardown then runs in reverse of construction order (LIFO), ending with topology unregister and closing the topo store.Exercises
Section titled “Exercises”-
Trace the full lifecycle through the real files. Start at
go/cmd/multigateway/main.go(main→cmd.Execute→RunE→run). Followrunintogo/services/multigateway/init.go(Init); note wheresenv.Initruns, and whereOnRun,RegisterReadyCheck, andOnCloseare registered. Then jump toRunDefault→go/common/servenv/run.go(Run). Write down the exact line where the process blocks, and the ordered list of every step that runs after aSIGTERM. -
Classify every hook in one service. Run
grep -rn 'OnInit\|OnRun\|OnTermSync\|OnTerm(\|OnClose' go/services/multigateway/ go/common/servenv/grpc_server.go. For each match, classify it as init / run / term-async / term-sync / close, and predict the order and timeout under which it fires during shutdown. Verify againstrun.go. -
Map
QueryServiceto its implementations. Opengo/common/queryservice/queryservice.go, thengrep -rn 'queryservice.QueryService' go/and confirm the pooler-side server, the gateway-side gRPC client, andPoolerGatewayall satisfy it. ForStreamExecute, trace one call from the gateway client across the wire to the pooler server. -
Compare the two lifecycle styles. List the structural differences between the gateway (inline
toporeg.RegisterSynchronous, inlineShutdown) and the pooler (a manager withStartTopoRegistration/StopTopoRegistration). Argue which is easier to test and why.
Continue to parser, lexer, AST & codegen — how the system turns SQL text into an AST and back.