The SQL Parser: Lexer, goyacc & AST Codegen
What you will learn: how a real system turns SQL text into a typed AST — a hand-written lexer, a goyacc-generated LALR parser, an ast.Node type system built on struct embedding, and a code generator that writes the clone/rewrite boilerplate so nobody has to maintain it by hand.
Prerequisites: interfaces & composition (the Node interface, BaseNode embedding, type switches), types, structs & methods (struct embedding, the New* constructor idiom), packages & go:generate, generics (to see why this code uses codegen instead), and sync & the memory model (sync.Pool).
This is the most advanced Go in the multigres codebase. It lives under go/common/parser/ and is a direct, file-by-file port of PostgreSQL’s own scanner and grammar, glued together with three layers of code generation. Once you understand the pipeline you can navigate roughly 2 MB of generated code with confidence, because you will know which source file each generated artifact came from.
The pipeline at a glance
Section titled “The pipeline at a glance”A query goes through four stages: raw text becomes tokens, tokens drive a grammar, the grammar builds an AST, and a separate generator produces the helpers that walk that AST. Here is the front half — text to tree:
flowchart TB
SQL(["SQL string"])
Lexer["NewLexer(input)<br/>lexer.go"]
Token["*Token { Type, Value, Position, Text }<br/>tokens.go"]
Parse["yyParse loop (LALR tables)<br/>postgres.go — generated from postgres.y"]
AST(["[]ast.Stmt — the AST<br/>ast/*.go"])
SQL --> Lexer
Lexer -->|"l.NextToken()"| Token
Token -->|"Lexer.Lex copies token to yySymType"| Parse
Parse -->|"grammar actions build nodes"| AST
Two more files are generated from the AST itself — the deep-copy and the tree-rewriter:
flowchart LR Nodes["ast/*.go<br/>(Node implementations)"] Gen["asthelpergen<br/>(reads the Node interface<br/>by type-checking)"] Clone["ast_clone.go<br/>deep copy"] Rewrite["ast_rewrite.go<br/>visitor / tree rewriter"] Nodes --> Gen Gen --> Clone Gen --> Rewrite
The public entry point is ParseSQL. It lives in the generated postgres.go because it is copied verbatim from the grammar file’s preamble:
func ParseSQL(input string) ([]ast.Stmt, error) { lexer := NewLexer(input) parser := parserPool.Get().(yyParser) parser.Parse(lexer) parserPool.Put(parser)
if lexer.HasErrors() { return nil, lexer.GetErrors()[0] }
return lexer.GetParseTree(), nil}Note the sync.Pool of parsers. Parser state tables are large and reusable, so pooling avoids re-allocating them per query on the hot query path. The pool stores yyParser values produced by the generated yyNewParser(). See sync & the memory model for sync.Pool semantics — anything in a pool can be garbage-collected or handed to another goroutine, so it must be reset on Parse.
Layer 1 — the hand-written lexer
Section titled “Layer 1 — the hand-written lexer”The lexer is not generated. It is a manual port of PostgreSQL’s scan.l and parser.c, spread across lexer.go, tokens.go, keywords.go, and charclass.go.
Tokens
Section titled “Tokens”type TokenValue struct { Ival int // integer literals (ICONST, PARAM) Str string // identifiers and non-integer literals Keyword string // canonical spelling of keywords}
type Token struct { Type int // parser-generated token constant Value TokenValue // the token's value (a "union") Position int // byte offset from start of input Text string // raw text that produced this token}TokenValue is Go’s answer to a C union: a struct where only one field is meaningful at a time, by convention. The lexer fills Ival for ICONST, Str for identifiers and strings, and Keyword for keywords.
NextToken and PostgreSQL’s two-phase lex
Section titled “NextToken and PostgreSQL’s two-phase lex”type Lexer struct { context *ParseContext // unified parse context (input, position, errors) parseTree []ast.Stmt // parse result, filled by grammar actions}NextToken runs two phases, mirroring PostgreSQL’s base_yylex:
func (l *Lexer) NextToken() *Token { token, err := l.nextTokenInternal() if err != nil { l.RecordError(err) return NewToken(INVALID, l.context.CurrentPosition(), "") } // keyword lookahead (WITH_LA, etc.) + Unicode conversion (UIDENT->IDENT, USCONST->SCONST) return l.applyTokenFilter(token)}applyTokenFilter is the filter phase. Some PostgreSQL keywords change meaning based on the next token — for example WITH followed by ORDINALITY versus WITH opening a CTE. The filter peeks ahead for FORMAT, NOT, NULLS_P, WITH, and WITHOUT and rewrites the token type, and it converts the Unicode-escape token forms UIDENT/USCONST into plain IDENT/SCONST. This is how a single NextToken can do lookahead without the grammar needing to.
Keywords
Section titled “Keywords”type KeywordInfo struct { Name string // lowercase keyword name TokenType int // token constant (from postgres.go) Category KeywordCategory // Unreserved / ColName / TypeFuncName / Reserved CanBareLabel bool}
// the ~491-entry flat tablevar Keywords = []KeywordInfo{ {"abort", ABORT_P, UnreservedKeyword, true}, // ... hundreds more}The four KeywordCategory values mirror PostgreSQL’s reservation classes and decide whether a keyword may double as a column or function name. LookupKeyword does an ASCII-only lowercase, then a map lookup. The token constant in each row (ABORT_P, SELECT, and the rest) again comes from the generated postgres.go.
Layer 2 — the goyacc grammar
Section titled “Layer 2 — the goyacc grammar”The grammar source: postgres.y
Section titled “The grammar source: postgres.y”postgres.y is a large yacc grammar with three parts: a Go preamble (%{ ... %}), declarations (%union/%struct, %token), and BNF rules whose actions build AST nodes. Here is the rule that builds a table reference:
qualified_name: ColId { $$ = &ast.RangeVar{ RelName: $1, Inh: true, // inheritance enabled by default } } | ColId indirection { ... builds schema-qualified RangeVar from $1.$2 ... } ;In yacc-speak $$ is the value this rule produces, and $1, $2 are the values of the symbols on the right-hand side. So qualified_name becomes a *ast.RangeVar whose RelName is the identifier text. The grammar action is literally Go code spliced into the generated parser.
The generated parser: postgres.go
Section titled “The generated parser: postgres.go”goyacc compiles the grammar into the giant LALR tables (yyAct, yyExca, and friends) plus the yyParse driver loop. You will essentially never read this file top to bottom — but you do need to know how it talks to the lexer.
The lexer-to-parser contract
Section titled “The lexer-to-parser contract”goyacc emits an interface the lexer must satisfy:
type yyLexer interface { Lex(lval *yySymType) int Error(s string)}This is the classic Go idiom of the consumer defining the interface (see interfaces & composition). The hand-written Lexer satisfies it via two methods that live in the grammar preamble, and so end up in postgres.go:
func (l *Lexer) Lex(lval *yySymType) int { token := l.NextToken() if token == nil { return EOF // EOF = 0, exactly what yacc expects } lval.location = token.Position lval.str = token.Value.Str lval.ival = token.Value.Ival return token.Type}
func (l *Lexer) Error(s string) { l.RecordError(fmt.Errorf("parse error at position %d: %s", l.GetPosition(), s))}yyParse calls Lex repeatedly to pull tokens, copying each token’s value into the parser’s semantic-value struct yySymType, and calls Error on a syntax error. The grammar actions then assemble nodes and stash the final tree in l.parseTree, which ParseSQL returns via GetParseTree().
Layer 3 — the AST package
Section titled “Layer 3 — the AST package”go/common/parser/ast/ is around 30 .go files split by statement category, plus the two generated helper files:
Directorygo/common/parser/ast/
- nodes.go the
Nodeinterface,BaseNode,NodeTag - statements.go core statements (
RangeVarlives here) - ddl_statements.go DDL nodes
- expressions.go expression nodes
- utility_statements.go utility nodes
- … more category files
- ast_clone.go generated — deep copy
- ast_rewrite.go generated — visitor / rewriter
- nodes.go the
At its core is one interface.
The Node interface and BaseNode
Section titled “The Node interface and BaseNode”type Node interface { NodeTag() NodeTag // type tag (mirrors PG's NodeTag enum) Location() int // byte offset in source, or -1 SetLocation(location int) String() string // debug representation SqlString() string // SQL deparse — enables SQL -> AST -> SQL round-trip}You almost never implement these five methods by hand. Instead every concrete node embeds BaseNode, which implements four of them for free:
type BaseNode struct { Tag NodeTag // node type tag Loc int // source location in bytes}
func (n *BaseNode) NodeTag() NodeTag { return n.Tag }func (n *BaseNode) Location() int { return n.Loc }func (n *BaseNode) SetLocation(loc int) { n.Loc = loc }func (n *BaseNode) String() string { return fmt.Sprintf("%s@%d", n.Tag, n.Loc) }This is Go struct embedding doing interface satisfaction — see interfaces & composition and types & methods. By embedding BaseNode, a struct promotes those methods and so already implements four-fifths of Node with zero code.
The fifth method, SqlString(), is intentionally not usable from BaseNode:
func (n *BaseNode) SqlString() string { if n.Tag == T_Invalid { return "<INVALID>" } panic(fmt.Sprintf("SqlString() not implemented for node type %s ...", n.Tag))}NodeTag — the type enum
Section titled “NodeTag — the type enum”type NodeTag int
const ( T_Invalid NodeTag = iota T_Node T_Query T_SelectStmt // ... 200+ more, mirroring postgres nodes.h // (T_RangeVar, T_Alias, ... appear further down the same block))NodeTag is an int enum built with iota (the standard idiom — see stdlib & idioms), one tag per node type, mirroring PostgreSQL’s NodeTag from nodes.h. Each concrete node sets its tag in its New* constructor. NodeTag also has a String() method so tags print readably in debug output.
A concrete node, start to finish
Section titled “A concrete node, start to finish”type RangeVar struct { BaseNode CatalogName string // database name, or empty SchemaName string // schema name, or empty RelName string // relation/sequence name Inh bool // expand by inheritance? RelPersistence rune Alias *Alias // table alias & optional column aliases}
// the New* constructor sets the tagfunc NewRangeVar(relName string, schemaName, catalogName string) *RangeVar { return &RangeVar{ BaseNode: BaseNode{Tag: T_RangeVar}, RelName: relName, SchemaName: schemaName, CatalogName: catalogName, Inh: true, }}
// SqlString deparses (note the ONLY / alias logic)func (r *RangeVar) SqlString() string { result := FormatFullyQualifiedName(r.CatalogName, r.SchemaName, r.RelName) if !r.Inh { result = "ONLY " + result } if r.Alias != nil { if aliasStr := r.Alias.SqlString(); aliasStr != "" { result += " " + aliasStr } } return result}Notice the field shapes, because they determine what the generated code does:
CatalogName,SchemaName,RelName(string),Inh(bool),RelPersistence(rune) — basic types.Alias *Alias— a pointer to anotherNode-implementing type.
These two categories are treated very differently by codegen, which is the whole point of the next section.
Layer 4 — the codegen: asthelpergen
Section titled “Layer 4 — the codegen: asthelpergen”With more than 200 node types, hand-writing a deep-clone and a tree-rewriter for each would be roughly 10,000 lines of error-prone boilerplate. Instead, the codebase generates them.
The trigger
Section titled “The trigger”//go:generate go run ../../../tools/asthelpergen/main --in . --iface github.com/multigres/multigres/go/common/parser/ast.NodeThe flags: --in is the package(s) to scan, --iface is the (required) fully-qualified root interface, plus --clone_exclude and --verify.
How the tool discovers node types
Section titled “How the tool discovers node types”It does not parse comments or maintain a list. It loads the package with golang.org/x/tools/go/packages, gets the type-checker’s view, and asks the compiler “who implements Node?”:
func findImplementations(scope *types.Scope, iff *types.Interface, impl func(types.Type) error) error { for _, name := range scope.Names() { obj := scope.Lookup(name) if _, ok := obj.(*types.TypeName); !ok { continue } baseType := obj.Type() if types.Implements(baseType, iff) { // value type implements? if err := impl(baseType); err != nil { return err } continue } pointerT := types.NewPointer(baseType) // or *T implements? if types.Implements(pointerT, iff) { if err := impl(pointerT); err != nil { return err } continue } } return nil}From each discovered type the tool walks the fields, and any field type it sees (*Alias, []Node, and so on) is added to a todo queue, so the full transitive set of types is generated. readValueOfType in clone_gen.go calls spi.addType(t) for non-basic field types and emits a CloneRefOf<T>(...) call; basic types are returned as-is (just copied).
How it emits Go: dave/jennifer
Section titled “How it emits Go: dave/jennifer”The tool builds Go source programmatically with the jen package (dave/jennifer). For example:
func newCloneGen(pkgname string, options *CloneOptions) *cloneGen { file := jen.NewFile(pkgname) addLicenseHeader(file) file.Comment("Code generated by asthelpergen. DO NOT EDIT.") return &cloneGen{exclude: options.Exclude, file: file}}
// names the output filefunc (c *cloneGen) genFile(_ generatorSPI) (string, *jen.File) { return "ast_clone.go", c.file}Builders in the jen.Func().Id(...).Call(...).Block(...) style construct each function; the result is rendered to ast_clone.go and ast_rewrite.go (the rewrite half lives in rewrite_gen.go).
Generated clone — ast_clone.go
Section titled “Generated clone — ast_clone.go”func CloneNode(in Node) Node { if in == nil { return nil } switch in := in.(type) { case *A_ArrayExpr: return CloneRefOfA_ArrayExpr(in) case *A_Const: // ... one case per node type }}CloneNode is a big type switch (interfaces & composition) dispatching to a per-type function. Here is the one for our RangeVar:
func CloneRefOfRangeVar(n *RangeVar) *RangeVar { if n == nil { return nil } out := *n // shallow struct copy — copies ALL basic fields out.BaseNode = CloneBaseNode(n.BaseNode) out.Alias = CloneRefOfAlias(n.Alias) // deep-clone the pointer field return &out}Map every line back to the struct:
out := *ncopies the whole struct by value, soCatalogName,SchemaName,RelName,Inh, andRelPersistenceare deep-copied automatically (strings, bools, and runes are value types — see pointers, values & memory).out.Alias = CloneRefOfAlias(n.Alias)overwrites the pointer field with a fresh deep clone, because the struct copy only copied the pointer, not the pointee.CloneRefOfAliasis nil-safe, so anilalias staysnil.
That is the entire rule: basic fields ride along on out := *n; pointer, slice, and interface fields get a recursive CloneRefOf.../CloneSliceOf... call.
Generated rewrite — ast_rewrite.go (the visitor)
Section titled “Generated rewrite — ast_rewrite.go (the visitor)”The public API is hand-written in rewriter_api.go; the per-type walkers are generated. You call:
func Rewrite(node Node, pre, post ApplyFunc) (result Node) { parent := &RootNode{node} replacer := func(newNode Node, _ Node) { parent.Node = newNode } a := &application{pre: pre, post: post} a.rewriteNode(parent, node, replacer) return parent.Node}pre runs before a node’s children (return false to prune them); post runs after (return false to abort). The callback receives a *Cursor:
type Cursor struct { parent, node Node; replacer replacerFunc; revisit bool }func (c *Cursor) Node() Node { return c.node }func (c *Cursor) Parent() Node { return c.parent }func (c *Cursor) Replace(newNode Node) { c.replacer(newNode, c.parent); c.node = newNode }func (c *Cursor) ReplaceAndRevisit(newNode Node) { /* replace, then re-walk the new node */ }The generated per-type walker for RangeVar shows the clever part — the replacer closure:
func (a *application) rewriteRefOfRangeVar(parent Node, node *RangeVar, replacer replacerFunc) bool { if node == nil { return true } if a.pre != nil { a.cur.replacer = replacer a.cur.parent = parent a.cur.node = node kontinue := !a.pre(&a.cur) if a.cur.revisit { a.cur.revisit = false; return a.rewriteNode(parent, a.cur.node, replacer) } if kontinue { return true } } if !a.rewriteRefOfAlias(node, node.Alias, func(newNode, parent Node) { parent.(*RangeVar).Alias = newNode.(*Alias) // typed write-back into the parent }) { return false } if a.post != nil { /* call a.post(&a.cur) */ } return true}When you Replace the Alias child, the visitor needs to write the new node back into its parent’s correct field. It can’t do that generically, so the generator emits a closure capturing the exact field and the exact concrete types:
parent.(*RangeVar).Alias = newNode.(*Alias)How to add a new AST node type
Section titled “How to add a new AST node type”Walking the full loop is the fastest way to see how the four layers fit together. The exact files to touch, in order:
- Define the struct in an
ast/*.gofile, embeddingBaseNode:go/common/parser/ast/your_file.go type MyNewStmt struct {BaseNodeTarget *RangeVarVerbose bool} - Add a tag to the
NodeTagenum inast/nodes.go(for exampleT_MyNewStmt) and acasein theNodeTag.String()switch so debug output is readable. - Write a
New*constructor that setsBaseNode{Tag: T_MyNewStmt}(followNewRangeVar). - Implement
SqlString()— or deparsing this node will panic at runtime (see theBaseNode.SqlString()gotcha above). - Add grammar rule(s) in
postgres.ywhose action builds`&ast.MyNewStmt{...}`. - Regenerate and commit. Run the build, then commit the regenerated
postgres.go,ast_clone.go, andast_rewrite.go.
After step 6 you will see brand-new CloneRefOfMyNewStmt and rewriteRefOfMyNewStmt functions appear — because findImplementations now sees your type implementing Node.
The regen workflow
Section titled “The regen workflow”This is a large project, so it wraps codegen and the build behind make targets rather than asking you to invoke each generator by hand:
make parser # go generate over the parser package: runs goyacc AND asthelpergenmake build-all # proto + parser + metrics + build (the complete command)make validate-generated-files # what CI runs: clean build-all, then fail if git is dirtyThe CI gate is real — it rebuilds everything from source and fails if anything regenerated differently from what’s committed:
validate-generated-files: clean build-all ## Validate that generated files match source. go mod tidy MODIFIED_FILES=$$(git status --porcelain | grep "^ M" | awk '{print $$2}') ; \ if [ -n "$$MODIFIED_FILES" ]; then ... exit 1; fiA go:generate directive on the parser package only runs goyacc, which can mislead you into thinking AST helpers aren’t regenerated. They are: make parser runs go generate recursively, so it also hits the AST package’s directive and runs asthelpergen. Either way, prefer make build-all — it’s the complete command and it’s exactly what validate-generated-files runs, so it’s the one that guarantees CI will pass. (The generator’s --verify mode regenerates in memory and byte-compares against the on-disk files, which is how CI detects an un-regenerated AST change.)
Checkpoints
Section titled “Checkpoints”Where is the token constant SELECT actually defined, and why isn’t it in tokens.go?
In the generated go/common/parser/postgres.go, emitted by goyacc from postgres.y’s %token declarations. tokens.go only defines INVALID = -1 and EOF = 0 plus the type TokenType = int alias, because the authoritative token constants are owned by the parser, not the lexer. ASCII chars 32 to 126 are their own values.In CloneRefOfRangeVar, why does Alias get a recursive CloneRefOfAlias call but RelName does not?
out := *n is a shallow struct copy. For value-typed fields like RelName (string), Inh (bool), RelPersistence (rune) that copy is a deep copy. But Alias is a *Alias pointer; the struct copy only duplicates the pointer, still aimed at the original Alias. To make the clone independent, the generator overwrites it with CloneRefOfAlias(n.Alias), which is nil-safe.You add a new node struct but forget to embed BaseNode. What happens, and when do you find out?
The type won’t satisfy the five-method Node interface, so findImplementations (via types.Implements) won’t pick it up and no CloneRefOf/rewriteRefOf is generated for it. It may still compile if nothing references it as a Node, but it gets zero clone/rewrite support — a silent gap discovered only when something tries to clone or walk it.Which command should you run after changing AST node types, and what CI job enforces it?
make build-all, then commit the regenerated postgres.go, ast_clone.go, and ast_rewrite.go. The validate-generated-files job runs clean build-all and fails if git status shows any modified files — that is, if you forgot to regenerate or commit.Exercises
Section titled “Exercises”-
Trace one query through the files. Take the SQL
SELECT 1 FROM only foo.bar. Starting atParseSQL, follow it intoLexer.Lex/NextTokenproducing tokens (usekeywords.goto find theSELECT/FROM/ONLYtoken types), then into the grammar rulesqualified_name(builds the base*ast.RangeVarwithSchemaName="foo",RelName="bar") andextended_relation_expr’sONLYform (which flipsInh=false). Write down which file performed each transformation, and confirm the resultingRangeVar.SqlString()reproduces theONLY foo.barform. -
Read the two generated halves side by side. Open
CloneRefOfRangeVar(ast_clone.go) andrewriteRefOfRangeVar(ast_rewrite.go) next to theRangeVarstruct (statements.go). Map every line of each function to a struct field, and explain in one sentence why onlyAliasneeds special handling in both generated functions. -
Round-trip an AST. Pick a query from
go/common/parser/testdata/select_cases.json, callParseSQLon it (model it onparse_test.go), callSqlString()on the result, thenParseSQLthe deparsed string again. Identify which node’sSqlString()controls the formatting of one clause (for example theFROMclause). For clone-independence, seeclone_test.go(TestCloneDeepCopy): modify a clone and assert the original is unchanged. -
Follow type discovery in the generator. In
go/tools/asthelpergen/clone_gen.go, readreadValueOfTypeand explain how a struct field of type*SomeNodecausesSomeNodeto be enqueued viaspi.addType(t)and aCloneRefOfSomeNode(...)call to be emitted — that is, how the generator reaches the full transitive set of node types starting from just theNodeinterface.
The PostgreSQL wire protocol & sqltypes — where parsed queries and their results meet the wire format on the way back to the client.
See also
Section titled “See also”- gRPC & protobuf — the other big “generated, DO NOT EDIT, must be committed” subsystem; compare protoc to goyacc/asthelpergen.
- Service anatomy — where parsed queries flow next in the gateway/pooler query path.
- Interfaces & composition — the
Nodeinterface,BaseNodeembedding, type switches inCloneNode, and the interface-satisfaction that asthelpergen relies on. - Types, structs & methods — struct embedding, method promotion, the
New*constructor idiom. - Generics — the contrast: this code uses codegen instead of generics; the “Why codegen” note above explains why.
- Sync & the memory model — the
sync.Poolof parsers inParseSQL. - Testing —
clone_test.goandparse_test.gotable-driven JSON-fixture patterns. - Modules, deps & codegen —
make parser/make build-all/validate-generated-files. - Glossary — lexer, token, goyacc/LALR, AST, NodeTag, visitor/Rewrite, deparse.
- Idioms & gotchas — DO-NOT-EDIT generated files; commit-regenerated-output rule.
- Orientation — where the parser sits in the overall map.