# Precise column name tracking - Associated: https://github.com/MaterializeInc/database-issues/issues/8889 Our `EXPLAIN` output typically uses column numbers. It should as much as possible use column names. ## The Problem Our `EXPLAIN WITH (humanized_expressions)` (the default) propagates names upwards from tables---which means we miss many opportunities for column names. Consider this simple example: ```sql CREATE TABLE t(x INT, y TEXT); EXPLAIN SELECT x + 1 AS succ_x, y FROM t; ``` Notice that the name `y` is kept in the _named_ column reference `#1{y}`, but `succ_x` is lost in the column reference `#2`. ``` Explained Query: Project (#2, #1{y}) // { arity: 2 } Map ((#0{x} + 1)) // { arity: 3 } ReadStorage materialize.public.t // { arity: 2 } Source materialize.public.t Target cluster: quickstart ``` ## Success Criteria For an MIR expression `q`, define: ``` NAMESCORE(q) = |# of named column references| ------------------------------ |# of column references| ``` i.e., `NAMESCORE(q)` is the percentage of column references which we can attach a name to. We should expect to see a serious improvement in `NAMESCORE` across both our tests and customer queries. ## Out of Scope We should _not_ expect `NAMESCORE(q) = 1.0` for all queries `q`. In particular, window functions may significantly complicate our ability to keep track of names. We should not reimplement all of the `Analysis` framework for LIR. ## Solution Proposal We will need to keep track of more column name information. We propose doing so in a few positions: 1. Keep track of column names all `ScalarExpr` constructors. 2. Keep track of column names on `usize` column references in `*RelationExpr`. In both cases, we will work with `Option>`, being careful to generate a single reference for each string, i.e., we will do lightweight, reference-counted string internment to avoid tons of extra allocations. ### Column names on `{Hir,Mir}ScalarExpr::Column` Extend `HirScalarExpr` and `MirScalarExpr` to optionally keep track of column names, i.e.: ```rust pub enum HirScalarExpr { Column(ColumnRef, Option>), CallUnary { func: UnaryFunc, expr: Box, name: Option>>, } // ... } impl HirScalarExpr { pub fn name(&self) -> Option> { match self { HirScalarExpr::Column(_, name) | HirScalrExpr::CallUnary { name, .. } | ... => name.clone() } } } ``` Keeping track of column names on _every_ expression makes it so we can remember names for SQL CTEs. Lowering would change to generate scalar expressions with names for: - column references - `SELECT expr AS name, ...` - `CREATE VIEW v(name, ...) AS ...` - `WITH t(name, ...), ...` - `WITH MUTUALLY RECURSIVE t(name type, ...), ...)` ### Column names in `{Hir,Mir}RelationExpr::{Project,Reduce,TopK}` Extend `HirScalarExpr` and `MirScalarExpr` to optionally keep track of the column name. ```rust pub enum HirRelationExpr { Project { input: Box, outputs: Vec<(usize, Option>)>, }, // ... } pub enum MirRelationExpr { Project { input: Box, outputs: Vec<(usize, Option>)>, }, // ... } ``` The `Project` is the most critical, as projections appear at the top of nearly every `SELECT` query---the missing `succ_x` in the example above appears in a `Project`, for example. We will need to similarly track names in `Mfp`s, which will involve updating `MapFilterProject::extract_*` methods. ### The new `EXPLAIN AS TEXT` format requires names for LIR https://github.com/MaterializeInc/materialize/pull/31643 proposes a new `EXPLAIN AS TEXT` format based on LIR. We will very much want good column names in the new format, which means keeping track of names at the LIR level, too. We currently compute arity, names, and other metadata using the `Analysis` bottom-up framework. But `Analysis` is only defined for MIR. We should build simple, direct versions of these analyses for LIR---and we can generalize it to a lower-level `Analysis` framework if we later need it. We may be able to use the existing LIR abstract interpretation framework. ## Alternatives Rather than attaching column names on every `ScalarExpr` node, we could just record them on `*ScalarExpr::Column`. Similarly, rather than attaching column names to to `*RelationExpr` nodes, an `EXPLAIN` plan has sufficient information to reconstruct the names of the outermost projection, as the original SQL select (with column name information) should be present: ```rust ExplainPlanPlan.explainee == Explainee::Statement( ExplaineeStatement::Select { plan: SelectPlan { select: Option>>, .. }, .. }, ) ``` Neither of these approach doesn't scale naturally to having multiple views or CTEs, though. ## Open questions Is it possible to get a good `NAMESCORE` without changing `{Hir,Mir}RelationExpr`?