# Query Compilation 101

## Prerequisite videos

* [Materialized performance 101](https://drive.google.com/file/d/1BlCFHVJsi6-YfQQpPMaWOhDOvaQDdcWV/view?usp=sharing)
  (skip forward to 5:13)
* [Materialized internals 101](https://drive.google.com/file/d/1_SlM-zQR2FifNMeECnFTnwRcTc7zxRuc/view).
* [Introduction to one-off queries](https://drive.google.com/file/d/1LsyMY1OMmDS7uQS6cT6IFmdROGiPB4Im/view?usp=sharing).
* [Materialize Decorrelation explained in Jamie Brandon’s Blog](https://www.scattered-thoughts.net/writing/materialize-decorrelation/).

## Current Compiler Pipeline

Representations:

* `SQL` — source language
* [`AST`](https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/plan.rs) — a parsed version of a SQL query.
* [`HIR`](https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/plan/expr.rs) — high-level intermediate representation.
* [`MIR`](https://github.com/MaterializeInc/materialize/blob/main/src/expr/src/relation/mod.rs) — mid-level intermediate representation.
* [`LIR`](https://github.com/MaterializeInc/materialize/blob/main/src/compute-client/src/plan/mod.rs) — low-level intermediate representation.
* `TDO` — target language (timely & differential operators).

Transformations in the compile-time lifecycle of a dataflow.

* [`SQL ⇒ AST`](https://github.com/materializeinc/materialize/blob/main/src/sql-parser/src/parser.rs#L55).
    * Parsing the SQL query.
* [`AST ⇒ AST`](https://github.com/MaterializeInc/materialize/blob/main/src/adapter/src/coord.rs#L1876)
    * [Resolving names against the catalog.](https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/names.rs#L1035-L1053)
        * [`CatalogItemType`](https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/catalog.rs#L336)
            lists the kinds of objects that can be resolved against the catalog.
* [`AST ⇒ HIR`](https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/plan/query.rs#L90-L129).
    * [Resolving column references, column aliases, and table aliases](https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/plan/scope.rs)
    * If the SQL query is a one-off, the outermost `TopK` is converted to a
      RowSetFinishing at this point.
    * `EXPLAIN RAW PLAN` returns the result of transformations up to this point.
* [`HIR ⇒ HIR`](https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/plan/lowering.rs#L149-L150)
    * Predecorrelation rewrites:
        * [Split out subquery conditions out as a separate predicate.](https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/plan/transform_expr.rs#L54)
        * [Try to rewrite other types of subqueries into EXISTS subqueries.](https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/plan/transform_expr.rs#L156)
* [`HIR ⇒ MIR`](https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/plan/lowering.rs).
    * Decorrelation:
        * Correlated queries are rewritten as graphs with join and distinct.
    * Lowering — express SQL-specific concepts as dataflow sub-graphs:
        * Outer joins are decomposed into multiple inner joins ([see README.md](https://github.com/aalexandrov/mzt-repos/blob/main/simplify_outer_joins/README.md)).
        * Machinery for introducing defaults in empty global aggregates.
        * Machinery for introducing errors for `SELECT` subqueries with more than one return value.
    * `EXPLAIN DECORRELATED PLAN` returns the result of transformations up to this point.
* [`MIR ⇒ MIR`](https://github.com/MaterializeInc/materialize/blob/main/src/transform).
    * [If the query is a view
      definition](https://github.com/MaterializeInc/materialize/blob/main/src/adapter/src/catalog.rs#L3325),
      run per-view logical optimizations against the SQL query. The catalog
      stores the result of transformations up to this point.
    * [Construct a dataflow for the query](https://github.com/MaterializeInc/materialize/blob/main/src/adapter/src/coord/dataflows.rs):
        * If the query depends on not-materialized views, the definitions of the
          not-materialized views get inlined.
        * For each materialized view that a query depends on, import all of its
          materializations. (This corresponds to all indexes on that view, which
          you can see if you call `SHOW INDEXES IN <view>`).
    * Run optimizations against the dataflow:
        * [Per-view logical](https://github.com/MaterializeInc/materialize/blob/main/src/transform/src/lib.rs#L282-L337).
        * [Cross-view logical](https://github.com/MaterializeInc/materialize/blob/main/src/transform/src/dataflow.rs#L31-L60).
            * Propagating source information up: optimize_dataflow_monotonic
            * Pushing optimizations down to sources: `LinearOperators`
                * [CSV decoding](https://github.com/MaterializeInc/materialize/blob/main/src/storage/src/decode/csv.rs)
                * [Upsert](https://github.com/MaterializeInc/materialize/blob/main/src/storage/src/render/upsert.rs)
            * View inlining.
            * Theoretically supports producing more than one index/sink in the same dataflow.
        * [Per-view logical](https://github.com/MaterializeInc/materialize/blob/main/src/transform/src/lib.rs#L281-L337) (second round).
        * [Per-view physical](https://github.com/MaterializeInc/materialize/blob/main/src/transform/src/lib.rs#L345-L367).
    * `EXPLAIN OPTIMIZED PLAN` returns the result of transformations up to this point.
* [`MIR ⇒ LIR`](https://github.com/MaterializeInc/materialize/blob/main/src/compute-client/src/plan/mod.rs#L882-L897).
    * Decisions are made regarding rendering.
        * All aggregations are created equal in MIR, but from the rendering perspective, [aggregations are evaluated differently according to what data needs to be kept to recalculate the aggregation after receiving a diff](https://github.com/MaterializeInc/materialize/blob/main/src/compute-client/src/plan/reduce.rs). A pictorial version can be found [here](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/arrangements.md).
        * [Joins are broken down into multiple stages](https://github.com/MaterializeInc/materialize/blob/main/src/compute-client/src/plan/join/linear_join.rs), and filters + projects run between each stage to shrink the intermediate result.
    * RelationTypes (column types + unique keys) are discarded since we do no key or type of validation at render time.
    * `EXPLAIN PHYSICAL PLAN` returns the result of transformations up to this point.
* [`LIR ⇒ TDO`](https://github.com/MaterializeInc/materialize/blob/main/src/compute/src/render/mod.rs).

For a one-off query, we run all the transformations until the MIR stage. Then we
determine whether we need to serve the query on the "slow path", that is,
creating a temporary dataflow and then deleting it. If we don't need to serve
the query on the "slow path", then we can skip the `MIR ⇒ LIR` and the `LIR ⇒ TDO` steps.
Existing "fast paths" include:
* [reading from an existing dataflow.](https://github.com/MaterializeInc/materialize/blob/main/src/compute/src/compute_state.rs#L704)
* [the adapter itself spitting out a constant set of rows.](https://github.com/MaterializeInc/materialize/blob/main/src/adapter/src/coord/peek.rs#L213)

Currently, the optimization team is mostly concerned with the `HIR ⇒ MIR` and `MIR ⇒ MIR` stages.

## Testing

### Integration tests

* [Sqllogictest](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/sqllogictest.md)
    * [Philip’s RQG tests](https://docs.google.com/presentation/d/1PvUzdeblYwLIWMpBLCtKY1Gys4L92jmr7fTdo4zE2g4/edit) will be in this format.
        * Add Philip to any PR where query plans may change.
    * A PR can be merged if it passes Fast SLT.
    * A PR does not need to pass Full SLT tests (`test/sqllogictest/sqlite`) to be merged.
        * Full SLT tests take 2-3 hours.
        * You can manually initiate full SLT tests on your branch [here](https://buildkite.com/materialize/slt).
* [Testdrive](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/testdrive.md)
    * We generally do not use testdrive except to see [linear operators in action](https://github.com/MaterializeInc/materialize/blob/main/test/testdrive/source-linear-operators.td).

### Unit tests

* [Datadriven](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-testing.md#datadriven)
    * [Transform unit tests ](https://github.com/MaterializeInc/materialize/tree/main/src/transform)currently allow:
        * testing each transformation independently of the others.
        * Printing out which block of transformations change the plan and how.
    * [Unit tests in the mz-expr](https://github.com/MaterializeInc/materialize/tree/main/src/expr/tests) crate currently allow:
        * Testing the simplifying MirScalarExpr, predicates, join equivalences.
        * Testing MapFilterProject.
    * [There is a DSL to specifying arbitrary MIRs.](https://github.com/MaterializeInc/materialize/tree/main/src/expr-test-util)
    * [DSL to specify arbitrary enums and structs.](https://github.com/MaterializeInc/materialize/tree/main/src/lowertest)

### Performance tests

* [TPCH](https://materializeinc.slack.com/archives/C01BE3RN82F/p1611161615021000)

## Tooling

* [mzt](https://github.com/aalexandrov/mzt) — can be used to create repositories of plans and write up a markdown that explains something based on those plans (see Alexander’s [mzt-repos](https://github.com/aalexandrov/mzt-repos) for example).