WITH MUTUALLY RECURSIVE
Progressively introduce support for WITH MUTUALLY RECURSIVE
common table expressions, as demonstrated in https://github.com/MaterializeInc/database-issues/issues/3264. This alternate form of WITH
requires column names and identifiers for each binding, but allows all bindings to reference each other bindings, and introduces support for recursion and mutual recursion.
Support for a correct implementation of WITH MUTUALLY RECURSIVE
.
There are likely several subgoals, corresponding to steps along the way. These include support in at least each of the following layers.
The one intentional restriction is to avoid support for nested mutually recursive fragments. These are not likely to be renderable in the foreseeable future, as they cannot necessarily be flattened to a single mutually recursive scope.
Support for SQL's WITH RECURSIVE
.
This is a tortured construct with surprising semantics.
Instead WITH MUTUALLY RECURSIVE
be easier to parse, plan, optimize, and render.
We should learn from this in ways that may inform whether we want to later support WITH RECURSIVE
.
Optimization for recursive queries.
There are several optimization patterns that recursive queries benefit from. For this design doc, we are only hoping to produce not-incorrect recursive queries.
Nested mutual recursion.
There should either be at most one WITH MUTUALLY RECURSIVE
, or appear so after query optimization.
We are not able to render nested mutually recursive fragements.
Much of this is taken from https://github.com/MaterializeInc/database-issues/issues/3264.
We can extend SQL's WITH
fragment with a new variant, not strictly more general, which looks like
WITH MUTUALLY RECURSIVE
name_1 (col1a type1a, col1b type1b, ... ) AS ( <SELECT STATEMENT 1> ),
name_2 (col2a type2a, col2b type2b, ... ) AS ( <SELECT STATEMENT 2> ),
name_3 (col3a type3a, col3b type3b, ... ) AS ( <SELECT STATEMENT 3> ),
...
This fragment binds multiple identifiers, must specify all column names and types, and all bound names are available in each SELECT
statement.
The value of each bound name in the following SELECT
block (as for a standard WITH
) is as if each were initially empty, and then each are updated in sequence using the most recent contents of each binding.
Mechanically, one could evaluate this as:
let mut bindings = HashMap::new();
for (name, _) in bindings.iter() {
bindings.insert(name, empty());
}
let mut changed = true;
while changed {
changed = false;
for (name, statement) in bindings {
let new_binding = statement.eval(&bindings);
if new_binding != binding[name] {
binding.insert(name, new_binding);
changed = true;
}
}
}
As an example, one can write "standard" WITH RECURSIVE
queries
WITH MUTUALLY RECURSIVE
reach (a int, b int) AS (
SELECT * FROM edges
UNION
SELECT edges.a, reach.b FROM edges, reach WHERE edges.b = reach.a
),
SELECT * FROM reach;
One can also write more complex non-linear, non-monotonic, mutual recursion
WITH MUTUALLY RECURSIVE
odd_path (a int, b int) AS (
SELECT * FROM edges
UNION
SELECT edges.a, even_path.b FROM edges, even_path WHERE edges.b = even_path.a
),
even_path (a int, b int) AS (
SELECT edges.a, odd_path.b FROM edges, odd_path WHERE edges.b = odd_path.a
UNION
-- This is optional, but demonstrates non-linear queries and reduces the "depth" of derivations.
SELECT path1.a, path2.b FROM odd_path path1, odd_path path2 WHERE path1.b = path2.b
),
SELECT * FROM reach;
The SELECT
blocks can be anything we support, except we are unlikely to support additional WITH MUTUALLY RECURSIVE
blocks in them for some time.
The sequence of goals have their own design considerations
SQL parsing
We must support the WITH MUTUALLY RECURSIVE
fragment and capture the column names and types.
We are in a position to allow and capture further constraint information, such as nullability and unique keys.
This happens by default with parse_columns
, though we could intentionally avoid accepting that information until we know what to do with it.
SQL name resolution
SQL name resolution currently introduces CTE bindings after resolving names in the CTE. We would need to generalize this to introduce the names before name resolving recursive CTEs.
SQL planning
SQL planning currently introduces CTE bindings (types mainly) after planning the CTE. We would need to generalize this to introduce the types before planning recursive CTEs.
HIR generalization
HIR has a single-binding Let
variant that cannot house mutually recursive bindings.
We could introduce a LetRec
or MultiLet
variant to support this, and even migrate other uses to it.
Lowering / decorrelation
It is not clear that lowering and decorrelation are complicated.
As existing single-binding Let
constructs, the value
and body
are each prefixed with correlated columns of the outer scope, and this should continue to be the case.
MIR generalization
MIR also requires a multi-binding Let
, which could be prototyped as LetRec
or MultiLet
.
MIR optimization corrections
All MIR optimizations are at risk of being incorrect with recursive Let
bindings.
Many of them propagate information about let bindings and must either a. start with conservative information (e.g. all columns required) or b. iterate to fixed point before applying their conclusions.
LIR generalization
LIR may want to remove Let
and move all bindings in to the sequence of dataflow objects to construct.
This is already roughly the same as having a multiple binding Let
block around the whole statement.
Rendering
Rendering will need a few modifications, but they have been prototyped and seen to work multiple times.
The main change is having each bindings result in a differential dataflow Variable
which is then used instead of the collection itself.
Once the definition of each object is complete, it is bound to its variable.
We have not previously prototyped sequenced recursion, where there are two recursive blocks in sequence. We can support this in the dataflow layer, but we would likely need to change our dataflow descriptions to accommodate it.
The only known alternative is SQL's WITH RECURSIVE
which is sufficiently fraught that we ought not implement it directly.
There are several distinctions it has from the proposed WITH MUTUALLY RECURSIVE
that it may help to visit.
SQL's WITH RECURSIVE
does not require column names and types.
In principle the names and types could be inferred.
In practice this appears to be a complicated task that would require substantial reconsideration.
Given that WITH MUTUALLY RECURSIVE
has no compatibility requirements, it seems reasonable to try out requiring this column information.
SQL's WITH RECURSIVE
requires each binding be used only once.
This is confusing and only enforced syntactically. One can introduce a CTE renaming a binding and then use that rebinding multiple times. In principle and practice, the requirement does not exist.
SQL's WITH RECURSIVE
requires each select statement have a "base case" (UNION
, UNION ALL
, etc with a non-recursive fragment).
This is not required for query planning, optimization, or execution. This may be helpful from a query admission point of view (should we allow you to run your query). We could likely introduce this constraint if we discover what it is valuable for.
There is an alternate evaluation order that updates all collections concurrently, rather than in sequence. This would mean that it takes "one step" for each value to propogate to each other binding, rather than be immediately available. This results in a different semantics, especially in the case of non-monotonic expressions (e.g. aggregation, retraction). However, it can always be "faked" in the sequential update model, by introducing "mirroring" collections at the end of the block and only using these in other bindings.
The concurrent update pattern also makes life unneccesarily hard for the optimization team. Ideally, all existing optimizations would apply now without changes, which is the case for sequential evaluation. However, with concurrent evaluation it would be incorrect to apply certain transformations:
WITH MUTUALLY RECURSIVE (CONCURRENT)
A (x int) as ... ,
B (x int) as SELECT * FROM A,
C (x int) as SELECT * FROM A EXCEPT ALL SELECT * FROM B
...
Ideally one could reason that C
is empty, which is the case in sequential evaluation but not in concurrent evaluation.
In concurrent evaluation, C
would contain version of A
minus the prior iterate of A
.
Despite working on a system that manages changing data, we've avoided until now the issue that the same name can be used for different versions of data.