WITH MUTUALLY RECURSIVE
The WITH MUTUALLY RECURSIVE
(WMR) implementation that was shipped as part of the previous design doc[^wmr] has some loose ends.
The aim of the design doc is to identify those and come up with a rollout plan for WMR to production environments.
Stabilizing support for WITH MUTUALLY RECURSIVE
is one of the technical bets that we are making in FY2024.
Adding first-class support for recursive queries will:
We should fill in the implementation gaps that were intentionally left as TODOs during MaterializeInc/database-issues#3264 and bring the feature into a shape where it can be gradually rolled out behind a feature toggle and ultimately stabilized.
The original WMR design doc[^wmr] and the technical article[^wmr_article] on our website provide a good explanation about the syntax and semantics of WMR. From the implementation point of view, adding WMR support requires extensions across every stage of the query lifecycle. The original design doc laid out a plan for adding support in each of the following layers.
The outstanding tracks of work can be summarized as follows:
Progress on all tracks can happen concurrently.
Mostly, the actual work revolves around enumerating and addressing unimplemented code blocks where the corresponding part of our compilation pipeline needs to handle WMR fragments.
hir_to_mir
loweringWe might have to check again the changes from MaterializeInc/materialize#16561. Mostly, I am concerned is what happens in the presence of:
mir
transformationsMaterializeInc/materialize#16561 extended MirRelationExpr
with a new LetRec
variant.
The Transform
trait was extended with a recursion_safe
method which returns true
iff the Transform
implementation is claiming to operate correctly in the presence of LetRec
nodes.
At the moment, the optimizer skips Transform
implementations that are not recursion_safe
.
The following table summarizes work that needs to be done for each transform.
Work estimates for each transform are given in relative t-shirt sizes.
The ?
suffix denotes uncertainty of absolute size 1 (M?
can be L
or S
).
The proposed implementation plan is summarized after the table.
We have 36 Transform
implementations, of which 3 are currently marked as recursion_safe
.
All but 16 can be trivially marked as recursion safe (done in MaterializeInc/database-issues#5317) because they represent local transformations that don't depend on transformation context that depends on the Let
bindings that are currently in scope.
From the remaining 16, based on an initial analysis it seems that:
M?
),Let
-based context and need case-by-case reasoning (marked with L?
).For most non-trivial transforms, we have multiple solutions at our disposal:
LetRec
nodes.
Transforms using this solution are marked with solution = advanced in the table above.
For the basic transforms the advanced solution is sketched in the linked issue in case we want to improve them as future work.Transform
code if we implement MaterializeInc/database-issues#5343.
Those are marked with solution = basic and the corresponding issue as depending on MaterializeInc/database-issues#5487.This should be mostly handled by MaterializeInc/materialize#17705. There is also an additional feature request for an optional max recursion limit in MaterializeInc/database-issues#4869 which will affect how plans are rendered. We might have to add more tests for that (see Testing and observability).
The WMR feature is currently only enabled in --unsafe-mode
.
As part of the enclosing epic, we will introduce a dedicated with_mutually_recursive
feature flag.
The feature will be first made available on all staging
environments (alpha testers) and then rolled out to production
environments for "public preview".
The following aspects need special attention:
ERROR AT RECURSION LIMIT 1000
, if she wants an additional guardrail.To validate (1), I suggest to:
Validating (2) is an open question.
We plan to build up confidence in the updated query optimization pipeline by adding new tests and revisiting existing tests. Test scenarios can be categorized along two dimensions:
Unit tests. We aim to have one unit test per transform. We can invest time proportional to the complexity of the transform to ensure that each transform is correct.
Integration tests.
We will add test scenarios inspired by the use cases of our prospects as end-to-end *.slt
tests.
We will also add at least one long-running mzcompose
test runs as part of our nightly tests and is used when qualifying future releases.
As those tests will include expected results, it will be great if we have a reference external iteration driver for the semantics proposed in the original design doc[^wmr].
That way we can cross-check the results of the reference and the internal implementation of WMR support and ensure that both produce equal results.
We can implement such driver in Python and integrate it in our mzcompose
tests.
End-to-end experiments.
We will perform a bunch of end-to-end experiments (available in the letrec-bench
GitHub repository) to get a sense of the resource consumption and stability of the feature under a sustained load of concurrently occurring updates.
Synthetic tests. (punted as follow-up work) The best synthetic use case that we have identified so far seems to be the LDBC social network benchmark[^ldbc]. With the scope of the dedicated epic (MaterializeInc/database-issues#5110), we will select a subset of the work in order to bootstrap a testing environment that consists of (a) LDBC data + updates, and (b) several of the recursive queries defined by the benchmark. We can use the choke-point characterization of each query to figure out the most representative subset.
Use-case driven. It is unclear how useful these will be as load tests, as we don't have the resources to write realistic data generators to replicate the domain of specific customers. However, we can try to map some of the customer use cases to the LDBC dataset. Also, might need to be careful about the specific problems we try to solve and use to showcase WMR. The power of incremental recursive computation only shines if the data dependency that is carried across iterations is somewhat bounded. Intuitively, this means an algorithm that does something like dynamic programming or reachability on graphs with some locality properties might handle small deltas in its input better than something like gradient descent.
Sourced from elsewhere. (punted as follow-up work)
We can check what tests Postgres has for their WITH RECURSIVE
support.
We plan to roll the implementation behind a with_mutually_recursive
feature flag.
It should be OK to turn the feature flag on for individual environments at all times.
It should be OK to turn the feature flag off for customers as long as they don't have catalog objects that use the feature.
The feature will go through an alpha
/beta
/stable
lifecycle.
Once we have reworked WMR to be behind a dedicated feature flag, we will enable this flag for all staging
environments, thereby entering the alpha stage.
The feature will be promoted to beta
(public preview) when the following conditions are met:
In the beta
testing phase, we will work with selected customers / prospects, who have previously explicitly voiced their interest in the feature and have a clear use case to demonstrate its value.
We will remain in close contact with those customers and treat their use cases as proof-of-concept in order to iron out potential operational and stability issues.
Once we have established the above and have build up confidence about the optimizer and runtime stability of recursive dataflows running in production, we will open the feature to everybody. This needs to be coordinated with the GTM team, as most probably we will want to advertise this accordingly.
I think the main question here is
Why should we not do this at the moment?
I can think of two reasons:
production
environment.I think that we can re-evaluate these points as part of an "end of epic" retrospective.
WITH MUTUALLY RECURSIVE
is known to play well with incremental computations?
See discussion of use-case driven tests in By test scenario.We are still lacking operational observability (tracked in MaterializeInc/database-issues#5271).
At the very least, we can export a Prometheus metric that tracks the number of indexed or materialized views that have recursive CTEs.
Once we have anonymized query logging, we can get some deper insights which would be useful for product analytics.
Within the scope of MaterializeInc/database-issues#4922 we only provided the basic case for most non-trivial transforms. Issues marked in the above table with solution = basic represent opportunities for improvement.
UI/UX improvements:
EXPLAIN
output of plans that have recursive queries does not work.
We will need to revisit this if we ever decide to make this the default or we have people that use it on a daily basis.
Tracked in MaterializeInc/database-issues#5631.Due to time constraints benchmarking of WMR based on LDBC has been punted in favor of a more limited benchmark available in the letrec-bench
GitHub repository.
The tracking epic for this is MaterializeInc/database-issues#5110.
Improve query planning by implementing the TODO from the plan_ctes
function.
This should be done only after investigating the impacts of having an extra Map
and Project
on our optimization potential.
The transitive closure of mz_internal.mz_object_dependencies
might be of interest to @umanwizard for MaterializeInc/materialize#17836.
with mutually recursive
base(src text, tgt text) as(
select object_id, referenced_object_id from mz_internal.mz_object_dependencies
),
reach(src text, tgt text) as (
select * from base
union
select r1.src, r2.tgt from reach r1 join reach r2 on r1.tgt = r2.src
)
select * from reach;
Session windows can be defined in an easier way (see MaterializeInc/database-issues#2664). @sploiselle was kind enough to add a PR for a prototype of that function in MaterializeInc/materialize#18330.
One of our cluster mzcompose
tests already uses WITH MUTUALLY RECURSIVE
(see MaterializeInc/materialize#18295).
@parkerhendo wants to answer the following question (slack):
What sources rely on this particular connection?
Using data from mz_sources
and mz_object_dependencies
.
[^wmr]: Original WITH MUTUALLY RECURSIVE
design doc
[^wmr_article]: Recursion in Materialize blog post
[^ldbc]: LDBC Social Network Benchmark (LDBC-SNB)