Optimizer changes are tricky to implement and deploy in a robust and predictable manner. The main problem is depicted by the following diagram.
Most of the time, a change to our optimizer will not only improve the performance of some SQL queries (hopefully a majority), but also introduce performance regressions to others.
The expected plan regressions and improvements of an optimizer change in
general can be identified by running random query workloads using one of the
fuzzers available in our CI pipeline (RQG, SQLSmith, SQLLancer). However, we
currently lack the right tooling to determine the specific impact of such
changes to our current production
environments.
The latter is of particular practical importance when prototyping and validating possible optimizer changes behind a feature flag:
There are two dimensions of assessing the quality of SQL queries in production:
mzexplore
tool
(MaterializeInc/materialize#22892) and with adding the ability to control
optimizer feature flags using the EXPLAIN WITH(...)
syntax
(MaterializeInc/materialize#22744). There are some known issues with the
current state of the code, but we have fixes for those that should be merged
soon (pending review).The current document proposes tooling and workflows in aid of (2).
Members of the compute team have agreed upon the developed the minimal amount of tooling and code infrastructure required to enable the evaluation of proposed optimizer changes in production.
The new feature has been used to asses the potential impact of the following optimizer changes:
Things that will be nice to have, but are intentionally left out of scope from the current proposal are listed below with a brief motivation for that decision.
mz_system
access in order to do a runtime validation
experiment. This doesn't seem urgent because we will always use
mzexplore
to first identify a small set of candidate dataflows. For those,
we will in turn run a dedicated experiment together with a designated
engineer that can temporarily request mz_system
privileges with Teleport
(similar to the lgalloc
experiments).FEATURES
support for unmanaged
clusters will require more code changes. Since the syntax is now deprecated,
there is no need to additionally complicate this code.CLUSTER
-specific features with ALTER CLUSTER
. For
the same reasons as above—for experimentation purposes we only need
short-lived clusters.The suggested high-level workflow for assessing the potential impact of an optimizer feature will be:
bin/mzexplore
to identify dataflows and clusters that might be affected
by an optimizer change gated behind a feature flag.The suggested workflow for running an experiment on a specific customer environment will be:
mz_system
privileges through Teleport.CREATE CLUSTER
definition.UNBILLED
replica for that cluster.In order to facilitate this workflow, we propose the following changes (discussed in detail below):
CREATE CLUSTER
syntax.CREATE CLUSTER
syntaxExtend the CREATE CLUSTER SYNTAX
for managed cluster plans as follows:
CREATE CLUSTER <name> ... FEATURES (...)
The newly added FEATURES
clause will be only visible by mz_system
. We will
extend CreateClusterManagedPlan
and CreateClusterManaged
with a new
ClusterFeatures
struct that models the feature flag configuration that can be
overridden on a per-cluster level.
optimize
APIHere we can benefit from the unified optimizer interface introduced with
MaterializeInc/materialize#20569. As part of the associated changes we
introduced an mz_adapter::optimize::OptimizerConfig
struct that currently can
can already be configured in a layered way:
SystemVars
.EXPLAIN
-specific overrides bound from the ExplainContext
,
which is available when the Coordinator
methods that drive the optimization
process are initiated from sequence_explain_plan
.Since all Optimizer::new(...)
constructor calls in the Coordinator
happen at
a time where the target cluster for the optimized statement is already resolved,
we can just add a new layer for cluster-specific overrides between (1) and (2).
An sketch of the proposed design can be found in the MVP draft PR[^3].
Nothing is done yet, but once we agree on the SQL extensions the changes to get something working end-to-end should be done quite quickly.
There is prior art for this in PostgreSQL: you can set (most) system parameters at the database level, and they take precedence in order of:
system < database < role < session
or something lke that. If we get more use cases for such layering we can invest
the time to teach the SystemVars
about cluster-specific parameters. Once we do
that, we should be able to enable feature flags on a per-cluster basis through
LaunchDarkly.
The approach is rejected (for now) because it requires more substantial changes
to the current state of our LaunchDarkly setup in main
. Basically at the
moment we pull parameters from LaunchDarkly in a loop using a fixed
ld::Context
[^1] that consists of:
kind = environment
that models the current Materialize
environment.kind = organization
that models the environment owner.If we want to provide cluster-specific configuration through LaunchDarkly, we
would need to extend the system_parameter_sync
loop[^2] to run a
frontend.pull(...)
call with a different ld::Context
for each cluster. We
would then use the CLUSTER
-specific ALTER SYSTEM
extensions in SystemVars
in the backend.push(...)
call.
N/A
From @benesch:
As future work, it seems like it'd also be interesting to allow users to use the
FEATURES
flag, limited to a restricted set of features that are worth allowing users to control.We might also consider introducing
CREATE TEMPORARY CLUSTER ... FEATURES(...)
and limit certain features for use only in temporary clusters. These clusters would only last for a single SQL session, and therefore wouldn't survive an
envd
restart, and therefore wouldn't prevent us from removing the feature flag in a future version of Materialize.
Automatically duplicate dataflow-backed catalog items for experimentation
[^1]: ld_ctx
definition
[^2]: system_parameter_sync
definition
[^3]: MVP sketch (draft PR)