Associated issues/prs: #20010 #26401
Reconfiguring a managed cluster via ALTER CLUSTER...
leads to downtime. This
is due to our cluster sequencer first removing the old replicas and then adding new
replicas that match the new configuration. The duration of downtime can be seen
as extending through the full period it takes to rehydrate the new replicas.
A mechanism should be provided that allows users to alter a managed cluster, delaying the deletion of old replicas while new replicas are spun up and hydrated.
This feature will introduce new SQL in ALTER CLUSTER
which will create new
replicas matching the alter statement; however, unlike existing alter mechanisms,
it will delay the cleanup of old replicas until the provided WAIT
condition is met.
The SQL will need to define a type of check to be performed, parameters for that check, and whether or not to background the operation.
Suggested syntax:
ALTER CLUSTER c1 SET (SIZE 'small') WITH (
WAIT = <condition>,
BACKGROUND = {true|false},
The two types of checks, in agreement with the V1 and V2 scopes respectively, are:
FOR
, which waits for a provided duration before proceeding with finalization.
ALTER CLUSTER c1 SET (SIZE 'small') WITH (
WAIT = FOR <interval>,
BACKGROUND = {true|false},
)
UNTIL CAUGHT UP
, which will wait for a "catch up" mechanism to return true
or for a timeout to be met. It will roll back or forward depending on the value
provided to ON TIMEOUT
.
ALTER CLUSTER c1 SET (SIZE 'small') WITH (
WAIT = UNTIL CAUGHT UP (
TIMEOUT = <interval>,
ON TIMEOUT = {CONTINUE | ABORT}
)
BACKGROUND = {true|false},
)
All replicas will be billed during reconfiguration (except explicitly unbilled replicas).
Cancelation
For v1, all ALTER CLUSTER
statements will block, and graceful reconfiguration
will only be cancelable by canceling the query from the connection that issued it.
For v2, we will introduce backgrounded reconfiguration which will need some new semantics to cancel. Something like the following should suffice.
CANEL ALTER CLUSTER <cluster>
-pending
. They will move to active
once the reconfiguration is finalized. These will be stored in the
mz_pending_cluster_replicas
table.mz_pending_cluster_replicas
tableReconfiguration replicas will be identifiable for purposes of reporting and
visibility in the UI. They will both have a -pending
suffix and will have a
pending
column in mz_cluster_replicas
will be true.
Active replicas will continue with the same naming scheme (r1, r2, ...)
Reconfiguration replicas will use the -pending
suffix; ex r1-pending
,
r2-pending
, which will be removed once they are moved to active.
We want to accomplish the following
A pending: bool
field will be added to the ClusterReplica
memory and durable
catalog structs. This will be set to true for reconfiguration replicas and will
help either recover or cleanup on environmentd crashes. mz_cluster_replicas
will show the value of this pending
field.
Below broken this process broken down into four components
sequence_alter_cluster_managed_to_managed_graceful
*In v2, steps 3 and 4 may be swapped.
Interpreting the Query
This will use standard parsing/planning, the only notable
callout here is that we'll want mz_sql::plan::AlterClusterOptions
to contain an alter_mechanaism
field which holds an optional
AlterClusterMechanism
:
enum AlterClusterMechanism {
Duration{
timeout: Duration
...
},
UntilCaughtUp{
timeout: Duration,
continue_on_timeout: bool
...
},
}
Cluster Sequencer
The following outlines the new logic for the cluster sequencer when performing an alter cluster on a managed cluster.
Because this will need to be cancelable and will wait for a significant
duration before responding, the current sequence_cluster_alter
process cannot
be used. This implementation will rely on migrating sequence_cluster_alter
to
use the sequence_staged
flow, allowing us to kick off a thread/stage which
performs waits/validations.
The new sequence_alter_cluster
function will need to perform
the current validations, used in both sequence_alter_cluster
and
sequence_alter_cluster_managed_to_managed
, but will kick off a new
StageResult::Handle
when an alter_mechanism
is provided. The task in this
handle will perform waits/validations and pass off the finalization to
a Finalize
stage.
Additionally, we'll need to perform some extra validations to ensure the
the alter_mechanism
and AlterClusterPlan
are compatible. These are:
The sequence_alter_cluster_managed_to_managed
method will need
to be updated to apply the correct name/pending status for new replicas
and only remove existing replicas when no alter_mechanism is provided.
For V2:
We will need to implement a message strategy that kicks the work off
off of the sequence_staged
flow. In this scenario, we will still create new
replicas in the foreground, but the wait/checks will be performed in a separate
message handle by emitting a ClusterReconfigureCheckReady
message.
The state for the reconfiguration will be stored in a PendingClusterReconfiguration
struct.
struct PendingClusterReconfiguration {
ctx: ExecuteContext,
// needed to determine timeouts
start_time: Instant,
// the plan will hold the cluster_id, config, mechanism, etc...
plan: AlterClusterPlan,
// idempotency key will need to be part of the message and will need to be
// validated to ensure we're looking at the correct plan, in case we
// miss a cancel and a new entry was put in pending_cluster_reconfiguration.
idempotency_key: Uuid,
}
In order to cancel we'll keep a map, pending_cluster_reconfigurations
, of
session ClusterIds
to PendingClusterReconfigurations
.
*Currently due to limitations in sources and sinks, we will need to maintain
the disruptive alter mechanism. Once multiple replicas are allowed for sources
and sinks we can remove the disruptive mechanism and default the wait to WAIT=For 0 seconds
.
The responsibilities of this operation will be:
WAIT
conditionChecking the WAIT
Condition
The message handler will perform all checks on every run. If the wait condition
has not yet been met, we will reschedule the message to be resent. For each
reschedule we'll spawn a task that waits some duration and sends a
ClusterReconfigureCheckReady
message with the connection_id.
For WAIT FOR
we can wait for the exact duration that would cause the next check to succeed.
For WAIT UNTIL CAUGHT UP
we can just wait for three seconds.
Finalize the Reconfiguration
Finalizing the reconfiguration will occur once the wait
condition has been
met. During this phase, the following steps will occur.
-pending
suffix.pending
to true.Handling Dropped Connections
Dropped connections will be handled by storing a clean_cluster_alters: BTreeSet<GlobalId>
in the ConnMeta
.
We will use this along with a retire_cluster_alter
function which can be called during connection termination.
retire_compute_sync_for_conn
provides an example of this.
Handling Cancelation
The mechanism used for handling dropped connections can be used for cancelation.
cancel_compute_sync_for_conn
provides an example of this.
v2 coniderations
Reverting the reconfiguration
On failure or timeout of the reconfiguration, this task will revert to the state
prior to the ALTER CLUSTER
DDL. It will remove reconfigure replicas, and ensure the
cluster config matches the original version.
For v1, environmentd restarts will remove pending
replicas from the Catalog
during Catalog initialization. This will cause the compute controller to remove
any pending physical replicas during its initialization (this is already an
existing feature).
For v2, we will need to store the AlterClusterPlan durably for all backgrounded reconfigurations. We will then need the coordinator to attempt to continue an ongoing reconfiguration on restart.
V1 We will choose to initially disallow any reconfiguration of a cluster with a schedule.
V2 For refresh schedules, we will only allow graceful reconfiguration on clusters whose refresh schedule is "ON", we would also need to prevent the clusters from being turned off until the reconfiguration is finished.
Future Schedules Some proposed schedules, auto-scaling, for instance, may want to use graceful reconfiguration under the hood. These would need specific well-defined interactions with user-invoked graceful reconfiguration.
The initial version of this doc detailed a potential solution where reconfiguration
was treated as a cluster schedule. In that plan, roughly, a new schedule would be added
to the cluster being gracefully reconfigured. Rather than having a task handle the reconfiguration
HandleClusterSchedule
would have logic that looked for ReconfigurationSchedule
s and would
send ReconfigurationFinalization
decisions when the wait
condition was met. This decision
would then promote pending replicas to active replicas, and remove the ReconfigurationSchedule
in handle_schedule_decision
The current proposal uses the WITH (...)
syntax, it may be more straightforward to add this to the
cluster as a config attribute though:
ALTER CLUSTER SET (CLEANUP TIMEOUT 5 minutes)
The above syntax would add a permanent reconfiguration timeout to the cluster that would
be applied on every reconfigure. For this to work we'd need to update
the cluster to know when a reconfigure event started, then we could compare the
event + the duration to now and send a ReconfigureFinalization
decision.
It seems like the reconfiguration task exists only to interact with actions taken from
ALTER CLUSTER
statements are not a property of the cluster, so I'm
not very bullish on this syntax. I could be convinced that this is more ergonomic.
There is a current limitation around clusters with sources where the replication factor must be 1 or 0 This being said we must disallow any reconfiguration schedules for clusters with sources (and sinks?). This is a new annoying issue as we now allow clusters to perform both compute and source roles. We should consider some ways to handle this, one potential is to keep source activity to a single replica in a multi-replica cluster. Alternatively, we just don't allow this, and we add some big disclaimers to clusters with sources.
There is currently the beginning of the mechanism to look at hydration
based metrics for scheduling in compute-client/src/controller/
instance:update_hydration_status
, this works very well for a single data flow,
but can't gauge the hydration status of a replica as a whole.
What we likely want is a mechanism discussed here, where we compare the data-flow frontiers between reconfiguration and active replicas, and when the reconfiguration replicas have all data-flows within some threshold of the active, we send a claim that the check condition has been met and proceed with finalization of the reconfiguration.
Potential challenges with using hydration status: It is difficult to know when "hydration" has finished because:
For hydration-based scheduling the until caught up
syntax will be used.
There is also a question on the mechanism for tracking hydration or frontiers. We could check at a regular interval if the frontiers between original and reconfiguration replicas match. Another possible alternative is spinning up a thread that subscribes to the frontiers and kicks off finalization once the chosen condition is met.
Currently we are opting not to allow graceful reconfiguration of clusters with sources and sinks, due to a single replica limit for these clusters. However, it may be possible to allow some graceful reconfiguration on these clusters, if we postpone installing the sources or sinks on the pending replicas until the replica is promoted. This will ensure that it is the only replica running. We would also have to ensure the new replication factor is greater than or equal to one in the new config. Multi-replica sources and sinks will implicitly resolve this issue, so, for now, it seems reasonable to wait for that.