NOTE: At the time this was merged, we were past zero-downtime upgrades, milestone 2 already. But we merged this so that historic context is available in our repository of design docs. Some of the descriptions in here are half baked, but overall the implementation followed closely what we laid out here.
We want to build out the ability to do zero-downtime upgrades. Here we outline the first of multiple incremental milestones towards "full" zero-downtime upgrades. We spell out trade-offs and hint at future milestones at the end.
Zero-downtime upgrades can encompass many things, we therefore carefully carve out things that we do want to provide with this first milestone and things we don't want to provide. We will use two basic notions in the discussion:
freshness: The system, a part of it, or an object responds to queries with data that is “sufficiently recent”.
sufficient to be determined by customers, for their workloads
We use the term availabllity to describe the simultaneous combination of responsiveness and freshness.
An object that is not responsive is unavailable, it cannot be queried at any timestamp. Objects that are only responsive can be queried when downgrading the consistency requirement to serializable. Only objects that are available can be queried at our default strict serializable with sufficiently low latency.
Today, we provide neither responsiveness nor freshness for in-memory objects
(read: indexes) because when we upgrade we immediately cut over to a new
environmentd
process and only then spin up clusters, which latter will start
re-hydrating those in-memory objects.
The work around zero-downtime upgrades can be largely seen as futzing with availability and freshness of objects and if/when we provide those.
We define the goals in terms of what of responsiveness or freshness we want to provide for objects during or right after an upgrade.
Implicitly, everything not mentioned above, but explicitly:
Sources are always responsive, because they are ultimately backed by persist shards, and you can always read from those. We do have "almost freshness" for most sources, because only UPSERT sources require re-hydration, but there is a window of non-freshness while the source machinery spins up after an upgrade.
Reducing that window for all types of sources is not a goal of this first milestone! Getting all sources to be fresh after upgrades is sufficiently hard, so we're punting that for the first milestone. Transitively, this means we cannot provide freshness for compute objects that depend on sources.
Today, in-memory compute objects offer neither responsiveness nor freshness immediately after an upgrade: they are backed by ephemeral in-memory state that has to be re-hydrated. This is different from sources and materialized views, which already offer responsiveness today.
The other non-goals also require non-obvious solutions and some light thinking but we believe any kind of zero-downtime cutover already provides benefits, so we want to deliver that first and then incrementally build on top in future milestones.
environmentd
at new version.environmentd
checks that it can migrate over durable environment state
(catalog), then signals readiness for taking over and sits and waits for the
go signal. Crucially, by this time it does not spin up controllers/clusters.environmentd
and cluster processes.The fact that we only spin up new clusters in step #4 is what leads to user-perceived downtime for compute objects and sources that need re-hydration.
A number of components need to learn to start up in a read-only mode, where they don't affect changes to durable environment state, including persist shards that they would normally write to.
Then we can do this when upgrading:
environmentd
at new version.environmentd
checks that it can migrate over durable environment state
(catalog), then spins up controllers (and therefore clusters) in read-only
mode. Only when everything is re-hydrated do we signal readiness for taking
over.environmentd
and cluster processes.The new parts in step #2 and #4 make it so that we re-hydrate compute objects (and later on also sources) before we cut over to the new version. Compute objects are immediately fresh when we cut over.
These parts need to learn to start in read-only mode:
Compute Controller (including the things it controls: clusters/compute objects):
I (aljoscha) don't think it's too hard. We need to thread a read-only mode
through the controllers to the cluster. The only thing doing writes in compute
is persist_sink
, and that's also not too hard to wire up.
Storage Controller (including the things it controls: sources, sinks, writing to collections, etc.):
"Real" read-only mode for sources is hard but I (aljoscha) think we can start with a pseudo read-only mode for milestone 1: while the storage controller is in read-only mode it simply doesn't send any commands to the cluster and doesn't attempt any writes. This has the effect that we don't re-hydrate sources while in read-only mode, but that is okay for milestone 1. Later on, we can teach STORAGE to have a real read-only mode and do proper re-hydration.
StorageCollections:
This is a recently added component that encapsulates persist SinceHandles
for
collections and hands out read holds. The fact that this needs a read-only mode
is perhaps surprising, but: each collection has one logical SinceHandle
and
when a new environmentd
takes over it also takes over control of the since
handles and downgrades them from then on. This is a destructive/write
operation. While a compute controller (and it's clusters) are re-hydrating,
they need read holds.
I (aljoscha) believe that a StorageCollections that is starting in read-only mode should acquire leased read handles for collections and use those to back read holds it hands out to the compute controller et al. Only when it is told to go out of read-only mode will it take over the actual SinceHandles. Relatedly, a StorageCollections in read-only mode is not allowed to do any changes to durable environment state.
Here's a sketch of the reasoning for a) the correctness, and b) the efficacy of this approach.
Correctness: meaning we don't validate the somewhat nebulous correctness guarantees of materialize. Roughly, that we uphold strict serializability, don't lose data, don't get into a state where components can't successfully restart.
clusterd
and environmentd
processes restarting.environmentd
will
re-acquire them and then decide as_ofs for rendering dataflows based on the
sinces that the critical handles lock in place. Therefore, leased read
handles are never required for correctness.environmentd
will not touch the critical since handles.environmentd
deployment restarting in read-write mode will acquire
the critical since handles, same as on a regular failure/restart cycle. It
will then determine as_ofs for dataflows based on them and send them out
again to already running clusters.Efficacy: meaning in read-only mode the leased read handles do something to help hydrate dataflows/clusters, without interacting with the critical since handles.
clusterd
process) to install persist sources, as long as they're not
expired. As seen in ci-failues, when we activate the
persist_use_critical_since_*
LD flags, the leased read handles holding back
the shard frontier are "load bearing" for this use case.clusterd
processes for when a restarted environmentd
, in
read-write mode, with critical since handles in place tries to reconcile with
them. Ideally, none of the already-running dataflows have to be restarted due
to incompatibilities. Dataflows would be incompatible if their as_of were
later than what the new environmentd
process requires.There are at least two ways for transitioning components out of read-only mode:
At least for clusters, we need approach #1 because the whole idea is that they re-hydrate and then stay running when we cut over to them and transition out of read-only mode.
For the other components, both approaches look feasible and it his hard to tell which one is easier/quicker to achieve for milestone 1.
I (aljoscha), think that in the long run we need to do approach #1 (graceful transition) for all components because it will always be faster and lead to less downtime.
If we want to gracefully transition the adapter/coordinator out of read-only mode, we need the work around Catalog follower/subscribers and then teach the Coordinator to juggle migrated in-memory state while it awaits taking over. With the crash-and-burn approach, the Coordinator could get a recent snapshot of the catalog, migrate it, and work off of that. It doesn't need to juggle migrated state and be ready to eventually take over.
This section is a bit in-the-weeds!
The way most storage-managed collections work today is that we "reset" them to empty on startup and then assume that we are the only writer and panic when an append fails.
This approach will not be feasible for long. At least when we want to do the
graceful transitioning approach we need a way for the storage controller to
buffer writes to storage-managed collections and reconcile the desired state
with the actual state. This is very similar to the self-correcting
persist_sink
.
There are some storage-managed collections for which this approach doesn't work, because they are more append-only in nature. We have to, at the very least, audit all storage-managed collections and see how they would work in a zero-downtime/read-only/transition world.
These can be done in any order:
After the above are done:
Milestone 2:
Milestone 3:
Future Milestones:
I (aljoscha) am not sure that we ever need a consistent cut-over point or smart handling of behaviour changes. It's very much a product question.
If we want to quickly deliver milestone 2 (freshness for sources), the storage team should get started on thinking about read-only mode for sources and how we can transition them during upgrades rather sooner than later.
If we want to quickly deliver milestone 2 (freshness for sources), the storage team should get started on thinking about read-only mode for sources and how we can transition them during upgrades rather sooner than later.
Graceful transition out of read-only mode or crash-and-burn at some level?