Materialize stores data in S3 and metadata in CRDB. When restoring from backup, it's important that the data and the metadata stay consistent, but it's impossible to take a consistent snapshot across both. Historically, restoring an environment to a historical state would involve a significant amount of manual or error-prone work.
Reasons one might want backup/restore, and whether they’re in scope for this design —
Scenario | Example | In scope? |
---|---|---|
User-level backups | A user wants to explicitly back up a table, either ad-hoc or on some regular cadence. | No. (A S3 sink/source would provide similar functionality and integrate better with the rest of the product surface.) |
User error | A user puts a bunch of valuable data into a table, deletes a bunch of rows by accident, and asks us to restore it for them. | No. (Possibly useful for a system of record, but substantially more complex.) |
Controller bug | The compute controller fails to hold back the since of a shard far enough in a new version of Materialize, losing data needed by a downstream materialized view. | Yes! |
Persist bug | A harmless-seeming change to garbage collection accidentally deletes too much. | Yes! |
Snapshot | A shard causes an unusual performance problem for some compute operator, and we’d like to inspect a previous state to investigate. | Nice to have. |
Operator error | An operator typos an aws CLI command, accidentally deleting blobs that are still referenced. | Yes. (Impossible to prevent an admin from deleting data entirely, but it’s good if we can make ordinary operations less risky.) |
Motivated by the above and some other feedback, this design doc focuses on infrastructure-level backups (without no product surface area) that optimize for disaster recovery. For other possible approaches or extensions to backup/restore, see the section on future work.
This means backups should be:
It’s helpful if we’re also able to use backups for investigations and debugging historical state, but this is secondary to the goals above.
The state of a Materialize environment is held in a set of Persist shards. (Along with some other CRDB state like the stash.) From Persist’s perspective, each shard is totally independent of the others; environmentd
is responsible for coordinating across shards in a way that presents a consistent state to the user.
This design is thus roughly composed of two distinct parts:
Persist is already a persistent datastructure: the full state of a shard is uniquely identified by the seqno of the shard. A valid backup of a persist shard is just some mechanism for making sure that we can restore the state of a shard to the contents as they were at a particular seqno.
We plan to take advantage of features built into our infrastructure:
To restore a shard at a particular time, we can:
While this imposes some additional cost, it’s modest both in absolute terms and relative to our other options. See the appendix for more details.
Unlike for a single shard, the state of an entire environment can not be represented with a single number. (Including the mztimestamp
— the timestamp of a shard can lag arbitrarily behind, if eg. the cluster that populates that shard is crashing.) Conceptually, an environment backup is a mapping between shard id and the seqno of the per-shard backup. (Along with any additional environment-level state, like the stash.)
Correctness criteria are also a bit more subtle, and involve reasoning about the dependencies between collections. We’re aware of two important correctness properties:
In a running environment, property 1 is carefully enforced by the various controllers, and property 2 is a consequence of causality. We need to ensure make sure that our restore process respects these properties as well.
Our proposed restore implementation relies on the ordering properties guaranteed by CRDB:
This is straightforward to implement, and does not require any significant changes to the running Materialize environment.
This is not strictly safe, since CRDB is not a strictly serializable store; in particular, the docs describe the risk of a causal reverse, where a read concurrent with updates A and B may observe B but not A, even if B is a real-time consequence of A. In the context of Materialize, if A is a compare-and-append and B is a compare-and-append on some downstream shard, a causal reverse could cause our backup of B to contain newer data than our backup of A, breaking correctness requirement 2.
A causal reverse is expected to be very rare, and if encountered we can just try restoring another backup.
We already back up our CRDB data, and our S3 buckets enable versioning and a lifecycle policy.
Since there’s no user-facing surface area to this feature, we do not expect to add additional code to the running environment. However, we will need code that can rewrite CRDB and update data in S3 while an environment is down. persistcli
seems like a natural place for this, and ad-hoc restores can run as ad-hoc Kubernetes jobs.
The code we write for the restore process should be tested to the usual standard of Persist code, including extensive unit tests. (This may require writing a new Blob implementation to fake versioning behaviour, or running one of the many S3 test fakes available.)
Since backups are often the recovery option of last resort, it’s useful to be able to test that the real-world production backups are working as expected. We should consider running a periodic job that restores a production backup to some secondary environment, brings that environment back up, and checks that the restored environment is healthy and can respond to queries.
The restore process does not run in a user’s environment, so it does not go through the usual deployment lifecycle. However, any periodic restore testing we choose to run would need resources provisioned somewhere.
The proposed approach to backups leans pretty heavily on the specific choices of our blob and consensus stores, otherwise well-abstracted in the code. This means that any change in the stores we use would involve substantial rework of the approach to backup/restore. We consider this to be fairly low likelihood.
The proposed CRDB backup strategy may lead to an unrestorable backup if we experience a causal reverse. However, we expect this to be rare and straightforward to work around when it does occur.
One natural way to make a backup of a shard would be to copy out a state rollup and all the files it recursively references out to some external location. This is straightforward to implement using the existing blob and consensus APIs and requires no new infrastructure.
environmentd
is coordinating the environment-wide backup process, making it copy every file in every shard is lots of extra work. (Though thankfully S3 can copy files without transferring all the data through the client.) We might need to introduce an additional sidecar just for backups, which is significant operational burden.If we’ve decided we don’t want to copy data, another option would be to teach the current persist implementation to designate certain rollups as “backup” rollups, and avoid deleting them and the files they reference. This is a more invasive change to Persist, but would allow us to restore the chosen backups without having to “undelete” any data.
environmentd
Imagine a fictional controller that:
In this world, a toy backup algorithm for a whole Materialize environment might look like:
Correctness property 2 is guaranteed by processing shards in reverse dependency order, so the backup of a shard will always be older than the backups of the shards it depends on. We maintain property 1 with the since hold juggling; if the hold was valid at the time the downstream shard was backed up, that same since hold will be maintained at the time the upstream gets backed up.
environmentd
. For our favoured per-shard backup approach, the performance cost should be pretty marginal. However, it does introduce complexity into an already very challenging to reason about codepath.While this approach is believed to work, it's much more invasive and complex to implement, and the issues it's working around are expected to be vanishingly rare.
AWS Backup allows making periodic or incremental backups of an S3 bucket, with no additional transfer costs. It relies on S3 versioning, and will back up every version of an object in the bucket.
Doing nothing is always an option, and maybe a fine one: we’re explicit with our users that we’re not to be treated as a system of record yet.
Materialize’s sources and sinks also store data in the systems they connect to. In general, we’re unable to back up and restore this data fully. (If we restore to a week-old backup, the data our Kafka source had ingested since then might be compacted away.)
How sources and sinks handle snapshots is expected to vary:
We expect to tackle the "low-hanging fruit" during the original epic, but generally this is something that individual sources and sinks will need to handle and may require some ongoing work.
This document describes approaches to backing up a single Persist shard, as well as the full environment. In some cases, we may actually only want to restore a subset of the shards; for example, if a source breaks with some bug, we may want to restore its state along with the state of all its downstreams. This does not even remotely resemble a historical state of the system!
It seems plausible that one could reconstruct such a state by inspecting the full history of the relevant shards and with a good understanding of the dependency relationships between them.
One can imagine wanting a user-facing syntax for backup and restore: for example, to make an ad-hoc snapshot of a table before making changes, or to periodically back up some critical dataset.
We expect this need to be served by future sources and sinks, like a potential S3 integration, instead of relying on any infrastructure-level backup.
It’s difficult to attribute costs exactly, since we have only rough aggregate numbers and both S3 and CRDB are shared resources. Nonetheless, and very approximately:
The only cost of S3 Versioning is the additional object storage. If we decide to adopt S3 Versioning and keep all versions for a month, we’d add ~10% to our overall S3 spend.
If we chose a object-copying approach to backups, and we copy objects hourly, the average object will be present in 24 backups. If we took the naive approach of copying every object at backup time — which is priced as a PUT — this would add an order of magnitude to our S3 costs. Even with the maximally clever approach, where we deduplicate and reference count our backed-up objects, we’d be increasing our spend by about half even before accounting for the additional storage costs.
We spend roughly an order of magnitude more on CRDB than S3, so even significant percentage changes in our S3 usage have a relatively modest impact on our overall spend. (Without accounting for the CPU cost of interacting with Persist in EC2.)