There are instances where we want to migrate the data stored in the Stash. For example, as we
build Role Based Access Control, we want to add new fields to the RoleValue
type. While today we have a migration flow
there are a few issues with it, namely:
We can fix problem 1 by maintaining some record of the previous types in the Stash (e.g. snapshotting), problem 2 by structuring our list of migrations in a more defined way, and problem 3 by creating a specific "initialize" step for the Stash.
Our overall goal is to create "fearless Stash migrations". In other words, we want to make it so we can assign an issue that requires a Stash migration to a new-hire, and have confidence that if our builds and tests are passing, then the migration won't break anything in production.
There have been several issues caused by Stash migrations:
_incident-47
The Stash
is a time-varying key-value store, i.e. it stores colletions of (key, value, timestamp, diff)
,
which we use to persist metadata for a Materialize deployment. For example, we use the Stash to
persist metadata for all MATERIALIZED VIEWS
that a user has created. This way when restarting an
environment, we know what Materialized Views to recreate on Compute. Concretely we use CockroachDB
and all keys and values are serialized as JSON for human readability.
What makes Stash migrations hard to write and hard to reason about, is that you need to define a
type that supports deserializing the current Stash and serializing the new Stash. For example, if
you want to add a field to a struct you need to wrap the new field in an Option<...>
. This way
your struct can deserialize when the field doesn't exist, yet provide the Stash with the new data.
There currently is not a way to do a migration like, converting a type from a u64
to an enum
.
The proposal to fix this is to "snapshot" our types, this would allow us to, in a type safe way, represent the current data we're reading from, and the new types we want to migrate to. Then we'd be able to do arbitrarily complex migrations from one type to another.
Concretely what this means is having types like:
struct RoleValuesV5 {
create_db: bool,
}
struct RoleValuesV6 {
// A bit flag of permissions this user is allowed.
perms: u64,
}
Today migrations are a list of closures, and we index into this list based off of the current version we read from our database. This is fragile because the list always need to be ever increasing in length, we can't move the position of a migration in the list, and there is nothing to indicate that a new migration needs to be added.
The proposal is to maintain a notion of a "current stash version" and restructure the migrations based on this version, for example:
const STASH_VERSION: u64 = 3;
async fn migrate(...) -> Result<...> {
let version = stash.read_version().await?;
const NEXT_VERSION = STASH_VERSION + 1;
match version {
0 => // migrate
1 => // migrate
2 => // migrate
CURRENT_VERSION => return Ok(()),
NEXT_VERSION.. => panic!("Traveled to the future!"),
}
stash.set_version(version + 1).await;
}
This way if someone bumps STASH_VERSION
we'll fail to compile because there will be a case in
the match
statement which is unhandled, and ideally with the match
statement it becomes easier
to understand which migrations are running and when. To keep logic simple, we'd also assert an
invariant that we only ever upgrade one version at a time, e.g. from N
to N + 1
, we would not
support arbitrary upgrades from N
to N + x
.
Also with the match statement we'd be able to add guards on the current build version, e.g.:
match version {
// ... PSEUDO CODE
3 if BUILD_NUMBER < "0.50" => // migrate
3 => (),
// ...
}
which is useful if we ever need to ensure that a migration runs for only a specific build.
Today when creating a Stash for the first time, we initialize it with a version of 0, and run it through all of the existing migrations, the first of which actually inserts the necessary initial data. This is problematic for three reasons:
environmentd
will continuously increase, since we need to run through all of the different migrations.The proposal is to create a specific "initialization" step for the Stash, that uses the "current"
Stash types, and initializes it to the STASH_VERSION
that is defined in part 2.
To snapshot the types we store in the Stash, we should define them in, and store serialized, protobufs.
Note: the Stash used to be serialized protobufs but we changed the format to JSON for human read-ability. I believe now with the
stash-debug
tool, there is less of a reason to require the serialized format to be human read-able. See materialize#14298 for the PR that changed it to JSON.
We would define the types we store in the Stash in .proto
files, and we'd introduce the following
file layout in the stash
crate:
/stash
/proto
objects.proto
/old
objects_v1.proto
objects_v2.proto
...
/src
...
stash/proto/objects.proto
would contain all of our "current" types, and the "old" types would
exist inside of the stash/proto/old
folder.
To generate Rust types from our .proto
schemas, we'd have two steps:
prost_build
to do the generation, and
as part of the build process we'd specifically omit any "includes"
which would force our .proto
files to have no dependencies.build.rs
) we'd also maintain a hash for the current
version of the protos (i.e. stash/proto/objects.proto
) and a hash for each of the snapshotted
versions. This would allow us to assert at build time that none of our protos have changed. We can
also assert that we have the correct nubmer of snapshots in stash/proto/old
based on the
STASH_VERSION
number introduced in the solution to problem 2.Protobufs also have a well defined serialization formation
so there is no risk of the serialization changing, as long as the .proto
files themselves don't
change.
This setup has several benefits:
.proto
files cannot have any dependencies. To snapshot the "current"
version, all you have to do is copy objects.proto
and paste it into /old/objects_vX.proto
..proto
definitions. This goes a
long way to making the migrations "fearless". If someone changes the current proto definitions, the
old versions, or bumps the STASH_VERSION
without snapshotting, we can emit helpful build errors.build.rs
script.Along with maintaining const STASH_VERSION
and using a match
statement as described above we'd
setup a pattern of defining migrations in their own modules. Specifically we'd introduce the
following layout in our repo:
/stash
/src
/migrations
v0_to_v1.rs
v1_to_v2.rs
v2_to_v3.rs
...
...
/proto
...
and our migration code would then look something like this:
const STASH_VERSION: u64 = 3;
async fn migrate(...) -> Result<...> {
let version = stash.read_version().await?;
const NEXT_VERSION = STASH_VERSION + 1;
match version {
0 => migrations::v0_to_v1::migrate(...).await?;
1 => migrations::v1_to_v2::migrate(...).await?;
2 => migrations::v2_to_v3::migrate(...).await?;
3 => // migrate
NEXT_VERSION.. => panic!("Traveled to the future!"),
}
}
There are a few benefits to putting each migration function in their own modules:
object_v1.proto
would be the v0_to_v1
and
v1_to_v2
modules. This helps prevent us from depending on the snapshotted versions anywhere
besides the necessary migration paths.Each migration function would be provided a stash::Transaction
, which would allow us to open
StashCollection
s
which should provide enough flexibility to facilitate any migration.
Note:
StashCollection
s do not enforce uniqueness likeTableTransaction
s do, some more thought is needed here as to whether or not uniqueness checks are required for migrations. They probably are?
We add a new method to the Stash that is async fn initialize(...) -> Result<...>
, that will
assert the Stash is empty, and then populate it with our inital values. We can already have logic
to detect when a Stash has not been initialized.
Instead of returning a version of 0
, we would now call this new initialize(...)
method. The
method itself would largely be very similar, if not the same, to our first migration step,
a new behavior though is it would initialize "user_version"
to the STASH_VERSION
defined in
part 2.
Creating this separate initialization step, allows us to deprecate old migrations, e.g. the
migrate(...)
function would be able to look something like:
const MIN_STASH_VERSION = 3;
async fn migrate(...) -> Result<...> {
let version = stash.read_version().await?;
const NEXT_VERSION = STASH_VERSION + 1;
match version {
..MIN_STASH_VERSION => panic!("Deprecated stash version!"),
3 => migrations::v3_to_v4::migrate(...).await?;
4 => migrations::v4_to_v5::migrate(...).await?;
NEXT_VERSION.. => panic!("Traveled to the future!"),
}
}
It also allows us to enforce the invariant of "once a Stash migration is written it should never
be modified", and allows Stash initialization to be a single step instead of STASH_VERSION
number
of steps.
We need to write a migration from our JSON types today, to this new migration framework that uses protobufs. I propose we do that in the following steps:
objects.proto
, also snapshot this initial version
as objects_v15.proto
. The current Stash version is 14.storage.rs
into a new legacy_json_objects.rs
, these will only continue to exist to facilitate the migration
to the new protobufs, leave an extensive doc comment explaining as much.legacy_json_objects.rs
.STASH_VERSION
to 15, write a v14_to_v15(...)
migration that migrates us from the
types in legacy_json_objects.rs
to the protos we snapshotted in objects_v15.proto
. Introduce
the "initialization" step as described in part 3, so new users will immediately have
STASH_VERSION: 15
which will contain the protobufs.Note: There are two alternative approachs I thought of, but don't think are great:
- Before running any of the existing Stash migrations, switch everything to protobufs and re-write the existing migrations using protos. I don't like this approach because we'd then have two version fields for the Stash, i.e. "version number" and "is proto". And we'd need to generate multiple
object_v11.proto
,object_v12.proto
, etc. files that would be identical.- Write a single
legacy_to_v15(...)
migration code path, that handles upgrading from all of the existing versions of the Stash to v15. This wouldn't be too bad, but it does break the invariant we wanted to uphold of only every upgrading one version at a time. With this approach we could theoretically upgrade from say v11 to v15.
At our current scale, I don't believe the benefit to partially rolling out a new format for the
Stash outweighs the complexity of concurrently maintaining two separate implementations. To
get similar testing coverage that a partial rollout would, we can validate that the new Stash
format and migration work, by running the stash-debug
's upgrade-check
command against select
customer environments.
This change will be tested manually, via Rust tests, and our existing testdrive upgrade tests. I don't believe there are any sqllogictest tests that would be able to exercise this change in a unique way.
We can test this change in the following ways:
environmentd
before this change, then restart
environmentd
after this change has been applied, assert we're able to migrate successfully.stash-debug
tool's upgrade-check
command to make sure customer
environments would be able to successfully upgrade.I believe testing methods 1 - 3 should block the rollout/merging of this change, while 4 - 6 are changes that can be done in the future, or fast-followed if deemed high enough of a priority.
As mentioned above, I don't believe we should do a partial rollout of this change. Therefore we wouldn't have any feature flags, or any lifecycle for this change.
stash-debug
removes this requirement because the tool itself
can deserialize the protobufs. Also, possibly a benefit to requiring the use of stash-debug
is
it discourages the use of tools like psql
to connect directly to production data, which are
scary because you could accidentally modify user data, unlike stash-debug
which protects against
this.#[stash_object]
procedural macro that snapshots our Rust types for us.
#[stash(introduced = 5)]
and that would generate all versions of the struct >= 5 with that field, branch.serde_json::Value
in the migration code path, to avoid snapshotting
altogether.
serde_json::Value
structs, where we'd check and modify the fields at runtime.trait Data
that all of our keys and values must implement. A requirement of Data
is that the type must
implement Ord
, which seems to be required
for consolidating rows. serde_json::Value
is not order-able, and there isn't a good way to
define how a generic JSON value should be ordered.objects_vX.rs
with Rust types instead of an objects_vX.proto
.
objects_vX.rs
files these types would be in.I believe this is the right design because of the following reasons:
TypeA
to TypeB
.Overall I believe it achieves the goal of "fearless migrations", for the lowest possible engineering cost.
environmentd
and stash-debug
are the only two things that directly depend
on the Stash. If application code wants to access the Stash, it needs to do it through
environmentd
.stash-debug
, that opens up the possibility of compressing the SQL statements stored in the
create_sql
column of the "items"
collection with an algorithm like brotli
. Theoretically
this reduces the amount of data we're storing in CockroachDB, and improves the speed of the Stash.
snappy
, so we probably wouldn't move the
needle much if we also compressed the data ourselves.