Platform v2: A Modularized/Decoupled Storage Controller
Context
As part of the platform v2 work (specifically use-case isolation) we want to
develop a scalable and isolated serving layer that is made up of multiple
processes that interact with distributed primitives at the moments where
coordination is required.
The way the compute controller and storage controller work currently will not
work well for that, because they're coupled together inside one process.
Goals
Overarching/long-term goal: make it possible to run controller code distributed
across different processes. Say, one compute controller per cluster, that is
local to that cluster.
Finer-grained goals:
- Pull out the part of StorageController that ComputeController needs. Let each
ComputeController have their own "handle". We will call this part
StorageCollections.
- Pull out the part of StorageController that deals with table
writes/transactions. Make it so multiple processes can hold a "handle" to
this. We will call this part TableWriter.
- Make the rest of StorageController (the part that deals with running
computation on a cluster) more like ComputeController, where each cluster has
their own piece of StorageController managing its storage-flavored
computation.
Non-Goals
- Actually separate things out into different processes.
Context/Current Architecture
- We want to have Coordinator-like things running in different processes. We
want each cluster's (compute) controller to be separate from other
controllers, running in separate processes.
- Currently, the compute controller is given a "mutable reference" to the
storage controller, when it needs to do any work.
- The storage controller has come to be responsible for 3 things that can and
should be separated out:
- StorageCollections: holds on to since handles and tells a compute
controller at what time collections are readable, when asked. Allows
installing read holds.
- TableWriter: facilitates writes to tables. Could also be called TxnWriter.
- StorageController: manages storage-flavored computation running on a
cluster. Think sources and sinks. This is the same as ComputeController,
but for running storage-flavored computation on a cluster.
Overview
- We need to take apart the naturally grown responsibilities of current
StorageController to make it fit into a distributed, post-platform-v2 world.
- StorageCollections and TableWriter will become roughly thin-clients that use
internal "widgets" to do distributed coordination:
- For StorageCollections, that widget is persist since handles for holding a
since, and other handles for learning about upper advancement.
- For TableWriter, that widget is
persist-txn
. But there is more work in
that area, specifically we need to replace the current process-level write
lock.
StorageCollections
- The compute controller only needs a StorageCollections. It has a much reduced
interface, compared to current StorageController. Greatly reducing surface
area/coupling between the two.
- We will revive an old idea where we use persist directly to learn about the
upper of collections: the StorageCollections does not rely on feedback from
the cluster to learn about uppers. It uses persist, and drives its since
handles forward based on that.
- Each compute controller is given their own "instance" of a
StorageCollections, that they own.
- A StorageCollections will not initially acquire since handles for all
collections but only for those in which the compute controller expresses an
interest.
- In the past, we were hesitant about this approach because we wouldn't want to
regularly poll persist for uppers. Now, with persist pubsub, that wont't be a
problem anymore. Persist pubsub will essentially become the fabric that ships
upper information around a whole environment, between different processes.
Advantages of this approach:
- Reduced coupling/much reduced surface area between ComputeController and
StorageCollections.
- Makes use-case isolation/distributed processes possible!
Implications:
- In the limit, it can now happen that we hold
num_collection * num_clusters
since handles, where before it was only num_collection
handles.
- More load on persist pubsub.
- One could argue there would be more latency in learning about a new upper,
but I don't think that's a valid concern: persist pubsub has so far proven to
be low latency in practice. And there's motivation to invest into fixing
issues with persist pubsub because it is used in all places where persist
shards are being read.
TableWriter
We can move table-related things out into a TableWriter because the
StorageController doesn't do much with/to tables, except:
- Learn about upper updates through a channel: when adapter writes to tables,
an update get sent through a channel, the StorageController absorbs those
similarly to how it absorbs upper updates from running ingestions.
- Acquire since holds when sinking tables to an export.
For both of these use cases, the StorageController can be given access to a
StorageCollections, and acquire read holds same as everyone else (same as
compute and the adapter). Upper updates will no longer have to flow through a
special channel, the StorageCollections will be keeping uppers/sinces up to
date same as for other collections: through persist pubsub.
StorageController (the per-cluster part)
If we want to achieve full physical use-case isolation, where we have the
serving work (and therefore also the controller work) of an environment split
across multiple processes and not one centralized environmentd
, we also need
StorageController to work in that world. That is, it needs to become more like
ComputeController where there is a per-cluster controller and not one
monolithic controller inside environmentd
.
Rollout
- We can immediately get started on factoring out TableWriter and
StorageCollections.
- Remodeling StorageController will come as a next step, but needs to happen
for fully-realized use-case isolation.
We only need #1 for use-case isolation Milestone 2, where we want better
isolated components and use-case isolation within the single environmentd
.
For Milestone 3, full physical use-case isolation, we also need #2.
Alternatives
Centralized StorageController and RPC
We can keep a centralized StorageController that runs as a singleton in one
process. Whenever other processes want to, for example, acquire or release read
holds they have to talk to this process via RPC.
Arguments against this alternative:
- We need to worry about processes timing out and then, for example, not
releasing read holds.
- We would introduce a special-case RPC protocol and a new service while we
already have persist pub sub as a general purpose fabric that works for our
purposes.
- Using
SinceHandles
(and other resources) for each cluster makes it clearer
who is holding on to things.
Open questions
None so far.