This document proposes an explicit SQL API for managing cluster replicas.
The ALTER CLUSTER
statement will be removed.
The CREATE CLUSTER
and CREATE CLUSTER REPLICA
statement will be evolved as
follows:
create_cluster_stmt ::=
CREATE CLUSTER <name>
[<inline_replica> [, <inline_replica> ...]]
create_cluster_replica_stmt ::=
CREATE CLUSTER REPLICA <cluster_name>.<replica_name>
[<replica_option> [, <replica_option> ...]]
inline_replica ::= REPLICA <name> [(<replica_option> [, <replica_option> ...])]
replica_option ::=
| SIZE '<size>'
| AVAILABILITY ZONE '<az>'
| REMOTE (['address' [, 'address' ...]])
It will be valid to create a cluster with no replicas. Creating a replica will be the moment at which compute resources are provisioned.
The SIZE
and REMOTE
options must not be specified together.
The AVAILABILITY ZONE
option is new. The set of allowable availability zones
will be determined by a new --availability-zone=AZ
command line option.
If no availability zone is specified explicitly, and at least one availability
zone is available, Materialize should automatically assign the availability
zone with the least existing replicas.
The SIZE
option will be changed from the existing SIZE
option to validate
the set of allowable sizes against a new --cluster-replica-sizes
command line option.
The DROP CLUSTER REPLICA
statement will be added:
DROP CLUSTER REPLICA [IF EXISTS] <cluster-name>.<replica-name>
It will have the usual semantics for a DROP
operation.
The SHOW CLUSTER REPLICAS
statement
SHOW CLUSTER REPLICAS [FOR <cluster>] [{ LIKE 'pattern' | WHERE <expr> }]
will list the replicas of the provided cluster, or of the current cluster if no
cluster is specified. The results will be optionally filtered by the provided
LIKE
pattern or WHERE
expression, which work analogously to the same clauses
in the SHOW SCHEMA
statement.
A new mz_cluster_replicas
table will be added with the following schema:
Field | Type | Meaning |
---|---|---|
cluster_id | bigint |
The ID of the cluster. |
name | text |
The name of the cluster replica. |
size | text |
The size of the cluster replica. NULL for a remote cluster. |
availability_zone | text |
The availability zone of the cluster replica, if any. May be NULL . |
status | text |
The status of the replica. Either unhealthy or healthy . |
The controller will use the replica size to determine the following properties of a cluster replica:
--workers
argument to computed
The mapping between replica size and the above properties will be
determined by the new --cluster-replica-sizes
comand line option, which
accepts a JSON object that can be deserialized into ClusterReplicaSizeMap
:
#[derive(Deserialize)]
struct ClusterReplicaSizeConfig {
memory_limit: Option<MemoryLimit>,
cpu_limit: Option<CpuLimit>,
scale: usize,
workers: usize,
}
type ClusterReplicaSizeMap = HashMap<String, ClusterReplicaSizeMap>;
The default mapping if unspecified will be:
{
"1": {"scale": 1, "workers": 1},
"2": {"scale": 1, "workers": 2},
"4": {"scale": 1, "workers": 4},
// ...
"32": {"scale": 1, "workers": 32},
// Testing with multiple processes on a single machine is a novelty, so
// we don't bother providing many options.
"2-1": {"scale": 2, "workers": 1},
"2-2": {"scale": 2, "workers": 2},
"2-4": {"scale": 2, "workers": 4},
}
This is convenient for development in local process mode, where the memory and size limits don't have any effect.
Materialize Cloud should use a mapping along the lines of the following:
{
"small": {"cpu_limit": 2, "memory_limit": 8, "scale": 1, "workers": 1},
"medium": {"cpu_limit": 8, "memory_limit": 32, "scale": 1, "workers": 4},
"large": {"cpu_limit": 32, "memory_limit": 128, "scale": 1, "workers": 16},
"xlarge": {"cpu_limit": 128, "memory_limit": 512, "scale": 1, "workers": 64},
"2xlarge": {"cpu_limit": 128, "memory_limit": 512, "scale": 4, "workers": 64},
"3xlarge": {"cpu_limit": 128, "memory_limit": 512, "scale": 16, "workers": 64},
}
The actual names and limits are strawmen. The key point is that the sizes have friendly names that abstract away the details.
Determining when to transition from scaling vertically to scaling horizontally is an open question that will need to be answered by running experiments on representative workloads. Prior work on benchmarking timely/differential in isolation showed performance improvements when switching to horizontal scalability before maxing out the resources on a single machine due to allocator overhead.
The controller will need to inform the adapter when a replica changes status, so
that the new status can be reflected in the mz_cluster_replicas
table. The
easiest way to do this is likely via a new
ComputeResponse::ReplicaStatusChange
variant.
The ServiceConfig
trait will be evolved with the option to specify an
availability_zone: Option<String>
.
When an availability zone is specified on a service, the
KubernetesOrchestrator
will attach a nodeSelector
to the pods created by the
service of availability-zone=AZ
.
Note that the availability zone for AWS should be the
stable AZ ID,
like use1-az1
, not the name like us-east-1a
, which can map to different
physical AZs in different accounts.
The Pulumi infrastructure will need to provide nodes groups in several
availability zones in production. The environment controller will plumb
the list of... available... availability zones to materialized
via the
new --availability-zone
argument.
Manually managing replicas is good for power users who want the flexibility, but adds friction for average users. In the future, we intend to support auto-scaling and auto-replicating clusters, where users declare the desired properties of their replica set, e.g.:
CREATE CLUSTER foo REPLICATION FACTOR 3, MIN SIZE 'small', MAX SIZE 'xlarge'
and Materialize automatically turns on the appropriate number of replicas, scaling them up and down in size in response to observed load.