20220413_cluster_replica.md 6.8 KB

Cluster replicas

Summary

This document proposes an explicit SQL API for managing cluster replicas.

Design

SQL syntax and semantics

The ALTER CLUSTER statement will be removed.

The CREATE CLUSTER and CREATE CLUSTER REPLICA statement will be evolved as follows:

create_cluster_stmt ::=
  CREATE CLUSTER <name>
    [<inline_replica> [, <inline_replica> ...]]

create_cluster_replica_stmt ::=
  CREATE CLUSTER REPLICA <cluster_name>.<replica_name>
    [<replica_option> [, <replica_option> ...]]

inline_replica ::= REPLICA <name> [(<replica_option> [, <replica_option> ...])]

replica_option ::=
    | SIZE '<size>'
    | AVAILABILITY ZONE '<az>'
    | REMOTE (['address' [, 'address' ...]])

It will be valid to create a cluster with no replicas. Creating a replica will be the moment at which compute resources are provisioned.

The SIZE and REMOTE options must not be specified together.

The AVAILABILITY ZONE option is new. The set of allowable availability zones will be determined by a new --availability-zone=AZ command line option. If no availability zone is specified explicitly, and at least one availability zone is available, Materialize should automatically assign the availability zone with the least existing replicas.

The SIZE option will be changed from the existing SIZE option to validate the set of allowable sizes against a new --cluster-replica-sizes command line option.

The DROP CLUSTER REPLICA statement will be added:

DROP CLUSTER REPLICA [IF EXISTS] <cluster-name>.<replica-name>

It will have the usual semantics for a DROP operation.

The SHOW CLUSTER REPLICAS statement

SHOW CLUSTER REPLICAS [FOR <cluster>] [{ LIKE 'pattern' | WHERE <expr> }]

will list the replicas of the provided cluster, or of the current cluster if no cluster is specified. The results will be optionally filtered by the provided LIKE pattern or WHERE expression, which work analogously to the same clauses in the SHOW SCHEMA statement.

System catalog changes

A new mz_cluster_replicas table will be added with the following schema:

Field Type Meaning
cluster_id bigint The ID of the cluster.
name text The name of the cluster replica.
size text The size of the cluster replica. NULL for a remote cluster.
availability_zone text The availability zone of the cluster replica, if any. May be NULL.
status text The status of the replica. Either unhealthy or healthy.

Controller changes

Replica size

The controller will use the replica size to determine the following properties of a cluster replica:

The mapping between replica size and the above properties will be determined by the new --cluster-replica-sizes comand line option, which accepts a JSON object that can be deserialized into ClusterReplicaSizeMap:

#[derive(Deserialize)]
struct ClusterReplicaSizeConfig {
    memory_limit: Option<MemoryLimit>,
    cpu_limit: Option<CpuLimit>,
    scale: usize,
    workers: usize,
}

type ClusterReplicaSizeMap = HashMap<String, ClusterReplicaSizeMap>;

The default mapping if unspecified will be:

{
    "1": {"scale": 1, "workers": 1},
    "2": {"scale": 1, "workers": 2},
    "4": {"scale": 1, "workers": 4},
    // ...
    "32": {"scale": 1, "workers": 32},
    // Testing with multiple processes on a single machine is a novelty, so
    // we don't bother providing many options.
    "2-1": {"scale": 2, "workers": 1},
    "2-2": {"scale": 2, "workers": 2},
    "2-4": {"scale": 2, "workers": 4},
}

This is convenient for development in local process mode, where the memory and size limits don't have any effect.

Materialize Cloud should use a mapping along the lines of the following:

{
    "small": {"cpu_limit": 2, "memory_limit": 8, "scale": 1, "workers": 1},
    "medium": {"cpu_limit": 8, "memory_limit": 32, "scale": 1, "workers": 4},
    "large": {"cpu_limit": 32, "memory_limit": 128, "scale": 1, "workers": 16},
    "xlarge": {"cpu_limit": 128, "memory_limit": 512, "scale": 1, "workers": 64},
    "2xlarge": {"cpu_limit": 128, "memory_limit": 512, "scale": 4, "workers": 64},
    "3xlarge": {"cpu_limit": 128, "memory_limit": 512, "scale": 16, "workers": 64},
}

The actual names and limits are strawmen. The key point is that the sizes have friendly names that abstract away the details.

Determining when to transition from scaling vertically to scaling horizontally is an open question that will need to be answered by running experiments on representative workloads. Prior work on benchmarking timely/differential in isolation showed performance improvements when switching to horizontal scalability before maxing out the resources on a single machine due to allocator overhead.

Replica status

The controller will need to inform the adapter when a replica changes status, so that the new status can be reflected in the mz_cluster_replicas table. The easiest way to do this is likely via a new ComputeResponse::ReplicaStatusChange variant.

Orchestrator changes

The ServiceConfig trait will be evolved with the option to specify an availability_zone: Option<String>.

When an availability zone is specified on a service, the KubernetesOrchestrator will attach a nodeSelector to the pods created by the service of availability-zone=AZ.

Note that the availability zone for AWS should be the stable AZ ID, like use1-az1, not the name like us-east-1a, which can map to different physical AZs in different accounts.

Cloud changes

The Pulumi infrastructure will need to provide nodes groups in several availability zones in production. The environment controller will plumb the list of... available... availability zones to materialized via the new --availability-zone argument.

Future work

Manually managing replicas is good for power users who want the flexibility, but adds friction for average users. In the future, we intend to support auto-scaling and auto-replicating clusters, where users declare the desired properties of their replica set, e.g.:

CREATE CLUSTER foo REPLICATION FACTOR 3, MIN SIZE 'small', MAX SIZE 'xlarge'

and Materialize automatically turns on the appropriate number of replicas, scaling them up and down in size in response to observed load.