# Cluster replicas ## Summary This document proposes an explicit SQL API for managing [cluster replicas](../platform/ux.md#cluster-replica). ## Design ### SQL syntax and semantics The `ALTER CLUSTER` statement will be removed. The `CREATE CLUSTER` and `CREATE CLUSTER REPLICA` statement will be evolved as follows: ```sql create_cluster_stmt ::= CREATE CLUSTER [ [, ...]] create_cluster_replica_stmt ::= CREATE CLUSTER REPLICA . [ [, ...]] inline_replica ::= REPLICA [( [, ...])] replica_option ::= | SIZE '' | AVAILABILITY ZONE '' | REMOTE (['address' [, 'address' ...]]) ``` It will be valid to create a cluster with no replicas. Creating a replica will be the moment at which compute resources are provisioned. The `SIZE` and `REMOTE` options must not be specified together. The `AVAILABILITY ZONE` option is new. The set of allowable availability zones will be determined by a new `--availability-zone=AZ` command line option. If no availability zone is specified explicitly, and at least one availability zone is available, Materialize should automatically assign the availability zone with the least existing replicas. The `SIZE` option will be changed from the existing `SIZE` option to validate the set of allowable sizes against a new `--cluster-replica-sizes` command line option. The `DROP CLUSTER REPLICA` statement will be added: ``` DROP CLUSTER REPLICA [IF EXISTS] . ``` It will have the usual semantics for a `DROP` operation. The `SHOW CLUSTER REPLICAS` statement ``` SHOW CLUSTER REPLICAS [FOR ] [{ LIKE 'pattern' | WHERE }] ``` will list the replicas of the provided cluster, or of the current cluster if no cluster is specified. The results will be optionally filtered by the provided `LIKE` pattern or `WHERE` expression, which work analogously to the same clauses in the `SHOW SCHEMA` statement. ### System catalog changes A new `mz_cluster_replicas` table will be added with the following schema: Field | Type | Meaning ------------------|-----------|-------- cluster_id | `bigint` | The ID of the cluster. name | `text` | The name of the cluster replica. size | `text` | The size of the cluster replica. `NULL` for a remote cluster. availability_zone | `text` | The availability zone of the cluster replica, if any. May be `NULL`. status | `text` | The status of the replica. Either `unhealthy` or `healthy`. ### Controller changes #### Replica size The controller will use the replica size to determine the following properties of a cluster replica: * [Memory limit](https://dev.materialize.com/api/rust/mz_orchestrator/struct.ServiceConfig.html#structfield.memory_limit) * [CPU limit](https://dev.materialize.com/api/rust/mz_orchestrator/struct.ServiceConfig.html#structfield.cpu_limit) * [Scale](https://dev.materialize.com/api/rust/mz_orchestrator/struct.ServiceConfig.html#structfield.scale) * Timely worker threads, specified as the `--workers` argument to `computed` The mapping between replica size and the above properties will be determined by the new `--cluster-replica-sizes` comand line option, which accepts a JSON object that can be deserialized into `ClusterReplicaSizeMap`: ```rust #[derive(Deserialize)] struct ClusterReplicaSizeConfig { memory_limit: Option, cpu_limit: Option, scale: usize, workers: usize, } type ClusterReplicaSizeMap = HashMap; ``` The default mapping if unspecified will be: ```jsonc { "1": {"scale": 1, "workers": 1}, "2": {"scale": 1, "workers": 2}, "4": {"scale": 1, "workers": 4}, // ... "32": {"scale": 1, "workers": 32}, // Testing with multiple processes on a single machine is a novelty, so // we don't bother providing many options. "2-1": {"scale": 2, "workers": 1}, "2-2": {"scale": 2, "workers": 2}, "2-4": {"scale": 2, "workers": 4}, } ``` This is convenient for development in local process mode, where the memory and size limits don't have any effect. Materialize Cloud should use a mapping along the lines of the following: ```json { "small": {"cpu_limit": 2, "memory_limit": 8, "scale": 1, "workers": 1}, "medium": {"cpu_limit": 8, "memory_limit": 32, "scale": 1, "workers": 4}, "large": {"cpu_limit": 32, "memory_limit": 128, "scale": 1, "workers": 16}, "xlarge": {"cpu_limit": 128, "memory_limit": 512, "scale": 1, "workers": 64}, "2xlarge": {"cpu_limit": 128, "memory_limit": 512, "scale": 4, "workers": 64}, "3xlarge": {"cpu_limit": 128, "memory_limit": 512, "scale": 16, "workers": 64}, } ``` The actual names and limits are strawmen. The key point is that the sizes have friendly names that abstract away the details. Determining when to transition from scaling vertically to scaling horizontally is an open question that will need to be answered by running experiments on representative workloads. Prior work on benchmarking timely/differential in isolation showed performance improvements when switching to horizontal scalability before maxing out the resources on a single machine due to allocator overhead. #### Replica status The controller will need to inform the adapter when a replica changes status, so that the new status can be reflected in the `mz_cluster_replicas` table. The easiest way to do this is likely via a new `ComputeResponse::ReplicaStatusChange` variant. ### Orchestrator changes The `ServiceConfig` trait will be evolved with the option to specify an `availability_zone: Option`. When an availability zone is specified on a service, the `KubernetesOrchestrator` will attach a `nodeSelector` to the pods created by the service of `availability-zone=AZ`. Note that the availability zone for AWS should be the [stable AZ ID](https://docs.aws.amazon.com/ram/latest/userguide/working-with-az-ids.html), like `use1-az1`, not the name like `us-east-1a`, which can map to different physical AZs in different accounts. ### Cloud changes The Pulumi infrastructure will need to provide nodes groups in several availability zones in production. The environment controller will plumb the list of... available... availability zones to `materialized` via the new `--availability-zone` argument. ## Future work Manually managing replicas is good for power users who want the flexibility, but adds friction for average users. In the future, we intend to support auto-scaling and auto-replicating clusters, where users declare the desired properties of their replica set, e.g.: ``` CREATE CLUSTER foo REPLICATION FACTOR 3, MIN SIZE 'small', MAX SIZE 'xlarge' ``` and Materialize automatically turns on the appropriate number of replicas, scaling them up and down in size in response to observed load.