# Persist Inline Writes - Associated: [#24832](https://github.com/MaterializeInc/materialize/pull/24832) ## The Problem A non-empty persist write currently always requires a write to S3 and a write to CockroachDB. S3 writes are much slower than CRDB writes and incur a per-PUT charge, so in the case of very small writes, this is wasteful of both latency and cost. Persist latencies directly impact the user experience of Materialize in a number of ways. The above waste is particularly egregious in DDL, which may serially perform a number of small persist writes. ## Success Criteria Eligible persist writes have latency of `O(crdb write)` and not `O(s3 write)`. ## Out of Scope - Latency guarantees: In particular, to keep persist metadata small, inline writes are an optimization, they are not guaranteed. - Read latencies. ## Solution Proposal Persist internally has the concept of a _part_, which corresponds 1:1:1 with a persist _blob_ and an _object_ in S3. Currently, all data in a persist shard is stored in S3 and then a _hollow_ reference to it is stored in CRDB. This reference also includes metadata, such as pushdown statistics and the encoded size. We make this reference instead a two variant enum: `Hollow` and `Inline`. The `Inline` variant stores the same data as would be written to s3, but in an encoded protobuf. This protobuf is only decoded in data fetch paths, and is otherwise passed around as opaque bytes to save allocations and cpu cycles. Pushdown statistics and the encoded size are both unnecessary for inline parts. The persist state stored in CRDB is a control plane concept, so there is both a performance and a stability risk from mixing the data plane into it. We reduce the inline parts over time by making compaction flush them out to S3, never inlining them. However, nothing prevents new writes from arriving faster than compaction can pull them out. We protect the control plane with the following two limits to create a hard upper bound on how much data can be inline: - `persist_inline_writes_single_max_bytes`: An (exclusive) maximum size of a write that persist will inline in metadata. - `persist_inline_writes_total_max_bytes`: An (inclusive) maximum total size of inline writes in metadata. Any attempted writes beyond this threshold will instead fall through to the s3 path. ## Alternatives - In addition to S3, also store _blobs_ in CRDB (or a third technology). CRDB [self-documents as not a good fit][crdb-large-blob] for workloads involving large binary blobs: "it's recommended to keep values under 1 MB to ensure adequate performance". That said, given that inline writes are small and directly translate to a CRDB write anyway, this would likely be a totally workable alternative. That said, CRDB is already a dominant cost of persist and a large part of that is a function of the rate of SQL values changed, regardless of the size or batching of those writes. Additionally, putting the writes inline in persist state allows pubsub to push them around for us, meaning that a reader that is subscribed/listening to the shard doesn't hit the network when it gets the new seqno. A third technology would not be worth the additional operational burden. - Make S3 faster. This is not actionable in the short term. [crdb-large-blob]: https://www.cockroachlabs.com/docs/stable/bytes#size ## Open Questions - How do we tune the two thresholds? - Should every inline write result in a compaction request for the new batch immediately flushing it out to s3?