There exists no -- or very limited -- visiblity into the body of statements that users are running against Materialize, either for us or for the users themselves. This creates several problems. Some examples, in no particular order:
Note that two meaningfully distinct features arise from these. The first is a user-visible statement log; that feature is the subject of this design doc ("Chapter One"). An upcoming design doc ("Chapter Two") will discuss the second feature: logging statements in our infrastructure, for our own benefit. It is thought that implementing Chapter One is a prerequisite for implementing Chapter Two.
SELECT
queries.
In the future, we may allow such durability as a specially feature
for users who request it, are willing to pay for it, and are also
willing to accept increased latency on the read path. It is not hard
to implement, but we would want to be careful about the tradeoffs
involves. That feature is, however, temporarily out of scope;
i.e., it is not discussed further in this document.The coordinator maintains a buffer of pending events related to
statement execution. Whenever a prepared statement is created, a
corresponding event is added to the buffer. Whenever a statement is
executed, a system var called statement_history_sample_rate
is
queried, and a random number between 0 and 1 is generated. If the
random number is greater than the sampling rate, a description of the
statement invocation is written into the buffer.
Every so often (controlled by statement_log_flush_interval
), the
buffer is converted to updates that are written out to
mz_statement_execution_history
and mz_prepared_statement_history
,
with a column in the former referencing the latter. These system
tables are persistent, but entries older than statement_log_max_age
are periodically deleted.
We will also create an mz_session_history
table that will track all
user sessions, so that we can track which statements were issued by
which users.
The coordinator maintains a buffer of pending events, which can be of the following variants:
enum StatementLogEvent {
StatementPrepared {
id: Uuid,
sql: String,
name: String, // "" for the default name
session_id: Uuid,
},
StatementExecutionBegan {
id: Uuid,
prepared_statement_id: Uuid,
params: Vec<Datum>,
occurred_at: DateTime,
// Additional metadata might eventually
// go here, e.g. `plan: Option<OptimizedMirRelationExpr>`
},
StatementExecutionEnded {
id: Uuid,
occurred_at: DateTime,
was_fast_path: bool,
result: StatementResult,
},
}
where StatementResult
is defined as follows:
enum StatementResult {
Canceled,
Error {
message: String,
},
Success {
// None for statements that don't return a result
rows_returned: Option<usize>,
}
}
The StatementPrepared
and StatementExecutionBegan
events are
generated and stored in the buffer in fn handle_execute
, after
checking a random value against statement_history_sample_rate
. This
requires keeping track of the prepared statement bound to each name,
and whether it has been logged yet (as we don't want to log each
statement several times). The StatementExecutionFinished
events are
generated in several different places, depending on the type of
statement. For example, the event for SELECT
queries is generated in
response to the ControllerResponse::PeekResponse
message.
The coordinator also holds on to an interval object whose tick rate is
controlled by statement_log_flush_interval
. It listens for ticks
on that interval in its main serve
loop; when such a tick occurs, it
emits data to the system tables, as follows:
StatementPrepared
event generates an entry in
mz_prepared_statement_history
.StatementExecutionBegan
event generates an entry in
mz_statement_execution_history
, and records that entry for
retraction as part of a later update.StatementExecutionEnded
event generates an update to the
entry described above.These entries are consolidated and then emitted to the builtin tables, and also to the catalog, ensuring durability.
On system restart, all entries in mz_statement_execution_history
older than statement_log_max_age
are deleted. Then "garbage
collection" is done to delete entries in
mz_prepared_statement_history
and mz_session_history
that are no
longer referenced.
The schemas of the tables are as follows. Note that deciding on a particular schema now doesn't preclude us from adding more fields in the future, if we decide we want more granular information about particular queries (for example, whether WMR was used).
mz_prepared_statement_history
Field | Type | Meaning |
---|---|---|
id |
uuid |
The unique identifier of the prepared statement. |
session_id |
uuid |
The ID of the session that prepared the statement. |
name |
text |
The name of the prepared statement. |
sql |
text |
The SQL text of the statement. |
prepared_at |
timestamp with time zone |
The timestamp at which the statement was prepared. |
mz_statement_execution_history
Field | Type | Meaning |
---|---|---|
id |
uuid |
The unique identifier of the statement execution. |
prepared_statement_id |
uuid |
The unique identifier of the corresponding item in mz_prepared_statement_history . |
sample_rate |
float8 |
The value of statement_history_sample_rate at the time of execution. |
params |
text list |
The textual representations of the supplied arguments. |
began_at |
timestamp with time zone |
The time at which the coordinator received the command and began executing the statement. |
finished_at |
timestamp with time zone |
The time at which the statement finished, or null if it is still running. |
was_successful |
bool |
Whether the statement terminated successfully, or null if it is still running. |
was_canceled |
bool |
Whether the statement was canceled, or null if it is still running. |
was_aborted |
bool |
Whether the statement was aborted; that is, environmentd crashed before execution finished. |
error_message |
text |
The error message. Non-null if and only if the statement finished without succeeding or being cancelled. |
rows_returned |
bigint |
How many rows the statement returned, or null if it is still running, errored or was canceled. |
was_fast_path |
bool |
Whether the statement executed on the fast path, or null if it is still running. |
mz_session_history
Field | Type | Meaning |
---|---|---|
id |
uuid |
The unique identifier of the session. |
application_name |
text |
The name of the application that created the session. |
user |
text |
The name of the user who owns the session. |
If we misjudge the size and volume of our typical users' queries, we will log much more data than we expect. Thus, rollout should proceed in two phases. First, we should enable the feature in "dry run" mode, which does not actually log any queries, but tracks (as a Prometheus metric) how much data would have been logged had the feature actually been turned on. Only once we validate that it meets expectations should we actually enable it. We should do something similar before adding any new columns to the tables (e.g., the plans).
For testing, we will probably need a way to make the random sampling
deterministic; this can be accomplished by adding a
statement_log_random_seed
system var that we set from within
testdrive or SLTs.
Log all executed statements, rather than sampling. This approach was rejected because users might come to rely on all statements being logged, which would require us making all statements durable, dramatically increasing the cost of high-QPS usage patterns. Instead, we will not allow setting the sampling rate above some value slightly less than 1, e.g. 0.99.