# Statement Logging (Chapter One) ## Context There exists no -- or very limited -- visiblity into the body of statements that users are running against Materialize, either for us or for the users themselves. This creates several problems. Some examples, in no particular order: * We can't allocate engineering resources effectively. For example, we don't know which query patterns are particularly common and thus that we should put particular effort into optimizing. * Administrators in large organizations have no way to see what queries people (or machines) in their organization are commonly issuing * If a user's cluster is crashing, support (or the users themselves) are not able to attempt to debug by checking whether a particularly expensive or dangerous query was run recently. Note that two meaningfully distinct features arise from these. The first is a user-visible statement log; that feature is the subject of this design doc ("Chapter One"). An upcoming design doc ("Chapter Two") will discuss the second feature: logging statements in _our_ infrastructure, for our own benefit. It is thought that implementing Chapter One is a prerequisite for implementing Chapter Two. ## Goals * Administrators should be able to access a table that records a representative sample of the statements issued in the recent past (e.g., last 30 days), along with other relevant information (e.g., timestamp information, time to return, current cluster, current user, plan information, etc.) ## Non-Goals * Chapter Two (getting the queries into our systems for our own analytics) is explicitly out of scope of _this_ design, but is necessary to unlock its full value, and so should be done as a follow-up ASAP. * 100% durability of queries for all customers is permanently out of scope. We have to accept the possibility of losing data under adverse conditions. The reason for this design choice is that otherwise we would have to persist the text of every statement (in either Cockroach or S3), which we consider unacceptable for `SELECT` queries. In the future, we may allow such durability as a specially feature for users who request it, are willing to pay for it, and are also willing to accept increased latency on the read path. It is not hard to implement, but we would want to be careful about the tradeoffs involves. That feature is, however, _temporarily_ out of scope; i.e., it is not discussed further in this document. ## Overview The coordinator maintains a buffer of pending events related to statement execution. Whenever a prepared statement is created, a corresponding event is added to the buffer. Whenever a statement is executed, a system var called `statement_history_sample_rate` is queried, and a random number between 0 and 1 is generated. If the random number is greater than the sampling rate, a description of the statement invocation is written into the buffer. Every so often (controlled by `statement_log_flush_interval`), the buffer is converted to updates that are written out to `mz_statement_execution_history` and `mz_prepared_statement_history`, with a column in the former referencing the latter. These system tables are persistent, but entries older than `statement_log_max_age` are periodically deleted. We will also create an `mz_session_history` table that will track all user sessions, so that we can track which statements were issued by which users. ## Detailed description The coordinator maintains a buffer of pending events, which can be of the following variants: ``` rust enum StatementLogEvent { StatementPrepared { id: Uuid, sql: String, name: String, // "" for the default name session_id: Uuid, }, StatementExecutionBegan { id: Uuid, prepared_statement_id: Uuid, params: Vec, occurred_at: DateTime, // Additional metadata might eventually // go here, e.g. `plan: Option` }, StatementExecutionEnded { id: Uuid, occurred_at: DateTime, was_fast_path: bool, result: StatementResult, }, } ``` where `StatementResult` is defined as follows: ``` rust enum StatementResult { Canceled, Error { message: String, }, Success { // None for statements that don't return a result rows_returned: Option, } } ``` The `StatementPrepared` and `StatementExecutionBegan` events are generated and stored in the buffer in `fn handle_execute`, after checking a random value against `statement_history_sample_rate` . This requires keeping track of the prepared statement bound to each name, and whether it has been logged yet (as we don't want to log each statement several times). The `StatementExecutionFinished` events are generated in several different places, depending on the type of statement. For example, the event for `SELECT` queries is generated in response to the `ControllerResponse::PeekResponse` message. The coordinator also holds on to an interval object whose tick rate is controlled by `statement_log_flush_interval`. It listens for ticks on that interval in its main `serve` loop; when such a tick occurs, it emits data to the system tables, as follows: * Each `StatementPrepared` event generates an entry in `mz_prepared_statement_history`. * Each `StatementExecutionBegan` event generates an entry in `mz_statement_execution_history`, and records that entry for retraction as part of a later update. * Each `StatementExecutionEnded` event generates an update to the entry described above. These entries are consolidated and then emitted to the builtin tables, and also to the catalog, ensuring durability. On system restart, all entries in `mz_statement_execution_history` older than `statement_log_max_age` are deleted. Then "garbage collection" is done to delete entries in `mz_prepared_statement_history` and `mz_session_history` that are no longer referenced. The schemas of the tables are as follows. Note that deciding on a particular schema now doesn't preclude us from adding more fields in the future, if we decide we want more granular information about particular queries (for example, whether WMR was used). **mz_prepared_statement_history** | Field | Type | Meaning | |---------------|----------------------------|----------------------------------------------------| | `id` | `uuid` | The unique identifier of the prepared statement. | | `session_id` | `uuid` | The ID of the session that prepared the statement. | | `name` | `text` | The name of the prepared statement. | | `sql` | `text` | The SQL text of the statement. | | `prepared_at` | `timestamp with time zone` | The timestamp at which the statement was prepared. | **mz_statement_execution_history** | Field | Type | Meaning | |-------------------------|----------------------------|----------------------------------------------------------------------------------------------------------| | `id` | `uuid` | The unique identifier of the statement execution. | | `prepared_statement_id` | `uuid` | The unique identifier of the corresponding item in `mz_prepared_statement_history`. | | `sample_rate` | `float8` | The value of `statement_history_sample_rate` at the time of execution. | | `params` | `text list` | The textual representations of the supplied arguments. | | `began_at` | `timestamp with time zone` | The time at which the coordinator received the command and began executing the statement. | | `finished_at` | `timestamp with time zone` | The time at which the statement finished, or null if it is still running. | | `was_successful` | `bool` | Whether the statement terminated successfully, or null if it is still running. | | `was_canceled` | `bool` | Whether the statement was canceled, or null if it is still running. | | `was_aborted` | `bool` | Whether the statement was aborted; that is, `environmentd` crashed before execution finished. | | `error_message` | `text` | The error message. Non-null if and only if the statement finished without succeeding or being cancelled. | | `rows_returned` | `bigint` | How many rows the statement returned, or null if it is still running, errored or was canceled. | | `was_fast_path` | `bool` | Whether the statement executed on the fast path, or null if it is still running. | **mz_session_history** | Field | Type | Meaning | |--------------------|--------|-------------------------------------------------------| | `id` | `uuid` | The unique identifier of the session. | | `application_name` | `text` | The name of the application that created the session. | | `user` | `text` | The name of the user who owns the session. | ### Observability and Rollout If we misjudge the size and volume of our typical users' queries, we will log much more data than we expect. Thus, rollout should proceed in two phases. First, we should enable the feature in "dry run" mode, which does not actually log any queries, but tracks (as a Prometheus metric) how much data would have been logged had the feature actually been turned on. Only once we validate that it meets expectations should we actually enable it. We should do something similar before adding any new columns to the tables (e.g., the plans). ### Testing For testing, we will probably need a way to make the random sampling deterministic; this can be accomplished by adding a `statement_log_random_seed` system var that we set from within testdrive or SLTs. ## Alternatives Log all executed statements, rather than sampling. This approach was rejected because users might come to rely on all statements being logged, which would require us making all statements durable, dramatically increasing the cost of high-QPS usage patterns. Instead, we will not allow setting the sampling rate above some value slightly less than 1, e.g. 0.99. ## Open questions * Should we let the user set their own sampling rate, or does this have the potential to put unacceptable load on e.g. Cockroach? * What should be the default values of the system vars?