In this document we will propose a few features related to "external" introspection; that is, introspection by end-users. These features will be exposed in the web UI.
It is currently difficult for anyone but the most unusually sophisticated users to debug issues with compute-maintained features: understand the health and stability of their replicas, understand the performance of their queries, and so on.
We will explain each proposed feature separately.
This will provide the functionality available now in the memory
and
/hierarchical-memory
visualizations: visualizing the graph of
operators in a dataflow, including the transfer of data along channels
and the number of records in arrangements.
This will differ from the presently-available GUIs in the following ways:
The global frontier lag visualizer will allow users to see at a glance whether dataflows are falling behind. It will display the controller's view of each source and export frontier, and color nodes if their outputs lag significantly behind their inputs.
There may be multiple values corresponding to a single export, if the cluster on which the source dataflow is running has more than one replica. Each edge between dataflows will be labeled with the maximum frontier across all replicas.
The view will incorporate all possible inputs and outputs of a dataflow: indexes, MVs, sources, and sinks.
By "resource usage information" we mean the same things that are displayed per-node in the hierarchical dataflow visualizer.
The scope structure should let us compute various values (time spent, records contained) and display them in a tree-like visualization, as shown below.
We should collected the following metadata for each query:
We can then save all this information in a table with non-trivial retention, and surface it from the web UI.
Currently we have text-only EXPLAIN PLAN
output. We should render
this data in visual form, as a graph. We should also (whenever
possible) flow column names from the source relations through the
nodes, so that we can show something more useful than #0
, etc.
All features will be implemented in the web UI using React, querying
the mz_internal
relations for the necessary data. Relations that
will be used include:
mz_dataflow_operators
mz_dataflow_channels
mz_arrangement_sizes
mz_dataflow_addresses
mz_object_dependencies
mz_cluster_replica_frontiers
mz_query_metadata
).EXPLAIN PLAN
output
(possibly from mz_query_metadata
, or entered manually by the user).We will use the d3-graphviz library for layout and rendering of the hierarchical dataflow visualizer, global frontier lag visualizer, and user-friendly plan rendering. We will use d3-flame-graph -- which we are already using on the internal side -- for the flamegraph visualizer.
We will be considering this an experimental feature until such time as we can involve professional product designers and front-end engineers. Until that time, the level of polish of the UX may not reach the same standards as the rest of the site, so we will require users to click a link with a label like "Advanced Features" or similar in order to access the tools.
We will also use LaunchDarkly to gate access to the feature, and not launch it at all to the public until we have gotten some internal feedback from support and DevEx that it is useful.
We will have a separate LaunchDarkly flag just for the
mz_query_metadata
table, since this will have potentially large cost
in high-QPS scenarios.
(TBD -- will cover this in our meeting with Robin tomorrow and get an overview from him of the testing strategy for this kind of feature)
mz_query_metadata
table will introduce overhead on
persist, especially in high-QPS scenarios. We should measure this
before implementing.mz_query_metadata
table, and is
it acceptable?NOTICE
messages in this case?Not sure of any. We should launch an MVP, get feedback and then iterate.