- Feature name: External introspection - Associated: (Insert list of associated epics, issues, or PRs) # Summary [summary]: #summary In this document we will propose a few features related to "external" introspection; that is, introspection by end-users. These features will be exposed in the web UI. # Motivation [motivation]: #motivation It is currently difficult for anyone but the most unusually sophisticated users to debug issues with compute-maintained features: understand the health and stability of their replicas, understand the performance of their queries, and so on. # Explanation [explanation]: #explanation We will explain each proposed feature separately. ## Hierarchical dataflow visualizer This will provide the functionality available now in the `memory` and `/hierarchical-memory` visualizations: visualizing the graph of operators in a dataflow, including the transfer of data along channels and the number of records in arrangements. This will differ from the presently-available GUIs in the following ways: * It will be externally visible * The user will be able to dynamically refine the zoom level by clicking (rather than having to scroll down as in the current hierarchical UI) * We will include scheduling durations alongside the other per-node information (arrangement sizes / channel traffic) * If possible (TBD) we should figure out how to filter out the error paths. Mockup: ![diagram of proposed feature](hdv.png) ## Global frontier lag visualizer The global frontier lag visualizer will allow users to see at a glance whether dataflows are falling behind. It will display the controller's view of each source and export frontier, and color nodes if their outputs lag significantly behind their inputs. There may be multiple values corresponding to a single export, if the cluster on which the source dataflow is running has more than one replica. Each edge between dataflows will be labeled with the maximum frontier across all replicas. The view will incorporate all possible inputs and outputs of a dataflow: indexes, MVs, sources, and sinks. Mockup: ![diagram of proposed feature](gflv.png) ## "Flamegraph" views of resource usage information By "resource usage information" we mean the same things that are displayed per-node in the hierarchical dataflow visualizer. The scope structure should let us compute various values (time spent, records contained) and display them in a tree-like visualization, as shown below. ![diagram of proposed feature](fg.png) ## Per-query metadata We should collected the following metadata for each query: * The timestamp at which the query executed * The frontiers of all dependencies * The optimized and physical query plans * The ID of the dataflow (if it is not a simple peek) used to service the query * The SQL text of the query We can then save all this information in a table with non-trivial retention, and surface it from the web UI. ## User-friendly rendering of query plans Currently we have text-only `EXPLAIN PLAN` output. We should render this data in visual form, as a graph. We should also (whenever possible) flow column names from the source relations through the nodes, so that we can show something more useful than `#0`, etc. # Reference explanation [reference-explanation]: #reference-explanation ## Data sources All features will be implemented in the web UI using React, querying the `mz_internal` relations for the necessary data. Relations that will be used include: * For the hierarchical dataflow visualizer: * `mz_dataflow_operators` * `mz_dataflow_channels` * `mz_arrangement_sizes` * `mz_dataflow_addresses` * For the global frontier lag visualizer: * `mz_object_dependencies` * `mz_cluster_replica_frontiers` * For the flamegraph views: same as the hierarchical dataflow visualizer * For the per-query metadata: New table to be created (`mz_query_metadata`). * For the user-friendly plan rendering: parsed `EXPLAIN PLAN` output (possibly from `mz_query_metadata`, or entered manually by the user). ## Layout and rendering We will use the [d3-graphviz](https://github.com/magjac/d3-graphviz) library for layout and rendering of the hierarchical dataflow visualizer, global frontier lag visualizer, and user-friendly plan rendering. We will use [d3-flame-graph](https://github.com/spiermar/d3-flame-graph) -- which we are already using on the internal side -- for the flamegraph visualizer. # Rollout and Lifecycle We will be considering this an experimental feature until such time as we can involve professional product designers and front-end engineers. Until that time, the level of polish of the UX may not reach the same standards as the rest of the site, so we will require users to click a link with a label like "Advanced Features" or similar in order to access the tools. We will also use LaunchDarkly to gate access to the feature, and not launch it at all to the public until we have gotten some internal feedback from support and DevEx that it is useful. We will have a separate LaunchDarkly flag just for the `mz_query_metadata` table, since this will have potentially large cost in high-QPS scenarios. ## Testing and observability (TBD -- will cover this in our meeting with Robin tomorrow and get an overview from him of the testing strategy for this kind of feature) # Drawbacks [drawbacks]: #drawbacks * If any of the features don't prove useful, we are cluttering our UX unnecessarily. * Maintaining the `mz_query_metadata` table will introduce overhead on persist, especially in high-QPS scenarios. We should measure this before implementing. # Conclusion and alternatives [conclusion-and-alternatives]: #conclusion-and-alternatives - I am unaware of any other possible designs # Unresolved questions [unresolved-questions]: #unresolved-questions * What will be the overhead of the `mz_query_metadata` table, and is it acceptable? * How can we communicate query IDs back to the user for looking up the query in the per-query metadata view? Is it acceptable to use `NOTICE` messages in this case? # Future work [future-work]: #future-work Not sure of any. We should launch an MVP, get feedback and then iterate.