# Compute Introspection sources/tables/views improvements

## Summary

We have identified several issues with the sources documented
[here](https://materialize.com/docs/sql/system-catalog/mz_internal/). In
particular, some key views are missing (e.g., a way to connect channel
IDs to the operators they connect), and there are various existing ones that
have inconsistently named fields or inconsistent
semantics. Furthermore, the documentation of the various relations is
incomplete (e.g., it doesn't make clear whether each relation is
per-replica or global).

## Goals

* Compute introspection relations should be useful and easy to understand
for internal debugging.

* They should be useful and easy to use for building external
  introspection UI.

## Non-Goals

* The relations do not need to be polished enough for external
consumption (although it's hoped they prove useful, we don't consider
it a bug if they're not). Customer-friendly introspection will be
provided in the web UI by the "external introspection" project.

* We make no attempt to maintain backwards compatibility with
  previously-existing queries, although as part of this work we will
  update the internal tools that use the introspection relations.

## Description

### Schema inconsistencies and inelegancies

In this section, we describe the various issues that make the schema
confusing.

#### Per-worker vs. aggregated

Several relations have a `worker_id` column. Technically, the
semantics of this column are the same across all the relations: it
corresponds to the worker on which the data was measured. However, in
practice the behavior is very different, and falls into a few separate
categories, as described below.

##### Sharded relations

*Example:* `mz_arrangement_sizes`. These are relations that report
some value that is expected to differ across all workers, such that
the semantically meaningful value is (often) the sum of all the
per-worker values. For example, a user interested in the total size of
each arrangement would normally want to issue a query like `SELECT
operator_id, sum(records), sum(batches) FROM mz_arrangement_sizes
GROUP BY operator_id`.

There are also a few relations where `sum` is not the most interesting
aggregate; for example, the `min` across workers in
`mz_compute_frontiers`. "Sharded" might not be the most precise way to
describe these, but I include them in this section because the point
is basically the same.

The existence of the per-worker underlying sources is not completely
useless, e.g. because they are useful for detecting skew. However,
they are probably less commonly useful than a summed view would
be. Unforunately, this summed view does not exist for most such
relations (one counterexample is `mz_records_per_dataflow` and
`mz_records_per_dataflow_global`).

##### Duplicated relations

*Example:* `mz_dataflow_channels`. These relations report static
structural properties of dataflows. As such, they will always report
exactly the same values across all workers, except in one of two
cases:

* Our dataflow-rendering logic is buggy, and we invalidly produce
  differently-shaped dataflows on different workers, or
* Due to a race condition, where the relation is queried just after
  one worker has emitted a log message corresponding to the dataflow,
  but just before another worker has.

These cases are unlikely to be observable in practice, the first because
dataflow mismatches will usually cause unpredictable behavior
(including crashes) long before we have a chance to usefully
investigate them via introspection sources, and the second because
such race conditions will typically be very short-lived.

Thus, the argument for making global views (e.g., that simply filter
the underlying sources by worker 0) is even stronger than for the
"[Sharded relations](#sharded-relations)" case.


#### `_internal` and `raw` variants

Several of the relations are views over very similar underlying
sources whose names are suffixed with `_internal` or infixed with
`raw`, which are similar enough to be possibly confusing, but mean
something completely different. `_internal` relations are those that
count (or sum) something in the Differential Dataflow count field; the
corresponding non-`_internal` views are simple aggregates that move
that count into the data. For example, `mz_arrangement_sharing` is
defined as

``` sql
SELECT
    operator_id,
    worker_id,
    pg_catalog.count(*) AS count
FROM mz_internal.mz_arrangement_sharing_internal
GROUP BY operator_id, worker_id"
```

This dimension is applied inconsistently: not all relations that
report a count or sum in the data are derived from a corresponding
`_internal` source.

`raw` variants, on the other hand, measure some time value in
nanoseconds, whereas their corresponding non-`raw` views convert the
times to the SQL interval type.

Confusingly, not all times are reported as intervals; for example,
`mz_scheduling_parks` has `slept_for` and `requested` columns, whose
type is `bigint` and whose unit is undocumented (in fact, it is nanoseconds).

The trade-off between intervals and nanoseconds timestamps is that the
former is more ergonomic but also has only microsecond precision.

#### Inconsistent presence of `sum` column in histograms

Some histograms include a per-bucket sum, which allows consumers to
compute average or total values precisely. Confusingly, not all
histograms report such a column (e.g. `mz_worker_compute_delays` does
not, whereas `mz_worker_raw_compute_delays` does).

We introduced the `sum` column initially to support Prometheus
histograms, which expects to know about cumulative sums and cannot be
convinced otherwise. However, it is possible to provide approximate
sums, based on multiplying the bucket values with the bucket counts.

#### Lack of clarity around per-replica vs. global relations

Many of the relations are logging sources that run locally in
clusters, are not persisted, and report different data depending on
the currently selected replica. Views based on these sources are also
per-replica. It is not straightforward to tell from our docs which is
the case for each relation.

Furthermore, sometimes the same logical value is reported by two
relations, one reporting `clusterd`'s view of the world, and the other
reporting `environmentd`'s. For example, `mz_compute_frontiers`
vs. `mz_cluster_replica_frontiers`. However, the names do not make it
clear what the distinction is.

### Suggested fixes for schema inconsistencies

* All per-worker relations should have a corresponding "global" view
  where the results are aggregated across workers in the way most
  likely to be useful (for example, summing, taking the minimum, or
  restricting to worker 0). Because these global views are more likely
  to be useful than the underlying sources, they should have the basic
  name (e.g., `mz_arrangement_sharing`). The underlying source should
  have a name with `per_worker` suffixed (e.g.,
  `mz_arrangement_sharing_per_worker`). Furthermore, the per-worker
  variant should be clearly less prominent in documentation (either at
  the bottom of the page, or in a "see more" section that has to be
  clicked to expand).
* The expectations about per-worker relations should be clearly
  documented (e.g., we should clearly say whether the values for each
  worker are expected to be the same during normal operation).
* `raw` variants should be removed, and all introspection relations
  should present durations as nanoseconds timestamps. This allows us to
  provide the full precision, while also allowing users to cast the
  durations to intervals if they find those more convenient.
* All relations that report a simple count or sum should be views based on
  a `*_raw` variant that maintains the count or sum in the Differential
  diff field. These relations previously had an `_internal` suffix, but
  `_raw` is less likely to be confused with the `mz_internal` schema,
  and better reflects that fact that these relations contain unaggregated
  data.
* All columns containing nanoseconds timestamps should have names
  suffixed with `_ns`.
* The `sum` column should be removed from all histogram sources.
  Maintaining that field has some cost, in the form of additional state
  we must keep, a roughly doubled amount of updates, and code complexity.
  On the other hand, it is not clear how much value the `sum` field
  provides given that we can approximate it.
* All histogram views or sources should be suffixed with `_histogram`,
  to make their role more obvious.

### Possibly useful missing views

#### Channel-to-operator mapping

See [GitHub
PR](https://github.com/MaterializeInc/materialize/pull/17825).

#### Scope child to parent mapping

The only way to understand the structure of scope nesting currently is
to use the `mz_dataflow_addresses` relation and do some tedious list
manipulation. A simple view of `(id, parent_id)` would make this much
simpler.

#### Transitive object dependencies and end-to-end lag

We should add the transitive closure of `mz_object_dependencies`,
called e.g. `mz_transitive_object_dependencies`.

This will allow us to see the end-to-end frontier lag for any
dataflow.

#### More views enriched with operator names, dataflow names, etc.

(Not quite sure about this -- it's tricky to balance ease-of-use with
not blowing up the set of documented relations)

## Alternatives

None known to me

## Open questions

* Are there further views that could be useful but that we don't have
access to? We should fill this out as more become known to us (e.g.,
while debugging customer issues)

* Is there a better name than `_internal` for the non-aggregated sources?