We have identified several issues with the sources documented here. In particular, some key views are missing (e.g., a way to connect channel IDs to the operators they connect), and there are various existing ones that have inconsistently named fields or inconsistent semantics. Furthermore, the documentation of the various relations is incomplete (e.g., it doesn't make clear whether each relation is per-replica or global).
Compute introspection relations should be useful and easy to understand for internal debugging.
They should be useful and easy to use for building external introspection UI.
The relations do not need to be polished enough for external consumption (although it's hoped they prove useful, we don't consider it a bug if they're not). Customer-friendly introspection will be provided in the web UI by the "external introspection" project.
We make no attempt to maintain backwards compatibility with previously-existing queries, although as part of this work we will update the internal tools that use the introspection relations.
In this section, we describe the various issues that make the schema confusing.
Several relations have a worker_id
column. Technically, the
semantics of this column are the same across all the relations: it
corresponds to the worker on which the data was measured. However, in
practice the behavior is very different, and falls into a few separate
categories, as described below.
Example: mz_arrangement_sizes
. These are relations that report
some value that is expected to differ across all workers, such that
the semantically meaningful value is (often) the sum of all the
per-worker values. For example, a user interested in the total size of
each arrangement would normally want to issue a query like SELECT
operator_id, sum(records), sum(batches) FROM mz_arrangement_sizes
GROUP BY operator_id
.
There are also a few relations where sum
is not the most interesting
aggregate; for example, the min
across workers in
mz_compute_frontiers
. "Sharded" might not be the most precise way to
describe these, but I include them in this section because the point
is basically the same.
The existence of the per-worker underlying sources is not completely
useless, e.g. because they are useful for detecting skew. However,
they are probably less commonly useful than a summed view would
be. Unforunately, this summed view does not exist for most such
relations (one counterexample is mz_records_per_dataflow
and
mz_records_per_dataflow_global
).
Example: mz_dataflow_channels
. These relations report static
structural properties of dataflows. As such, they will always report
exactly the same values across all workers, except in one of two
cases:
These cases are unlikely to be observable in practice, the first because dataflow mismatches will usually cause unpredictable behavior (including crashes) long before we have a chance to usefully investigate them via introspection sources, and the second because such race conditions will typically be very short-lived.
Thus, the argument for making global views (e.g., that simply filter the underlying sources by worker 0) is even stronger than for the "Sharded relations" case.
_internal
and raw
variantsSeveral of the relations are views over very similar underlying
sources whose names are suffixed with _internal
or infixed with
raw
, which are similar enough to be possibly confusing, but mean
something completely different. _internal
relations are those that
count (or sum) something in the Differential Dataflow count field; the
corresponding non-_internal
views are simple aggregates that move
that count into the data. For example, mz_arrangement_sharing
is
defined as
SELECT
operator_id,
worker_id,
pg_catalog.count(*) AS count
FROM mz_internal.mz_arrangement_sharing_internal
GROUP BY operator_id, worker_id"
This dimension is applied inconsistently: not all relations that
report a count or sum in the data are derived from a corresponding
_internal
source.
raw
variants, on the other hand, measure some time value in
nanoseconds, whereas their corresponding non-raw
views convert the
times to the SQL interval type.
Confusingly, not all times are reported as intervals; for example,
mz_scheduling_parks
has slept_for
and requested
columns, whose
type is bigint
and whose unit is undocumented (in fact, it is nanoseconds).
The trade-off between intervals and nanoseconds timestamps is that the former is more ergonomic but also has only microsecond precision.
sum
column in histogramsSome histograms include a per-bucket sum, which allows consumers to
compute average or total values precisely. Confusingly, not all
histograms report such a column (e.g. mz_worker_compute_delays
does
not, whereas mz_worker_raw_compute_delays
does).
We introduced the sum
column initially to support Prometheus
histograms, which expects to know about cumulative sums and cannot be
convinced otherwise. However, it is possible to provide approximate
sums, based on multiplying the bucket values with the bucket counts.
Many of the relations are logging sources that run locally in clusters, are not persisted, and report different data depending on the currently selected replica. Views based on these sources are also per-replica. It is not straightforward to tell from our docs which is the case for each relation.
Furthermore, sometimes the same logical value is reported by two
relations, one reporting clusterd
's view of the world, and the other
reporting environmentd
's. For example, mz_compute_frontiers
vs. mz_cluster_replica_frontiers
. However, the names do not make it
clear what the distinction is.
mz_arrangement_sharing
). The underlying source should
have a name with per_worker
suffixed (e.g.,
mz_arrangement_sharing_per_worker
). Furthermore, the per-worker
variant should be clearly less prominent in documentation (either at
the bottom of the page, or in a "see more" section that has to be
clicked to expand).raw
variants should be removed, and all introspection relations
should present durations as nanoseconds timestamps. This allows us to
provide the full precision, while also allowing users to cast the
durations to intervals if they find those more convenient.*_raw
variant that maintains the count or sum in the Differential
diff field. These relations previously had an _internal
suffix, but
_raw
is less likely to be confused with the mz_internal
schema,
and better reflects that fact that these relations contain unaggregated
data._ns
.sum
column should be removed from all histogram sources.
Maintaining that field has some cost, in the form of additional state
we must keep, a roughly doubled amount of updates, and code complexity.
On the other hand, it is not clear how much value the sum
field
provides given that we can approximate it._histogram
,
to make their role more obvious.See GitHub PR.
The only way to understand the structure of scope nesting currently is
to use the mz_dataflow_addresses
relation and do some tedious list
manipulation. A simple view of (id, parent_id)
would make this much
simpler.
We should add the transitive closure of mz_object_dependencies
,
called e.g. mz_transitive_object_dependencies
.
This will allow us to see the end-to-end frontier lag for any dataflow.
(Not quite sure about this -- it's tricky to balance ease-of-use with not blowing up the set of documented relations)
None known to me
Are there further views that could be useful but that we don't have access to? We should fill this out as more become known to us (e.g., while debugging customer issues)
Is there a better name than _internal
for the non-aggregated sources?