Associated issue: #21557
Avro schemas allow users to annotate records, record fields, and enums with
documentation. Note the doc
fields in the following example Avro schema:
{
"type": "record",
"name": "row",
"doc": "represents a row in some materialized view",
"fields" : [
{"name": "c1", "type": "string", "doc": "this is the meaning of column c1"},
{"name": "c2", "type": "int"}
]
}
When using an Avro-formatted sink, Materialize generates the Avro schema based on the shape of the underlying relation. For example, given the following sink:
CREATE MATERIALIZED VIEW v AS SELECT 'text' AS c1, '3'::integer as c2;
CREATE SINK FROM v INTO KAFKA ... FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY ...;
Materialize will automatically generate the following Avro schema and publish it to the schema registry:
{
"type": "record",
"name": "envelope",
"fields" : [
{"name": "c1", "type": "string"},
{"name": "c2", "type": "int"}
]
}
Today, there is no way to instruct Materialize to include doc
fields in the
generated Avro schema. This poses a challenge for organizations with
requirements on the use of documentation—e.g., a policy that requires that every
field in every schema published to the organization's schema registry has
documentation.
Must haves:
doc
fields in Avro
schemas to use Avro-formatted sinks.Nice to haves:
Consistency with Materialize's existing approach to generating Avro schemas from SQL relations.
Good ergonomics.
None yet.
We'll extend the FORMAT AVRO
clause with a new [KEY|VALUE] DOC ON
[TYPE|COLUMN]
option. For example:
CREATE SINK ...
FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY ... (
DOC ON COLUMN <column-specifier> = '<documentation string>',
KEY DOC ON TYPE <type-specifier> = '<documentation string>',
VALUE DOC ON COLUMN <column-specifier> = '<documentation string>',
...
)
WITH (...)
The option will be located in the existing parenthesized option list immediately
following the existing FORMAT AVRO
clause. After this change,
[KEY|VALUE] DOC ON [TYPE|COLUMN]
will live alongside the previously existing
AVRO KEY FULLNAME
and AVRO VALUE FULLNAME
options and we expect to add more
options to control the Avro schema in the future.
Aside: the current design of the USING CONFLUENT SCHEMA REGISTRY
option is
historical and predates the introduction of the standardized option list. In the
future, we may also move the USING CONFLUENT SCHEMA REGISTRY
into the option
list, where it can be freely intermixed with other options, e.g.:
FORMAT AVRO (
DOC ON ...,
CONFLUENT SCHEMA REGISTRY = '...',
AVRO KEY FULLNAME = '...',
KEY DOC ON ...,
)
Each [KEY|VALUE] DOC ON ...
option describes a documentation comment that should
be attached to a single field or type in the Avro schema. A KEY DOC ON ...
option applies only to the key schema; a VALUE DOC ON ...
option applies only
to the value schema; a DOC ON ...
option applies to both types of schemas.
Each DOC ON
option applies to either a column or a type:
Column comments ([KEY|VALUE] DOC ON COLUMN ...
) name a column in the
format [[db.]schema.]object.column
. Object names are looked up according to usual
SQL name resolution rules for the search path and active database.
Type comments ([KEY|VALUE] DOC ON TYPE ...
) specify a type in the
format [[db.]schema.]object
. Object names are looked up according to usual
SQL name resolution rules for the search path and active database.
Materialize will automatically use comments on the columns of the materialized view and their constituent record types as the default documentation for the fields and record types, respectively, in the Avro schema.
For a given field in the Avro schema, Materialize will consider the following sources of documentation in order:
KEY DOC ON COLUMN
option specifically naming that
field. If the value schema, a VALUE DOC ON COLUMN
option specifically
naming that field.DOC ON COLUMN
option specifically naming that field.For a given record type in the Avro schema, Materialize will consider the following sources of documentation in order:
KEY DOC ON TYPE
option specifically naming the type.
If the value schema, a VALUE DOC ON TYPE
option specifically naming the
type.DOC ON TYPE
option specifically naming the type.For example, consider the following sink:
CREATE TYPE point AS (x integer, y integer);
CREATE MATERIALIZED VIEW v AS SELECT ROW(1, 1)::point AS c1, 'text' AS c2;
CREATE SINK FROM v INTO KAFKA ... FORMAT AVRO ...;
Materialize will generate the following Avro value schema:
{
"type": "record",
"name": "envelope",
"doc": "documentation for materialized view underlying the sink, if any",
"fields": [
{
"name": "c1",
"doc": "documentation for field `envelope::c1``, if any",
"type": {
"type": "record",
"name": "record0",
"namespace": "com.materialize.sink",
"doc": "documentation for type `com.materialize.sink.record0`, if any",
"fields": [
{
"name": "x",
"type": [
"null",
"int"
],
"doc": "documentation for field `com.materialize.sink.record0::x`, if any"
},
{
"name": "y",
"type": [
"null",
"int"
],
"doc": "documentation for field `com.materialize.sink.record0::y``, if any"
}
]
}
},
{
"name": "c2",
"doc": "documentation for field `envelope::c1``, if any",
"type": "string"
},
]
}
For the field envelope::c1
, Materialize will consider the following sources of
documentation in order:
KEY DOC ON COLUMN
or VALUE DOC ON COLUMN
option naming v.c1
.DOC ON COLUMN
option naming v.c1
.v.c1
.The field envelope::c2
is analogous.
For the type com.materialize.sink.record0
, Materialize will consider the
following sources of documentation in order:
KEY DOC ON TYPE
or VALUE DOC ON TYPE
option naming point
.DOC ON TYPE
option naming point
.point
.For the field com.materialize.sink.record0::x
, Materialize will consider the
following sources of documentation in order:
KEY DOC ON COLUMN
or VALUE DOC ON COLUMN
option naming point.x
.DOC ON COLUMN
option naming point.x
.point.x
.com.materialize.sink.record0::y
is analogous.
Materialize will also use comments on the materialized view itself as the
default documentation for the top-level record (named row
for the key schema
and envelope
for the value schema). In the example above, Materialize will
consider the following sources of documentation in order:
DOC ON TYPE
comment naming v
.v
.Changes made to comments after the sink is created will not be reflected in the Avro schema.
The details in the Design section above will need to be worked into the Avro sink documentation. In particular, we should be sure to precisely explain the order in which Materialize searches for sources of documentation.
Note that we expect the COMMENT
interface to be the most common method by
which users add comments to their sink schemas. The DOC ON
option is primarily
useful for the implementation, so that a CREATE SINK
command can be fully self
describing. We do expect users to use DOC ON
in a pinch, e.g. if a single sink
needs to customize the documentation for a nested type, rather than pulling its
default description from its comment.
The extensions to the parser are straightforward.
The planner will "freeze" any comments that have been promoted to documentation
by rewriting the CREATE SINK
statement to contain those comments as explicit
DOC ON
options. This ensures that the Avro schema generated for a sink is
deterministic—i.e., that future changes to comments on the columns and types
used by the sink will not result in Materialize generating a different schema
on restart.
No changes to the catalog are required. The documentation will be automatically
stored in the catalog via the addition of the DOC ON
option, as the catalog
stores the raw DDL statement for each sink.
The interchange::json::build_row_schema_json
will need to be adjusted to
take the requested documentation as a parameter. Loosely, it could grow a
docs: BTreeMap<DocTarget, String>
parameter, where the key DocTarget
is
defined like the following to represent key or type specifiers,
enum DocTarget {
Type(GlobalId),
Field { object_id: GlobalId, column_name: String }
}
and, the value represents the requested documentation for the specifier.
The sink code in the coordinator would need to convert the DOC ON
comments
to this BTreeMap
.
Materialize could instead allow users to provide an Avro schema with all
required doc
fields. For example, users could pre-publish the desired schema
to the schema registry, and then Materialize could load that schema at CREATE
SINK
time.
There are several issues with this approach:
It is not ergonomic. The provided schema must exactly match the schema
Materialize generates, except for the doc
fields. Minor errors in
constructing the schema (e.g., using a long
where an int
is required, or
ordering fields wrong) will result in hard to debug failures.
It prevents Materialize from making backwards compatible changes to sink schemas.
Today, we can make changes the generated Avro sink schemas as long as they are backwards compatible, according to Avro's schema resolution rules. For example, Materialize could choose to add a new field to the top-level record containing additional metadata. This is a backwards compatible change as long as Materialize can publish a new schema to the registry containing the new field. Downstream consumers using the old schema (as a reader schema) will simply discard the new field.
We'd not be able to evolve schemas like this if users provide Materialized with the schema to use. The user-provided schema would not contain the new fields, and Materialize would not have the authority to edit that schema to contain the new fields.
At a high level, allowing users to specify their desired Avro schema would make Materialize responsible for reconciling that Avro schema with the shape of the materialized view provided to the sink. This reconciliation would be quite delicate and difficult, and liable to cause sink breakage during version upgrades.
So, as a philosophy, we propose that Materialize is always solely responsible
for generating the Avro schema used by an Avro-formatted sink. Customizations to
the generated Avro schema must be made via purpose-built options in the FORMAT
AVRO
clause.
The DOC ON
options, rather than specifying columns and types to which to
attach comments according to their SQL names, could instead specify fields in
the Avro schema by their Avro names. Brennan invented a way of
unambiguously naming Avro schema nodes in the kgen
load generator; see the
next section for a description of its syntax and semantics.
This alternative was rejected because it too tightly couples the syntax of
CREATE SINK
to the particulars of how we generate Avro schemas. For example,
we don't currently propagate SQL names for nested record types to the Avro
schema, and instead install an automatically generated name that the user
would need to guess. For example, consider again the example sink:
CREATE TYPE point AS (x integer, y integer);
CREATE MATERIALIZED VIEW v AS SELECT ROW(1, 1)::point AS c1, 'text' AS c2;
CREATE SINK FROM v INTO KAFKA ... FORMAT AVRO ...;
Adding a comment to point.x
would require an option like DOC ON
'com.materialize.sink.record0::x' ...
, where the user hardcodes the
autogenerated record0
name. This is not only painful for users, but painful
for us as well, as we'd be unable to change how we autogenerate record names
in the future without breaking user's DOC ON
options.
Note also the user pain grows with the number of referenced types. With only one
nested record, it's easy to hardcode the record0
rule. With dozens of
referenced types, it becomes all but impossible.
An Avro schema node can appear in the following contexts:
We define the "unambiguous name" of a schema node recursively as follows:
.
, followed by its name if it is.::
, followed by
the field name. Note that you don't need to account for multiple levels of
nesting here-- if record my.namespace.a
contains record my.namespace.b
which contains record my.namespace.c
which contains field f1
, you can
just call that field my.namespace.c::f1
, unambiguously..
, followed by the
node's index within a union. I think we should actually change this and use
the type name instead (e.g. union_unambiguous_name.long
instead of
union_unambiguous_name.4
). That would work because unions are not allowed
to contain multiple variants of the same kind of type unless they are named
types.[]
.Possibly future work includes:
Hypothetical syntax:
ALTER SINK ... FORMAT AVRO (
ADD DOC ON <field-specifier> = '<documentation string>',
ADD DOC ON <field-specifier> = '<documentation string>',
...
);
Hypothetical syntax:
ALTER SINK ... FORMAT AVRO (REFRESH DOCS);
Hypothetical syntax:
CREATE SINK ... FORMAT AVRO ... (IGNORE COMMENTS);
Hypothetical syntax:
CREATE SINK POLICY ... FORMAT AVRO (REQUIRE DOCUMENTATION FOR ALL FIELDS);
None.