For context, Kafka -- unlike most other message transports -- exposes "key" information as part of every message. Keys can be arbitrary serialized bytes, but for our customers so far, commonly use the same serialization as the value of each message.
We do not have general support for Kafka keys, only a little bit of (differently) special-cased
support in our two UPSERT
envelopes and our DEBEZIUM
envelope.
The specific proposal here is to move the special casing outside of FORMAT
(and ENVELOPE
)
syntax into a new KEY FORMAT
/VALUE FORMAT
syntax.
The inciting goal for this is to unify the syntax and implementation for UPSERT
and DEBEZIUM
envelopes, but an equally-important goal is to set us up to be able to better handle key/value
message transports in the future.
This project should:
While we don't want to specifically make much progress on them, it is important that the work in this project sets us up for, or at least does not preclude:
key
fields from Kafka in user dataflowsThe Kafka source type syntax will change to support explicitly defining the Key encoding via a KEY
FORMAT
specifier, in which case the value format will need to be specified as VALUE FORMAT
. The
format syntax fragment will become:
'FORMAT' format_specifier | 'KEY FORMAT' format_specifier 'VALUE FORMAT' format_specifier
The plain FORMAT
specifier will still be allowed to specify an confluent schema registry config,
which may define both the key and value formats.
The Encoding
enum in the sql-parser AST will grow new variants:
KeyValue { key: Box<Encoding>, value: Box<Encoding> }
which will be used either if the user
specifies both a key and a value variant.Undefined
, which will only be used as the key variant if the
user specifies the confluent schema registry format and we are unable to determine the schema for
the key. This allows us to error if the envelope (or later SQL) depends on a key schema.The default ENVELOPE NONE
does not care about the key schema, but initially we will just error
out in all cases instead of supporting this.
The current syntax for UPSERT will continue to work for at least one release cycle, but will be normalized to use the new implementation.
Thinking about point 3 of non-goals, once key encodings are well-integrated it will be possible as
future work to access Kafka keys with a small amount of new syntax e.g. 'KEEPING' (KEY | VALUE |
BOTH)
or a with-option.
This was tried in 6289, but it seems like the wrong idea.
Unknown.