The dataflow and optimization layers are hard to test. Currently the way you
test most of the code is that you have to come up with a SQL query will
result in the MirRelationExpr
you want.
A blocker to more directly testing these two layers is that specifying a
particular Mir{Relation, Scalar}Expr
in Rust is onerous. Even if you put the
effort in to specify something like this code below, the test becomes difficult
to read.
let mut test_expr = MirRelationExpr::Filter {
input: Box::new(MirRelationExpr::FlatMap {
input: Box::new(MirRelationExpr::Map {
input: Box::new(MirRelationExpr::Get {
id: Id::Global(GlobalId::User(0)),
typ: RelationType::new(vec![
ColumnType {
nullable: true,
scalar_type: ScalarType::Int32,
},
ColumnType {
nullable: true,
scalar_type: ScalarType::Int64,
},
]),
}),
scalars: vec![MirScalarExpr::literal_null(ScalarType::Int32)],
}),
func: TableFunc::GenerateSeriesInt32,
exprs: vec![MirScalarExpr::Column(1)],
demand: None,
}),
predicates: vec![MirScalarExpr::CallBinary {
func: BinaryFunc::Eq,
expr1: Box::new(MirScalarExpr::Column(0)),
expr2: Box::new(MirScalarExpr::Column(3)),
}],
};
Justin had started on a system, which can be found at
/src/transform/tests
where you can specify a Mir{Relation, Scalar}Expr
using a much more readable
and simple syntax:
(
filter
(
flat-map
(map (get x) [(null int32)])
generate-series-int32
[#1]
)
[(call-binary eq #0 #3)]
)
but implementing the syntax itself is onerous because Rust can't take a string
and just convert it to an enum. The code for constructing an BinaryFunc
would look like:
match func {
"eq" => BinaryFunc::Eq
"isregexpmatch" => {
let case_insensitive = self.next_sexp().parse::<bool>();
BinaryFunc::IsRegexpMatch {case_insensitive}
}
<repeat for every single BinaryFunc variant>
}
As such, currently Justin's system only supports the following flavors:
MirScalarExpr
=> columns, numeric literals, bool literalsMirRelationExpr
=> get, let, map, filter, project, constant, join (only
JoinImplementation::Unimplemented
), union, negate, arrange-byFiguring out how to more efficiently implement the syntax has been an ongoing issue. There has been some earlier design discussions on this, which are documented in the comments of #5684.
Mir{Relation, Scalar}Expr
. Being able to do this
will help a lot with:
Mir
s can allow for
testing the effects of proposed transforms without having to write code for
the transform.Mir
representation.ReducePlan
(a.k.a LIRR
, "LIR for Reduce")Lir
expressionsHir
expressions? We have been talking about doing some rewrites before the decorrelation stage.DataflowDescs
to test optimization across dataflows?Mir{Relation, Scalar}Expr
builder should not bother with attempting to
infer the type of a column. E.g. when doing a sum reduce, it is up to the
test-writer to select the correct AggregateFunc
from SumInt32
,
SumInt64
, etc.Let's call this approach "Minimal boilerplate".
I propose to retain Justin's syntax and invest in a mechanism that can automatically convert it in to a JSON formatted string that can be correctly deserialized by serde_json.
The mechanism has 3 components: (1) A parser. See the section Parser changes for more details.
(2) A derive macro that generates, for each enum, a mapping from the enum's variants to the names and the types of the variant's fields.
(3) Code that constructs the JSON formatted string, stitching the parsed tokens to the information about the enums.
Pseudocode for (3) looks like this:
To construct the JSON string for an enum of type type
specified as
(variant <arg1> .. <argn>)
:
(type, variant)
combination
requires special action, take it.
Otherwise, construct the JSON string in the default manner, which is:Type::Variant
<argi>
based on the type information.{"Variant": {"Arg1Name": <arg1_json>, ... "ArgNName": <argn_json>}}
.
If the variant has unnamed fields, it looks like
{"Variant": [<arg1_json>, .., <argn_json>]}
. If the variant has no fields,
it looks like "Variant"
.(3) supports optional arguments. For example, if you have the spec
(join <input> <equivalences>)
, the corresponding JSON string
{"Join": {"input": <input>, "equivalences": <equivalences>}}
will deserialize fine into a MirRelationExpr::Join
as long as you add
the attribute #[serde(default)]
to the definition of the optional fields
implementation
and demand
.
Step 1 in the pseudocode enables support for irregular syntax. For example, the
syntax (null int32)
corresponds to the MirScalarExpr
null literal, which has
a corresponding JSON string
{"Literal":[{"Ok":{"data":[0]}},{"nullable":true,"scalar_type":"Int32"}]}
.
In this case, MirScalarExpr
null
with first argument null
and goes to a separate branch to construct a
MirScalarExpr::Literal
. The MirScalarExpr::Literal
gets re-serialized into
JSON and pasted into the string for the map.
While I have been focusing on describing how to convert Justin's syntax into a string that serde_json will correctly deserialize as an enum, it should be easy to see how the mechanism can support converting the syntax into a string that serde_json will correctly deserialize as a struct.
I propose replacing the Justin's existing sexp parser with proc_macro2::TokenStream.
The primary motivation is "why write your own parser when you don't have to?"
Also, I like how despite its name, proc_macro2::TokenStream
turns the syntax
into a tree that mirrors the structure of the tree that we're trying to build.
For example, proc_macro2::TokenStream
parses into
(map (get x) [(null int32)])
into something like:
ParenGroup [0] [1] [2]
| | |
V V V
Ident(map) ParenGroup [0] [1] BracketGroup [0]
| | |
V V V
Ident(get) Ident(x) ParenGroup [0] [1]
| |
V V
Ident(null) Ident(int32)
A consequence of doing this replacement is that idents in Justin's original
syntax that are in kebab-case need to be changed to snake_case because the
parser will recognize stuff_in_snake_case
as one ident but will think
stuff-in-kebab-case
is four different idents separated by '-'.
Creating a reversal mechanism, a.k.a. converting the JSON string serialization back to Justin's syntax, is not a priority compared to enabling testing/experimentation/debugging. I have a hard time thinking of a use for round-trip support other than test migration (see below).
That said, the reverse mechanism works just like the forward mechanism, with the exception that the parser for the reverse mechanism is the json crate. In order to translate substrings back to irregular syntax, it will use the same derive macro to track the type substrings correspond to.
If we have round-trip support, then it becomes easy to support migrating tests if the structure of an class of expression changes.
For example, if MirRelationExpr::{Map, Filter, Project}
were being replaced by MirRelationExpr::MapFilterProject
, you would
MirRelationExpr::MapFilterProject
.MirRelationExpr::{Map, Filter, Project}
with
MirRelationExpr::MapFilterProject
.MirRelationExpr
.MirRelationExpr
.MirRelationExpr::{Map, Filter, Project}
To reduce the amount of work required to migrate tests, I propose creating a template for step (3) when there is available bandwidth or when doing a test migration for the first time. The template can do all the test rewrites, and one just has to supply the transform.
This approach was rejected in the earlier discussion, but I am listing it for the sake of completeness.
The idea was to use the strum
crate to derive a method that will construct the
enum from a string.
The original objection to this approach is concern that the strum
crate is
insufficiently mature to depend on.
It should also be noted that for enum variants with fields, the macro will populate the fields with default values, so:
Box
fields like
MirRelationExpr::Map
.Box
field, it will take extra
boilerplate code to enable specifiying an enum with non-default values.This proposal may bring up the question of "Why use Justin's syntax instead of just specifying all expressions using JSON strings?"
There are two reasons:
MirRelationExpr::Constant
, are
very hard to specify as JSON strings that will deserialize properly. Since we
would need some kind of parsing mechanism to convert part of the spec into
stuff like MirRelationExpr::Constant
anyway, we might as well use the
parsing mechanism to enable specifying the whole thing in a simpler syntax.There is also the question of, "Why construct a JSON string and then deserialize that instead of creating a macro generates a method that directly converts Justin's syntax to the expression?"
My belief is that macros are hard to write, introspect, and debug. Using
serde_json
gets string to enum conversion with a lot less coding effort and
with more insight on where you may have made a mistake in your syntax if the
deserialization fails.
A.k.a. Just grind through all the enums and enum variants, filling in the entirety of methods of like
match func {
"eq" => BinaryFunc::Eq
"isregexpmatch" => {
let case_insensitive = self.next_sexp().parse::<bool>();
BinaryFunc::IsRegexpMatch {case_insensitive}
}
<repeat for every single BinaryFunc variant>
}
This is essentially the current state of unit testing with
Mir{Relation, Scalar}Expr
.
Pros: No macros required. No major syntax change.
Cons: There are a lot of enums and enum variants. This is especially the case
for enums whose type names end in Func
. Also, new variants for enums
whose type names end in Func
are ever-increasing, and these variants are being
added by different team members from those who are interested in constructing
arbitrary Mir{Relation, Scalar}Expr
.
The status quo is that we attempt to amortize the time costs of putting in boilerplate by adding boilerplate as test cases need. I feel that there are a couple of problems with the status quo:
Mir{Relation, Scalar}Expr
for the
purposes of debugging/experimentation/testing the dataflow layer directly
because of the amount of boilerplate required to get started.The idea here is to take advantage of the fact that while enums whose names end
with Func
have an ever-increasing number of variants, the number of variants
for enums whose names end in Expr
are pretty stable.
Keep Justin's syntax for MirRelationExpr
and MirScalarExpr
and implement
Justin's syntax with boilerplate. But specify *Func
enums using JSON strings
can be correctly deserialized using serde_json.
Most *Func
variants are unit variants, and you can get the unit enum variant
of your choice by using serde_json to deserialize the name.
serde_json::from_str::<TableFunc>("GenerateSeriesInt32")
Most of the *Func
variants that are not unit variants have only one
argument, so while specifying those variants using a JSON string involves
making the syntax more verbose, it doesn't look all that bad. Examples:
(flat_map (get x) {"CsvExtract":2} [#0])
(flat_map (get x) {"JsonbArrayElements": {"stringify": true}} [#0])
There are a few worst-case variants where the JSON string can be rather
onerous to specify, for example TableFunc::RegexpExtract
:
{"RegexpExtract":
[
<the regex>,
[{"index":0, "name": "something", "nullable": false} ... {..}]]
]
}
which we can address by either:
This approach requires no macros. It involves less boilerplate required than the
"All boilerplate" option. It involves more boilerplate than the "Minimal
boilerplate" option. If we just want to use Justin's syntax to construct
Mir{Relation, Scalar}Expr
s, the difference in boilerplate will not be that
much, but the difference will be more noticeable if we start wanting to use the
syntax to construct other things.