Created at: December 16, 2022 5:35 PM; GitHub epic: https://github.com/MaterializeInc/database-issues/issues/4838
This document aims at capturing a high-level understanding of TopK
rendering code as well as propose potential improvements in rendering monotonic top-k plans with strategies that that are more efficient when k is small, as is common in practice.
Presently, our rendering of TopK
plans treats three cases: MonotonicTop1Plan
, MonotonicTopKPlan
, and BasicTopKPlan
. Taking a step back, there seem to be at least two aspects to consider in our rendering options: monotonicity and boundedness. Monotonicity is an input property that allows us to come up with more specialized rendering for TopK
plans, since we do not have to account for retractions and can thus not maintain as much state. Boundedness can refer to the input or to the output sizes. It has been observed in issue #14444 that input size boundedness can be exploited to simplify TopK
chains. Output size boundedness is exploited in the rendering of MonotonicTop1Plan
as well as MonotonicTopKPlan
. However, there may be opportunities to improve the rendering even further, especially for small k, of MonotonicTopKPlan
.
We do not aim at providing improvements to MonotonicTop1Plan
nor to BasicTopKPlan
at this point. However, analysis contained herein may prove informative for future efforts aimed at these other rendering flavors.
We provide below an overview and discuss general observations about the present rendering strategies for TopK
plans. For simplicity, the description here ignores group-by keys, but extension of the strategies to also include this additional partitioning consideration is possible (and done in our implementation). Additionally, the methods are supposed to be easily extensible to function in a multi-worker setting. However, again for simplicity, their analysis is sketched without reference to the number of workers.
BasicTopKPlan
Overview of evaluation strategy: We build a stack of arrangements, each produced by a reduction. Each reduction works on a set of top-k groups. The groups are formed by a composite key, namely a pair (group_by_key, hash_modulus_key)
. Here, the second component of the pair is the more interesting one. A hash of the row is taken modulo a predefined factor for each level of the hierarchy. These factors are chosen such that a merge structure is created. Here is the code where the factors are declared. We build the key in the following fragments: 1) Group by + row hash first; 2) Then, we map the second component in the pair modulo the factor of that top-k stage.
The bottom arrangement in the reduction stack is “fat”: it has many keys for a given intended top-k group; the top arrangement will be “thin”: it has one key for the final top-k group. Each key in the arrangement holds a number of values that is at worst proportional to (not necessarily equal to) $k$. The height of the stack is logarithmic in an upper bound $N$ to the total number of rows $n$ (currently taken as the size of the 64-bit integer domain). At every level, we emit all rows in the group and retract the rows that are in the partial top-k. Again at every level, these results are negated and concatenated with the input for that level, thus resulting in only the partial top-k rows. This mechanic allows for retractions outside of the partial top-k rows to be absorbed at a given level and not propagated upwards. Additionally, it reduces memory usage at each level by obviating keeping input rows more than once per level. At the last level, the final top-k elements are produced.
Summarized Analysis
must_shrink
(see code). There, the $n$ rows form groups of size proportional to $k$, which are sorted. Subsequently, we just propagate partial top-k rows up doing at most the same amount of work at each level. Since there are at most $n/k$ groups at the stage where we must_shrink
, we get $O(n \cdot \log k \cdot \log N)$.negate
/concat
/other is $O(n)$ and the update work is proportional to the size of the change (so $O(1)$ for 1 update). Additionally, we are sorting the input rows for the reduction, but this $O(n\cdot \log n)$ cost is also bound by the other factors articulated. Thus, these additional costs do not change the overall complexity of the operations argued above.MonotonicTopKPlan
Overview of evaluation strategy: The strategy is similar to the construction for the BasicTopKPlan
; however, the monotonic strategy makes changes that affect how much state from the input collection ends up being retained. Essentially, it is observed that the input can be “thinned” once the top-k rows are determined, since all “loser” rows outside of the top-k can never again make it to the top-k set. The latter is because the input is monotonic, so there are no input retractions. The rendering thus removes all non-top-k rows from the input after the top-k set is found. This rendering strategy is not applied when an offset is given.
The behavior above is concretely implemented by a feedback operator, triggered by the usage of delay and a variable here. This strategy results in an even more complex rendering than for BasicTopKPlan
.
Summarized Analysis:
BasicTopKPlan
. However, a substantial amount of memory can be released for all the input rows that are never going to make it into the top-k elements after the initial top-k elements are computed.MonotonicTop1Plan
Overview of evaluation strategy: We move each row to the diff
field by wrapping it in a Top1Monoid
, and then use a reduction to keep only the top row.
Summarized Analysis:
The strategies described in the previous section have the interesting property that they are resilient to large $k$, e.g., even comparable to $n$, or skew in the formation of top-k groups. However, there may be opportunities for improvements when evaluation is restricted to small values of $k \ll n$. We are particularly interested in improvements that to monotonic top-k rendering, since the current rendering exhibits an allocation spike with a multiplicative effect due to the arrangement stack created in BasicTopKPlan
.
In the descriptions, we gloss over the details of row multiplicities for simplicity in understanding. An implementation will need to consider these details to be correct.
MonotonicTopKPlan
OnlyIt is known from literature that the top-k rows in a traditional query processing setting can be computed by keeping a priority queue in $O(n \cdot \log k)$. The setting is similar to having monotonic sources, but we must deal with the additional complexities of maintenance by additions as well as formulating the strategy in a manner that is idiomatic for implementation in DD.
Alternative 1: TopKMonoid
Overview of evaluation strategy: We create a TopKMonoid
difference type for MonotonicTopKPlan
that keeps a collection of up to $k$ rows in a priority queue, namely encoded as a max-heap. With this data structure, we can combine two instances in amortized $O(k)$ since the each data structure is at most of size $k$. Now, we could have a special case for combining an instance with another that only contains a single element. In this case, either the single element should not make it into the partial top-k, being discarded, or should dislodge the maximum element in the partial top-k, which can be achieved in $O(\log k)$.
Summarized Analysis:
Alternative 2: Special-Purpose Operator
Overview of evaluation strategy: We create a special-purpose operator that implements the classic strategy of keeping a priority queue, namely as a max-heap. However, we are guaranteed to keep a single priority queue per instance of the operator instead of having to merge priority queues as in the TopKMonoid
strategy. Note that we could run an instance of the operator per Timely worker, so we need a final top-k stage that performs a reduction merging the partial top-k results produced by each operator instance.
Summarized Analysis:
MonotonicTop1Plan
above. So some care needs to be exercised when maintaining a data structure to compute the top-k vs. ownership/copying of data.BasicTopKPlan
and MonotonicTopKPlan
Alternative 3: Special-Purpose Operator
Overview of evaluation strategy: We can create a special-purpose operator that computes partial top-k results per Timely worker, which are then combined by a final top-k stage performing a reduction. Even though at a high level this strategy is similar to Alternative 2 above, the special-purpose operator needs to be carefully constructed since we cannot rely on monotonicity. In particular, an outline of the necessary data structures could be:
For every incoming row, if we are adding the row, we first record it in the row B-tree data structure. Then, we check if the row would make it into the top-k by querying the top-k B-tree for its maximum. If not, the row is discarded; otherwise, we dislodge the maximum element and insert the row into the top-k B-tree.
If, on the other hand, we are retracting the row, we first delete the row from the row B-tree. Then, we check if the row is in the top-k elements by looking it up in the top-k B-tree. If not, then nothing else needs to be done. Otherwise, we remove the row from the top-k B-tree. Then, we query the top-k B-tree to find the new maximum element $e$. Now, we can query the row B-tree to find the next element that is greater than* $e$. This element is then inserted into the top-k B-tree.
Summarized Analysis:
MonotonicTopKPlan
plan, we might wish to instead build this alternative to the BasicTopKPlan
and not update the row B-tree when monotonicity can be asserted. However, the implementation complexity is higher overall. Additionally, with large $k$ comparable to $n$, we would be keeping 2x the input size allocated in this alternative.A variation of Alternative 2 ended up being implemented in this PR. Similarly to Alternative 2, a special-purpose operator computes top-k results per worker as a thinning stage per timestamp. However, a data structure different from a binary heap is employed to reduce memory allocations. This leads to a slight asymptotic loss, aligning with $O(n \cdot \log n)$ for construction.
In order to bound results from an inter-timestamp perspective, the feedback loop is maintained, but now operating only on pre-thinned input. We can then assert that the input to the final top-k reduction is bounded by num_workers * k * constant_number_of_timestamps
. As such, it is safe to render a final top-k reduction with a single stage, instead of a full stack of arrangements as for the BasicTopKPlan
. By reusing the existing rendering for the final top-k stage, update (1 row) complexity aligns with and $O(k \cdot \log k)$, being thus favorable for small $k$.