LDBC SNB Business Intelligence benchmark

To run this benchmark, you will need:

inputs at an appropriate scale factor
parameters at that same scale factor

You can download pre-fab datasets and parameters from https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md. (Note that validation parameters exist only for scale factor 10.)

What does the benchmark run?

LDBC BI runs a variety of analysis queries on a (generated) social-network-like graph. After a bulk load, the queries are rerun after several batches of updates. The benchmark specifies a protocol for running "enough" queries on each batch of updates.

Each query of the benchmark has some parameters to make it somehow specific to the data. For example, query 1 has a parameter datetime; it calculates statistics on the number of messages for each 'message length category' in the same year as datetime but before datetime itself. The generated parameters offer a number of interesting datetimes to select.

The Umbra implementation does some pre-computation in manually materialized views. We translate those to true materialized views.

How do I run it?

The script init.sql does the initial bulk load. To get it to work, you should download scale factor 1 and unzip it into test/ldbc-bi/bi-sf1-composite-merged and run find ${UMBRA_CSV_DIR} -name "*.csv.gz" -print0 | parallel -q0 gunzip to unzip each individual CSV file.

You can then run \i init.sql from a psql session with Materialize. It will take a few minutes to load the data and define the materialized views.

Run \i qXX.sql to run query XX. Each query has appropriate \set commands at the beginning to fill in a parameter value (the first one in the parameter set, arbitrarily).

Ideas, questions, and tasks

During the conversion process, each of the benchmark queries is run as one-shot select. It may be more interesting to treat the queries as materialized views; we would then want to not just track total time querying in the benchmark, but also some measure of latency.

Local changes

We've manually reordered weights in the PathQ19 view of query 19 to accommodate the way delta joins hydrate (they follow the join plan of the first syntactic table, which happened to be a poor choice for this query).

TODO

apply updates
- do we want to measure "liveness" of the views as we run?
fully automate locally/in staging
load generator

README.md 2.6 KB History Raw