Bazel

An open source distributed build system maintained by Google, a fork of their internal build system known as Blaze.

tl;dr: These tips should get you started with Bazel:

To generate a BUILD.bazel file, run bin/bazel gen.

When running Bazel, a target is defined like //src/catalog:mz_catalog where // signifies the root of our repository, src/catalog is the path to a directory containing a BUILD.bazel file, and :mz_catalog is a named target within that BUILD.bazel file.

To see what targets are available you can use the query subcommand, e.g. bin/bazel query //src/catalog/....

About Bazel

Bazel's main component for building code are "rules", which are provided by open source rule sets, e.g. rules_rust. When using a rule, e.g. rust_library, you define all of the inputs (e.g. source files) and extra parameters (e.g. compiler flags) required to build your target. Bazel then computes a build graph which is used to order operations and determine when something needs to get re-built.

A common annoyance with Bazel is that it operates in a sandbox. A build that otherwise succeeds on your machine might fail when run with Bazel because it has a different version of a compiler, or can't find some necessary file. This is a key feature though because it makes builds hermetic and allows Bazel to aggressively cache artifacts, which reduces build times.

Table of contents:

Getting Started
Using Bazel
How Bazel Works
Building Rust Code

Getting Started

Installing `bazelisk`

To use bazel you first need to install bazelisk, which is a launcher that automatically makes sure you have the correct version of Bazel installed.

Note: We have a .bazelversion file in our repository that ensures everyone is using the same version.

On macOS you can do this with Homebrew:

brew install bazelisk

For Linux distributions you'll need to grab a binary from their releases page and put them into your PATH as bazel:

chmod +x bazelisk-linux-amd64
sudo mv bazelisk-linux-amd64 /usr/local/bin/bazel

Defining your own `.bazelrc` file

Bazel has numerous command line options, which can be defined in a .bazelrc file to create different configurations that you run Bazel with. We have a .bazelrc in the root of our repository that defines several common build configurations, but it's also recommended that you create a .bazelrc in your home directory (i.e. ~/.bazelrc) to customize how you run Bazel locally. Options specified in your home RC file will override those of the workspace RC file.

A good default to start with is:

# Bazel will use all but one CPU core, so your machine is still responsive.
common --local_resources=cpu="HOST_CPUS-1"

# Define a shared disk cache so builds from different Materialize repos can share artifacts.
build --disk_cache=~/.cache/bazel

# Optional. The workspace RC already sets a max disk cache size, and artifact age
# but you can override that if you have more limited disk space.
common --experimental_disk_cache_gc_max_size=40G
common --experimental_disk_cache_gc_max_age=7d

Remote caching

Bazel supports reading and writing artifacts to a remote cache. We currently have two setup in i2 that are backed by S3 and running bazel-remote. One is accessible by developers and used by PR builds in CI, we treat this as semi-poisoned. The other is only accessible by CI and used for builds from main and tagged builds.

To enable remote caching as a developer you must to the following:

Have Teleport setup and tsh installed
Create ~/.config/materialize/build.toml and add the following:
```
[bazel]
remote_cache = "teleport:bazel-remote-cache"
```

When running Bazel via bin/bazel we will read the build config and spawn a Teleport proxy via tsh if one isn't already running, then specify --remote_cache to bazel with the correct URL.

Teleport proxy fails to start

In some cases you might see a warning printed when calling bin/bazel indicating the Teleport proxy failed to start, e.g.

Teleport proxy failed to start, 'tsh' process already running!
  existing 'tsh' processes: [10001]
  exit code: 1

Generally this means there is a Teleport proxy already running that we've lost track of. You can fix this issue by terminating the existing tsh process with the PID specified in the warning message.

S3 Bucket Layout

We maintain two remote caches in the "Materialize Core" AWS account stored under S3 buckets:

materialize-bazel-remote: Used for PR builds and accessible by developers
materialize-bazel-remote-pa: Used for main branch and tagged builds (CI only)

Each bucket contains two main folders cas.v2 and ac.

To force Bazel to rebuild each cache from scratch, you can delete these folders. Note that you'll need the appropriate AWS permissions to perform these operations.

Using Bazel

Bazel has been integrated into mzbuild, which means you can use it for other tools as well like mzimage and mzcompose! To enable Bazel specify the --bazel flag like you would specify the --dev flag, e.g. bin/mzcompose --bazel ....

Otherwise Bazel can be used just like cargo, to build individual targets and run tests. We provide a thin wrapper around the bazel command in the form of bin/bazel. This sets up remote caching, and provides the fmt and gen subcommands. Otherwise it forwards all commands onto bazel itself.

Building a crate

All Rust crates in our Cargo Workspace have a BUILD.bazel file that define different build targets for the crate. You don't have to write these files, they are automatically generated from the crate's Cargo.toml. For more details see the Generating BUILD.bazel files section.

tl;dr to build a crate run bin/bazel build //src/<crate-name> from the root of the repo.

To determine what targets are available for a crate you can use the query subcommand, e.g.

$ bin/bazel query //src/adapter/...

//src/adapter:adapter
//src/adapter:mz_adapter
//src/adapter:mz_adapter_doc_test
//src/adapter:mz_adapter_lib_tests
//src/adapter:mz_adapter_parameters_tests
//src/adapter:mz_adapter_sql_tests
//src/adapter:mz_adapter_timestamp_selection_tests

Every Rust crate has at least one Bazel target, which is the name of the crate. In the example above the "adapter" crate has the target mz_adapter. So you can build the mz_adapter crate by running the following:

$ bin/bazel build //src/adapter:mz_adapter

For convenience we also alias the primary target to have the same name as the folder, in the example above we alias mz_adapter to adapter. This allows a shorthand syntax for building a crate:

# Builds the same target as the example above!
$ bin/bazel build //src/adapter

Adding a new crate

When adding a new crate to our workspace follow the normal flow that you would with Cargo, e.g. run cargo new --lib my_crate. Once it's created you'll need to add an entry to the Bazel WORKSPACE in the root of our repository. In that file search for "crates_repository" and then find the manifests section, it should look something like this:

crates_repository(
  name = "crates_io",

  //...

  manifests = [
    "//:Cargo.toml",
    "//:src/adapter-types/Cargo.toml",
    <add your new crate to this list>
  ],
)

The crates_repository Bazel rule aggregates all of the third-party crates that we use and automatically generates BUILD.bazel files for them.

Once your new crate is added to crates_repository run bin/bazel gen to generate a new BUILD.bazel file, and you should be all set!

Running a test

Note: Support for running Rust tests with Bazel is still experimental. We're waiting on #29266.

Defined in a crate's BUILD.bazel are test targets. The following targets are automatically generated:

<crate_name>_lib_tests
<crate_name>_doc_tests
<crate_name>_<integration_test_file_name>_tests

For example, at the time of writing the ore crate has three files underneath ore/tests, future.rs, panic.rs, and task.rs. As such the BUILD.bazel file for the ore crate has the following test targets:

mz_ore_lib_tests
mz_ore_doc_tests
mz_ore_future_tests
mz_ore_panic_tests
mz_ore_task_tests

You can run the tests in future.rs by running the following command:

bin/bazel test //src/ore:mz_ore_future_tests

Filtering Tests

You can provide arguments to the underlying test binary with the --test_arg command line option. This allows you to provide a filter to Rust's test framework, e.g.

bin/bazel test //src/ore:mz_ore_future_tests --test_arg=catch_panic_async

Would run only the tests in future.rs matching the filter "catch_panic_async".

How Bazel Works

`WORKSPACE`, `BUILD.bazel`, `*.bzl` files

There are three kinds of files in our Bazel setup:

WORKSPACE: Defines the root of our workspace, we only have one of these. This is where we load all of our rule sets, download remote repositories, and register toolchains.
BUILD.bazel: Defines how a library/crate is built, where you use "rules". This is generally equivalent to a Cargo.toml, one per-crate.
*.bzl: Used to define new functions or macros that can be used in BUILD.bazel files, written in Starlark. As a general developer you should rarely if ever need to interact with these files.

Generating `BUILD.bazel` files

tl;dr run bin/bazel gen from the root of the repository.

Just like Cargo.toml, associated with every crate is a BUILD.bazel file that provides targets that Bazel can build. We auto-generate these files with cargo-gazelle which developers can easily run via bin/bazel gen.

There are times though when Cargo.toml doesn't provide all of the information required to build a crate, for example the std::include_str! macro adds an implicit dependency on the file being included. Bazel operates in a sandbox and thus will fail unless you tell it about the file! For these cases you can add the dependency via a [package.metadata.cargo-gazelle.<target>] section in the Cargo.toml. For example:

[package.metadata.cargo-gazelle.lib]
compile_data = ["path/to/my/file.txt"]

This will add "path/to/my/file.txt" to the compile_data attribute on the resulting rust_library Bazel target.

`cargo-gazelle`

gazelle is a semi-official BUILD.bazel file generator that supports Golang and protobuf. There exists a gazelle_rust plugin, but it's not yet mature enough to fit our needs. Still, it's important for producivity that developers who don't want to interact with Bazel shouldn't have to, so generating a BUILD.bazel file from a Cargo.toml is quite important.

Thus we decided to write our own generator, cargo-gazelle! It's not a plugin for the existing gazelle tool but theoretically could be. It's designed to be fully generic with very few (if any) Materialize specific configurations built in.

Supported Configurations

cargo-gazelle supports the following configuration in a Cargo.toml file.

# Configuration for the crate as a whole.
[package.metadata.cargo-gazelle]
# Will skip generating a BUILD.bazel entirely.
#
# If you specify this setting please include a reason at the top of the
# BUILD.bazel file explaining why we skip generating.
skip_generating = (true | false)
# Concatenate the specified string at the end of the generated BUILD.bazel file.
#
# This is largely an escape hatch and should be avoided if possible.
additive_content = "String"


# Configuration for the library target of the crate.
[package.metadata.cargo-gazelle.lib]
# Skip generating the library target.
skip = (true | false)
# Extra data that will be provided to the Bazel target at compile time.
compile_data = ["String Array"]
# Extra data that will be provided to the Bazel target at compile and run time.
data = ["String Array"]
# Extra flags for rustc.
rustc_flags = ["String Array"]
# Environment variables to set for rustc.
[package.metadata.cargo-gazelle.lib.rustc_env]
var1 = "my_value"

# By default Bazel enables all features of a crate, if provided we will
# _override_ that set with this list.
features_override = ["String Array"]
# Extra dependencies to include for the target.
extra_deps = ["String Array"]
# Extra proc-macro dependencies to include for the target.
extra_proc_macro_deps = ["String Array"]


# Configuration for the crate's build script.
[package.metadata.cargo-gazelle.build]
# Skip generating the library target.
skip = (true | false)
# Extra data that will be provided to the Bazel target at compile time.
compile_data = ["String Array"]
# Extra data that will be provided to the Bazel target at compile and run time.
data = ["String Array"]
# Extra flags for rustc.
rustc_flags = ["String Array"]
# Environment variables to set for rustc.
[package.metadata.cargo-gazelle.build.rustc_env]
var1 = "my_value"

# Environment variables to set for the build script.
build_script_env = ["String Array"]
# Skip the automatic search for protobuf dependencies.
skip_proto_search = (true | false)


# Configuration for test targets in the crate.
#
# * Library tests are named "lib"
# * Doc tests are named "doc"
#
[package.metadata.cargo-gazelle.test.<name>]
# Skip generating the library target.
skip = (true | false)
# Extra data that will be provided to the Bazel target at compile time.
compile_data = ["String Array"]
# Extra data that will be provided to the Bazel target at compile and run time.
data = ["String Array"]
# Extra flags for rustc.
rustc_flags = ["String Array"]
# Environment variables to set for rustc.
[package.metadata.cargo-gazelle.test.<name>.rustc_env]
var1 = "my_value"

# Bazel test size.
#
# See <https://bazel.build/reference/be/common-definitions#common-attributes-tests>.
size = "String"
# Environment variables to set for test execution.
[package.metadata.cargo-gazelle.test.<name>.env]
var1 = "my_value"


# Configuration for binary targets of the crate.
[package.metadata.cargo-gazelle.binary.<name>]
# Skip generating the library target.
skip = (true | false)
# Extra data that will be provided to the Bazel target at compile time.
compile_data = ["String Array"]
# Extra data that will be provided to the Bazel target at compile and run time.
data = ["String Array"]
# Extra flags for rustc.
rustc_flags = ["String Array"]
# Environment variables to set for rustc.
[[package.metadata.cargo-gazelle.binary.<name>.rustc_env]]
var1 = "my_value"

# Environment variables to set for test execution.
[package.metadata.cargo-gazelle.binary.<name>.env]
var1 = "my_value"

If all else fails, the code that handles this configuration lives in misc/bazel/cargo-gazelle!

Platforms

Official Documentation

Bazel is designed to be run on a variety of hardware, operating systems, and system configurations. To manage this complexity Bazel has a concept of "constraints" to allow conditional configuration of rules, and "platforms" to manage hardware differences. There are three roles that a platform can serve:

Host: the platform that Bazel is invoked from, generally a developers local machine.
Execution: the platform that Bazel is using to execute actions or compile code. For Materialize this is always the same as the "Host platform" since we don't utilize distributed builds.
Target: the platform we are building for.

The platforms that we build for are defined in /platforms/BUILD.bazel.

A common way to configure a build based on platform is to use the select function. This allows you to return different values depending on the platform we're targetting.

Custom Build Flags

Not necessarily related to platforms, but still defined in /platforms/BUILD.bazel are our custom build flags. Currently we have custom build settings for the following features:

Sanitizers like AddressSanitizer (ASan).
Cross language LTO

While most build settings can get defined in the .bazelrc these features require slightly more complex configuration. For example, if we're building with a sanitizer we need to disable jemalloc, this is because sanitizers commonly have their own allocator. To do this we create a new build flag with the string_flag rule from the Bazel Skylib rule set and match on this using the config_setting rule that is built in to Bazel. The [config_setting] is then what we can match on in our BUILD.bazel files with a select({ ... }) function.

Toolchains

Official Documentation

Bazel has a specific framework to manage compiler toolchains. For example, instead of having to specify a Rust toolchain every time you use the rust_library rule, you instead register a global Rust toolchain that rules resolve during analysis.

Toolchains are defined and registered in the WORKSPACE file. We currently use Clang/LLVM to build C/C++ code (via the toolchains_llvm ruleset) where the version is defined by the LLVM_VERSION constant. For Rust we support both stable and nightly, where the versions defined by the RUST_VERSION and RUST_NIGHTLY_VERSION constants respectively.

Both toolchains_llvm and rules_rust have "process wrappers". These are small wrappers around clang and rustc that are able to inspect the absolute path they are being invoked from. Bazel does not expose absolute paths at all so these wrappers are how arguments like --remap-path-prefix get set. These wrappers are helpful but can also cause issues like toolchains_llvm#421.

The upstream LLVM toolchains are very large and built for bespoke CPU architectures. While maybe not ideal, we build our own LLVM toolchains which live in the MaterializeInc/toolchains repo. This ensures we're using the same version of clang across all architectures we support and greatly improves the speed of cold builds.

Note: The upstream LLVM toolchains are ~1 GiB and compressed with gzip, end-to-end they took about 3 minutes to download and setup. Our toolchains are ~80MiB and compressed with zstd which end-to-end take less than 30 seconds to download and setup.

System Roots

Along with a C-toolchain we also provide a system root for our builds. A system root contains things like libc, libm, and libpthread, as well as their associated header files. Our system roots also live in the MaterializeInc/toolchains repo.

Building Rust Code

For building Rust code we use rules_rust. It's primary component is the crates_repository rule.

`crates_repository`

Normally when building a Rust library you define external dependencies in a Cargo.toml, and cargo handles fetching the relevant crates, generally from crates.io. The crates_repository rule does the same thing, we define a set of manifests (Cargo.toml files), it will analyze them and create a Bazel repository containing all of the necessary external dependencies.

Then to build our crates, e.g. mz-adapter, we use the handy all_crate_deps macro. When using this macro in a BUILD.bazel file, it determines which package we're in (e.g. mz-adapter) and expands to all of the necessary external dependencies. Unfortunately it does not include dependencies from within our own workspace, so we still need to do a bit of manual work of specifying dependencies when writing our BUILD.bazel files.

In the WORKSPACE file we define a "root" crates_repository named crates_io.

Rust `-sys` crates

There are some Rust crates that are wrappers around C libraries, like decnumber-sys is a wrapper around libdecnumber. cargo-gazelle will generate a Bazel target for the crate's build script, but it's likely this build script will fail because it can't find tools like cmake, our system root, or implicitly depends on some other C library.

The general approach we've used to get these crates to build is to duplicate the logic from the -sys crate's build.rs script into a Bazel target. See bazel/c_deps/rust-sys for some examples. Once you write a BUILD.bazel file for the C dependency we add a crate.annotation in our WORKSPACE file that appends your newly written BUILD.bazel file to the one generated for the Rust crate.

Duplicating logic is never great, but having Bazel explicitly build these C dependencies provides better caching and more control over the process which unlocks features like cross language LTO.

Other C dependencies

There are a few C dependencies which are used both by a Rust -sys crate and another C dependency. For example zstd is used by both the zstd-sys Rust crate and the rocksdb C library. For these cases instead of depending on the version included via the Rust -sys crate, we "manually" include them by downloading the source files as an http_archive. All cases of external C dependencies live in bazel/c_deps/repositories.bzl.

Protobuf Generation

Nearly all of our Rust build scripts do a single thing, and that's generate Rust bindings to protobuf definitions. rules_rust includes rules for generating protobuf bindings when using Prost and Tonic, but they don't interact with Cargo Build Scripts very well. Instead we added a new crate called build-tools whose purpose is to abstract over whatever build system you're using and provide the tool a build script might need, like protoc.

For Bazel we provide the necessary tools via "runfiles", which are defined in the data field of the rust_library target. Bazel "runfiles" are a set of files that are provided at runtime execution. So in your build script to get the current path of the protoc executable you would call mz_build_tools::protoc (example) which returns a different path depending on the build system currently being used.

Git Hash Versioning

Development builds of Materialize include the current git hash in their version number. The sandbox that Bazel creates when building a Rust library does not include any git info, so attempts to get the current hash will fail.

But! Bazel has a concept of "stamping" builds which allows you to provide local system information as part of the build process, this information is known as the workspace status. Generating the workspace status and providing it to Rust libraries requires a few steps, all of which are described in the bazel/build-info/BUILD.bazel file.

Unfortunately this isn't the whole story though. It turns out workspace status and stamping builds causes poor remote cache performance. On a new build Bazel will regenerate the volatile-status.txt file used in workspace stamping which causes any stamped libraries to not be fetched from the remote cache, see bazelbuild#10075. For us this caused a pretty serious regression in build times so we came up with a workaround:

When building in release mode but not a tagged build, (e.g. a PR) mzbuild.py will write out the current git hash to a temporary file.
Our build-info Rust crate knows to read from this temporary file in a non-hermetic/side-channel way to get the git hash into the current build without invalidating the remote cache.

While definitely hacky, our side-channel for the git hash does provide a substantial improvement in build times, while providing similar guarantees to the Cargo build with respect to when the hash gets re-computed.

bazel.md 28 KB History Raw