A 5-minute tour of the lakebench CLI. Get from zero to a measured benchmark
run on your laptop without touching any Python.
# pip — pick the engines you want; DuckDB has the smallest footprint
pip install 'lakebench[duckdb,tpch_datagen]'Verify:
lakebench --version
lakebench --helpUsing
uvinstead ofpip? Every command below works with the same arguments — just prefix withuv run, e.g.uv run lakebench --version. To set up the dev environment from a clone:uv sync --group dev --extra duckdb --extra tpch_datagenInstalluvwithcurl -LsSf https://astral.sh/uv/install.sh | sh.
lakebench datagen \
--benchmark tpch \
--scale-factor 1 \
--output /tmp/tpch_sf1That writes the 8 TPC-H tables as parquet under /tmp/tpch_sf1/. Use scale
factor 0.1 if you want it to finish in seconds.
You can run with no profile at all:
lakebench run \
--engine duckdb \
--benchmark tpch --scenario sf1 --scale-factor 1 \
--input-uri /tmp/tpch_sf1--engine builds an ad-hoc profile inline. Local engines (duckdb, polars,
daft, sail) get a working-directory URI under $TMPDIR/lakebench-scratch
unless you override with -E schema_or_working_directory_uri=....
Drop --engine and the CLI will auto-create ~/.lakebench.json the first
time, picking the first installed local engine (priority: duckdb → polars →
daft → spark → sail). You'll see one warning line:
WARNING lakebench: No profile config found — created starter at /home/you/.lakebench.json
(re-run with --engine to override).
After that, future runs use the saved default with no flags needed.
For more than one engine or non-default settings, create
./lakebench.json in the repo root (project-level):
{
"defaults": { "profile": "local-duckdb" },
"profiles": {
"local-duckdb": {
"engine": "duckdb",
"engine_options": {
"schema_or_working_directory_uri": "/tmp/lakebench-duckdb"
}
}
}
}Inspect what the CLI actually sees:
lakebench profiles list
lakebench profiles show local-duckdblakebench run \
--benchmark tpch \
--scenario sf1 \
--scale-factor 1 \
--input-uri /tmp/tpch_sf1Because defaults.profile is set, you didn't need --profile. Add
--print-config (or --dry-run) first if you want to see the merged config
without actually launching an engine:
lakebench run --benchmark tpch --scenario sf1 \
--scale-factor 1 --input-uri /tmp/tpch_sf1 --print-configlakebench results latest # most recent run
lakebench results list --benchmark tpch # filter
lakebench results show <run_id_prefix> # 6-char prefix is enough
lakebench results stats --benchmark tpch # n / mean / p50 / p95Runs land in ./results/ by default — change with --results-dir DIR or
LAKEBENCH_RESULTS_DIR.
Pointing LakeBench at a Fabric workspace or Databricks catalog for the first time? Ask it what's there:
lakebench discover --profile my-fabricExample output:
catalog schema benchmark confidence matched/expected
spark_catalog tpcds_sf1000 tpcds | eltbench 100% 24/24
spark_catalog tpch_sf1000 tpch 100% 8/8
spark_catalog clickbench clickbench 100% 1/1
Now you know which schema to pass as --input-uri / schema_name in a
subsequent lakebench run. Also works with --engine duckdb against a local
scratch dir. --min-confidence 0.8 hides partial matches; --format json
emits machine-readable output for scripting.
Once discover tells you what's in the lakehouse, run queries against it
without re-loading. Use --mode query, --database <schema>, and (for
multi-catalog engines) --catalog <name>:
# Fabric / Synapse / HDInsight via Livy
lakebench run --profile my-fabric \
--benchmark tpcds --scenario sf1000 --scale-factor 1000 \
--database tpcds_sf1000 --mode query
# Databricks (Unity Catalog or hive_metastore)
lakebench run --profile my-databricks \
--benchmark tpch --scenario sf100 --scale-factor 100 \
--catalog hive_metastore --database tpch_sf100 --mode query--database (alias: --schema) overlays onto engine_options.schema_name,
and --catalog onto engine_options.catalog_name. Queries are auto-qualified
with the resolved catalog/schema, so no SQL edits are required.
Before debugging a flaky run, ask the CLI to self-check:
lakebench doctor
lakebench doctor --profile local-duckdbCatches missing extras, broken profile, datagen tools not on PATH, unwritable
results dir, and missing/unauthenticated az CLI when any profile uses
auth: az (Fabric / Databricks / Synapse / HDInsight).
Two override flags, last-one-wins, deep-merged into the profile:
# -E: any key under engine_options (JSON-aware, dotted nesting)
lakebench run --benchmark tpch --scenario sf1 \
--scale-factor 1 --input-uri /tmp/tpch_sf1 \
-E "compute_stats_all_cols=true"
# --conf: shortcut for engine_options.session_conf.<key>
lakebench run --benchmark tpch --scenario sf1 ... \
--conf spark.sql.shuffle.partitions=200Both also have file forms: --engine-options-file foo.json,
--conf-file foo.properties.
# bash
eval "$(lakebench --shell-init bash)"
# zsh
eval "$(lakebench --shell-init zsh)"
# fish
lakebench --shell-init fish | sourceRequires argcomplete (pip install argcomplete); otherwise this is a no-op.
| Task | Command |
|---|---|
| List supported run modes for a benchmark | lakebench list-modes tpch |
| Compare two runs side-by-side | lakebench results compare <a> <b> |
| Tag a run | lakebench results tag <run_id> baseline production |
| Add a note | lakebench results notes <run_id> "warm cache, after vacuum" |
| Export to CSV / Markdown | lakebench results export --format md --output report.md |
| Purge old runs | lakebench results purge --older-than 30d |
| Get full traceback on error | add --debug |
| Continue past engine crash, exit 2 instead of 3 | add --continue-on-error |
docs/cli-reference.md— every flag, every subcommand, all defaults.docs/install-fabric.md— Fabric-specific install + first run.docs/install-databricks.md— Databricks-specific install + first run.README.md— Python-API usage, custom benchmarks/engines.lakebench doctor— first stop when something doesn't work.