Add quickstart and engine benchmark comparison to README (#497)

gabotechs · web-flow · commit 4e7303e388fd · 2026-06-14T18:54:22.000+02:00
## What Reworks the README to help new users get started and to surface our benchmark results. ### Quickstart - Adds a runnable two-binary walkthrough (worker + coordinator) querying a public ClickHouse Parquet dataset over HTTP. - `Cargo.toml` and `use` statements are tucked into `<details>` blocks to keep the steps code-forward. ### Benchmarks - Adds a headline chart comparing DataFusion Distributed against Ballista, Spark, and Trino across TPC-H (SF1/SF10/SF100) and TPC-DS SF1, plus per-dataset charts and a totals table in a `<details>` block. - Methodology: each engine's total is the sum of per-query median (p50) latencies over the queries all compared engines completed successfully (lower is better). df-dist numbers come from the `dynamic-task-count` branch. - **Conditions**: 12× AWS EC2 `c5n.2xlarge` instances reading Parquet files from Amazon S3. - Ballista is marked N/A on TPC-H SF100 (only completes 4/22 queries, see #1836). - Charts are stored under `docs/source/_static/images/` and optimized (PNG-8 palette, ~9-16k each). - Removes the now-unused `dist-df-vs-df-vs-trino.png`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
diff --git a/README.md b/README.md
@@ -1,16 +1,225 @@
 # DataFusion Distributed
 
-Library that brings distributed execution capabilities to [Apache DataFusion](https://github.com/apache/datafusion).
+Library for building distributed query execution engines based on [Apache DataFusion](https://github.com/apache/datafusion).
 
 > [!NOTE]  
-> This is project is not part of Apache DataFusion
+> This project is not part of Apache DataFusion
+
+## Quickstart
+
+Starting with the following dependencies:
+
+<details>
+<summary><code>Cargo.toml</code></summary>
+
+```toml
+[package]
+name = "datafusion-distributed-quick-start"
+version = "0.1.0"
+edition = "2024"
+default-run = "main"
+
+[dependencies]
+datafusion = "54"
+datafusion-distributed = "2"
+object_store = { version = "0.13.2", features = ["http"] }
+tokio = { version = "1", features = ["full"] }
+tonic = "0"
+url = "2"
+alloc-no-stdlib = "=2.0.4"
+
+[patch.crates-io]
+# https://github.com/dropbox/rust-brotli/issues/256, issue unrelated to this project, this will stop being
+# necessary as sson as rust-brotli fixes it.
+alloc-no-stdlib = { git = "https://github.com/dropbox/rust-alloc-no-stdlib", rev = "6032b6a9b20e03737135c55a0270ccffcc1438ef" }
+alloc-stdlib = { git = "https://github.com/dropbox/rust-alloc-no-stdlib", rev = "6032b6a9b20e03737135c55a0270ccffcc1438ef" }
+
+[[bin]]
+name = "main"
+path = "src/main.rs"
+
+[[bin]]
+name = "worker"
+path = "src/worker.rs"
+
+```
+
+</details>
+
+---
+
+`src/worker.rs`: Spawn a Distributed DataFusion worker in a localhost port.
+
+<details>
+<summary><code>use</code> statements</summary>
+
+```rust
+use datafusion::execution::runtime_env::RuntimeEnvBuilder;
+use datafusion_distributed::Worker;
+use object_store::http::HttpBuilder;
+use std::net::{IpAddr, Ipv4Addr, SocketAddr};
+use std::sync::Arc;
+use tonic::transport::Server;
+use url::Url;
+```
+
+</details>
+
+```rust
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    // 1. Spawn a Distributed DataFusion worker in a localhost port.
+    let base = Url::parse("https://datasets.clickhouse.com")?;
+    let store = HttpBuilder::new().with_url(base.clone()).build()?;
+    let runtime = RuntimeEnvBuilder::new().build_arc()?;
+    runtime.register_object_store(&base, Arc::new(store));
+
+    let worker = Worker::default().with_runtime_env(runtime);
+
+    let port = std::env::var("PORT")?.parse()?;
+    let addr = SocketAddr::new(IpAddr::V4(Ipv4Addr::LOCALHOST), port);
+    println!("Distributed DataFusion worker listening on {addr}...");
+    Ok(Server::builder()
+        .add_service(worker.into_worker_server())
+        .serve(addr)
+        .await?)
+}
+```
+
+--- 
+
+`src/main.rs`: Prepare the `SessionContext` with all the pieces necessary to communicate with the workers above.
+
+<details>
+<summary><code>use</code> statements</summary>
+
+```rust
+use datafusion::error::DataFusionError;
+use datafusion::execution::SessionStateBuilder;
+use datafusion::prelude::{ParquetReadOptions, SessionConfig, SessionContext};
+use datafusion_distributed::{
+    DistributedExt, SessionStateBuilderExt, WorkerResolver, display_plan_ascii,
+};
+use object_store::http::HttpBuilder;
+use std::sync::Arc;
+use url::Url;
+```
+
+</details>
+
+```rust
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    // 2. Create a WorkerResolver implementation that knows how to resolve
+    //    Distributed DataFusion workers running remotely.
+    let workers = std::env::var("WORKERS").unwrap_or_default();
+    let mut urls: Vec<Url> = vec![];
+    for port in workers.split(",").filter(|v| !v.is_empty()) {
+        urls.push(Url::parse(&format!("http://127.0.0.1:{port}"))?);
+    }
+
+    struct LocalhostWorkerResolver(Vec<Url>);
+    impl WorkerResolver for LocalhostWorkerResolver {
+        fn get_urls(&self) -> Result<Vec<Url>, DataFusionError> {
+            Ok(self.0.clone())
+        }
+    }
+
+    // 3. Build the SessionContext as usual. Distributed queries will use
+    //    this SessionContext as a "coordinator", as it'll be in charge of
+    //    distributed planning + fanning out tasks to workers.
+    let state = SessionStateBuilder::new()
+        .with_default_features()
+        .with_config(SessionConfig::new().with_information_schema(true))
+        .with_distributed_worker_resolver(LocalhostWorkerResolver(urls))
+        .with_distributed_planner()
+        // A very low value forces queries to be heavily distributed.
+        .with_distributed_file_scan_config_bytes_per_partition(1)?
+        .build();
+
+    let ctx = SessionContext::from(state);
+
+    let base = Url::parse("https://datasets.clickhouse.com")?;
+    let store = HttpBuilder::new().with_url(base.clone()).build()?;
+    ctx.register_object_store(&base, Arc::new(store));
+    ctx.register_parquet(
+        "hits",
+        "https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_0.parquet",
+        ParquetReadOptions::default(),
+    )
+        .await?;
+
+    // 4. Issue the SQL query, and get a nice visualization for distributed plans 
+    //    with `display_plan_ascii`.
+    let sql = std::env::args().skip(1).collect::<Vec<_>>().join(" ");
+    let df = ctx.sql(&sql).await?;
+    let plan = df.create_physical_plan().await?;
+    println!("{}", display_plan_ascii(plan.as_ref(), false));
+    df.show().await?;
+
+    Ok(())
+}
+```
+
+---
+
+Start a couple of workers, each on its own port:
+
+```sh
+PORT=8080 cargo run --bin worker &
+PORT=8081 cargo run --bin worker &
+PORT=8082 cargo run --bin worker &
+```
+
+Then point the main script at them and run a query:
+
+```sh
+WORKERS=8080,8081,8082 cargo run -- "SELECT COUNT(*), AVG(\"ResolutionWidth\") FROM hits"
+```
+
+You'll see the distributed plan printed, followed by the query results.
+
+## Benchmarks
+
+DataFusion Distributed consistently outperforms other distributed query engines across TPC-H and
+TPC-DS. The chart below shows how much slower each engine is relative to DataFusion Distributed
+(lower is better):
+
+![How much slower than DataFusion Distributed?](./docs/source/_static/images/summary_relative.png)
+
+<details>
+<summary>Per-dataset totals</summary>
+
+| Dataset     | df-dist |                                                                                Ballista | Spark | Trino | Queries compared |
+|-------------|--------:|----------------------------------------------------------------------------------------:|------:|------:|-----------------:|
+| TPC-H SF1   |  **7s** |                                                                                     11s |   30s |   18s |               22 |
+| TPC-H SF10  | **10s** |                                                                                     57s |   51s |   33s |               22 |
+| TPC-H SF100 | **42s** | N/A ([#1836](https://github.com/datafusion-contrib/datafusion-distributed/issues/1836)) |  261s |   93s |               19 |
+| TPC-DS SF1  | **29s** |                                                                                     72s |  101s |   85s |               67 |
+
+![TPC-H SF1](./docs/source/_static/images/tpch_sf1.png)
+![TPC-H SF10](./docs/source/_static/images/tpch_sf10.png)
+![TPC-H SF100](./docs/source/_static/images/tpch_sf100.png)
+![TPC-DS SF1](./docs/source/_static/images/tpcds_sf1.png)
+
+</details>
+
+**Conditions.** All engines ran on the same cluster: 12 AWS EC2 `c5n.2xlarge` instances (8 vCPUs and
+21 GiB of memory each, with up to 25 Gbps networking) reading Parquet files stored in Amazon S3. Each
+engine's total is the sum of per-query median (p50) latencies over the queries that all compared engines
+completed successfully; lower is better.
+
+The benchmarking code is public an open for anyone to easily reproduce. It uses AWS CDK for automating the creation
+of the benchmarking cluster so that anyone can reproduce the same results in their own AWS account. The code can
+be found in the [benchmarks/cdk](./benchmarks/cdk) directory.
 
 ## What can you do with this crate?
 
-This crate is a toolkit that extends [Apache DataFusion](https://github.com/apache/datafusion) with distributed
-capabilities,
-providing a developer experience as close as possible to vanilla DataFusion while being unopinionated about the
-networking stack used for hosting the different workers involved in a query.
+This crate is a toolkit that extends [Apache DataFusion](https://github.com/apache/datafusion) with distributed capabilities, providing a developer 
+experience as close as possible to vanilla DataFusion while being unopinionated about the networking stack.
+
+It's not an out of the box distributed engine, it's instead a library for building distributed query engines with some
+sane defaults for when the data sources are just files.
 
 Users of this library can expect to take their existing single-node DataFusion-based systems and add distributed
 capabilities with minimal changes.
@@ -21,49 +230,14 @@ capabilities with minimal changes.
   a familiar API for building applications.
 - Unopinionated about networking. This crate does not take any opinion about the networking stack, and users are
   expected to leverage their own infrastructure for hosting DataFusion nodes.
-- No coordinator-worker architecture. To keep infrastructure simple, any node can act as a coordinator or a worker.
-
-# Benchmarks
-
-The benchmarking code is public an open for anyone to easily reproduce. It uses AWS CDK for automating the creation
-of the benchmarking cluster so that anyone can reproduce the same results in their own AWS account. The code can
-be found in the [benchmarks/cdk](./benchmarks/cdk) directory.
-
-### TPC-H SF1
-
-![benchmarks_sf1.png](https://github.com/user-attachments/assets/2f922066-7382-4c31-9e76-74b1ca053bfc)
-
-### TPC-H SF10
-
-![benchmarks_sf10.png](https://github.com/user-attachments/assets/08fd3090-92bf-43fd-b80c-12e3a127e724)
+- No coordinator-worker split. To keep infrastructure simple, any node can act as a coordinator or a worker.
 
 # Docs
 
-The user  guide can be found here:
+The user guide can be found here:
 
 https://datafusion-contrib.github.io/datafusion-distributed
 
 If you'd like to contribute, see the contributor guide:
 
 https://datafusion-contrib.github.io/datafusion-distributed/contributor-guide/index.html
-
-## Getting familiar with distributed DataFusion
-
-There are some runnable examples showcasing how to provide a localhost implementation for Distributed DataFusion in
-[examples/](examples):
-
-- [localhost_worker.rs](examples/localhost_worker.rs): code that spawns a Worker listening for physical
-  plans over the network.
-- [localhost_run.rs](examples/localhost_run.rs): code that distributes a query across the spawned Workers and executes
-  it.
-
-The integration tests also provide an idea about how to use the library and what can be achieved with it:
-
-- [tpch_validation_test.rs](tests/tpch_plans_test.rs): executes all TPCH queries and performs assertions over the
-  distributed plans.
-- [custom_config_extension.rs](tests/custom_config_extension.rs): showcases how to propagate custom DataFusion config
-  extensions.
-- [custom_extension_codec.rs](tests/custom_extension_codec.rs): showcases how to propagate custom physical extension
-  codecs.
-- [distributed_aggregation.rs](tests/distributed_aggregation.rs): showcases how to manually place `ArrowFlightReadExec`
-  nodes in a plan and build a distributed query out of it.
diff --git a/docs/source/_static/images/dist-df-vs-df-vs-trino.png b/docs/source/_static/images/dist-df-vs-df-vs-trino.png
diff --git a/docs/source/_static/images/summary_relative.png b/docs/source/_static/images/summary_relative.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8e4daeee88e5ee6498ab603578953221e41d73f1ff130af85d3c9cee91351e03
+size 16913
diff --git a/docs/source/_static/images/tpcds_sf1.png b/docs/source/_static/images/tpcds_sf1.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:352f994011a0a965d56c155b9219e24e7adc82cc75ef055b9627968f19a3c121
+size 10133
diff --git a/docs/source/_static/images/tpch_sf1.png b/docs/source/_static/images/tpch_sf1.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:fce1d3536639544f3d0d6faf4c07d47ce1c1d6232cd9fefdaef9ade363e5763d
+size 9785
diff --git a/docs/source/_static/images/tpch_sf10.png b/docs/source/_static/images/tpch_sf10.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:238ba60b9f35f33a8ab1b052e31cf6f258d6b4a22f70773d129777bf67f041b4
+size 10072
diff --git a/docs/source/_static/images/tpch_sf100.png b/docs/source/_static/images/tpch_sf100.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5abdb5af7cf714a25fc26af7559419d5df2b115995e949a4c3789f393fb80f1b
+size 11394