Skip to content

Commit 06e3fc5

Browse files
claudeconnortsui20
authored andcommitted
benchmarks-website/server: snapshot in Vortex format, not CSV
Switches /api/admin/snapshot from `EXPORT DATABASE … (FORMAT csv)` to per-table `COPY (SELECT * FROM <table>) TO … (FORMAT vortex)`. Dogfoods the project's own format and compresses an order of magnitude better than gzipped CSV on this shape (BIGINT[] runtime arrays + short strings). - schema.rs grows a `pub const TABLES: &[&str]` so the snapshot loop has a stable list to iterate. - admin::snapshot writes `schema.sql` from `SCHEMA_DDL` verbatim, then `INSTALL vortex FROM community; LOAD vortex;` (idempotent — autoload is enabled in the bundled libduckdb-sys), then one COPY per table. - ops/README.md restore section rewritten: untar → `.read schema.sql` → `INSERT … SELECT * FROM read_vortex(<file>)` per table. - Two snapshot tests are marked `#[ignore]` because they need outbound network to fetch the vortex extension. Run them by hand before merge: cargo test -p vortex-bench-server --test admin -- --ignored Signed-off-by: Claude <noreply@anthropic.com>
1 parent 9a52e88 commit 06e3fc5

5 files changed

Lines changed: 122 additions & 55 deletions

File tree

benchmarks-website/ops/README.md

Lines changed: 31 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -23,14 +23,14 @@ out-of-tree state — every script and unit lives in
2323
range touch website-relevant paths it builds, atomically swaps the
2424
binary, and restarts the server. Otherwise it fast-forwards the
2525
working tree and exits.
26-
- A second timer fires hourly, asks the server to `EXPORT DATABASE` to
27-
a CSV snapshot, `tar czf`s it into a single archive, and uploads it
28-
to `s3://vortex-benchmark-results-database/v3-backups/<UTC ts>.tar.gz`.
29-
CSV is the default DuckDB EXPORT format (only format the `bundled`
30-
libduckdb-sys feature ships with), and gzip reclaims 5–7× on this
31-
shape — most data lands in the BIGINT[] runtime columns which
32-
bloat 2–3× as text. Flipping to parquet or a Vortex layout later is
33-
a one-line change in [`server/src/admin.rs`](../server/src/admin.rs).
26+
- A second timer fires hourly, asks the server to write a per-table
27+
Vortex snapshot (`schema.sql` + one `<table>.vortex` per table),
28+
`tar czf`s it, and uploads to
29+
`s3://vortex-benchmark-results-database/v3-backups/<UTC ts>.tar.gz`.
30+
The vortex DuckDB extension is auto-installed from the community
31+
repo on first call. Vortex compresses the BIGINT[] runtime arrays
32+
and string columns roughly an order of magnitude better than
33+
gzipped CSV — and dogfoods the project's own format.
3434
- For ad-hoc reads, `inspect.sh` calls a bearer-gated `/api/admin/sql`
3535
endpoint instead of stopping the server.
3636
- For DB-replacing operations (re-running the v2→v3 migration),
@@ -52,7 +52,7 @@ out-of-tree state — every script and unit lives in
5252
│ bin/ │
5353
│ vortex-bench-server ← symlink → versioned binary │
5454
│ vortex-bench-server.<ts> ← versioned, last $KEEP_BINARIES (3) │
55-
│ snapshots/<ts>/ ← transient EXPORT DATABASE landing │
55+
│ snapshots/<ts>/ ← transient vortex-snapshot landing │
5656
│ last-deployed-sha ← stamp file for the deploy timer │
5757
│ .deploy.lock ← flock guard │
5858
│ ops -> /home/ec2-user/vortex/benchmarks-website/ops │
@@ -78,8 +78,7 @@ out-of-tree state — every script and unit lives in
7878
│ <UTC ts>.tar.gz │
7979
│ <UTC ts>/ │
8080
│ schema.sql │
81-
│ load.sql │
82-
│ <table>.csv │
81+
│ <table>.vortex │
8382
└───────────────────────────────────────┘
8483
```
8584

@@ -335,20 +334,35 @@ aws s3 ls s3://vortex-benchmark-results-database/v3-backups/ | tail -20
335334
```
336335

337336
Each `<ts>.tar.gz` archive contains a single directory `<ts>/` with
338-
the artifacts of DuckDB's `EXPORT DATABASE`: a `schema.sql`, a
339-
`load.sql`, and one CSV file per table. Restore on a fresh box:
337+
a `schema.sql` (verbatim DDL the server applies on boot) and one
338+
`<table>.vortex` per table. Restore on a fresh box:
340339

341340
```bash
342341
sudo systemctl stop vortex-bench-server
343342
cd /tmp
344343
aws s3 cp s3://vortex-benchmark-results-database/v3-backups/<ts>.tar.gz .
345344
tar xzf <ts>.tar.gz # extracts ./<ts>/
346-
duckdb /var/lib/vortex-bench/bench.duckdb -c "IMPORT DATABASE '/tmp/<ts>'"
345+
ts=<ts> # e.g. 20260508T010000Z
346+
sudo -u ec2-user rm -f /var/lib/vortex-bench/bench.duckdb \
347+
/var/lib/vortex-bench/bench.duckdb.wal
348+
duckdb /var/lib/vortex-bench/bench.duckdb <<EOF
349+
INSTALL vortex FROM community;
350+
LOAD vortex;
351+
.read /tmp/${ts}/schema.sql
352+
INSERT INTO commits SELECT * FROM read_vortex('/tmp/${ts}/commits.vortex');
353+
INSERT INTO query_measurements SELECT * FROM read_vortex('/tmp/${ts}/query_measurements.vortex');
354+
INSERT INTO compression_times SELECT * FROM read_vortex('/tmp/${ts}/compression_times.vortex');
355+
INSERT INTO compression_sizes SELECT * FROM read_vortex('/tmp/${ts}/compression_sizes.vortex');
356+
INSERT INTO random_access_times SELECT * FROM read_vortex('/tmp/${ts}/random_access_times.vortex');
357+
INSERT INTO vector_search_runs SELECT * FROM read_vortex('/tmp/${ts}/vector_search_runs.vortex');
358+
EOF
347359
sudo systemctl start vortex-bench-server
348360
```
349361

350-
(`duckdb` CLI version doesn't strictly have to match — DuckDB's
351-
import format is forward/backward compatible across recent versions.)
362+
The `duckdb` CLI version needs to be recent enough that the vortex
363+
community extension is published for it. If `INSTALL vortex FROM
364+
community` fails, upgrade the CLI to match (or exceed) the version
365+
the server was built against (`duckdb` crate `1.10502` ≈ DuckDB 1.5.x).
352366

353367
If you want to take an out-of-band snapshot (e.g. before a risky
354368
operation), just call the same endpoint the timer does:
@@ -426,7 +440,7 @@ also constitute the public admin contract for any future tooling.
426440
| Method + path | Bearer | Notes |
427441
|------------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------|
428442
| `GET /health` | none | `deploy.sh` polls for liveness after a restart. |
429-
| `POST /api/admin/snapshot?ts=<id>` | admin | Triggers `EXPORT DATABASE`. `ts` must match `[A-Za-z0-9_-]{1,64}`. 409 if the dir exists. |
443+
| `POST /api/admin/snapshot?ts=<id>` | admin | Writes `schema.sql` + per-table `.vortex` files. `ts` must match `[A-Za-z0-9_-]{1,64}`. 409 if the dir exists. |
430444
| `POST /api/admin/sql` (body `{"sql": …}`, `?format=json\|table`) | admin | Read-only SQL only — `SELECT`/`WITH`/`PRAGMA`/`SHOW`/`DESCRIBE`/`EXPLAIN`. |
431445
| `POST /api/ingest` | ingest | Used by CI, not by these scripts. Documented under [`crate::ingest`]. |
432446

benchmarks-website/ops/backup.sh

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,16 @@
44
#
55
# Hourly snapshot to S3, called by vortex-bench-backup.timer.
66
#
7-
# Asks the running server to EXPORT DATABASE via /api/admin/snapshot
8-
# (so the export uses the same DuckDB process that owns the file — no
9-
# stop required), `tar czf`s the resulting CSV dir into a single
10-
# archive, uploads it to $S3_BACKUP_PREFIX/<ts>.tar.gz, and deletes
11-
# the local copies.
7+
# Asks the running server to write a per-table Vortex snapshot via
8+
# /api/admin/snapshot (so the writer uses the same DuckDB process
9+
# that owns the file — no stop required), `tar czf`s the resulting
10+
# directory into a single archive, uploads it to
11+
# $S3_BACKUP_PREFIX/<ts>.tar.gz, and deletes the local copies.
1212
#
13-
# We gzip rather than uploading raw CSVs because DuckDB's CSV EXPORT
14-
# is verbose for our shape (most data lands in BIGINT[] runtime
15-
# columns that bloat 2–3× as text); gzip typically reclaims 5–7× on
16-
# this kind of payload.
13+
# Vortex compresses our shape (mostly BIGINT[] runtime arrays + short
14+
# strings) far better than gzipped CSV; the additional gzip on the
15+
# tarball is largely catching schema.sql and tar metadata, not the
16+
# data files themselves.
1717
#
1818
# The instance IAM role must already permit s3:PutObject under
1919
# $S3_BACKUP_PREFIX. (Same bucket the v2 backup script used.)

benchmarks-website/server/src/admin.rs

Lines changed: 50 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -14,19 +14,23 @@
1414
//!
1515
//! ### `POST /api/admin/snapshot?ts=<id>`
1616
//!
17-
//! Runs `EXPORT DATABASE '<snapshot_dir>/<ts>/' (FORMAT csv)` against the
18-
//! live DuckDB connection. CSV is the only EXPORT format the
19-
//! `bundled` libduckdb-sys feature ships with — switching to parquet or
20-
//! a Vortex layout later means flipping the duckdb feature flag and
21-
//! changing one literal below. CSV round-trips losslessly through
22-
//! `IMPORT DATABASE` (a `schema.sql` is written alongside the data so
23-
//! types and array columns rehydrate correctly).
17+
//! Writes a snapshot directory `<snapshot_dir>/<ts>/` containing:
18+
//! - `schema.sql` — verbatim copy of [`crate::schema::SCHEMA_DDL`], so a
19+
//! restore knows how to recreate the tables before bulk-loading.
20+
//! - `<table>.vortex` for every table in [`crate::schema::TABLES`] —
21+
//! each produced by a `COPY (SELECT * FROM <table>) TO …
22+
//! (FORMAT vortex)`. The vortex DuckDB extension is auto-installed
23+
//! from the community repo on first call, then `LOAD`ed.
24+
//!
25+
//! Vortex compresses the BIGINT[] runtime arrays and string columns
26+
//! roughly an order of magnitude better than gzipped CSV on this shape;
27+
//! it is also the project's own format, which is the obvious dogfood.
2428
//!
2529
//! `ts` must match `[A-Za-z0-9_-]{1,64}`; the snapshot script
2630
//! conventionally passes a UTC timestamp like `20260508T010000Z`. The
27-
//! target subdirectory must not already exist (409 otherwise). The export
28-
//! is transactionally consistent: writes during the export queue on the
29-
//! connection mutex.
31+
//! target subdirectory must not already exist (409 otherwise). All
32+
//! per-table COPY statements run on a connection cloned from the
33+
//! shared handle, so concurrent ingest writes are not blocked.
3034
//!
3135
//! ### `POST /api/admin/sql`
3236
//!
@@ -62,6 +66,7 @@ use thiserror::Error;
6266

6367
use crate::app::AppState;
6468
use crate::db;
69+
use crate::schema;
6570

6671
/// Errors surfaced by `/api/admin/*` handlers. Auth (401) is handled by
6772
/// [`require_admin_bearer`] and never reaches a handler.
@@ -162,8 +167,9 @@ pub struct SnapshotResponse {
162167
pub snapshot_dir: String,
163168
}
164169

165-
/// Handler for `POST /api/admin/snapshot?ts=<id>`. Runs `EXPORT DATABASE`
166-
/// to a fresh subdirectory under [`AppState::snapshot_dir`].
170+
/// Handler for `POST /api/admin/snapshot?ts=<id>`. Writes
171+
/// `schema.sql` plus one `<table>.vortex` file per fact/dim table into
172+
/// a fresh subdirectory under [`AppState::snapshot_dir`].
167173
pub async fn snapshot(
168174
State(state): State<AppState>,
169175
Query(q): Query<SnapshotQuery>,
@@ -176,27 +182,43 @@ pub async fn snapshot(
176182
target.display()
177183
)));
178184
}
179-
if let Some(parent) = target.parent() {
180-
std::fs::create_dir_all(parent)
181-
.with_context(|| format!("creating snapshot parent {}", parent.display()))?;
182-
}
183-
let target_str = target
184-
.to_str()
185-
.ok_or_else(|| AdminError::Internal(anyhow::anyhow!("snapshot path is not UTF-8")))?
186-
.to_string();
187-
let target_for_response = target.clone();
185+
std::fs::create_dir_all(&target)
186+
.with_context(|| format!("creating snapshot dir {}", target.display()))?;
187+
188+
// Schema is just our DDL string verbatim; restore reads this with
189+
// `duckdb -init schema.sql` (or `.read schema.sql`) before
190+
// bulk-loading the per-table vortex files.
191+
std::fs::write(target.join("schema.sql"), schema::SCHEMA_DDL)
192+
.with_context(|| format!("writing schema.sql under {}", target.display()))?;
193+
194+
let target_for_db = target.clone();
188195
db::run_blocking(&state.db, move |conn| {
189-
// `target_str` is composed from the configured snapshot dir + a
190-
// validated [A-Za-z0-9_-] timestamp, so single-quote escaping is
191-
// a non-issue here.
192-
let sql = format!("EXPORT DATABASE '{target_str}' (FORMAT csv)");
193-
conn.execute_batch(&sql)
194-
.with_context(|| format!("EXPORT DATABASE to {target_str}"))
196+
// Idempotent — `INSTALL` is a no-op if the extension is already
197+
// present, `LOAD` is cheap once the binary is on disk. The
198+
// bundled libduckdb-sys has autoload enabled, so the very first
199+
// call also auto-fetches the extension from the DuckDB
200+
// community repo. Subsequent calls are entirely local.
201+
conn.execute_batch("INSTALL vortex FROM community; LOAD vortex;")
202+
.context("INSTALL/LOAD vortex extension")?;
203+
for table in schema::TABLES {
204+
// Single-quote escaping is a non-issue: `target_for_db`
205+
// is composed from the operator-configured snapshot dir +
206+
// a validated [A-Za-z0-9_-] timestamp, and table names
207+
// come from the closed const list in schema.rs.
208+
let path = target_for_db.join(format!("{table}.vortex"));
209+
let path_str = path
210+
.to_str()
211+
.ok_or_else(|| anyhow::anyhow!("snapshot path is not UTF-8: {}", path.display()))?;
212+
let sql = format!("COPY (SELECT * FROM {table}) TO '{path_str}' (FORMAT vortex)");
213+
conn.execute_batch(&sql)
214+
.with_context(|| format!("COPY {table} TO {path_str}"))?;
215+
}
216+
Ok(())
195217
})
196218
.await
197219
.map_err(AdminError::Internal)?;
198220
Ok(Json(SnapshotResponse {
199-
snapshot_dir: target_for_response.display().to_string(),
221+
snapshot_dir: target.display().to_string(),
200222
}))
201223
}
202224

benchmarks-website/server/src/schema.rs

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,3 +183,16 @@ CREATE TABLE IF NOT EXISTS vector_search_runs (
183183
/// Schema version expected by the server. The ingest envelope's
184184
/// `run_meta.schema_version` must match this exactly at alpha.
185185
pub const SCHEMA_VERSION: i32 = 1;
186+
187+
/// Every table in the schema, in the order a fresh boot creates them.
188+
/// Used by the snapshot endpoint to drive a per-table `COPY ... TO`
189+
/// across the whole DB and by the restore docs to document the same
190+
/// list. `commits` is the dim table; the rest are facts.
191+
pub const TABLES: &[&str] = &[
192+
"commits",
193+
"query_measurements",
194+
"compression_times",
195+
"compression_sizes",
196+
"random_access_times",
197+
"vector_search_runs",
198+
];

benchmarks-website/server/tests/admin.rs

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,13 @@ async fn admin_unmounted_when_admin_token_absent() -> Result<()> {
240240
Ok(())
241241
}
242242

243+
// The snapshot endpoint INSTALLs and LOADs the vortex DuckDB community
244+
// extension on first call; that needs outbound network to
245+
// `community-extensions.duckdb.org` which sandboxed CI environments
246+
// generally don't allow. Run manually before merge:
247+
// cargo test -p vortex-bench-server --test admin -- --ignored
243248
#[tokio::test]
249+
#[ignore = "needs network to install the vortex DuckDB community extension"]
244250
async fn admin_snapshot_creates_export_directory() -> Result<()> {
245251
let server = Server::start_with_admin().await?;
246252
let client = reqwest::Client::new();
@@ -260,11 +266,22 @@ async fn admin_snapshot_creates_export_directory() -> Result<()> {
260266
.context("snapshot_dir field")?;
261267
let dir_path = std::path::PathBuf::from(dir);
262268
assert!(dir_path.exists(), "{dir} should exist");
263-
// EXPORT DATABASE always writes a schema.sql.
269+
// schema.sql is written verbatim from SCHEMA_DDL.
264270
assert!(
265271
dir_path.join("schema.sql").exists(),
266272
"{dir}/schema.sql should exist"
267273
);
274+
// One .vortex file per table — `commits` is the dim table and is
275+
// present even when the DB is otherwise empty (the schema was
276+
// applied at AppState::open).
277+
assert!(
278+
dir_path.join("commits.vortex").exists(),
279+
"{dir}/commits.vortex should exist"
280+
);
281+
assert!(
282+
dir_path.join("query_measurements.vortex").exists(),
283+
"{dir}/query_measurements.vortex should exist"
284+
);
268285
// And the directory should be under the configured snapshot dir.
269286
assert!(
270287
dir_path.starts_with(&server.snapshot_dir),
@@ -275,6 +292,7 @@ async fn admin_snapshot_creates_export_directory() -> Result<()> {
275292
}
276293

277294
#[tokio::test]
295+
#[ignore = "needs network to install the vortex DuckDB community extension"]
278296
async fn admin_snapshot_rejects_existing_directory() -> Result<()> {
279297
let server = Server::start_with_admin().await?;
280298
let client = reqwest::Client::new();

0 commit comments

Comments
 (0)