Skip to content

fix(db): exclude eval_samples/server_logs from dump to fit GitHub 2 GiB cap#418

Draft
arygupt wants to merge 1 commit into
masterfrom
fix/db-dump-exclude-bloat-tables
Draft

fix(db): exclude eval_samples/server_logs from dump to fit GitHub 2 GiB cap#418
arygupt wants to merge 1 commit into
masterfrom
fix/db-dump-exclude-bloat-tables

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented Jun 4, 2026

Problem

The weekly public DB dump (db-backup.yml, cron Mondays 0 0 * * 1) publishes a zip as a GitHub release asset, which is hard-capped at 2 GiB (2_147_483_648 bytes).

Every dump since 2026-05-18 has failed during gh release create with:

size must be less than 2147483648

The 2026-05-11 archive (the last good one) was already 2.072 GB — 96.5% of the cap. The next week tipped over.

Root cause

Two tables dominate the archive and are ~99% of its size:

table compressed share
eval_samples ~1.7 GB ~82%
server_logs ~345 MB ~17%
benchmark_results (the analytically useful one) ~20 MB ~1%

Fix

Exclude eval_samples and server_logs from the dump by default. This drops the zip from ~2.07 GB → ~0.36 GB, well under the cap, and unblocks the weekly release.

  • Opt back in to a full backup with DUMP_INCLUDE_ALL=1.
  • load-dump.ts already skips missing table files (skip <table> (file not found)), so restores round-trip cleanly without these two files.
  • No change to benchmark_results / configs / workflow_runs / run_stats etc. — all still dumped.

Verification

  • Manually trigger db-backup.yml (workflow_dispatch) on this branch and confirm the asset uploads.

🤖 Generated with Claude Code

…iB cap

The weekly public DB dump is published as a GitHub release asset, which is
hard-capped at 2 GiB. eval_samples (~1.7 GB compressed) + server_logs
(~345 MB) make up ~99% of the archive, pushing it past the cap — every dump
since 2026-05-18 has failed with `size must be less than 2147483648`.

Exclude both tables from the dump by default (set DUMP_INCLUDE_ALL=1 for a
full backup). This drops the zip from ~2.07 GB to ~0.36 GB and unblocks the
weekly release. The analytically useful tables are unaffected
(benchmark_results is only ~20 MB). load-dump already skips missing table
files, so restores round-trip cleanly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment Jun 4, 2026 8:55pm

Request Review

adibarra added a commit that referenced this pull request Jun 5, 2026
The weekly dump zipped the per-table JSON into a single release asset, which
now exceeds GitHub's 2 GiB per-asset cap (~23 GB raw -> ~2 GB zip). Rather than
dropping tables (the #418 stopgap, which breaks anyone rebuilding the DB from
the dump), tar the dump, compress with xz/lzma2 (preset=9e, 192 MiB dict), and
split into <2 GiB parts. split always emits >=1 part, so consumers use one
uniform 'cat parts | xz -d | tar -x' flow.

xz -9e is CPU-bound, so run the job on a 32-vCPU Blacksmith runner to stay
within the 30-min cap. Update README + inferencex-data skill consumer docs to
match. dump-db.ts/load-dump.ts unchanged (still a plain JSON dir).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant