fix(db): exclude eval_samples/server_logs from dump to fit GitHub 2 GiB cap#418
Draft
arygupt wants to merge 1 commit into
Draft
fix(db): exclude eval_samples/server_logs from dump to fit GitHub 2 GiB cap#418arygupt wants to merge 1 commit into
arygupt wants to merge 1 commit into
Conversation
…iB cap The weekly public DB dump is published as a GitHub release asset, which is hard-capped at 2 GiB. eval_samples (~1.7 GB compressed) + server_logs (~345 MB) make up ~99% of the archive, pushing it past the cap — every dump since 2026-05-18 has failed with `size must be less than 2147483648`. Exclude both tables from the dump by default (set DUMP_INCLUDE_ALL=1 for a full backup). This drops the zip from ~2.07 GB to ~0.36 GB and unblocks the weekly release. The analytically useful tables are unaffected (benchmark_results is only ~20 MB). load-dump already skips missing table files, so restores round-trip cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
adibarra
added a commit
that referenced
this pull request
Jun 5, 2026
The weekly dump zipped the per-table JSON into a single release asset, which now exceeds GitHub's 2 GiB per-asset cap (~23 GB raw -> ~2 GB zip). Rather than dropping tables (the #418 stopgap, which breaks anyone rebuilding the DB from the dump), tar the dump, compress with xz/lzma2 (preset=9e, 192 MiB dict), and split into <2 GiB parts. split always emits >=1 part, so consumers use one uniform 'cat parts | xz -d | tar -x' flow. xz -9e is CPU-bound, so run the job on a 32-vCPU Blacksmith runner to stay within the 30-min cap. Update README + inferencex-data skill consumer docs to match. dump-db.ts/load-dump.ts unchanged (still a plain JSON dir).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The weekly public DB dump (
db-backup.yml, cron Mondays0 0 * * 1) publishes a zip as a GitHub release asset, which is hard-capped at 2 GiB (2_147_483_648bytes).Every dump since 2026-05-18 has failed during
gh release createwith:The 2026-05-11 archive (the last good one) was already 2.072 GB — 96.5% of the cap. The next week tipped over.
Root cause
Two tables dominate the archive and are ~99% of its size:
eval_samplesserver_logsbenchmark_results(the analytically useful one)Fix
Exclude
eval_samplesandserver_logsfrom the dump by default. This drops the zip from ~2.07 GB → ~0.36 GB, well under the cap, and unblocks the weekly release.DUMP_INCLUDE_ALL=1.load-dump.tsalready skips missing table files (skip <table> (file not found)), so restores round-trip cleanly without these two files.benchmark_results/configs/workflow_runs/run_statsetc. — all still dumped.Verification
db-backup.yml(workflow_dispatch) on this branch and confirm the asset uploads.🤖 Generated with Claude Code