Add sharded data-parallel runs sharing one run_path by luciaquirke · Pull Request #295 · EleutherAI/bergson

luciaquirke · 2026-06-05T13:08:29Z

Summary

Data attribution with IFs is embarrassingly parallel, but the SLURM example required users to hand-compute --split slices per node, write to per-node run paths, and manually stitch the resulting memmaps. This PR lets independent bergson build/bergson score invocations (e.g. one per SLURM array task) write into the same run_path with no stitching at either end, and survive nodes dying and being restarted.

--num_shards N --shard_id i on build/score (shard_id inferred from SLURM_ARRAY_TASK_ID/SLURM_PROCID when unset). Each shard processes one contiguous dataset slice and keeps its usual intra-node NCCL rendezvous.
Shards write to run_path/shards/<i>-of-<n>.part and atomically rename into place, recording provenance in shard.json (dataset row range, host, SLURM ids). A crashed shard leaves a .part dir and is rebuilt by re-running the same command; a published shard is skipped — requeued array tasks are idempotent.
The first shard atomically writes a canonical run_path/config.yaml (per-invocation fields shard_id/overwrite/node_rank stripped); later shards verify equality, so differently-configured shards can never mix in one run_path.
New bergson/sharding.py with ShardedMemmap, a lazy concatenated view over per-shard memmaps. load_gradients, load_scores, load_token_gradients, load_gradient_dataset, GradientProcessor.load, and the FAISS builder transparently resolve sharded run paths (allow_partial to peek at in-flight runs).
New bergson status <run_path> reports published / in-progress / missing shards.
examples/slurm/data_parallel_score.sh rewritten as sbatch --array --requeue job array; sharding docs added to docs/cli.rst.
Hessian estimation, aggregation/reduce, and pipeline commands explicitly reject sharded mode (factors can't be merged across independent shards yet).

Test plan

tests/test_sharded_runs.py unit tests: shard ranges match Dataset.shard(contiguous=True), ShardedMemmap indexing vs np.concatenate (flat + structured), canonical config publish/verify, shard inventory/coverage errors, config validation
GPU lifecycle test: SIGKILL a shard mid-build → .part left, not published; restart rebuilds; re-running a published shard is a no-op; mismatched config rejected; sharded index reads back identical to a non-sharded build of the same data
Manual end-to-end sharded bergson score (2 shards): load_scores returns one concatenated array, is_written(), bergson status reports complete
Regression: 25 tests across test_build/data/reduce/multinode/truncation/advantages/batch_size_invariance pass; pre-commit run --all-files clean

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

luciaquirke and others added 2 commits June 5, 2026 13:07

Add sharded data-parallel runs sharing one run_path

d70ec13

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c229883

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sharded data-parallel runs sharing one run_path#295

Add sharded data-parallel runs sharing one run_path#295
luciaquirke wants to merge 2 commits into
mainfrom
feat/sharded-runs

luciaquirke commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luciaquirke commented Jun 5, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant