Skip to content

Add sharded data-parallel runs sharing one run_path#295

Open
luciaquirke wants to merge 2 commits into
mainfrom
feat/sharded-runs
Open

Add sharded data-parallel runs sharing one run_path#295
luciaquirke wants to merge 2 commits into
mainfrom
feat/sharded-runs

Conversation

@luciaquirke
Copy link
Copy Markdown
Collaborator

Summary

Data attribution with IFs is embarrassingly parallel, but the SLURM example required users to hand-compute --split slices per node, write to per-node run paths, and manually stitch the resulting memmaps. This PR lets independent bergson build/bergson score invocations (e.g. one per SLURM array task) write into the same run_path with no stitching at either end, and survive nodes dying and being restarted.

  • --num_shards N --shard_id i on build/score (shard_id inferred from SLURM_ARRAY_TASK_ID/SLURM_PROCID when unset). Each shard processes one contiguous dataset slice and keeps its usual intra-node NCCL rendezvous.
  • Shards write to run_path/shards/<i>-of-<n>.part and atomically rename into place, recording provenance in shard.json (dataset row range, host, SLURM ids). A crashed shard leaves a .part dir and is rebuilt by re-running the same command; a published shard is skipped — requeued array tasks are idempotent.
  • The first shard atomically writes a canonical run_path/config.yaml (per-invocation fields shard_id/overwrite/node_rank stripped); later shards verify equality, so differently-configured shards can never mix in one run_path.
  • New bergson/sharding.py with ShardedMemmap, a lazy concatenated view over per-shard memmaps. load_gradients, load_scores, load_token_gradients, load_gradient_dataset, GradientProcessor.load, and the FAISS builder transparently resolve sharded run paths (allow_partial to peek at in-flight runs).
  • New bergson status <run_path> reports published / in-progress / missing shards.
  • examples/slurm/data_parallel_score.sh rewritten as sbatch --array --requeue job array; sharding docs added to docs/cli.rst.
  • Hessian estimation, aggregation/reduce, and pipeline commands explicitly reject sharded mode (factors can't be merged across independent shards yet).

Test plan

  • tests/test_sharded_runs.py unit tests: shard ranges match Dataset.shard(contiguous=True), ShardedMemmap indexing vs np.concatenate (flat + structured), canonical config publish/verify, shard inventory/coverage errors, config validation
  • GPU lifecycle test: SIGKILL a shard mid-build → .part left, not published; restart rebuilds; re-running a published shard is a no-op; mismatched config rejected; sharded index reads back identical to a non-sharded build of the same data
  • Manual end-to-end sharded bergson score (2 shards): load_scores returns one concatenated array, is_written(), bergson status reports complete
  • Regression: 25 tests across test_build/data/reduce/multinode/truncation/advantages/batch_size_invariance pass; pre-commit run --all-files clean

🤖 Generated with Claude Code

luciaquirke and others added 2 commits June 5, 2026 13:07
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant