You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add shard-count suggestion and render knobs used by Claude skills
The skills claimed manifest analyze emitted `suggested_shard_count`, that
slurm render picked up a rebased manifest automatically, and that render
could be run without a local rclone.conf. None of those matched the CLI.
Make the CLI match the documented behavior:
* `xfer manifest analyze` now emits `suggested_shard_count`,
`shard_count_reasoning`, and `shard_count_assumptions` based on a
10 TiB per-shard cap, `4 * array_concurrency` slack, and an optional
core budget. New `--assumed-*` / `--max-shard-bytes-tb` flags let the
user sharpen the suggestion.
* `xfer slurm render` gains `--manifest`, so rebased manifests can be
consumed without clobbering `run/manifest.jsonl`, and its
`--rclone-config` no longer requires a local file (it warns instead).
* Skill docs updated to match, plus pytest coverage for the shard-count
heuristic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|`--assumed-core-budget`| unset | Total cores the partition will make available (supply from `sinfo`). |
33
+
|`--max-shard-bytes-tb`|`10`| Per-shard byte cap. No single shard should carry more than this. |
34
+
|`--base-flags "<flags>"`| — | Prepend the user's preferred rclone flags to the suggested ones. |
35
+
36
+
If the user already knows the transfer cluster's available core budget, pass it — the shard-count suggestion will be sharper. Otherwise the default (concurrency + bytes-only) is fine.
27
37
28
38
## Step 3 — Report
29
39
30
40
Read `run/analyze.json` and report to the user:
31
41
32
-
1.**Dataset shape**: total object count, total bytes, median size, p90/p99 sizes, and the histogram bin counts (power-of-2 edges).
42
+
1.**Dataset shape**: total object count, total bytes, median size, p10/p90 sizes, and the histogram bin counts (power-of-2 edges).
33
43
2.**Profile classification**: which profile the analyzer picked (`small_files`, `large_files`, or `mixed`) and the reasoning (e.g., ">70% of objects are under 1 MiB").
34
-
3.**Suggested rclone flags**: the concrete string to pass to `--rclone-flags` for render. Typical examples:
44
+
3.**Suggested rclone flags** (`suggested_flags`): the concrete string to pass to `--rclone-flags` for render. Typical examples:
4.**Suggested shard count**: derive from total object count / target objects-per-shard (aim for ~10k–50k objects per shard for small-file workloads, smaller for large-file). Cap at what the chosen transfer cluster can reasonably host in a job array. If you don't yet know the transfer cluster, give a range and defer to `xfer-manifest-shard`.
48
+
4.**Suggested shard count** (`suggested_shard_count`, plus `shard_count_reasoning` and `shard_count_assumptions`). The heuristic:
49
+
- If `total_bytes` is below the per-shard cap (default 10 TiB), **1 shard** — don't shard small datasets.
50
+
- Otherwise `ceil(total_bytes / cap)` shards, upper-bounded by `4 × array_concurrency` and (if a core budget was supplied) `core_budget // cpus_per_task`.
51
+
52
+
Quote `shard_count_reasoning` verbatim back to the user so they can see the trade-offs.
39
53
40
54
## Step 4 — Persist for downstream skills
41
55
42
-
`run/analyze.json` is the source of truth for flag/shard decisions. `xfer-manifest-shard` and `xfer-slurm-render` both read it. Don't re-derive flags by hand in those skills — point at this file.
56
+
`run/analyze.json` is the source of truth for flag/shard decisions. `xfer-manifest-shard` reads `suggested_shard_count` and `xfer-slurm-render` reads `suggested_flags` — point at this file, don't re-derive.
57
+
58
+
If the user's plan changes (different transfer cluster, different concurrency cap), re-run `xfer manifest analyze` with updated `--assumed-*` flags before calling `xfer-manifest-shard`.
Copy file name to clipboardExpand all lines: .claude/skills/xfer-manifest-rebase/SKILL.md
+16-4Lines changed: 16 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,7 +40,9 @@ Always write to a new file (don't overwrite `manifest.jsonl`). Keeping the origi
40
40
41
41
## Step 3 — Re-shard
42
42
43
-
Sharding is derived from the manifest, so **re-shard after rebasing**:
43
+
Sharding is derived from the manifest, so **re-shard after rebasing**. The existing `run/shards/` directory contains pre-rebase paths and must be replaced.
44
+
45
+
Confirm with the user before removing the old shards (`run/shards` is small but removing it is irreversible locally):
44
46
45
47
```bash
46
48
rm -rf run/shards
@@ -50,16 +52,26 @@ uv run xfer manifest shard \
50
52
--num-shards <same-N-as-before>
51
53
```
52
54
53
-
(Or invoke `xfer-manifest-shard`.) Byte balance won't change meaningfully, but the shard files need to carry the rebased paths or workers will try to copy from the wrong URI.
55
+
(Or invoke `xfer-manifest-shard` with the rebased manifest as input.) Byte balance won't change meaningfully, but the shard files need to carry the rebased paths or workers will try to copy from the wrong URI.
56
+
57
+
## Step 4 — Point `xfer slurm render` at the rebased manifest
58
+
59
+
`xfer slurm render` reads `source_root` and `dest_root` from a manifest file. By default it reads `<run_dir>/manifest.jsonl`, which is intentionally left at the pre-rebase vantage point as an audit record. Pass `--manifest` to read the rebased file instead:
54
60
55
-
## Step 4 — Point downstream skills at the rebased manifest
61
+
```bash
62
+
uv run xfer slurm render \
63
+
--run-dir run \
64
+
--manifest run/manifest.rebased.jsonl \
65
+
...
66
+
```
56
67
57
-
When you invoke `xfer-slurm-render` next, pass `--run-dir run` but ensure `config.resolved.json` references the rebased manifest. If the user plans a fresh `xfer slurm render`, that's automatic (render reads from `run/shards/`).
68
+
Without `--manifest`, render would use the original roots and every array task would target the wrong URI.
58
69
59
70
## Safety
60
71
61
72
- Never delete the original manifest — always keep `run/manifest.jsonl` as an audit trail alongside `run/manifest.rebased.jsonl`.
62
73
- Rebase is a remap, not a content migration. It does not move data. It only relabels what each shard points to.
74
+
- Confirm before `rm -rf run/shards` — the user may want to move the old shards aside rather than delete them.
Copy file name to clipboardExpand all lines: .claude/skills/xfer-manifest-shard/SKILL.md
+10-11Lines changed: 10 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,22 +13,21 @@ Runs **locally on the workstation**. Pure file processing. No Slurm/SSH needed.
13
13
14
14
## Step 1 — Read the analyze output
15
15
16
-
Read `run/analyze.json` (from `xfer-manifest-analyze`). Use its`suggested_shard_count`/ profile as the starting point. If analyze hasn't been run yet, invoke `xfer-manifest-analyze` first — don't guess shard counts from the raw manifest.
16
+
Read `run/analyze.json` (from `xfer-manifest-analyze`) and use`suggested_shard_count`directly as the shard count. The analyzer already factors in the 10 TiB/shard cap, the expected array concurrency, and (if supplied) the core budget.
17
17
18
-
## Step 2 — Reconcile shard count with cluster resources
18
+
If `run/analyze.json` doesn't exist yet, invoke `xfer-manifest-analyze` first — don't guess shard counts from the raw manifest.
19
19
20
-
The right shard count depends on **both** rclone settings (from analyze) **and** the transfer cluster's available resources. Ask the user:
20
+
## Step 2 — Decide whether to override
21
21
22
-
1. Which cluster will run the transfer? (Same as build host, or different?)
23
-
2. What's the target array concurrency — how many shards should run at once? Typical range: 32–256, capped by the partition's `MaxArraySize` and the throughput both S3 endpoints can handle.
24
-
3. What's the partition's per-node core/memory budget?
22
+
Only override `suggested_shard_count` if one of the inputs that fed it has changed since analyze ran:
25
23
26
-
Rule of thumb:
27
-
- Total shards ≈ max(suggested_shard_count_from_analyze, 4 × array_concurrency). This gives the scheduler enough slack to keep the array fully packed even as slow shards trail.
28
-
- For small-files profiles (heavy listing, light bytes), bias toward **more shards** of fewer objects each.
29
-
- For large-files profiles (heavy bytes, few objects), bias toward **fewer shards** with more bytes each — byte-balancing matters more than object count.
24
+
- The transfer cluster is different from what analyze assumed (different core budget).
25
+
- The array concurrency cap is different from what analyze assumed (defaults: `--assumed-array-concurrency=64`).
26
+
- The user wants a different per-shard byte cap (default 10 TiB).
30
27
31
-
State your recommendation with the reasoning, then confirm before running.
28
+
In that case, **re-run `xfer-manifest-analyze`** with updated `--assumed-*` flags rather than hand-picking a new number here. Sharing the reasoning/assumptions via `run/analyze.json` is how downstream skills stay coherent.
29
+
30
+
Show the user `suggested_shard_count` alongside `shard_count_reasoning` from analyze, then confirm before running.
Copy file name to clipboardExpand all lines: .claude/skills/xfer-slurm-render/SKILL.md
+11-7Lines changed: 11 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -57,16 +57,19 @@ Collect from the user (with defaults from `run/analyze.json` and the chosen part
57
57
|`--rclone-image`|`rclone/rclone:latest`|
58
58
|`--rclone-config`| absolute path to rclone.conf **on the transfer cluster's compute nodes** (see note below) |
59
59
|`--rclone-flags`|`suggested_flags` from `run/analyze.json`|
60
-
|`--max-attempts`|3 (default) |
60
+
|`--max-attempts`|5 (default) |
61
61
|`--sbatch-extras`| site-specific `--account=...`, `--qos=...`, etc. |
62
62
|`--pyxis-extra`| extra `srun --container-*` flags if site requires them |
63
+
|`--manifest`| optional; path to a specific manifest (pass `run/manifest.rebased.jsonl` if the rebase skill ran) |
63
64
64
-
The `--rclone-config` path is baked into `sbatch_array.sh` and resolved **on the transfer cluster at job time**, not on the workstation. It must:
65
+
The `--rclone-config` path is baked into `sbatch_array.sh` and resolved **on the transfer cluster at job time**, not at render time. Render itself no longer requires the file to exist on the workstation — it only prints a warning if the local path is missing, since the actual consumer is the compute node. Still:
65
66
66
-
-Be an absolute path valid on that cluster's compute nodes (home dirs and shared paths differ between sites).
67
-
-Exist with `0600` permissions before the job starts.
67
+
-The path must be an absolute path valid on the cluster's compute nodes.
68
+
-It must exist with `0600` permissions**on the cluster** before the job starts.
68
69
69
-
If the user doesn't already have the config deployed to this cluster at a known path, **stop and invoke `xfer-rclone-config`** to set it up and record the path. Don't guess the path — a wrong value here means every array task will fail identically at container start.
70
+
If the user doesn't already have the config deployed to the cluster at a known path, invoke `xfer-rclone-config` to set it up and record the path. A wrong value here means every array task will fail identically at container start, so double-check the path.
71
+
72
+
Use `--manifest` whenever the user ran `xfer-manifest-rebase`: render reads `source_root` / `dest_root` from a manifest file, and the default path (`<run_dir>/manifest.jsonl`) is intentionally left at the pre-rebase vantage point as an audit record. Passing `--manifest run/manifest.rebased.jsonl` is how render picks up the rebased roots.
help="Cores per shard worker (matches slurm_render default). Used only for shard-count suggestion.",
474
+
),
475
+
assumed_array_concurrency: int=typer.Option(
476
+
64,
477
+
min=1,
478
+
help="Expected Slurm array concurrency (matches slurm_render default). Used only for shard-count suggestion.",
479
+
),
480
+
assumed_core_budget: Optional[int] =typer.Option(
481
+
None,
482
+
help="Total cores the transfer cluster's partition will make available. Used only for shard-count suggestion. If omitted, the core constraint is skipped.",
483
+
),
484
+
max_shard_bytes_tb: int=typer.Option(
485
+
10,
486
+
min=1,
487
+
help="Per-shard byte cap in TiB (no single shard should exceed this).",
488
+
),
470
489
) ->None:
471
490
"""
472
491
Analyze manifest file sizes and suggest optimal rclone flags.
help="Absolute path to rclone.conf on the transfer cluster's compute nodes. Not required to exist on this host; a warning is emitted if it is missing locally.",
888
916
resolve_path=True,
889
917
),
890
918
rclone_conf_in_container: str=typer.Option(
@@ -911,6 +939,11 @@ def slurm_render(
911
939
pyxis_extra: str=typer.Option(
912
940
"", help="Extra pyxis flags (string placed after --container-mounts...)"
913
941
),
942
+
manifest: Optional[Path] =typer.Option(
943
+
None,
944
+
help="Manifest JSONL to read source/dest_root from. Defaults to <run_dir>/manifest.jsonl. Use this after `xfer manifest rebase` to point render at the rebased file.",
945
+
resolve_path=True,
946
+
),
914
947
) ->None:
915
948
"""
916
949
Render worker.sh, sbatch_array.sh, and submit.sh under run_dir.
@@ -919,12 +952,19 @@ def slurm_render(
919
952
mkdirp(run_dir/"logs")
920
953
mkdirp(run_dir/"state")
921
954
922
-
# If source/dest not provided, try to read first line of manifest.jsonl (if present)
955
+
ifnotrclone_config.exists():
956
+
eprint(
957
+
f"WARNING: rclone_config {rclone_config} does not exist on this host. "
958
+
"That is fine if the path is valid on the transfer cluster's compute nodes. "
959
+
"Verify before submitting."
960
+
)
961
+
962
+
# If source/dest not provided, try to read first line of the manifest (if present)
0 commit comments