Skip to content

Commit aa883aa

Browse files
[adapter,hparams] fix: address Copilot review on PR X-GenGroup#160
Three fixes from the Copilot review on the upstream PR: 1. Path-traversal validation (utils/checkpoint.py): parse_hf_checkpoint_path now rejects '.', '..', and backslash segments with an informative ValueError that preserves the original spec. Validates at the parser front door rather than at the download site so error messages keep the user's exact input. Without this, a spec like 'owner/repo/..' would escape the snapshot directory via os.path.join. 2. Un-gate _resolve_checkpoint_path (models/abc.py): Remove the `if is_local_main_process:` gate and the post-barrier double-call pattern. All ranks now call snapshot_download directly; huggingface_hub's per-blob WeakFileLock serializes concurrent calls within each filesystem domain (cross-node on POSIX-locking shared FS, per-node on non-shared FS), so we still get exactly one download per filesystem domain. This eliminates the distributed-deadlock hazard where a download failure on the gated rank would raise before reaching wait_for_everyone(), leaving siblings blocked at the barrier until NCCL watchdog timeout. The trailing wait_for_everyone() is kept to maintain lockstep entry into the downstream loaders. Residual asymmetric-failure risk (one rank's network blip while others succeed) is documented in the docstring. 3. Skill-checklist alignment (.agents/skills/ff-review/SKILL.md): Replace the duplicated import-exception list with a reference to constraint X-GenGroup#22, where the full set of three sanctioned exceptions (optional deps, backend-gated runtime feature checks, circular imports) lives. Prevents future drift between the two documents. Verified with 8 happy-path + 5 original-error + 6 path-traversal parser test cases (all pass). Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 5fb0b14 commit aa883aa

3 files changed

Lines changed: 38 additions & 18 deletions

File tree

.agents/skills/ff-review/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ git status # Modified files
6060
- [ ] English comments and docstrings
6161
- [ ] Apache 2.0 license header on new files
6262
- [ ] No unnecessary wildcard imports (except `hparams`)
63-
- [ ] **Top-level imports only** — no `import` / `from ... import ...` inside function or method bodies (constraint #22). Exceptions: documented optional deps via `try/except ImportError`, or genuine unresolvable circular imports.
63+
- [ ] **Top-level imports only** (constraint #22) — see that file for the three sanctioned exceptions (optional deps via `try/except ImportError`, backend-gated runtime feature checks like DeepSpeed/FSDP, unresolvable circular imports).
6464

6565
### Documentation
6666
- [ ] `guidance/` docs updated if behavior changed

src/flow_factory/models/abc.py

Lines changed: 26 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1470,11 +1470,17 @@ def _resolve_checkpoint_path(self, path: str) -> str:
14701470
3. Otherwise, parse as ``owner/repo[/subfolder][@revision]`` and download
14711471
via Hugging Face Hub.
14721472
1473-
Multi-node-safe: the download is gated on ``is_local_main_process`` (one
1474-
process per node), not ``is_main_process`` (one global). This populates the
1475-
per-node HF cache exactly once on non-shared filesystems; on shared
1476-
filesystems, ``huggingface_hub``'s per-blob ``WeakFileLock`` dedupes the
1477-
concurrent ``snapshot_download`` calls so only one node transfers bytes.
1473+
Multi-node-safe: all ranks call ``snapshot_download`` directly. Hugging
1474+
Face Hub's per-blob ``WeakFileLock`` serializes concurrent calls within
1475+
each filesystem domain (cross-node on POSIX-locking shared FS, per-node
1476+
on non-shared FS), so exactly one rank per filesystem domain actually
1477+
transfers bytes. Un-gated (rather than ``is_local_main_process`` plus a
1478+
barrier) so a failed download raises uniformly on every affected rank
1479+
instead of leaving siblings deadlocked at a barrier the failing rank
1480+
never reaches. Residual hazard: a rare single-rank transient failure
1481+
(e.g. one node's network blip) can produce asymmetric progress, in
1482+
which case the surviving ranks will eventually trip the NCCL watchdog
1483+
on the final barrier below.
14781484
14791485
Args:
14801486
path: Local filesystem path or HF spec (with or without ``hf://`` prefix).
@@ -1493,19 +1499,8 @@ def _resolve_checkpoint_path(self, path: str) -> str:
14931499

14941500
repo_id, subfolder, revision = parse_hf_checkpoint_path(spec)
14951501

1496-
if self.accelerator.is_local_main_process:
1497-
local_path = download_hf_checkpoint(repo_id, subfolder, revision)
1498-
logger.info(
1499-
f"[local rank 0 / global rank {self.accelerator.process_index}] "
1500-
f"resolved checkpoint '{path}' -> {local_path}"
1501-
)
1502-
self.accelerator.wait_for_everyone()
1503-
1504-
# All ranks call again; on the populated cache this is a metadata-only
1505-
# path lookup. Narrow re-raise for the specific HF-Hub failure modes so
1506-
# users see a single actionable message instead of a raw HTTPError.
15071502
try:
1508-
return download_hf_checkpoint(repo_id, subfolder, revision)
1503+
local_path = download_hf_checkpoint(repo_id, subfolder, revision)
15091504
except (RepositoryNotFoundError, HfHubHTTPError) as e:
15101505
raise FileNotFoundError(
15111506
f"Checkpoint {path!r} not found locally and could not be fetched "
@@ -1514,6 +1509,20 @@ def _resolve_checkpoint_path(self, path: str) -> str:
15141509
f"on ALL nodes."
15151510
) from e
15161511

1512+
# Sync after download so downstream loaders enter the lockstep dispatch
1513+
# together. On symmetric failure every rank raises above before this
1514+
# barrier is reached, so no deadlock; the residual asymmetric-failure
1515+
# case is documented in the docstring.
1516+
self.accelerator.wait_for_everyone()
1517+
1518+
if self.accelerator.is_local_main_process:
1519+
logger.info(
1520+
f"[local rank 0 / global rank {self.accelerator.process_index}] "
1521+
f"resolved checkpoint '{path}' -> {local_path}"
1522+
)
1523+
1524+
return local_path
1525+
15171526
@staticmethod
15181527
def load_sharded_checkpoint(checkpoint_dir: str, index_file: str) -> Dict[str, torch.Tensor]:
15191528
"""Load sharded safetensors checkpoint."""

src/flow_factory/utils/checkpoint.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,17 @@ def parse_hf_checkpoint_path(path: str) -> Tuple[str, Optional[str], Optional[st
189189
f"(expected at least 'owner/repo', got {len(parts)} non-empty segments)"
190190
)
191191

192+
# Reject path-traversal segments. Without this, a spec like
193+
# 'owner/repo/..' would resolve via os.path.join to a directory outside
194+
# the snapshot root and let downstream loaders read from unintended
195+
# locations. Backslashes are rejected to block Windows-style traversal.
196+
for seg in parts:
197+
if seg in (".", "..") or "\\" in seg:
198+
raise ValueError(
199+
f"invalid segment {seg!r} in HF checkpoint path: {path!r} "
200+
f"('.', '..', and backslashes are not allowed)"
201+
)
202+
192203
repo_id = "/".join(parts[:2])
193204
subfolder = "/".join(parts[2:]) if len(parts) > 2 else None
194205
return repo_id, subfolder, revision

0 commit comments

Comments
 (0)