Commit bfd0f14
[adapter,hparams] feat: support resuming from Hugging Face checkpoint paths (#160)
* [adapter,hparams] feat: support resuming from Hugging Face checkpoint paths
Extend `BaseAdapter.load_checkpoint` to transparently resolve a Hugging
Face repo spec (either `hf://owner/repo[/subdir][@rev]` or bare
`owner/repo[/...]`) to a local cache directory via
`huggingface_hub.snapshot_download`, reusing the existing `lora` /
`full` / `state` loading branches unchanged.
Logic priority at `_resolve_checkpoint_path`:
1. `hf://` prefix -> force HF (overrides any colliding local dir)
2. Local path exists -> return as-is
3. Otherwise parse as `owner/repo[/subfolder][@revision]` and download
Multi-node-safe: download gated on `is_local_main_process` (one per
node), not `is_main_process` (one global). On non-shared filesystems
each node populates its own HF cache once; on shared filesystems
huggingface_hub's per-blob `WeakFileLock` dedupes the concurrent
`snapshot_download` calls so only one node transfers bytes.
Fail-fast: narrow `except (RepositoryNotFoundError, HfHubHTTPError)`
re-raised as `FileNotFoundError` with full context (path, repo_id,
subfolder, revision, HF_TOKEN hint) for actionable error messages.
Also updates the `resume_path` help text and normalizes the inline
comment on all 59 example YAMLs to document the new HF support. No
new config fields; `resume_path: Optional[str]` keeps the same shape.
Note: pre-existing black/isort lint debt in the touched src files is
unrelated to this change.
Co-authored-by: Cursor <cursoragent@cursor.com>
* [adapter,docs] refactor: hoist HF imports to module top; codify rule
Move the two `huggingface_hub` imports added in the previous commit
out of function bodies and up to the module's import block:
- `from huggingface_hub import snapshot_download` in
`src/flow_factory/utils/checkpoint.py`
- `from huggingface_hub.errors import RepositoryNotFoundError,
HfHubHTTPError` in `src/flow_factory/models/abc.py`
`huggingface_hub` is already a hard dependency (`pyproject.toml:39`),
so there is no import-cost concern; lazy-loading only hid the
dependency surface from readers and `isort`.
Codify the rule so this pattern is caught in review going forward:
- Extend constraint #22 ("Import Style") in
`.agents/knowledge/constraints.md` with an explicit "Top-level imports
only" bullet, listing the three sanctioned exceptions (optional deps
via `try/except ImportError`, backend-gated imports under runtime
feature checks like DeepSpeed/FSDP, and unresolvable circular imports).
- Add a matching checklist item to the Code Style section of
`.agents/skills/ff-review/SKILL.md`.
Pre-existing inline FSDP/DeepSpeed imports in `models/abc.py` (lines
~997-1152) are grandfathered under the new rule's exception (b).
Co-authored-by: Cursor <cursoragent@cursor.com>
* [adapter,hparams] fix: address Copilot review on PR #160
Three fixes from the Copilot review on the upstream PR:
1. Path-traversal validation (utils/checkpoint.py):
parse_hf_checkpoint_path now rejects '.', '..', and backslash
segments with an informative ValueError that preserves the original
spec. Validates at the parser front door rather than at the
download site so error messages keep the user's exact input.
Without this, a spec like 'owner/repo/..' would escape the snapshot
directory via os.path.join.
2. Un-gate _resolve_checkpoint_path (models/abc.py):
Remove the `if is_local_main_process:` gate and the post-barrier
double-call pattern. All ranks now call snapshot_download directly;
huggingface_hub's per-blob WeakFileLock serializes concurrent calls
within each filesystem domain (cross-node on POSIX-locking shared
FS, per-node on non-shared FS), so we still get exactly one
download per filesystem domain.
This eliminates the distributed-deadlock hazard where a download
failure on the gated rank would raise before reaching
wait_for_everyone(), leaving siblings blocked at the barrier until
NCCL watchdog timeout. The trailing wait_for_everyone() is kept to
maintain lockstep entry into the downstream loaders.
Residual asymmetric-failure risk (one rank's network blip while
others succeed) is documented in the docstring.
3. Skill-checklist alignment (.agents/skills/ff-review/SKILL.md):
Replace the duplicated import-exception list with a reference to
constraint #22, where the full set of three sanctioned exceptions
(optional deps, backend-gated runtime feature checks, circular
imports) lives. Prevents future drift between the two documents.
Verified with 8 happy-path + 5 original-error + 6 path-traversal
parser test cases (all pass).
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>1 parent 01f0a52 commit bfd0f14
64 files changed
Lines changed: 261 additions & 62 deletions
File tree
- .agents
- knowledge
- skills/ff-review
- examples
- awm/lora
- flux1
- flux2_klein_base
- sd3_5
- crd/lora/sd3_5
- dgpo/lora/sd3_5
- dpo/lora/sd3_5
- grpo
- full
- flux1_kontext
- flux1
- flux2_klein_base
- flux2_klein
- flux2
- qwen_image_edit_plus
- qwen_image
- sd3_5
- wan21
- wan22
- z_image_turbo
- z_image
- lora
- flux1_kontext
- flux1
- flux2_klein_base
- flux2_klein
- flux2
- ltx2
- qwen_image_edit_plus
- qwen_image
- sd3_5
- wan21
- wan22
- z_image_turbo
- z_image
- nft
- full
- flux1
- flux2_klein_base
- wan22
- z_image_turbo
- z_image
- lora
- flux1_kontext
- flux1
- flux2_klein_base
- qwen_image_edit_plus
- qwen_image
- sd3_5
- wan21
- wan22
- z_image
- template/sd3_5
- src/flow_factory
- hparams
- models
- utils
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
131 | 131 | | |
132 | 132 | | |
133 | 133 | | |
| 134 | + | |
134 | 135 | | |
135 | 136 | | |
136 | 137 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| 63 | + | |
63 | 64 | | |
64 | 65 | | |
65 | 66 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
| 26 | + | |
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
| 26 | + | |
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
| 27 | + | |
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | | - | |
| 28 | + | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
51 | | - | |
| 51 | + | |
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
53 | | - | |
| 53 | + | |
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
| 24 | + | |
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| |||
0 commit comments