Skip to content

Commit bfd0f14

Browse files
[adapter,hparams] feat: support resuming from Hugging Face checkpoint paths (#160)
* [adapter,hparams] feat: support resuming from Hugging Face checkpoint paths Extend `BaseAdapter.load_checkpoint` to transparently resolve a Hugging Face repo spec (either `hf://owner/repo[/subdir][@rev]` or bare `owner/repo[/...]`) to a local cache directory via `huggingface_hub.snapshot_download`, reusing the existing `lora` / `full` / `state` loading branches unchanged. Logic priority at `_resolve_checkpoint_path`: 1. `hf://` prefix -> force HF (overrides any colliding local dir) 2. Local path exists -> return as-is 3. Otherwise parse as `owner/repo[/subfolder][@revision]` and download Multi-node-safe: download gated on `is_local_main_process` (one per node), not `is_main_process` (one global). On non-shared filesystems each node populates its own HF cache once; on shared filesystems huggingface_hub's per-blob `WeakFileLock` dedupes the concurrent `snapshot_download` calls so only one node transfers bytes. Fail-fast: narrow `except (RepositoryNotFoundError, HfHubHTTPError)` re-raised as `FileNotFoundError` with full context (path, repo_id, subfolder, revision, HF_TOKEN hint) for actionable error messages. Also updates the `resume_path` help text and normalizes the inline comment on all 59 example YAMLs to document the new HF support. No new config fields; `resume_path: Optional[str]` keeps the same shape. Note: pre-existing black/isort lint debt in the touched src files is unrelated to this change. Co-authored-by: Cursor <cursoragent@cursor.com> * [adapter,docs] refactor: hoist HF imports to module top; codify rule Move the two `huggingface_hub` imports added in the previous commit out of function bodies and up to the module's import block: - `from huggingface_hub import snapshot_download` in `src/flow_factory/utils/checkpoint.py` - `from huggingface_hub.errors import RepositoryNotFoundError, HfHubHTTPError` in `src/flow_factory/models/abc.py` `huggingface_hub` is already a hard dependency (`pyproject.toml:39`), so there is no import-cost concern; lazy-loading only hid the dependency surface from readers and `isort`. Codify the rule so this pattern is caught in review going forward: - Extend constraint #22 ("Import Style") in `.agents/knowledge/constraints.md` with an explicit "Top-level imports only" bullet, listing the three sanctioned exceptions (optional deps via `try/except ImportError`, backend-gated imports under runtime feature checks like DeepSpeed/FSDP, and unresolvable circular imports). - Add a matching checklist item to the Code Style section of `.agents/skills/ff-review/SKILL.md`. Pre-existing inline FSDP/DeepSpeed imports in `models/abc.py` (lines ~997-1152) are grandfathered under the new rule's exception (b). Co-authored-by: Cursor <cursoragent@cursor.com> * [adapter,hparams] fix: address Copilot review on PR #160 Three fixes from the Copilot review on the upstream PR: 1. Path-traversal validation (utils/checkpoint.py): parse_hf_checkpoint_path now rejects '.', '..', and backslash segments with an informative ValueError that preserves the original spec. Validates at the parser front door rather than at the download site so error messages keep the user's exact input. Without this, a spec like 'owner/repo/..' would escape the snapshot directory via os.path.join. 2. Un-gate _resolve_checkpoint_path (models/abc.py): Remove the `if is_local_main_process:` gate and the post-barrier double-call pattern. All ranks now call snapshot_download directly; huggingface_hub's per-blob WeakFileLock serializes concurrent calls within each filesystem domain (cross-node on POSIX-locking shared FS, per-node on non-shared FS), so we still get exactly one download per filesystem domain. This eliminates the distributed-deadlock hazard where a download failure on the gated rank would raise before reaching wait_for_everyone(), leaving siblings blocked at the barrier until NCCL watchdog timeout. The trailing wait_for_everyone() is kept to maintain lockstep entry into the downstream loaders. Residual asymmetric-failure risk (one rank's network blip while others succeed) is documented in the docstring. 3. Skill-checklist alignment (.agents/skills/ff-review/SKILL.md): Replace the duplicated import-exception list with a reference to constraint #22, where the full set of three sanctioned exceptions (optional deps, backend-gated runtime feature checks, circular imports) lives. Prevents future drift between the two documents. Verified with 8 happy-path + 5 original-error + 6 path-traversal parser test cases (all pass). Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 01f0a52 commit bfd0f14

64 files changed

Lines changed: 261 additions & 62 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/knowledge/constraints.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,7 @@ The adapter sets inference dtype for frozen components and training dtype for tr
131131
- Use relative imports within `flow_factory` package (e.g., `from ..hparams import *`)
132132
- Use absolute imports for external packages
133133
- Follow existing wildcard import patterns for `hparams`
134+
- **Top-level imports only**: All `import` / `from ... import ...` statements MUST live at the top of the module, never inside function bodies, methods, `__init__`, or conditional branches. Sanctioned exceptions: (a) optional dependencies wrapped in `try/except ImportError` (e.g., `deepspeed`, `xformers`); (b) backend-gated imports where the target symbol is only resolvable under a specific runtime backend already selected by a preceding feature check (e.g., DeepSpeed/FSDP submodules guarded by `is_deepspeed()` / `is_fsdp2()` in `models/abc.py`); (c) genuine unresolvable circular imports documented inline. Lazy imports added merely for "import speed" or "to keep the module light" are NOT acceptable — every hard dependency already runs through Python's import machinery on a typical import path. Inline imports hide the dependency surface from readers, `isort`, and static-analysis tools, and re-execute on every call in hot loops.
134135

135136
### 23. Type Annotations
136137
All public methods must have type annotations. Use `typing` module types (`List`, `Dict`, `Optional`, `Tuple`, `Union`) for Python 3.10 compatibility.

.agents/skills/ff-review/SKILL.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ git status # Modified files
6060
- [ ] English comments and docstrings
6161
- [ ] Apache 2.0 license header on new files
6262
- [ ] No unnecessary wildcard imports (except `hparams`)
63+
- [ ] **Top-level imports only** (constraint #22) — see that file for the three sanctioned exceptions (optional deps via `try/except ImportError`, backend-gated runtime feature checks like DeepSpeed/FSDP, unresolvable circular imports).
6364

6465
### Documentation
6566
- [ ] `guidance/` docs updated if behavior changed

examples/awm/lora/flux1/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ model:
2323
target_modules: "default" # Options: all, default, or list of module names like ["to_k", "to_q", "to_v", "to_out.0"]
2424
model_name_or_path: "black-forest-labs/FLUX.1-dev" # HuggingFace model ID or local path
2525
model_type: "flux1"
26-
resume_path: null # Path to load previous checkpoint/lora adapter
26+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2727
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
2828
# attn_backend: '_flash_3_hub'
2929

examples/awm/lora/flux2_klein_base/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ model:
2323
target_modules: "default" # Options: all, default, or list of module names like ["to_k", "to_q", "to_v", "to_out.0"]
2424
model_name_or_path: "black-forest-labs/FLUX.2-klein-base-4B" # Options: black-forest-labs/FLUX.2-klein-base-4B, black-forest-labs/FLUX.2-klein-base-9B
2525
model_type: "flux2-klein"
26-
resume_path: null # Path to load previous checkpoint/lora adapter
26+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2727
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
2828
# attn_backend: '_flash_3_hub' # Attention backend for training.
2929

examples/awm/lora/sd3_5/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ model:
2424
target_modules: "default" # Options: all, default, or list of module names like ["to_k", "to_q", "to_v", "to_out.0"]
2525
model_name_or_path: "stabilityai/stable-diffusion-3.5-medium"
2626
model_type: "sd3-5"
27-
resume_path: null # Path to load previous checkpoint/lora adapter
27+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2828
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
2929
# attn_backend: '_flash_3_hub' # Attention backend for training.
3030

examples/crd/lora/sd3_5/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ model:
2525
target_modules: "default" # Options: all, default, or list of module names like ["to_k", "to_q", "to_v", "to_out.0"]
2626
model_name_or_path: "stabilityai/stable-diffusion-3.5-medium"
2727
model_type: "sd3-5"
28-
resume_path: null # Path to load previous checkpoint/lora adapter
28+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2929
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
3030
# attn_backend: '_flash_3_hub' # Attention backend for training.
3131

examples/dgpo/lora/sd3_5/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ model:
4848
target_modules: "default"
4949
model_name_or_path: "stabilityai/stable-diffusion-3.5-medium" # config.pretrained.model
5050
model_type: "sd3-5"
51-
resume_path: null
51+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
5252
resume_type: null
5353

5454
# Training Configuration

examples/dgpo/lora/sd3_5/nocfg.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ model:
3939
target_modules: "default"
4040
model_name_or_path: "stabilityai/stable-diffusion-3.5-medium" # config.pretrained.model
4141
model_type: "sd3-5"
42-
resume_path: null
42+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
4343
resume_type: null
4444

4545
# Training Configuration

examples/dpo/lora/sd3_5/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ model:
5050
target_modules: "default"
5151
model_name_or_path: "stabilityai/stable-diffusion-3.5-medium" # Same as flow_grpo
5252
model_type: "sd3-5"
53-
resume_path: null
53+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
5454
resume_type: null
5555

5656
log:

examples/grpo/full/flux1/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ model:
2121
target_modules: "default" # Options: all, default, or list of module names like ["to_k", "to_q", "to_v", "to_out.0"]
2222
model_name_or_path: "black-forest-labs/FLUX.1-dev" # HuggingFace model ID or local path
2323
model_type: "flux1"
24-
resume_path: null # Directory contains previous checkpoint/lora adapter
24+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2525
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
2626

2727
log:

0 commit comments

Comments
 (0)