Skip to content

Commit 2207c92

Browse files
[adapter,hparams] feat: support resuming from Hugging Face checkpoint paths
Extend `BaseAdapter.load_checkpoint` to transparently resolve a Hugging Face repo spec (either `hf://owner/repo[/subdir][@rev]` or bare `owner/repo[/...]`) to a local cache directory via `huggingface_hub.snapshot_download`, reusing the existing `lora` / `full` / `state` loading branches unchanged. Logic priority at `_resolve_checkpoint_path`: 1. `hf://` prefix -> force HF (overrides any colliding local dir) 2. Local path exists -> return as-is 3. Otherwise parse as `owner/repo[/subfolder][@revision]` and download Multi-node-safe: download gated on `is_local_main_process` (one per node), not `is_main_process` (one global). On non-shared filesystems each node populates its own HF cache once; on shared filesystems huggingface_hub's per-blob `WeakFileLock` dedupes the concurrent `snapshot_download` calls so only one node transfers bytes. Fail-fast: narrow `except (RepositoryNotFoundError, HfHubHTTPError)` re-raised as `FileNotFoundError` with full context (path, repo_id, subfolder, revision, HF_TOKEN hint) for actionable error messages. Also updates the `resume_path` help text and normalizes the inline comment on all 59 example YAMLs to document the new HF support. No new config fields; `resume_path: Optional[str]` keeps the same shape. Note: pre-existing black/isort lint debt in the touched src files is unrelated to this change. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 01f0a52 commit 2207c92

62 files changed

Lines changed: 241 additions & 62 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

examples/awm/lora/flux1/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ model:
2323
target_modules: "default" # Options: all, default, or list of module names like ["to_k", "to_q", "to_v", "to_out.0"]
2424
model_name_or_path: "black-forest-labs/FLUX.1-dev" # HuggingFace model ID or local path
2525
model_type: "flux1"
26-
resume_path: null # Path to load previous checkpoint/lora adapter
26+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2727
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
2828
# attn_backend: '_flash_3_hub'
2929

examples/awm/lora/flux2_klein_base/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ model:
2323
target_modules: "default" # Options: all, default, or list of module names like ["to_k", "to_q", "to_v", "to_out.0"]
2424
model_name_or_path: "black-forest-labs/FLUX.2-klein-base-4B" # Options: black-forest-labs/FLUX.2-klein-base-4B, black-forest-labs/FLUX.2-klein-base-9B
2525
model_type: "flux2-klein"
26-
resume_path: null # Path to load previous checkpoint/lora adapter
26+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2727
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
2828
# attn_backend: '_flash_3_hub' # Attention backend for training.
2929

examples/awm/lora/sd3_5/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ model:
2424
target_modules: "default" # Options: all, default, or list of module names like ["to_k", "to_q", "to_v", "to_out.0"]
2525
model_name_or_path: "stabilityai/stable-diffusion-3.5-medium"
2626
model_type: "sd3-5"
27-
resume_path: null # Path to load previous checkpoint/lora adapter
27+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2828
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
2929
# attn_backend: '_flash_3_hub' # Attention backend for training.
3030

examples/crd/lora/sd3_5/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ model:
2525
target_modules: "default" # Options: all, default, or list of module names like ["to_k", "to_q", "to_v", "to_out.0"]
2626
model_name_or_path: "stabilityai/stable-diffusion-3.5-medium"
2727
model_type: "sd3-5"
28-
resume_path: null # Path to load previous checkpoint/lora adapter
28+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2929
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
3030
# attn_backend: '_flash_3_hub' # Attention backend for training.
3131

examples/dgpo/lora/sd3_5/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ model:
4848
target_modules: "default"
4949
model_name_or_path: "stabilityai/stable-diffusion-3.5-medium" # config.pretrained.model
5050
model_type: "sd3-5"
51-
resume_path: null
51+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
5252
resume_type: null
5353

5454
# Training Configuration

examples/dgpo/lora/sd3_5/nocfg.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ model:
3939
target_modules: "default"
4040
model_name_or_path: "stabilityai/stable-diffusion-3.5-medium" # config.pretrained.model
4141
model_type: "sd3-5"
42-
resume_path: null
42+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
4343
resume_type: null
4444

4545
# Training Configuration

examples/dpo/lora/sd3_5/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ model:
5050
target_modules: "default"
5151
model_name_or_path: "stabilityai/stable-diffusion-3.5-medium" # Same as flow_grpo
5252
model_type: "sd3-5"
53-
resume_path: null
53+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
5454
resume_type: null
5555

5656
log:

examples/grpo/full/flux1/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ model:
2121
target_modules: "default" # Options: all, default, or list of module names like ["to_k", "to_q", "to_v", "to_out.0"]
2222
model_name_or_path: "black-forest-labs/FLUX.1-dev" # HuggingFace model ID or local path
2323
model_type: "flux1"
24-
resume_path: null # Directory contains previous checkpoint/lora adapter
24+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2525
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
2626

2727
log:

examples/grpo/full/flux1_kontext/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ model:
2121
target_modules: "default" # Options: all, default, or list of module names like ["to_k", "to_q", "to_v", "to_out.0"]
2222
model_name_or_path: "black-forest-labs/FLUX.1-Kontext-dev" # HuggingFace model ID or local path
2323
model_type: "flux1-kontext"
24-
resume_path: null # Directory contains previous checkpoint/lora adapter
24+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2525
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
2626

2727
log:

examples/grpo/full/flux2/i2i.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ model:
2424
target_modules: ["attn.to_q", "attn.to_k", "attn.to_v", "attn.to_out.0"]
2525
model_name_or_path: "black-forest-labs/FLUX.2-dev" # HuggingFace model ID or local path
2626
model_type: "flux2" # Options: flux1, flux1-kontext, flux2, qwenimage, qwenimage-edit
27-
resume_path: null # Directory contains previous checkpoint/lora adapter
27+
resume_path: null # Local path or HF repo id (e.g. 'owner/repo[/subdir][@rev]') for previous checkpoint/lora adapter
2828
resume_type: null # Options: lora, full, state. Null to auto-detect based on `finetune_type`
2929

3030
log:

0 commit comments

Comments
 (0)