fix(train/data): default create_empty_sample to bfloat16 to match training dtype

coolthor · coolthor · commit 0d69f0f2bdfb · 2026-05-17T00:36:13.000+08:00
When vLLM hidden-state extraction times out during training, the training
loop substitutes an empty sample via `create_empty_sample()`. The original
implementation built `torch.empty(0, ...)` tensors with no `dtype` argument,
so PyTorch fell back to `float32`.

Downstream EAGLE-3 layers (`fc`, `verifier_lm_head`) load `bfloat16` weights
when training a bf16 verifier. The first time the empty sample reached one
of those layers we got an explicit dtype mismatch crash, taking the whole
job down at a random late-epoch step.

This patch:

- Adds a `dtype: torch.dtype = torch.bfloat16` keyword argument (covers the
  common case of training against a bf16 verifier).
- Threads it through the `hidden_states` and `verifier_last_hidden_states`
  tensors so the empty placeholders match downstream weight dtype.
- Also pins `input_ids` to `torch.long` (it was previously default float).

Reproducer: train an EAGLE-3 drafter against any bf16 verifier with
`extract_hidden_states` + `ExampleHiddenStatesConnector`. Sufficient vLLM
extraction time-outs eventually surface the empty-sample path and the run
crashes with `RuntimeError: ... expected Float, got BFloat16`. With this
default, the empty sample flows through the bf16 layers cleanly.

Callers that train against a non-bf16 verifier can override the dtype
explicitly.
diff --git a/src/speculators/train/data.py b/src/speculators/train/data.py
@@ -64,7 +64,7 @@ def split_files(datapath: str, ratio: float = 0.9, seed: int = 0):
 StandardizeFnSig = Callable[[dict[str, Any]], dict[str, Any]]
 
 
-def create_empty_sample(hidden_size: int):
+def create_empty_sample(hidden_size: int, dtype: torch.dtype = torch.bfloat16):
     # data structure: {
     #     "hidden_states": [seq_len, 3 * hidden_size],
     #     "input_ids": [seq_len],
@@ -73,11 +73,15 @@ def create_empty_sample(hidden_size: int):
     #     "lengths": [1],
     #     "position_ids": [seq_len],
     # }
+    # Default dtype is bfloat16 to match the hidden_states dtype used downstream.
+    # When this fallback is used (e.g. vLLM hidden-state extraction times out and
+    # we substitute an empty sample), the implicit float32 placeholders crashed
+    # bf16 EAGLE-3 layers (fc, verifier_lm_head) with a dtype mismatch.
 
     return {
-        "hidden_states": torch.empty(0, 3 * hidden_size),
-        "input_ids": torch.empty(0),
-        "verifier_last_hidden_states": torch.empty(0, hidden_size),
+        "hidden_states": torch.empty(0, 3 * hidden_size, dtype=dtype),
+        "input_ids": torch.empty(0, dtype=torch.long),
+        "verifier_last_hidden_states": torch.empty(0, hidden_size, dtype=dtype),
         "loss_mask": torch.empty(0),
         "lengths": torch.tensor([0], dtype=torch.long),
         "position_ids": torch.arange(0, dtype=torch.long),