Add Qwen3.6-35B-A3B MoE LoRA function-calling test case (Megatron-Bridge + Kubeflow PyTorchJob) by yhou-uk · Pull Request #1091 · awslabs/awsome-distributed-ai

yhou-uk · 2026-05-14T13:28:26Z

Purpose

Add a self-contained Kubernetes test case for parameter-efficient fine-tuning of a Mixture-of-Experts model. This is the first MoE PEFT recipe under 3.test_cases/megatron/megatron-lm/ and the first to exercise Megatron-Bridge 0.4's expert_model_parallel_size dimension end-to-end (training, adapter export to HuggingFace PEFT format, and vLLM serving).

Changes

New test case at 3.test_cases/megatron/megatron-lm/qwen36-moe-lora/ (28 files) following the canonical <framework>/<library>/<model>/{kubernetes,...} layout
Training driver (src/xlam_runner.py) wraps Megatron-Bridge's qwen35_vl_35b_a3b_peft_config recipe with overrides for text-only SFT (vision encoder frozen, HuggingFace tokenizer, xLAM dataset)
Adapter exporter (src/export_lora_adapter.py) converts the distributed Megatron-Bridge LoRA checkpoint to standard HuggingFace PEFT format using AutoBridge.save_hf_adapter with adapter-only state-dict filtering
Eight numbered Kubernetes manifests covering: storage (FSx Lustre 2.15), weight pre-cache, dataset prep, HF→Bridge conversion, training (Kubeflow PyTorchJob), adapter export, vLLM inference (Deployment + Service), and evaluation (Job)
Nine numbered shell scripts that drive the manifests via envsubst with explicit variable whitelists
Two-gate evaluation harness (src/eval_function_calling.py): 10 hand-crafted function-calling prompts, plus 50 held-out xLAM validation prompts judged by Claude Opus 4.7 via Amazon Bedrock
Documentation: top-level README, docs/PERFORMANCE.md (observed step time, MFU, parallelism topology), docs/EVALUATION.md (methodology + reference results), docs/TROUBLESHOOTING.md (failure modes encountered during bring-up)

Test Plan

Environment:

AWS Service: Amazon SageMaker HyperPod (EKS mode)
Instance type: ml.p5e.48xlarge (8x NVIDIA H200 141 GB, 32 EFA NICs per node)
Number of nodes: 2 (16 H200 total)

Storage: FSx for Lustre 2.15 (Scratch_2, 1.2 TiB)
Container: nvcr.io/nvidia/nemo:26.04 (Megatron-Bridge 0.4.0rc0, Megatron-Core 0.17.0rc0)
Parallelism: TP=2, PP=1, EP=4, DP=2

Test commands:

cd 3.test_cases/megatron/megatron-lm/qwen36-moe-lora
cp kubernetes/scripts/env.example kubernetes/scripts/env.sh
# fill in FSX_SUBNET_ID, FSX_SECURITY_GROUP, etc.
source kubernetes/scripts/env.sh

./kubernetes/scripts/0.setup-storage.sh        # FSx Lustre PVC + ConfigMap
./kubernetes/scripts/1.precache-weights.sh     # ~30 min, 70 GB
./kubernetes/scripts/2.prep-dataset.sh         # ~2 min
./kubernetes/scripts/3.convert-to-bridge.sh    # ~25 min, one-time
./kubernetes/scripts/4.train.sh                # ~2 h 12 min
./kubernetes/scripts/5.export-adapter.sh       # ~10 min
./kubernetes/scripts/6.deploy-inference.sh     # ~8 min cold start
GATE=all ./kubernetes/scripts/7.run-eval.sh    # ~6 min

To skip training and serve the published reference adapter directly:

export LORA_SOURCE=hf
export LORA_REPO=ying2022/qwen3-6-35b-xlam-tools-lora
./kubernetes/scripts/6.deploy-inference.sh

Test Results

End-to-end run executed on 2x ml.p5e.48xlarge.

Training:

Metric	Value
Wall-clock, 4,200 iterations	2 h 12 min
Step time (mean)	1.73 s
Per-GPU throughput	~77 MODEL TFLOP/s
Aggregate throughput	~1.2 PFLOPS
Total tokens trained	413 M (3.43 epochs over 58.8k samples)
Memory per GPU	56-68 GB / 141 GB (40-48% utilization)
LoRA adapter output	108 MB (rank 64, alpha 128, target `linear_qkv` + `linear_proj`)

Evaluation:

Gate	Metric	Result
Gate 1 (10 hand-crafted prompts)	Pass rate	base 6/10 -> LoRA 9/10 (+30 pp)
Gate 2 (50 xLAM val prompts, Claude Opus 4.7 judge)	LoRA win rate	47/50 (94%)

Reference adapter (publicly available; use to skip training and reproduce only the eval): https://huggingface.co/ying2022/qwen3-6-35b-xlam-tools-lora

Directory Structure

3.test_cases/
└── megatron/
    └── megatron-lm/
        └── qwen36-moe-lora/
            ├── README.md
            ├── .gitignore
            ├── src/                       # platform-agnostic Python (training, eval, export)
            │   ├── __init__.py
            │   ├── prep_xlam_dataset.py
            │   ├── xlam_runner.py
            │   ├── export_lora_adapter.py
            │   └── eval_function_calling.py
            ├── docs/                      # platform-agnostic docs
            │   ├── PERFORMANCE.md
            │   ├── EVALUATION.md
            │   └── TROUBLESHOOTING.md
            └── kubernetes/
                ├── manifests/             # 8 envsubst-driven YAML templates
                └── scripts/               # env.example + 9 numbered shell scripts

Layout follows the canonical <framework>/<library>/<model>/{kubernetes,...} structure documented in .github/PULL_REQUEST_TEMPLATE.md. Platform-agnostic Python and docs sit above kubernetes/ so a future Slurm or HyperPod-Slurm path can be added without relocating any code.

Checklist

I have read the contributing guidelines.
I am working against the latest main branch.
I have searched existing open and recently merged PRs to confirm this is not a duplicate.
The contribution is self-contained with documentation and scripts.
External dependencies are pinned to a specific version or tag (no latest).
A README is included or updated with prerequisites, instructions, and known issues.
New test cases follow the expected directory structure.

First MoE PEFT test case under 3.test_cases/megatron/megatron-lm/kubernetes/. Exercises Megatron-Bridge 0.4 expert parallelism (EP=4), a LoRA rank-64 fine- tune on xLAM-60k for structured tool-call generation, adapter export to HF PEFT format, vLLM serving, and a 2-gate evaluation (hand-crafted prompts + LLM-as-judge via Bedrock). Tested on 2x ml.p5e.48xlarge (16x H200). Reference results: base 6/10 -> LoRA 9/10 on 10 hand-crafted function-calling prompts; LoRA wins 47/50 (94%) on 50 xLAM val prompts. Reference adapter: ying2022/qwen3-6-35b-xlam-tools-lora

KeitaW

Review batch 1/5 — Overview, Structure & Deployment Pipeline

Posting a multi-batch themed review. Each batch covers one category; inline comments and cross-cutting body findings are split across batches for readability. Final batch will close with kudos and source citations.

Summary

This is a thorough, self-contained Kubernetes test case that wires Megatron-Bridge's expert-parallelism dimension end-to-end (convert → train → export → serve → eval). The structure is clean — platform-agnostic Python under src/, manifests + scripts under kubernetes/, three substantive docs covering performance, evaluation, and troubleshooting.

Three classes of issue compound, and they need to be unpicked in order:

1. Evaluation methodology validity. Gate 2 evaluates on a local random partition of xLAM-60k — but Salesforce ships xLAM-60k with no upstream train/val split, so the "validation" set shares ~100% of schemas with the training partition. It's structurally test-on-train. Single-judge LLM-as-judge for tool-use specifically agrees with humans at only κ=0.34–0.57 (vs. MT-Bench's κ≈0.80 for open-text); N=50 with no CI puts the 94% headline in a Wilson [83.8%, 97.9%] band before adding judge-run variance and self-preference bias on top. The base 6/10 Gate-1 score is also implausibly low vs. Qwen3-30B-A3B's documented 69% BFCL proxy — strongly suggesting harness misconfig is degrading the base, in which case the LoRA's "+30 pp gain" is partly the adapter compensating for the harness. See the new "Evaluation Methodology Validity" section.

2. Implementation bugs on the path from "training finished" to the published numbers. Four issues — vLLM's Hermes parser fighting Qwen3.6's qwen3_coder XML format, str(None) landing in the judge prompt, freeze_* flags not round-tripping between training and export, and strict=False masking adapter-load mismatches — each independently produce numbers that look reasonable without measuring what the README claims.

3. Placement, operational, and infrastructure polish. The contribution lives under megatron-lm/ but uses Megatron-Bridge (PR #1065 establishes megatron-bridge/ as the canonical home); FSx reclaimPolicy: Delete is a data-loss footgun; a handful of K8s manifest tightenings (livenessProbe, conflicting CUDA allocator flags, security context, parser-flag fix).

If the harness misconfig (#2) is fixed first, then the eval design (#1) is upgraded to BFCL v4 / MTU-Bench / τ²-bench with a multi-judge panel and contamination audit, the numbers would mean what the README claims. The current 94% is best framed as a smoke test, not a capability claim.

Structure & Repository Hygiene

This test case uses Megatron-Bridge, not Megatron-LM — consider placing it under `megatron-bridge/`

The contribution lives at 3.test_cases/megatron/megatron-lm/qwen36-moe-lora/, but the runtime stack is Megatron-Bridge 0.4.0rc0 + Megatron-Core 0.17.0rc0 (per docs/PERFORMANCE.md:6), not Megatron-LM. The training driver imports from megatron.bridge.* (e.g. xlam_runner.py:26-30: megatron.bridge.recipes, megatron.bridge.training.finetune), and the README's first sentence describes it as "the first MoE PEFT recipe under 3.test_cases/megatron/megatron-lm/ and the first to exercise Megatron-Bridge 0.4's expert_model_parallel_size dimension end-to-end" — which itself reads as an acknowledgement that the placement is awkward.

There's a parallel in-flight PR — #1065 "Adding a megatron-bridge sample" — that's establishing 3.test_cases/megatron/megatron-bridge/ as the canonical home for Megatron-Bridge test cases (it adds aws-megatron-bridge.Dockerfile, a kubernetes/ subdir, and a README at that path). Landing this PR under megatron-lm/ while #1065 lands megatron-bridge/ would split Megatron-Bridge contributions across two libraries and almost certainly trigger a rename PR within a release.

I'd suggest moving this contribution to 3.test_cases/megatron/megatron-bridge/qwen36-moe-lora/ to align with #1065, and coordinating merge order with #1065's author so the directory exists when this lands.

A rename is mechanically cheap right now (28 files, all new) and saves a future git mv PR plus broken doc links from anyone who bookmarks the current path.

Deployment Pipeline (K8s / Slurm)

Only `4.train.sh` refreshes the ConfigMap — re-running prep/export/eval silently uses stale source

Scripts 2.prep-dataset.sh, 5.export-adapter.sh, and 7.run-eval.sh kubectl apply their manifests but never refresh the ConfigMap, so they execute whatever 0.setup-storage.sh baked in at setup time. A user iterating on eval_function_calling.py (the parser regex, Gate 1 prompt set, or judge prompt) and re-running only 7.run-eval.sh will get the original version's output — same for prep_xlam_dataset.py (script 2) and export_lora_adapter.py (script 5).

I'd suggest extracting the configmap-refresh block from 0.setup-storage.sh:24 into a small kubernetes/scripts/_lib.sh and sourcing it at the start of every numbered script. That way 4.train.sh's existing refresh becomes the standard, not the exception.

Script and manifest numbering are off by one

A reader stepping through this test case has to keep two parallel numbering schemes in their head: scripts are numbered 0.setup-storage → 7.run-eval, but the manifests they apply are storage + 0.precache-weights → 6.eval-job. The mapping is:

Script	Manifest
`0.setup-storage.sh`	`storage.yaml-template`
`1.precache-weights.sh`	`0.precache-weights.yaml-template`
`2.prep-dataset.sh`	`1.prep-dataset.yaml-template`
`3.convert-to-bridge.sh`	`2.convert-to-bridge.yaml-template`
`4.train.sh`	`3.pytorchjob-train.yaml-template`
`5.export-adapter.sh`	`4.export-adapter.yaml-template`
`6.deploy-inference.sh`	`5.inference-vllm.yaml-template`
`7.run-eval.sh`	`6.eval-job.yaml-template`

I'd suggest renaming the manifests to match their driving script (0.setup-storage.yaml-template, 1.precache-weights.yaml-template, …) so the indices line up. Avoids the off-by-one when someone debugs by eyeballing kubectl get pods against the manifest tree, and lets a future contributor insert a step without re-numbering both halves separately.

KeitaW

Review batch 2/5 — Training & Evaluation Code Quality

7 inline findings on the path from "training finished" to the published numbers. Each is anchored to a specific line via an inline comment with a one-click suggestion block where the fix is a direct line replacement.

Most consequential are the four that compound to distort the headline Gate 2 result: vLLM's Hermes parser fighting Qwen3.6's qwen3_coder XML, str(None) landing in the judge prompt, freeze_* flags not round-tripping between training and export, and strict=False masking adapter-load mismatches. They each independently produce numbers that look reasonable without measuring what the README claims.

KeitaW · 2026-05-16T00:30:05Z

+                --enable-auto-tool-choice \
+                --tool-call-parser hermes \


vLLM's Hermes parser and the harness's XML parser are fighting over the same bytes

File: kubernetes/manifests/5.inference-vllm.yaml-template (lines 56-57), cross-referenced in src/eval_function_calling.py (lines 18-21, 42-65)

There are two tool-call parsers in this pipeline, and they expect different formats:

vLLM's --tool-call-parser flag selects which server-side parser interprets the model's raw output and surfaces structured tool_calls on the OpenAI-compatible response. The hermes parser specifically is built for the Hermes-2-Pro / NousResearch family, which emits JSON inside <tool_call> tags:

<tool_call> {"name": "search_flights", "arguments": {"origin": "JFK"}} </tool_call>

With --enable-auto-tool-choice --tool-call-parser hermes active, vLLM intercepts the stream, tries to extract that JSON, surfaces it as a structured tool_calls field on the response, and strips the wrapper from content in the process.

Qwen3.6 emits a different format — nested XML, not JSON — and the harness comment at eval_function_calling.py:18-21 plus docs/EVALUATION.md both call this out:

<tool_call> <function=search_flights> <parameter=origin>JFK</parameter> </function> </tool_call>

The harness knows about the mismatch and has its own parse_qwen_tool_call that regex-matches the XML out of msg.content, ignoring msg.tool_calls entirely. So in principle the Hermes parser is just inert here — but in practice it isn't, because it still rewrites content even when it fails to find valid JSON. What gets left depends on vLLM patch version and streaming mode, anywhere from:

Best case: Hermes finds no JSON, leaves content untouched → harness regex works.

Middle case: Hermes strips the outer <tool_call>...</tool_call> wrapper but leaves the inner <function=...>...</function> → harness regex still matches.

Bad case: Hermes consumes part of <function= while trying to interpret it as JSON, leaves a malformed fragment → harness regex fails → returns None → scored as "no tool called".

If the bad case ever triggers, scores can fall as low as 1/10 in the worst case — the logical floor derived from the harness's score_response function (eval_function_calling.py:1804-1827) plus the GATE1_PROMPTS list: when parse_qwen_tool_call returns None for every prompt, right_name evaluates True only when expected_tool is None, which holds for exactly one of the ten prompts (#9, "ambiguous - no tool"). Both base and LoRA would degrade the same way under this floor, so the LoRA-vs-base delta survives but the absolute scores collapse. vLLM logs no error. A 9 → 1 collapse across a parser-version bump would look identical to "the adapter is broken" and get misdiagnosed as a real regression. This is the worst case of a distribution of behaviors (best case: parser gives up, harness regex still works, scores unchanged); the unverified part is which patch of vLLM 0.10.x lands where in that distribution.

The fix is to switch vLLM to its built-in parser for exactly this format — vLLM ships qwen3_coder as a --tool-call-parser value specifically for Qwen3.6's coder XML grammar (per the Qwen3.6-35B-A3B HF model card and listed in vLLM's supported parsers registry). The harness can then stop fighting vLLM for content and just consume msg.tool_calls cleanly.

Suggested change

--enable-auto-tool-choice \

--tool-call-parser hermes \

--tool-call-parser qwen3_coder \

(replaces --tool-call-parser hermes; keep --enable-auto-tool-choice.) If qwen3_coder isn't available in the pinned vLLM 0.10.2 image — verify with vllm serve --help | grep tool-call-parser — drop both flags entirely instead, since the harness's regex-against-raw-content fallback is correct for the qwen3_coder XML format. Either way the symptom path (silent decode failures degrading Gate 1) closes.

KeitaW · 2026-05-16T00:30:05Z

+            base_out = base_msg.get("content") or str(base_msg.get("tool_calls"))
+            lora_out = lora_msg.get("content") or str(lora_msg.get("tool_calls"))


Gate 2 stringifies missing model output as the literal string "None"

File: src/eval_function_calling.py (lines 301-302)

base_out = base_msg.get("content") or str(base_msg.get("tool_calls")) lora_out = lora_msg.get("content") or str(lora_msg.get("tool_calls"))

dict.get("tool_calls") returns None when absent, and str(None) == "None". So when a model returns {"content": "", "tool_calls": None} — which is what happens when vLLM's parser consumes XML and leaves nothing useful (see the Hermes-parser finding above) — base_out becomes the four-character string "None". That literal is then placed into the judge prompt as OUTPUT A:\nNone, and the judge compares "None" against whatever the other model produced. Combined with the rolled-up gate2_win_rate = wins / total, this directly affects the headline 94% number.

I'd suggest an explicit serializer that distinguishes empty from absent:

Suggested change

base_out = base_msg.get("content") or str(base_msg.get("tool_calls"))

lora_out = lora_msg.get("content") or str(lora_msg.get("tool_calls"))

base_out = _stringify(base_msg)

lora_out = _stringify(lora_msg)

And add the helper near the top of the file:

def _stringify(msg: dict) -> str: tc = msg.get("tool_calls") if tc: return json.dumps(tc) return msg.get("content") or "<empty>"

KeitaW · 2026-05-16T00:30:05Z

+        allowed = {"target_modules", "dim", "alpha", "dropout",
+                   "dropout_position",
+                   "freeze_language_model", "freeze_vision_model",
+                   "freeze_vision_projection"}


freeze_* flags are written on cfg.model during training but read from the peft section on export — the round trip is broken

File: src/xlam_runner.py (lines 87-88) vs src/export_lora_adapter.py (lines 123-126)

Training sets:

cfg.model.freeze_vision_model = True cfg.model.freeze_vision_projection = True

Export's allowlist reads these keys from run_cfg_dict.get("peft", {}):

allowed = {"target_modules", "dim", "alpha", "dropout", "dropout_position", "freeze_language_model", "freeze_vision_model", "freeze_vision_projection"}

read_run_config will never produce these under peft — they live under model. So peft_cfg ends up empty for the freeze flags, and peft_class(**peft_cfg) instantiates LoRA with the class's default freeze behavior. If those defaults differ from what training applied, the export attaches LoRA modules to layers training never touched (or skips layers training did), and the next finding (strict=False) makes the mismatch silent.

I'd suggest either:

Drop the freeze_* keys from export's allowed set entirely and infer the PEFT class (LoRA vs VLMLoRA) from training's model config only, or

Persist freeze flags into run_config.yaml's peft section in training (probably the cleaner fix since VLMLoRA is where they actually live in Megatron-Bridge).

KeitaW · 2026-05-16T00:30:05Z

+        else next(k for k in loaded_sd if k.startswith("model"))
+    )
+    adapter_sd = loaded_sd[model_section_key]
+    model[0].load_state_dict(adapter_sd, strict=False)


model[0].load_state_dict(adapter_sd, strict=False) swallows every adapter-shape mismatch silently

File: src/export_lora_adapter.py (line 153)

model[0].load_state_dict(adapter_sd, strict=False)

dist_checkpointing.load(..., strict=StrictHandling.LOG_UNEXPECTED) only logs unexpected keys — it doesn't raise. Then strict=False here additionally swallows missing keys. Combined with the freeze-flag mismatch above, the script can complete successfully while having attached LoRA wrappers to one set of modules at export time and loaded weights for a different set from the training checkpoint.

I'd suggest an explicit invariant check immediately before the load, and flip to strict=True:

expected = {k for k, _ in model[0].named_parameters() if "lora" in k.lower() or "adapter" in k.lower()} got = set(adapter_sd.keys()) if expected != got: raise RuntimeError( f"adapter key mismatch:\n missing: {sorted(expected - got)[:10]}\n extra: {sorted(got - expected)[:10]}" ) model[0].load_state_dict(adapter_sd, strict=True)

That way, the export job fails loudly on round-trip drift instead of producing a partially-correct artifact.

KeitaW · 2026-05-16T00:30:05Z

+    cfg = qwen35_vl_35b_a3b_peft_config(
+        peft_scheme="lora",
+        hf_path=hf_model_id,
+    )


Reusing qwen35_vl_35b_a3b_peft_config — recipe choice is defensible; freeze-flag and HBM impact still need verification

File: src/xlam_runner.py (lines 26, 74-77, 87-88)

Update: Qwen/Qwen3.6-35B-A3B is a real, distinct HF checkpoint (released April 2026) and it is a multimodal VLM with Gated DeltaNet linear attention interleaved with standard attention and a 262K context. So the recipe import (qwen35_vl_35b_a3b_peft_config) targeting a VL architecture is defensible — my earlier suggestion that this was a text-only model was wrong.

Three follow-up questions still worth answering before merge, given the VL architecture:

HBM cost of the frozen vision tower: freeze_vision_model=True stops gradients but doesn't remove the tower from the parameter graph. The PERFORMANCE.md memory table (~56-68 GB / 141 GB) doesn't break out vision-tower residency. Worth measuring how much of that is the (frozen, unused) vision encoder — if it's >5 GB per rank, removing it from instantiation would free real headroom for MICRO_BS increases.

gpt_forward_step against a VL model: the runner passes the GPT step to finetune(). Does Megatron-Bridge's gpt_forward_step know to skip the vision-encoder forward when called against a VL-shaped model, or does it iterate over the tower with no inputs? Worth a brief comment in the runner pointing to whichever upstream invariant guarantees this works.

Freeze-flag round-trip: the related finding above on freeze_* keys mismatched between training (cfg.model) and export (cfg.peft) — that bug is more acute on a VL model because the export-time class choice (LoRA vs VLMLoRA) actually changes which modules get LoRA wrappers.

Not blocking. Just capture the answers in TROUBLESHOOTING.md so a future Megatron-Bridge release doesn't silently break this path.

KeitaW · 2026-05-16T00:30:05Z

+
+    # 1. Build the HF bridge config (no weights yet).
+    print_rank_0(f"[export] Building AutoBridge from {args.hf_model}")
+    bridge = AutoBridge.from_hf_pretrained(args.hf_model, trust_remote_code=True)


trust_remote_code=True set unconditionally without a rationale

File: src/export_lora_adapter.py (line 72)

bridge = AutoBridge.from_hf_pretrained(args.hf_model, trust_remote_code=True)

trust_remote_code=True is required for Qwen3 modelling code, so the flag itself is fine — but it executes arbitrary Python from the Hub at load time, and args.hf_model is user-controlled, so it's worth a one-line comment so future readers (and security audits) understand the choice is deliberate.

Suggested change

bridge = AutoBridge.from_hf_pretrained(args.hf_model, trust_remote_code=True)

# trust_remote_code is required for the Qwen3 modelling code shipped with the Hub repo.

bridge = AutoBridge.from_hf_pretrained(args.hf_model, trust_remote_code=True)

KeitaW · 2026-05-16T00:30:05Z

+# vLLM's built-in --tool-call-parser hermes expects JSON-in-XML instead, so we
+# parse the XML ourselves rather than relying on msg.tool_calls being populated.
+# ---------------------------------------------------------------------------
+_TOOL_CALL_RE = re.compile(r"<function=(\w+)>(.*?)</function>", re.S)


_TOOL_CALL_RE excludes function names with - or .

File: src/eval_function_calling.py (line 42)

_TOOL_CALL_RE = re.compile(r"<function=(\w+)>(.*?)</function>", re.S)

\w is [A-Za-z0-9_] — names containing - (search-flights) or . (namespace.method) won't match. The Gate 1 prompts all use snake_case so they're fine, but the Gate 2 xLAM validation split has 3,673 unique tool schemas and almost certainly contains some with - or . in the name. Those rows would be silently scored as "no tool emitted" regardless of model output.

Suggested change

_TOOL_CALL_RE = re.compile(r"<function=(\w+)>(.*?)</function>", re.S)

_TOOL_CALL_RE = re.compile(r"<function=([\w.\-]+)>(.*?)</function>", re.S)

KeitaW

Review batch 3/5 — Evaluation Methodology Validity

This batch is methodology-level rather than implementation-level. The previous batch covered implementation bugs in the eval code; this one is about whether the eval design — even if implemented bug-free — actually measures what the README claims. The short answer from a literature pass on current (2025-2026) function-calling eval practice is: the design has four independent issues that compound, and the headline "Base 6/10 → LoRA 9/10, Gate 2 94%" is not interpretable as written.

Gate 2 evaluates on a slice of its own training distribution

Salesforce ships xLAM-60k as a single 60,000-row file with no upstream train/val split. The PR's "2% validation slice" is a local random partition created at training time — prep_xlam_dataset.py:2363-2368 does random.seed(42); random.shuffle(rows) then takes the first 2% as validation, no stratification. The synthesis pipeline behind xLAM is described in the APIGen / xLAM paper (Liu et al., arXiv:2406.18518).

Measured overlap (replicating the PR's exact seed=42 partition). "Schema" here means the called function — i.e. the assistant emits tool_calls[*].function.name and that's what the model is being asked to produce. I measured the overlap under three different precisions of that definition, since the right grain is non-obvious:

Definition of "schema"	Distinct in val	Novel to val	Val-side overlap
Function name alone (e.g. `search_flights`)	999	2	99.80%
Function (name, full parameter signature) — strictest definition	1,071	2	99.81%
Full available-tools bundle shown in the prompt (the model picks one)	1,080	70	93.52%

The first two collapse onto each other because 91.8% of function names in xLAM have exactly one canonical parameter signature — name is the schema for most of the catalog. The exceptions are 15 overloaded names; the worst case is search with 77 distinct signatures (reused across book/flight/web/etc. APIs). For the other 3,310 names, name uniquely determines the signature.

The third row (bundles) is lower because bundles are combinatorial: even when every individual tool has been seen, a specific combination of N tools shown together in one prompt can be novel. 70 of 1,080 val bundles are novel combinations — but only 2 of those rows test a called function name the model has never seen called.

The load-bearing number for the Gate 2 claim is the first row — 99.80%. Gate 2 asks "given this user query and these N tools, emit the right call"; the model's ability to do that depends on whether the target function was seen during training, not whether the precise tools-list combination was. The 6.5% bundle-novelty signal is the only dimension along which Gate 2 has any held-out content at all, and it's the weakest dimension on which to claim "the LoRA learned function-calling as a skill".

Distribution shape (controls how the overlap behaves under different seeds): mean 26.2 rows per called function, median 15, singletons (1 row only) 89 functions = 2.8% of the catalog, rare (≤5 rows) 17.8%, common (≥50 rows) 8.1%. Heaviest function search appears in 1,469 rows; P99 occurrence count is 271. The distribution is heavily long-tailed but with a thinner singleton tail than I'd assumed — better for the model (more rows per function = stronger lexical learning) and worse for the eval (val rows are almost guaranteed to test functions the model saw 10+ times in training).

Why 99.80% overlap means the val isn't measuring generalization: the 2 novel-name functions (of 999) carry no statistical weight in a 50-prompt Gate 2 sample — expected novel-function prompts in those 50 rows is 50 × 2/999 ≈ 0.1. With probability ~99% the entire Gate 2 sample is drawn from functions the model trained on, and each of those functions had ~15-26 sibling training examples teaching the schema-to-prompt mapping. That's in-distribution recall of learned associations, not generalization.

A concrete example. Take the function calculate_investment_return, which appears in 318 training rows and 8 val rows under the seed=42 partition. The schema definition presented to the model is byte-identical (signature hash 1bc0fa30ea8e) across all 326 rows — same three parameters (initial_amount: integer, interest_rate: number, num_years: integer), all three required, no additionalProperties. Two training rows the model sees:

"Determine the final amount from an initial investment of $15000 at 6% interest over 15 years." →
calculate_investment_return({"initial_amount": 15000, "interest_rate": 0.06, "num_years": 15})

"Calculate the investment return for an initial deposit of $5000 at an annual interest rate of 3% over 10 years." →
calculate_investment_return({"initial_amount": 5000, "interest_rate": 0.03, "num_years": 10})

Two val rows it's then evaluated on:

"What will $9000 become after 14 years at a 6.5% annual interest rate?" →
calculate_investment_return({"initial_amount": 9000, "interest_rate": 0.065, "num_years": 14})

"Suppose I invest $5,000 at an interest rate of 4% annually for 10 years. What will be the total amount after the investment period?" →
calculate_investment_return({"initial_amount": 5000, "interest_rate": 0.04, "num_years": 10})

The val prompts are new wordings — they're not byte-identical to any training row. But the task is the one the model drilled on 318 times: extract principal + rate + duration from natural language, convert percent to decimal, emit the three named integers/numbers. A val row that hits this schema is testing whether the LoRA can apply a pattern it was explicitly trained on, paraphrased once more. That's a fine smoke test of "did training take" but it doesn't tell you whether the model can call functions it hasn't been trained on. This pattern repeats for all 997 of the 999 functions present in val.

The contrast that would measure generalization: hand-author a new schema like compute_loan_payoff(principal, apr, term_months) — never in xLAM, semantically adjacent — and check whether the LoRA can pick it from a tools list and fill its arguments. If the LoRA collapses there, it learned the xLAM catalog, not the skill of function-calling. That's the canary-set check the methodology section recommends.

The Hammer paper from MadeAgents (Lin et al., arXiv:2410.04587) documents exactly this failure mode and identifies xLAM-trained models as the leading example. From §1 of the paper: "models tend to perform well on benchmarks that closely align with the naming conventions present in the training data but suffer notable performance declines when encountering benchmarks with differing naming styles." Table 1 of that paper makes the case empirically — xLAM-7B-fc is the best-performing model on BFCL (79.41) but the worst on overall cross-benchmark average (69.05), collapsing to 57.5 on NexusRaven and 59.0 on Tool-Alpaca. The Hammer authors interpret this as the model learning the xLAM catalog's specific naming conventions rather than the semantic mapping from user intent → parameter description → call, and their proposed fix ("function masking") replaces function names with random placeholders during training to force semantic learning. An eval drawn from the same naming distribution as training rewards the brittle pattern that fails cross-benchmark.

For context: published xLAM fine-tunes typically score 78-88% on BFCL v2 Live (a genuinely held-out, AST-scored benchmark, per the Berkeley Function Calling Leaderboard). The PR's 94% on a self-split isn't comparable to those numbers; the first 50 rows of xLAM's local validation slice is the worst-fit option among current function-calling benchmarks. Replacement recommendations below.

The 6/10 base Gate-1 score is implausibly low — most likely cause is harness misconfiguration

Qwen3-30B-A3B (the closest publicly-benchmarked proxy for Qwen3.6-35B-A3B as of 2026-05-15) scores ~69% on BFCL v2 Live overall per the Berkeley leaderboard. The PR's base Gate-1 result of 6/10 = 60% on 10 hand-crafted prompts is below that — on a less standardized, often easier prompt set. The numbers don't add up.

The most likely cause is harness misconfiguration biasing the base score downward. Candidates in decreasing likelihood:

Hermes parser fighting Qwen3.6's qwen3_coder XML output (covered separately above).
Thinking-mode <think>...</think> tokens not stripped before regex-parsing tool calls. The harness already documents the fix (enable_thinking=False) in TROUBLESHOOTING.md; worth confirming it's actually being passed at request time, not just documented. See the Qwen3.6-35B-A3B model card for the documented behavior.
Chat template mismatch — Qwen3.6 needs qwen-3.6.jinja (or equivalent), not the generic Qwen3 template (see the model card's tokenizer_config.json).
Sampling-param mismatch — the Qwen3.6-35B-A3B model card requires presence_penalty=1.5 for instruct mode; PR uses temperature=0.1 only.

If any of these is degrading the base score, the "+30 pp LoRA gain" headline is partly the adapter compensating for the broken harness, not measuring genuine capability improvement. The diagnostic is cheap: re-run Gate-1 against the base alone with the corrected parser + chat template + sampling params. If the score jumps from 6/10 to the 7-9/10 range, harness misconfig is confirmed and the LoRA delta needs to be re-stated against the corrected base.

Single-judge LLM-as-judge with N=50 is below current standards for tool-use evaluation

Two issues that compound:

Judge agreement with humans on tool-use is much lower than the open-text figures suggest. The often-cited "80%+ judge agreement with humans" (Zheng et al. 2023, Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, arXiv:2306.05685) is for free-text instruction following. For tool-use specifically, single-judge agreement with humans is κ = 0.34–0.57 depending on judge model, with 3-model ensembles landing at κ ≈ 0.43 (AgentProp-Bench, arXiv:2604.16706). That's well below MT-Bench's κ ≈ 0.80. PR #1091 uses a single Claude Opus 4.7 judge with no per-judge κ characterization — its measurement instrument is somewhere in that 0.34–0.57 band on this task class.

N=50 with a single trial leaves the headline poorly constrained. Wilson 95% binomial CI on 47/50 is [83.8%, 97.9%] — 14.2 percentage points wide (Wilson 1927 / Newcombe 1998 for the standard CI formula). Standard practice on BFCL is N ≈ 1,700 per category; PR #1091's N=50 sits in the bottom 5th percentile of current FC eval practice. The McNemar significance test (χ² = 36.98, p ≪ 0.001) confirms the direction of the effect, but the CI width means "94%" is consistent with anywhere from 84% to 98% — the difference between "moderate" and "remarkable" improvement.

Layer on self-preference bias (Panickssery et al. 2024, Self-Preference Bias in LLM-as-a-Judge, arXiv:2410.21819): frontier judges inflate their own family's outputs by ~52 pp TPR-vs-TNR gap. If Claude Opus 4.7 ever sees outputs that look Claude-shaped (system prompts, formatting choices), it'll prefer them by 10-25% on top of any genuine quality signal.

Drop LLM-as-judge entirely. Function-calling is one of the few NLP tasks where the field has converged on deterministic structural verification as the gold standard — exactly because the failure modes catalogued above (low tool-use κ, single-judge variance, self-preference) make LLM judges the wrong instrument for this task class. The Berkeley Function Calling Leaderboard, τ²-bench, and MTU-Bench all avoid LLM-as-judge by construction — they use AST-equivalence, executable correctness, or decomposed name-vs-parameter matching. That's what the recommended replacement section below specifies; that's the methodology this PR should adopt.

Recommended replacement evaluation scheme

The shortest path to a defensible methodology is to anchor on standard function-calling benchmarks that the field already trusts, then layer the LLM-as-judge work as a secondary qualitative signal — not the headline.

Primary metric — held-out structural correctness:

BFCL v4 (current Berkeley leaderboard release, supersedes v2 Live as of 2026): Live Multiple sub-slice (≈1,053 samples) + Live Irrelevance sub-slice (≈882). AST-scored, temporally separated from xLAM. Combined N ≈ 1,935 — comfortably above any reviewer floor. Source code and harness: github.com/ShishirPatil/gorilla.

Secondary — decomposed name vs. argument accuracy:

MTU-Bench S-S + S-M (431 samples) — reports TS (name) and PS (parameter) separately. This is what Gate-1 was trying to do, done in the standard way the field reports it.

Generalization check:

BFCL v4 multi-turn / agentic components — verifies the adapter didn't overfit single-turn patterns (leaderboard).

Reliability check:

τ²-bench (Barres et al., arXiv:2506.07982) pass^1 + pass^3 — supersedes τ-bench (Yao et al., arXiv:2406.12045) retail; measures consistency across repeated trials.

Contamination audit (the dimension all of the above still don't cover):
Even after switching to BFCL v4, the benchmarks themselves may sit in Qwen3.6's pretraining corpus. Before reporting any external-benchmark number:

N-gram dedup of BFCL / MTU / τ² prompts against the xLAM-60k training set (and against the base model's public pretraining shards where available).
A schema-disjoint private canary set of 50-100 hand-authored tool schemas using internal or novel APIs. A LoRA whose advantage collapses on the canary set is fitting public schemas, not learning to call functions.

Statistical reporting:

≥ 3 fine-tuning seeds, report mean ± std (LoRA seed variance is 0.07-18.22 pp depending on task, per arXiv:2503.07329).
Wilson 95% CI alongside every point estimate (see the HELM methodology paper, Liang et al., arXiv:2211.09110 for the established standard).
If LLM-as-judge is retained: multi-judge panel + full swap-and-compare protocol + explicit rubric.

Ablation grid (staged, not cross-product):
A full rank × alpha × targets × data-fraction cross-product × 3 seeds lands at 400+ runs — unshippable at PR scale. The defensible alternative is staged one-factor-at-a-time:

Stage 1 (hyperparameter sanity) — sweep LR, schedule, warmup, LoRA dropout, batch size; single-seed each.
Stage 2 (LoRA structure) — rank ∈ {8, 16, 32, 64} with matched α, targets ∈ {attention-only, attention + linear_proj, all-attention}, data fraction ∈ {25%, 50%, 100%}; single-seed each.
Stage 3 (finalists) — 2-3 best Stage-2 configs × ≥ 3 seeds → published numbers.

The most important baseline to add is prompt-only. Give the base Qwen3.6-35B-A3B an explicit 10-shot system prompt teaching the qwen3_coder XML format (the model was trained to emit it; it just needs to be told to), and report its score on the same benchmark suite. If prompt-only closes ≥ 80% of the LoRA's headline gap, the LoRA contribution should default to the cheapest recipe or be dropped in favor of prompting.

Framing: smoke test vs. research evidence

The right way to think about the current eval is in two goalposts:

As a PR-level smoke test that the LoRA pipeline runs end-to-end and produces output a judge prefers more often than not — adequate, once the harness misconfig is fixed. The current 94% is uninterpretable until then.
As evidence that the LoRA improves general function-calling ability — not defensible. The framing in the current README ("94% win rate") invites the standard the methodology fails to meet.

A defensible reframing for the README without changing any eval code:

"In a pilot comparison of N=50 in-distribution prompts from xLAM-60k's validation split, the fine-tuned model was preferred 47/50 times (94%, Wilson 95% CI 84-98%) by a single Claude Opus 4.7 judge with position randomization. This is a smoke test on in-distribution data; capability claims await validation on BFCL v4 / τ²-bench."

That sentence makes the headline number honest without removing any of the work. The deeper fix — switching to BFCL v4 / MTU-Bench / τ²-bench with a multi-judge panel and contamination audit — is what the contribution would need to make capability claims.

KeitaW

Review batch 4/5 — Infrastructure & NCCL Configuration

7 inline findings + 4 cross-cutting hardening items (below). Most consequential inline finding is reclaimPolicy: Delete on the FSx StorageClass — a data-loss footgun that destroys 2h12min of GPU time on any accidental kubectl delete pvc. The cross-cutting items below apply across multiple manifests and are easier to read as one body block than split inline.

`elasticPolicy.maxRestarts: 0` combined with Worker `restartPolicy: OnFailure` is internally contradictory

File: kubernetes/manifests/3.pytorchjob-train.yaml-template (lines 24, 28)

The Kubeflow Training Operator reads elasticPolicy.maxRestarts: 0 as "do not restart the job at the operator level". Meanwhile, restartPolicy: OnFailure on the Worker template tells the kubelet to restart the container in-place on non-zero exit. The two don't compose — if the rendezvous-leader pod's container dies, the kubelet restarts it alone, but the operator won't recreate sibling pods, and c10d rendezvous can't recover. The job wedges with zombie ranks instead of failing cleanly.

I'd suggest aligning the two policies:

For a strict no-restart test case: set both maxRestarts: 0 and Worker restartPolicy: Never.
For a more production-like setup: bump maxRestarts to 2-3 and keep restartPolicy: OnFailure for the kubelet-level container restart, accepting that whole-job restart is what actually rebuilds rendezvous.

Pods run as UID 0 with `IPC_LOCK` and no `runAsNonRoot` / `seccompProfile`

IPC_LOCK is correctly added on the GPU pods (needed for EFA RDMA pinning), but securityContext doesn't set runAsNonRoot: false explicitly, doesn't set allowPrivilegeEscalation: false, and doesn't pin a seccomp profile. The NeMo image runs as root by default, so any cluster with PodSecurityAdmission at baseline or restricted will reject these manifests with hard-to-diagnose admission errors.

Affected manifests:

kubernetes/manifests/2.convert-to-bridge.yaml-template (securityContext at end)
kubernetes/manifests/3.pytorchjob-train.yaml-template (securityContext at end of worker)
kubernetes/manifests/4.export-adapter.yaml-template (securityContext at end)

At minimum I'd suggest adding allowPrivilegeEscalation: false everywhere and an explicit comment noting that root is required by the NeMo image. If you can move to a non-root NeMo build later, that's a separate cleanup.

`tolerations: [{operator: Exists}]` tolerates every taint, including control-plane and draining nodes

Every manifest uses an unkeyed operator: Exists toleration, which means the pod will land on any tainted node that the nodeSelector matches. On a single-tenant HyperPod cluster with one GPU instance type this is functionally fine, but the moment a cluster has Karpenter's karpenter.sh/disrupted:NoSchedule taint on draining nodes, or a dedicated-namespace taint, the workloads will schedule onto them anyway.

I'd suggest narrowing to the specific taints these pods actually need to tolerate, e.g. for HyperPod:

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  - key: hyperpod
    operator: Exists

Applies to every manifest in kubernetes/manifests/.

Deprecated `TRANSFORMERS_CACHE` env var carried through three manifests

The HuggingFace transformers library deprecated TRANSFORMERS_CACHE in favor of HF_HOME several releases ago — newer versions emit a FutureWarning and the var is slated for removal. Each of these manifests already sets HF_HOME=/fsx/hf_cache, so the TRANSFORMERS_CACHE line is redundant noise that will start logging warnings as the base image rolls forward:

kubernetes/manifests/3.pytorchjob-train.yaml-template line 79
kubernetes/manifests/4.export-adapter.yaml-template line 55
kubernetes/manifests/5.inference-vllm.yaml-template line 64

I'd just delete all three lines.

KeitaW · 2026-05-16T00:30:28Z

+  storageType: SSD
+  subnetId: ${FSX_SUBNET_ID}
+provisioner: fsx.csi.aws.com
+reclaimPolicy: Delete


FSx StorageClass reclaimPolicy: Delete destroys the precache + Bridge checkpoint + trained adapter on any kubectl delete pvc

File: kubernetes/manifests/storage.yaml-template (line 29)

reclaimPolicy: Delete

cleanup.sh documents kubectl delete pvc qwen-moe-lustre as the data-purge command, but the same command is what any namespace teardown, GitOps drift reconcile, or copy-paste accident will run — and reclaimPolicy: Delete means the underlying FSx filesystem (plus the ~70 GB precache, the ~68 GB Bridge checkpoint, training checkpoints, and the exported adapter) disappears with it. That's a data-loss footgun on top of ~2h12min of GPU time per recovery.

I'd suggest defaulting to Retain and making purge an explicit opt-in:

Suggested change

reclaimPolicy: Delete

reclaimPolicy: Retain

Then in cleanup.sh, gate destructive teardown behind a flag (e.g. PURGE_DATA=1) that patches the SC to Delete before removing the PVC — so accidental deletes are recoverable.

KeitaW · 2026-05-16T00:30:28Z

+                # HF cache on shared Lustre
+                - { name: HF_HOME,               value: /fsx/hf_cache }
+                - { name: TRANSFORMERS_CACHE,    value: /fsx/hf_cache }
+                - { name: PYTORCH_CUDA_ALLOC_CONF, value: "expandable_segments:True,max_split_size_mb:512" }


PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512 — these two settings are mutually exclusive

File: kubernetes/manifests/3.pytorchjob-train.yaml-template (line 80)

Per the PyTorch allocator docs, when expandable_segments:True is enabled, the allocator uses the expandable-segments path where max_split_size_mb does not apply (segments are grown, not split by the legacy splitting policy). One of the two settings is ignored at runtime, and downstream users debugging OOM against this manifest will tune a knob that does nothing.

For MoE with EP=4 and frequent expert-routing reshapes, expandable_segments:True alone is the right choice:

Suggested change

- { name: PYTORCH_CUDA_ALLOC_CONF, value: "expandable_segments:True,max_split_size_mb:512" }

- { name: PYTORCH_CUDA_ALLOC_CONF, value: "expandable_segments:True" }

KeitaW · 2026-05-16T00:30:28Z

+          ports:
+            - containerPort: 8000
+              protocol: TCP
+          readinessProbe:


vLLM Deployment has no livenessProbe; a hung worker stays Ready indefinitely

File: kubernetes/manifests/5.inference-vllm.yaml-template (line 69, where readinessProbe is defined)

Only a readinessProbe against /v1/models is configured. vLLM workers can deadlock (CUDA driver hangs, asyncio task pileups, spawn orphans) while the HTTP server still answers 200 OK to /v1/models. Without a liveness probe, the pod is never restarted, the Service keeps routing traffic to the dead backend, and the Gate 2 Job hangs at its 180-second per-request timeout for every prompt — up to 100 × 180s = 5 hours before Kubernetes intervenes.

I'd suggest adding a stricter liveness check that exercises the inference path:

livenessProbe: httpGet: path: /v1/models port: 8000 initialDelaySeconds: 600 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 3

(A POST to /v1/chat/completions would be a stronger check but is harder to set up via httpGet; a polling sidecar is the alternative if you want the stricter probe.)

KeitaW · 2026-05-16T00:30:28Z

+      restartPolicy: OnFailure
+      template:
+        spec:
+          hostIPC: true


hostIPC: true on training workers is unnecessary alongside the in-pod /dev/shm emptyDir

File: kubernetes/manifests/3.pytorchjob-train.yaml-template (line 31)

The Pod already mounts a 128 GiB emptyDir{medium: Memory} at /dev/shm, which is sufficient for PyTorch DataLoader multiprocessing and NCCL CUDA IPC. hostIPC: true shares the node's IPC namespace with the Pod — useful for some legacy NCCL setups, but not required when the in-pod shm mount is sized correctly. It widens the blast radius (co-tenant pods + host processes can see each others' SysV semaphores and POSIX shm) for no observed perf gain.

I'd suggest removing it. If a specific NCCL/UCX feature genuinely needs it, that's worth a comment explaining why.

Suggested change

hostIPC: true

hostIPC: false

(Or drop the key entirely — the K8s default is false.)

KeitaW · 2026-05-16T00:30:28Z

+export GATE=1                                     # 1 | 2 | all
+export N_SAMPLES=50                               # only used by Gate 2
+export BEDROCK_REGION=us-west-2
+export JUDGE_MODEL=us.anthropic.claude-opus-4-7


JUDGE_MODEL=us.anthropic.claude-opus-4-7 and the temperature-deprecation comment both need verification

File: kubernetes/scripts/env.example (line 48), src/eval_function_calling.py (lines 265-266)

Two things here that I'd want confirmed before merge:

Bedrock cross-region inference-profile IDs usually carry a date+version suffix (...-YYYYMMDD-v1:0). The bare prefix us.anthropic.claude-opus-4-7 looks incomplete — I'd want a confirming round-trip from aws bedrock list-inference-profiles --region us-west-2 showing this exact string is invocable.

The comment at eval_function_calling.py:265-266 claims Opus 4.7 deprecated the temperature parameter. I couldn't find a primary source for that — Opus 4.7's documented deprecation is around top_k under thinking mode. If the comment is wrong, future maintainers may strip temperature from other judge calls unnecessarily.

I'd suggest pinning the profile ID with its full suffix in env.example, and either citing a source for the temperature-deprecation claim or removing the comment.

KeitaW · 2026-05-16T00:30:28Z

+              args:
+                - |
+                  set -ex
+                  python -c 'from megatron.core import __version__ as v; assert v >= "0.13", f"megatron-core >= 0.13 required, got {v}"'


assert v >= "0.13" compares versions as strings

File: kubernetes/manifests/3.pytorchjob-train.yaml-template (line 44)

String comparison happens to give the right answer for 0.17 vs 0.13 ("0.17" > "0.13" lexicographically because '7' > '3'), but it lies for anything where digit-count differs across the components — "0.2" > "0.13" is True as strings even though 0.2 < 0.13 as a semver. If a future bump moves the floor to 0.20 and someone runs against 0.5, the gate will let it through.

Suggested change

python -c 'from megatron.core import __version__ as v; assert v >= "0.13", f"megatron-core >= 0.13 required, got {v}"'

python -c 'from packaging.version import parse as V; from megatron.core import __version__ as v; assert V(v) >= V("0.13"), f"megatron-core >= 0.13 required, got {v}"'

KeitaW · 2026-05-16T00:30:28Z

+              env:
+                # Training hyperparams (consumed by xlam_runner.py)
+                - { name: HF_MODEL_ID,          value: "${HF_MODEL_ID}" }
+                - { name: TOKENIZER_PATH,       value: "/fsx/hf_cache/models--Qwen--Qwen3.6-35B-A3B" }


Hardcoded TOKENIZER_PATH won't follow HF_MODEL_ID

File: kubernetes/manifests/3.pytorchjob-train.yaml-template (line 62)

- { name: TOKENIZER_PATH, value: "/fsx/hf_cache/models--Qwen--Qwen3.6-35B-A3B" }

HF_MODEL_ID is parameterized via env.example, but TOKENIZER_PATH is a frozen literal. If a user swaps in a different Qwen variant by changing HF_MODEL_ID alone, training will try to load the tokenizer from a stale path that doesn't exist. The same hardcoded default lives in src/xlam_runner.py:60. The precache pod already derives the cache path from HF_MODEL_ID (models--{HF_MODEL_ID.replace('/','--')}); applying the same derivation here keeps the two in sync.

I'd suggest either dropping the env var entirely and letting xlam_runner.py derive it (f"/fsx/hf_cache/models--{HF_MODEL_ID.replace('/','--')}"), or computing it in the script entrypoint before launching torchrun.

KeitaW

Review batch 5/5 — Documentation, Minor follow-ups, Sources, and Kudos

Documentation Consistency

Doc snippets use `./scripts/...` but scripts live at `./kubernetes/scripts/...`

Several doc-embedded shell snippets reference scripts at the wrong relative path. Copy-pasting them from the test-case root won't work. Affected:

docs/PERFORMANCE.md:85 — ./scripts/3.convert-to-bridge.sh
docs/EVALUATION.md:116, 118 — ./scripts/6.deploy-inference.sh, ./scripts/7.run-eval.sh
docs/TROUBLESHOOTING.md:103, 109, 125 — ./scripts/3.convert-to-bridge.sh, ./scripts/5.export-adapter.sh (×2)

The README's walkthrough uses the correct ./kubernetes/scripts/... form; the docs just need to match.

The "p5e AMI" attribution for the FSx 2.15 requirement is imprecise — Lustre client version comes from the AMI, not the instance type

The Lustre 2.15.6 / FSx 2.10 incompatibility is real, but several places in the docs and the manifest attribute the 2.15 client to the instance type, when it's actually a property of the AMI chosen at cluster-creation time. The same p5e instance can run the HyperPod base AMI (which ships one Lustre client version), the Ubuntu DLAMI (another), or stock AL2023 (no Lustre client at all until the user installs one). Tying the requirement to "the p5e AMI" reads as if p5e implies a fixed AMI, which it doesn't.

Affected locations:

README.md:81 — "the FSx default of 2.10 is incompatible with the p5e AMI's Lustre client"
kubernetes/manifests/storage.yaml-template:7 — "incompatible with the Lustre 2.15.6 client bundled in the p5e AMI"
docs/TROUBLESHOOTING.md:18 — "bundled in the p5e AMI"
docs/TROUBLESHOOTING.md:33 — "The HyperPod p5e AMI has the Lustre client installed but doesn't auto-load…"

I'd suggest rewording to name the AMI explicitly, e.g. "the Lustre 2.15.6 client shipped in the HyperPod base AMI for p5/p5e nodes (as of <AMI version or date>)". That keeps the warning useful for users on that exact AMI while making clear it doesn't generalize to every cluster setup. The 1.architectures/7.sagemaker-hyperpod-eks/ reference in the README's Prerequisites is the right spot to link to so readers can verify which AMI their cluster is actually running.

Reference adapter is published under a personal HF account

The "skip training" path documented in the README, env.example, and EVALUATION.md depends on ying2022/qwen3-6-35b-xlam-tools-lora — a personal HuggingFace account. If that account is removed or the artifact deleted, the documented LORA_SOURCE=hf flow breaks for every future reader.

References: README.md:23, 164, kubernetes/scripts/env.example:39, docs/EVALUATION.md:115.

Two options worth considering: mirror the adapter under an awslabs/ or aws-samples/ HF org if one is appropriate, or add a one-line disclaimer in the README so users know the dependency before they hit a 404. Not blocking — just worth deciding deliberately.

Minor follow-ups

A few items I noticed but wouldn't block on:

src/xlam_runner.py:80-83 — apply_dataset_override(..., seq_length=seq_length) followed by cfg.dataset.seq_length = seq_length is redundant. The post-hoc cfg.dataset.dataset_root = dataset_root assignment also suggests the helper doesn't accept that kwarg; worth checking whether direct assignment of all three fields is cleaner than the helper call.
src/eval_function_calling.py:77,91 — timeout=180 is a scalar; a (10, 180) connect/read tuple combined with a wall-time cap on the eval loop would prevent unbounded hangs when vLLM is sick.
src/prep_xlam_dataset.py + src/eval_function_calling.py:268-273 — validation split is a non-stratified random slice; Gate 2's 50-sample result is sensitive to args.seed=42. Worth documenting the seed-dependence in EVALUATION.md, or stratifying validation by tool name.

Sources

Citations supporting the Evaluation Methodology Validity section, in the order they appear. Every claim that depends on external research has a primary source pinned here; full research report at /mnt/fsx/ubuntu/.claude/plugins/data/lt-lieutenant/research/pr1091-eval-validity/report.md.

Datasets and benchmarks (primary):

Salesforce xLAM-60k dataset card — confirms single 60k file with no upstream train/val split
APIGen / xLAM paper, Liu et al., arXiv:2406.18518 — xLAM-60k synthesis pipeline
Berkeley Function Calling Leaderboard (BFCL) — leaderboard for Qwen3-30B-A3B ~69%, published xLAM fine-tunes 78-88% on v2 Live
BFCL harness source — implementation
τ-bench (Yao et al., arXiv:2406.12045) and τ²-bench (Barres et al., arXiv:2506.07982)
MTU-Bench (Wang et al., arXiv:2410.11710)
Hammer paper (Lin et al., arXiv:2410.04587) — Table 1 + §1: xLAM-7B-fc tops BFCL (79.41) but bottoms cross-benchmark average (69.05); paper introduces "function masking" to mitigate naming-convention overfitting. Also Hammer model collection on HF.

Model identity and tool-call format (primary):

Qwen/Qwen3.6-35B-A3B model card — checkpoint identity, qwen3_coder XML format, sampling-param recommendations
Qwen/Qwen3-30B-A3B model card — closest publicly-benchmarked proxy
vLLM tool-calling docs and the --tool-call-parser flag reference — confirms qwen3_coder is a supported parser value

LLM-as-judge methodology (primary, confirmed):

Zheng et al. 2023, Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, arXiv:2306.05685 — foundational bias taxonomy; MT-Bench κ ≈ 0.80 for free-text
Panickssery et al. 2024, Self-Preference Bias, arXiv:2410.21819 — frontier judges inflate own family by ~52 pp
Verga et al. 2024, Replacing Judges with Juries (PoLL), arXiv:2404.18796 — panel methodology
Dubois et al. 2024, Length-Controlled AlpacaEval, arXiv:2404.04475 — verbosity-bias correction
AgentProp-Bench, arXiv:2604.16706 — single-judge tool-use κ = 0.34-0.57; ensemble κ = 0.432

Statistical methodology (primary):

Wilson 1927 / Newcombe 1998 — Wilson binomial CI
HELM, Liang et al., arXiv:2211.09110 — large-N evaluation methodology
arXiv:2503.07329 — LoRA seed variance 0.07-18.22 pp

Citations pending verification (treat as Tier 2 until confirmed):

arXiv:2506.13639 — claim: explicit rubric → +7 pp human alignment with GPT-4o judges. Not explicitly confirmed in Codex web-search cross-check; verify before citing externally.

Things That Look Great

Every shell script opens with set -euo pipefail plus the MIT-0 copyright header, and uses : "${VAR:?source scripts/env.sh first}" to fail fast on missing config — the most defensive script preamble I've reviewed in this repo recently.
envsubst calls in every script pass an explicit variable whitelist (envsubst '$NAMESPACE $IMAGE ...'), so a stray $ in a manifest comment can't accidentally get substituted.
NCCL_SOCKET_IFNAME=^lo exclusion pattern used in every manifest that runs NCCL — exactly right for EFA-equipped instances.
cleanup.sh correctly drops the pytorchjob resource type rather than a plain job, which is one of the more common copy-paste mistakes in this repo's test cases.
src/xlam_runner.py:96 explicitly excludes linear_fc1/linear_fc2 from LORA_TARGET_MODULES and explains why in a comment ("applying LoRA there would bloat the adapter ~256× and break EP sharding") — exactly the kind of MoE-specific tribal knowledge that's normally lost.
Two-gate evaluation harness (deterministic hand-crafted prompts + position-randomized LLM-as-judge over held-out validation) — much stronger than the typical "eval loss went down" claim in test-case PRs.
docs/TROUBLESHOOTING.md catalogs the actual failure modes encountered during bring-up (FSx 2.10/2.15 client version mismatch, NullTokenizer.space_sensitive, apply_factory_merges key-set mismatch, vLLM Hermes parser vs Qwen XML tool-call format) with concrete symptoms and root causes — this is the most useful doc in the contribution.
kubernetes/manifests/2.convert-to-bridge.yaml-template:36-39 short-circuits on a .done marker so the 25-minute conversion isn't accidentally re-run; the training and export pods test for it as a hard precondition.
Reference performance and eval results are reported with raw numerators (Base 6/10 → LoRA 9/10, 47/50) rather than just percentages — easy to audit, easy to compare against a re-run.
Container images are all pinned (nvcr.io/nvidia/nemo:26.04, vllm/vllm-openai:v0.10.2) — no :latest.

KeitaW

Few comments

yhou-uk force-pushed the megatron-bridge-qwen36-moe-lora branch from c9d8e60 to 0c3fff7 Compare May 14, 2026 13:54

KeitaW reviewed May 16, 2026

View reviewed changes

KeitaW requested changes May 16, 2026

View reviewed changes

	--enable-auto-tool-choice \
	--tool-call-parser hermes \
	--tool-call-parser qwen3_coder \

		base_out = base_msg.get("content") or str(base_msg.get("tool_calls"))
		lora_out = lora_msg.get("content") or str(lora_msg.get("tool_calls"))

	bridge = AutoBridge.from_hf_pretrained(args.hf_model, trust_remote_code=True)
	# trust_remote_code is required for the Qwen3 modelling code shipped with the Hub repo.
	bridge = AutoBridge.from_hf_pretrained(args.hf_model, trust_remote_code=True)

	_TOOL_CALL_RE = re.compile(r"<function=(\w+)>(.*?)</function>", re.S)
	_TOOL_CALL_RE = re.compile(r"<function=([\w.\-]+)>(.*?)</function>", re.S)

	- { name: PYTORCH_CUDA_ALLOC_CONF, value: "expandable_segments:True,max_split_size_mb:512" }
	- { name: PYTORCH_CUDA_ALLOC_CONF, value: "expandable_segments:True" }

	python -c 'from megatron.core import __version__ as v; assert v >= "0.13", f"megatron-core >= 0.13 required, got {v}"'
	python -c 'from packaging.version import parse as V; from megatron.core import __version__ as v; assert V(v) >= V("0.13"), f"megatron-core >= 0.13 required, got {v}"'

Conversation

yhou-uk commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Test Results

Directory Structure

Checklist

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review batch 1/5 — Overview, Structure & Deployment Pipeline

Summary

Structure & Repository Hygiene

This test case uses Megatron-Bridge, not Megatron-LM — consider placing it under megatron-bridge/

Deployment Pipeline (K8s / Slurm)

Only 4.train.sh refreshes the ConfigMap — re-running prep/export/eval silently uses stale source

Script and manifest numbering are off by one

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review batch 2/5 — Training & Evaluation Code Quality

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review batch 3/5 — Evaluation Methodology Validity

Gate 2 evaluates on a slice of its own training distribution

The 6/10 base Gate-1 score is implausibly low — most likely cause is harness misconfiguration

Single-judge LLM-as-judge with N=50 is below current standards for tool-use evaluation

Recommended replacement evaluation scheme

Framing: smoke test vs. research evidence

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review batch 4/5 — Infrastructure & NCCL Configuration

elasticPolicy.maxRestarts: 0 combined with Worker restartPolicy: OnFailure is internally contradictory

Pods run as UID 0 with IPC_LOCK and no runAsNonRoot / seccompProfile

tolerations: [{operator: Exists}] tolerates every taint, including control-plane and draining nodes

Deprecated TRANSFORMERS_CACHE env var carried through three manifests

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review batch 5/5 — Documentation, Minor follow-ups, Sources, and Kudos

Documentation Consistency

Doc snippets use ./scripts/... but scripts live at ./kubernetes/scripts/...

The "p5e AMI" attribution for the FSx 2.15 requirement is imprecise — Lustre client version comes from the AMI, not the instance type

Reference adapter is published under a personal HF account

Minor follow-ups

Sources

Things That Look Great

Uh oh!

yhou-uk commented May 14, 2026 •

edited

Loading

This test case uses Megatron-Bridge, not Megatron-LM — consider placing it under `megatron-bridge/`

Only `4.train.sh` refreshes the ConfigMap — re-running prep/export/eval silently uses stale source

`elasticPolicy.maxRestarts: 0` combined with Worker `restartPolicy: OnFailure` is internally contradictory

Pods run as UID 0 with `IPC_LOCK` and no `runAsNonRoot` / `seccompProfile`

`tolerations: [{operator: Exists}]` tolerates every taint, including control-plane and draining nodes

Deprecated `TRANSFORMERS_CACHE` env var carried through three manifests

Doc snippets use `./scripts/...` but scripts live at `./kubernetes/scripts/...`