test(mtp): add RED-bar scripts for converter + e2e determinism#242
Open
dusterbloom wants to merge 2 commits into
Open
test(mtp): add RED-bar scripts for converter + e2e determinism#242dusterbloom wants to merge 2 commits into
dusterbloom wants to merge 2 commits into
Conversation
Two shell tests carved out of Luce-Org#241 per @howard0su's review request: - test_mtp_converter.sh: Phase 1 RED for convert_dflash_to_gguf.py --mtp-assistant mode. Validates the converter output matches the AtomicChat gold reference (tensor list, required MTP tensors, GGUF metadata keys, requires_target_arch == gemma4). Exits 77 cleanly if gold reference missing. - test_mtp_e2e.sh: Phase 3 RED for --draft-method mtp byte-identical output vs --draft-method none at temp=0 seed=0. Exits 77 cleanly if test_gemma4_dflash binary or models missing. Both safe to land now: scripts skip gracefully when their prerequisites aren't on disk, so they're CI-stable. Will turn GREEN when the corresponding MTP work lands.
Contributor
There was a problem hiding this comment.
2 issues found across 2 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
Two cubic findings on PR Luce-Org#242 (both valid P2): - Tensor-list extraction used 'grep -E \d' which matches a literal backslash-d in POSIX ERE, not a digit. Switched to [0-9]. - target-arch check was substring-based and would pass for values merely containing 'gemma4'. Switched to exact-equality. Script still exits 77 cleanly when prerequisites (gold model, safetensors) are absent.
Contributor
|
Submodule rollback hidden in diff, can you fix? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two shell test scripts carved out of #241 per @howard0su's review request to split the rename PR from the test additions.
What's added
dflash/test/test_mtp_converter.sh— Phase 1 RED test forconvert_dflash_to_gguf.py --mtp-assistantmode. Validates the converter's GGUF output matches the AtomicChat gold reference on (1) tensor list, (2) required MTP tensors, (3) required GGUF metadata keys, (4)requires_target_arch == "gemma4". Exits 77 (autotools skip code) cleanly if the gold reference at$MTP_GOLD_GGUFis missing.dflash/test/test_mtp_e2e.sh— Phase 3 RED test for--draft-method mtpbyte-identical output vs--draft-method noneattemp=0 seed=0. Exits 77 cleanly if thetest_gemma4_dflashbinary or the model files are missing.Safe to land now
Both scripts:
#!/usr/bin/env bash+set -euo pipefail/tmpThey'll turn GREEN when the corresponding MTP work lands. Until then they're CI-stable skips.
Not in this PR
The third test file flagged in #241,
dflash/test/test_flash_attn_sparse_mask.cpp, is deferred. It's a TDD red-bar that intentionally references symbols (flash_prefill_forward_bf16_masked(), 6-argggml_flash_attn_sparse()) that don't exist yet, so adding it to main without the spike that turns it green would leave a permanently broken build target. It will ship with the sparse-mask spike PR.