Skip to content

Arm backend: Add static cache integration test with llama#18404

Open
xingguo01 wants to merge 1 commit intopytorch:mainfrom
xingguo01:arm-backend-stat-cache-integration-llama
Open

Arm backend: Add static cache integration test with llama#18404
xingguo01 wants to merge 1 commit intopytorch:mainfrom
xingguo01:arm-backend-stat-cache-integration-llama

Conversation

@xingguo01
Copy link
Copy Markdown
Collaborator

@xingguo01 xingguo01 commented Mar 23, 2026

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 23, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18404

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 6 Pending, 3 Unrelated Failures

As of commit c175435 with merge base e638059 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 23, 2026
@xingguo01 xingguo01 added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk release notes: arm Changes to the ARM backend delegate labels Mar 23, 2026
@zingo
Copy link
Copy Markdown
Collaborator

zingo commented Mar 23, 2026

Hi @digantdesai this PR touch code outside Arm backend and need a Meta review. Thanks!

Change-Id: I881fa107f43c9682c18480d01996a5795ae7f086
Signed-off-by: Xingguo Li <xingguo.li@arm.com>
Copilot AI review requested due to automatic review settings April 14, 2026 17:22
@xingguo01 xingguo01 force-pushed the arm-backend-stat-cache-integration-llama branch from de5d980 to c175435 Compare April 14, 2026 17:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an Arm backend integration test that exercises HuggingFace LLaMA StaticCache lowering, and adjusts backend transforms/passes to better support LLaMA-style attention and graph-signature constraints during lowering.

Changes:

  • Extend SDPA decomposition to handle LLaMA-style GQA (Q heads != KV heads) and refactor SDPA graph-copying helpers.
  • Add a new HuggingFace StaticCache-based LLaMA INT TOSA integration test.
  • Fix bias placeholder insertion ordering for rewritten convs to satisfy constant-vs-user-input placeholder constraints.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
backends/transforms/decompose_sdpa.py Refactors SDPA decomposition and adds a GQA wrapper for LLaMA-style head mismatch.
backends/arm/test/models/test_llama.py Adds HF StaticCache LLaMA test and tweaks existing LLaMA TOSA pipeline settings.
backends/arm/_passes/rewrite_conv_pass.py Ensures synthetic bias placeholders are inserted before user-input placeholders.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 208 to 212
custom_path="llama_tosa_fb_int",
run_on_tosa_ref_model=False, # Just want to write TOSA FB to disk
run_on_tosa_ref_model=True, # Just want to write TOSA FB to disk
use_to_edge_transform_and_lower=True,
frobenius_threshold=None,
cosine_threshold=None,
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment says this test “Just want to write TOSA FB to disk”, but run_on_tosa_ref_model is now True (and the explicit serialize stage was removed). Either update the comment to match the new behavior (running the TOSA ref model) or set run_on_tosa_ref_model=False if the intent is still artifact-only.

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +57
@staticmethod
def _extract_input_tensors(node: torch.fx.Node) -> tuple[object, ...]:
def _extract_arg_value(arg):
if isinstance(arg, torch.fx.Node):
if "val" not in arg.meta:
raise RuntimeError(
f"Missing meta['val'] for SDPA arg node: {arg.name}"
)
return arg.meta["val"]
return arg

return tuple(_extract_arg_value(arg) for arg in node.args)

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_extract_input_tensors only walks node.args and ignores node.kwargs. For aten.scaled_dot_product_attention it’s common for attn_mask / dropout_p / is_causal / scale to be provided as kwargs, so the make_fx trace here can silently use defaults and decompose the wrong computation. Consider canonicalizing the SDPA call into a full positional arg list (q,k,v,attn_mask,dropout_p,is_causal,scale) by merging args+kwargs+defaults, and use that both for tracing and for the later scale adjustment (including handling scale passed positionally).

Copilot uses AI. Check for mistakes.
Comment on lines +80 to +83
for decomposed_node in decomposed_module.graph.nodes:
node.meta["nn_module_stack"] = decomposed_node.meta.get("nn_module_stack")
if decomposed_node.op == "placeholder":
continue
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _copy_decomposed_graph, nn_module_stack metadata is being written onto the original SDPA node (which is erased) instead of propagating from the original node to the decomposed nodes / copied subgraph nodes. This likely drops nn_module_stack on the new nodes and breaks downstream tooling relying on that metadata. The direction should match other decomposition passes (e.g., set decomposed_node.meta["nn_module_stack"] = node.meta.get("nn_module_stack") before node_copy, or set it on subgraph_node after copying).

Copilot uses AI. Check for mistakes.
Hk = k.shape[1]
if Hq != Hk:
# LLaMA-style GQA: tile K and V heads to match Q
assert Hq % Hk == 0, f"GQA mismatch: Hq={Hq}, Hk={Hk}"
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a bare assert for the GQA head-ratio check makes this validation disappear under Python -O and can turn a shape mismatch into harder-to-debug downstream errors during tracing. Prefer raising a RuntimeError / ValueError with the same message so it is always enforced.

Suggested change
assert Hq % Hk == 0, f"GQA mismatch: Hq={Hq}, Hk={Hk}"
if Hq % Hk != 0:
raise ValueError(f"GQA mismatch: Hq={Hq}, Hk={Hk}")

Copilot uses AI. Check for mistakes.
Comment on lines 188 to 191
custom_path="llama_tosa_fb",
run_on_tosa_ref_model=False, # Just want to write TOSA FB to disk
run_on_tosa_ref_model=True, # Just want to write TOSA FB to disk
use_to_edge_transform_and_lower=True,
transform_passes=[InsertInt32CastsAfterInt64PlaceholdersPass()],
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment says this test “Just want to write TOSA FB to disk”, but run_on_tosa_ref_model is now True (and the explicit serialize stage was removed). Either update the comment to match the new behavior (running the TOSA ref model) or set run_on_tosa_ref_model=False if the intent is still artifact-only.

Copilot uses AI. Check for mistakes.
)(*input_tensors)

with graph.inserting_before(node):
name_to_input_tensor_map = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a refactor?

Copy link
Copy Markdown
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work on the tosa ref model though? Just curious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: arm Changes to the ARM backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants