Skip to content

Arm backend: fix(arm): validate partitions for dependency cycles after Q/DQ de-tagging#18191

Open
beomwookang wants to merge 4 commits intopytorch:mainfrom
beomwookang:fix/partition-cycle-after-qdq-detag
Open

Arm backend: fix(arm): validate partitions for dependency cycles after Q/DQ de-tagging#18191
beomwookang wants to merge 4 commits intopytorch:mainfrom
beomwookang:fix/partition-cycle-after-qdq-detag

Conversation

@beomwookang
Copy link
Copy Markdown

@beomwookang beomwookang commented Mar 16, 2026

Fixes #18190

Summary

_detag_boundary_nodes() removes Q/DQ nodes from partition boundaries after CapabilityBasedPartitioner has produced cycle-free partitions.
However, this de-tagging can introduce dependency cycles for models with complex attention blocks (e.g. MobileViT, where CNN and Transformer ops are grouped into a single large partition).

The cycle occurs because removing Q/DQ bridge nodes creates paths that exit the partition and re-enter it through the now-unpartitioned nodes, making it impossible to extract the partition as a valid subgraph:

[partition] Linear_Q → [de-tagged Q] → [outside] → [de-tagged DQ] → [partition] Matmul
[partition] Linear_K → [de-tagged Q] → [outside] → [de-tagged DQ] → [partition] Matmul

Changes

After _detag_boundary_nodes(), validate each partition for dependency cycles.
When a cycle is detected, split the partition into connected components of the surviving tagged nodes.
Each component becomes a separate valid partition.

  • Add _validate_partition(): BFS-based cycle detection (same algorithm as torch.fx.passes.utils.fuser_utils.validate_partition)
  • Add _find_connected_components(): undirected graph traversal to split surviving nodes into disjoint sub-partitions
  • Guard the nocompute-partition tags.remove() against already-removed tags from the cycle-split path

Test Results

  • MobileViT-S on Ethos-U85: previously failed with AssertionError: Invalid partition, found dependency cycles, now successfully produces a .pte file (5.7 MB). Nine attention-block partitions are each split into 3 sub-partitions. All sub-partitions remain on NPU (no CPU fallback).
  • CNN-only models (ResNet, MobileNetV2, EfficientNet): unaffected — no cycles after de-tagging, no partition splitting occurs.

How to test

import torch
from executorch.backends.arm.ethosu import EthosUPartitioner, EthosUCompileSpec
from executorch.exir import to_edge_transform_and_lower, EdgeCompileConfig

# Load a MobileViT (or any CNN+Transformer hybrid) model
exported_program = torch.export.load("mobilevit_s.pt2")
model = exported_program.module()
example_inputs = (torch.randn(1, 3, 256, 256),)

compile_spec = EthosUCompileSpec("ethos-u85-256")
partitioner = EthosUPartitioner(compile_spec)

# This previously raised: AssertionError: Invalid partition, found
dependency cycles
edge = to_edge_transform_and_lower(
    exported_program,
    partitioner=[partitioner],
    compile_config=EdgeCompileConfig(_check_ir_validity=False),
)

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 16, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18191

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 16, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@gggekov gggekov added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk labels Mar 16, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 16, 2026

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@rascani rascani requested a review from zingo March 16, 2026 17:00
@zingo zingo changed the title fix(arm): validate partitions for dependency cycles after Q/DQ de-tagging Arm backend: fix(arm): validate partitions for dependency cycles after Q/DQ de-tagging Mar 17, 2026
@nil-is-all
Copy link
Copy Markdown
Contributor

Hi @beomwookang, could you resolve the failing CI tests and check the lintrunner? Do let us know here once done for a review. Thanks for the bug report and the PR!

@nil-is-all
Copy link
Copy Markdown
Contributor

@beomwookang pinging again to resolve CI tests failures

…ging

`_detag_boundary_nodes` removes Q/DQ nodes from partition boundaries
after `CapabilityBasedPartitioner` has produced cycle-free partitions.
However, this de-tagging can introduce dependency cycles for models
with complex attention blocks (e.g. MobileViT, where CNN and
Transformer ops are grouped into a single large partition).

The cycle occurs because removing Q/DQ bridge nodes creates paths
that exit the partition and re-enter it through the now-unpartitioned
nodes, making it impossible to extract the partition as a valid
subgraph.

This change adds cycle validation after `_detag_boundary_nodes`. When
a cycle is detected, the partition is split into connected components
of the surviving (still-tagged) nodes. Each component becomes a
separate partition that is individually cycle-free after de-tagging.

- Add `_validate_partition()`: BFS-based cycle detection (same
  algorithm as `torch.fx.passes.utils.fuser_utils.validate_partition`)
- Add `_find_connected_components()`: undirected graph traversal to
  split surviving nodes into disjoint sub-partitions
- Guard the nocompute-partition `tags.remove()` against already-removed
  tags from the cycle-split path

Tested with MobileViT-S on Ethos-U85: previously failed with
`AssertionError: Invalid partition, found dependency cycles`, now
successfully produces a .pte file (5.7 MB). Nine attention-block
partitions are each split into 3 sub-partitions. All sub-partitions
remain on NPU (no CPU fallback). Existing CNN-only models (ResNet,
MobileNetV2, EfficientNet) are unaffected as their partitions have
no cycles after de-tagging.
@beomwookang beomwookang force-pushed the fix/partition-cycle-after-qdq-detag branch from 5f597ab to c94b062 Compare April 15, 2026 02:36
@beomwoo-kang
Copy link
Copy Markdown

@nil-is-all
Hello, sorry for the late response (it seems like I missed the notification)

All remaining CI failures on this PR are pre-existing issues unrelated to the backends/arm/tosa/partitioner.py change:

  • test-torchao-huggingface-checkpoints (lfm2_5_1_2b, qwen3_4b, phi_4_mini) on linux.arm64.2xlarge: pre-existing failure on main — verified on commit 7761476 where these same three jobs fail with NotImplementedError: to() is not implemented for IntxOpaqueTensor in examples/models/llama/export_llama_lib.py. The x86 linux.2xlarge variants of the same jobs pass on both main and this PR.
  • test-multimodal-linux (gemma3-4b): C++ segfault (exit 139) during multimodal runner execution with 7.5 GiB RSS and swap pressure. Export phase succeeded; failure is in extension/llm/runner/multimodal_runner.cpp runtime. Uses the xnnpack recipe, not ARM
  • unittest/unittest-editable/unittest-release on windows: 120-minute timeout cancellations. All tests actually PASSED (reached [100%]) around the 71-minute mark, but pytest process hung during teardown. Reproduces on main — unittest-release / windows on main commit 7761476 was also cancelled at exactly 2:00:25
  • unittest-editable / macos: 1 failed / 1474 passed. The single failure is test_target_recipes.py::test_mv3_model with AssertionError: Model 'mv3' Recipe: ios-arm64-coreml-fp16, tolerance check failed. Despite the "arm64" in the recipe name (which refers to Apple Silicon CPU architecture), this recipe is [CoreMLRecipeType.FP16, XNNPackRecipeType.FP32] — it exercises the CoreML and XNNPACK backends, not the ARM (Ethos-U/TOSA) backend modified by this PR. The log also confirms Skipping ARM flow registration: No module named 'tosa_serializer', so the ARM backend is not even registered in this run. This is a flaky CoreML FP16 numerical tolerance check.
  • test-models-macos-coreml (edsr): 90-minute timeout cancellation on the macOS CoreML runner. The same job completes successfully in ~13 minutes on main (commit 7761476). Uses the CoreML recipe; no ARM code involved. Infrastructure hang on this specific runner.

This PR touches only backends/arm/tosa/partitioner.py and none of these failing jobs exercise that code path

@digantdesai
Copy link
Copy Markdown
Contributor

@claude Review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Apr 22, 2026

Claude finished @digantdesai's task in 3m 35s —— View job


Review of PR #18191

  • Read the changed file and understand the diff
  • Review the cycle detection algorithm (_validate_partition)
  • Review the connected components algorithm (_find_connected_components)
  • Review the integration into _tag_module()
  • Check for edge cases and potential issues
  • Post review feedback

Summary

This PR fixes a real bug where _detag_boundary_nodes() can introduce dependency cycles by removing Q/DQ bridge nodes from partitions in models with complex attention blocks (e.g. MobileViT). The approach — detect cycles post-de-tagging, then split into connected components — is sound. The algorithms are correct and the implementation is clean. I have a few observations, mostly around an interaction between the cycle-split path and the nocompute-partition check.


Algorithm Correctness

_validate_partition (partitioner.py:135-168): The BFS-based cycle detection is correct and matches the algorithm from torch.fx.passes.utils.fuser_utils.validate_partition. It starts from external users of the partition and walks forward through the graph — if it reaches back into the partition, a cycle exists. The use of deque for proper BFS is appropriate.

_find_connected_components (partitioner.py:171-203): The undirected BFS traversal to find connected components is correct. Using both all_input_nodes and users ensures undirected connectivity. The remaining set shrinks with each component found, guaranteeing termination and O(V+E) complexity.


Issues

1. Nocompute-partition check doesn't cover sub-partitions after cycle split (moderate)

After the cycle-split path (lines 360-382), the nocompute check at lines 384-403 still iterates over the original partition.nodes and only attempts to remove the original tag. If a cycle split occurred:

  • The original tag was already removed from tags at line 376
  • The nodes were re-tagged with new_tag values at lines 378-382
  • The nocompute check on partition.nodes will still find delegation_tag on the re-tagged nodes, but reject_partition() will delete those new tags
  • The tags.remove(tag) at line 403 is guarded by if tag in tags, so it won't crash — but it also won't remove the new_tag entries from tags

In practice, this means the nocompute check after a cycle split is partially broken: it may incorrectly strip tags from newly created sub-partitions without removing their corresponding tags from the tags set, potentially leaving orphan tags in the return value. This is unlikely to hit in practice (a cycle split on a partition with only noops seems improbable), but worth fixing for correctness.

Fix this →

2. _validate_partition could be called on an empty set (minor)

At line 366, surviving and not _validate_partition(surviving) guards against an empty set, so this is correctly handled already. No issue here.

3. Connected components don't guarantee cycle-free sub-partitions (theoretical)

Splitting by connected components guarantees that each component is a connected subgraph, but does not strictly guarantee each component is cycle-free. In practice, a single connected component that is contiguous in the directed graph should be extractable without creating external cycles — the cycles are caused by disconnected groups of nodes within the same partition that have paths through external nodes between them. So splitting by connectivity is the right fix here. But it may be worth adding a debug assertion to verify each component passes _validate_partition as a safety net.


Style Observations

  • The code matches the existing style of the file well
  • The docstrings are clear and follow the existing conventions
  • The logger.info message at line 368 is helpful for debugging partition issues in production

Missing Tests

There are no unit tests for _validate_partition or _find_connected_components. The existing ARM test suite (backends/arm/test/misc/) has tests for custom partitioning and decomposed quantized ops, but nothing that exercises the cycle-detection-and-split path. Adding a targeted test with a synthetic graph that triggers a cycle after Q/DQ de-tagging would strengthen confidence in this fix and prevent regressions.


Verdict

The core fix is correct and addresses a real issue with CNN+Transformer hybrid models on the ARM backend. The two new utility functions are well-implemented. I'd recommend addressing the nocompute check interaction (issue #1) and adding a unit test before merging. The guard change at line 402 (if tag in tags) is a necessary defensive fix given the cycle-split path can remove tag from tags earlier.

…ions

  After a cycle split, the nocompute check still iterated over the original
  partition nodes and only attempted to remove the original tag.  This left
  orphan tags in the returned set when sub-partitions were rejected.

  Group the nocompute check by active tag so each sub-partition is evaluated
  and cleaned up independently.  Also update reject_partition() to accept an
  iterable of nodes instead of a Partition object.
  Add tests for _validate_partition and _find_connected_components using
  synthetic torch.fx graphs.  Cover contiguous/non-contiguous partitions,
  single nodes, branching graphs, and empty sets.
@github-actions github-actions Bot added the module: arm Issues related to arm backend label Apr 23, 2026
@beomwoo-kang
Copy link
Copy Markdown

@digantdesai I reflected the review from Claude. Please check:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: arm Issues related to arm backend partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ARM/TOSA] Dependency cycle after Q/DQ de-tagging in TOSAPartitioner breaks models with attention blocks

6 participants