Skip to content

Canonicalize extract past distinct-index insert#2965

Open
rylanmalarchick wants to merge 4 commits into
PennyLaneAI:mainfrom
rylanmalarchick:extract-canon-distinct-index
Open

Canonicalize extract past distinct-index insert#2965
rylanmalarchick wants to merge 4 commits into
PennyLaneAI:mainfrom
rylanmalarchick:extract-canon-distinct-index

Conversation

@rylanmalarchick

Copy link
Copy Markdown

Context:

quantum.extract / quantum.insert thread qubits through an SSA register value. When a circuit extracts qubit b after an insert at a distinct constant index a, the extract reads the post-insert register even though that insert never touched qubit b. That is a false data dependency between wires acting on different qubits of the same register (issue #2725).

Description of the Change:

Extend ExtractOp::canonicalize to handle %q = extract (insert %reg[a], v)[b] where a and b are distinct static constants: rewrite the extract to read %reg[b] directly, since the insert at index a leaves qubit b unchanged.

The rewrite only fires while the bypassed insert still has a non-extract user, so it can never be left dead. A dead insert (NoMemoryEffect) would be removed by DCE along with the gate feeding it, which would silently drop a gate. This was observed in --decompose-lowering during development, and the guard prevents it.

Verified with make test-mlir (check-dialects): 147/147 pass.

Benefits:

Breaks the false wire dependency the issue describes, exposing independent qubits to downstream analyses. In most cases it also simplifies the IR by collapsing inverse extract/insert round-trips: the QRef/SemanticConversion goldens shrink, and the long-standing TODO in TestFlatCircuits.mlir about distinct-index inverse pairs is now satisfied.

Possible Drawbacks:

In one --decompose-lowering test module (@different_qreg_values) the rewrite adds two register ops (an extract and an insert) while preserving all gate counts. It is value-correct, but it grows the IR there rather than shrinking it. Goldens in QRef/SemanticConversion and DecomposeLoweringTest change because those tests run --canonicalize.

Related GitHub Issues:

Fixes #2725. Also satisfies the distinct-static-index case of the inverse extract/insert TODO in TestFlatCircuits.mlir; the insert-chain case (#2726) remains open.

Extend ExtractOp::canonicalize so an extract whose index is a
statically-distinct constant from the index of the quantum.insert
defining its register reads from the register feeding the insert.
This removes the false data dependency between wires that act on
different qubits of the same register.

The rewrite only fires while the bypassed insert keeps a non-extract
user, so it is never left dead. A dead insert (NoMemoryEffect) would
otherwise be removed by DCE along with the gate feeding it.

Add lit coverage for the rewrite and for the leaf-insert guard, and
update the canonicalization, QRef semantic-conversion, and
decompose-lowering golden tests that run --canonicalize.

Fixes PennyLaneAI#2725
@github-actions github-actions Bot added the external PRs where the author is not a part of PennyLane Org (or part of external contributors team) label Jun 18, 2026
@dime10 dime10 requested a review from paul0403 June 18, 2026 18:25
@dime10

dime10 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Thanks @rylanmalarchick! We're currently resolving some CI issue and will get back to you afterwards.

@paul0403 paul0403 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Comment thread mlir/lib/Quantum/IR/QuantumOps.cpp
Comment thread mlir/lib/Quantum/IR/QuantumOps.cpp Outdated
Comment on lines +174 to +175
bool insertOnRegisterChain = llvm::any_of(
insert.getResult().getUsers(), [](Operation *user) { return !isa<ExtractOp>(user); });

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer to change the condition to be, perform the rewrite if the insert op has any users that is not the current extract op? Because the point is to prevent DCE eliminating the insert op if the extract op was the only user before the rewrite right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that exact condition (rewrite when the insert has any user other than this extract) and it brings back the gate drop this guard prevents: DecomposeLoweringTest fails and an RX disappears, with the custom-op count going from 43 to 42. The reason is that "a user other than this extract" is also true when the only other user is another extract, which the greedy driver then redirects away as well; the insert ends up with no users and DCE removes it along with the gate it was feeding. Requiring a non-extract user keeps the insert on the register chain (a downstream insert or dealloc), so it can't be stranded. I kept that condition and renamed it to insertHasNonExtractUser to make the intent clear.

Comment thread mlir/test/QRef/SemanticConversion/TestFlatCircuits.mlir
Comment thread mlir/test/Quantum/CanonicalizationTest.mlir Outdated
Comment thread mlir/test/Quantum/CanonicalizationTest.mlir
Comment on lines +752 to +754
// CHECK: [[q2_pre:%.+]] = quantum.extract [[reg]][ 2] : !quantum.reg -> !quantum.bit
// CHECK: [[insert_2:%.+]] = quantum.insert [[H_insert]][ 2], [[q2_pre]] : !quantum.reg, !quantum.bit
// CHECK: [[full_insert:%.+]] = quantum.insert [[insert_2]][ 1], [[q1]] : !quantum.reg, !quantum.bit

@paul0403 paul0403 Jun 19, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand what was adding these new ops. Can you explain a little bit? Because the new pattern should only be changing an operand on an extract op right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that the pattern only swaps the extract's qreg operand; it doesn't add ops. The extra extract/insert come from that swap turning off a downstream fold. In the @my_cnot decomposition, qubit 2 gets re-inserted into %3 at index 2. InsertOp::canonicalize only folds an insert of an extracted qubit when the extract and insert share the register (its sameQreg check). Before this change qubit 2 was extract %3[2], so insert %3[2], extract(%3[2]) hit that case and folded away. After the redirect qubit 2 is extract %0[2], so the extract reads %0 while the insert targets %3, sameQreg is false, and the pair stays. It's semantically identical, since the only insert into %3 is at index 0, so %0[2] and %3[2] are the same qubit, and the gate counts match baseline. So it trades the false dependency for two register ops that previously cancelled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is semantically equivalent, but it looks slightly awkward. If I understood correctly, this is a case where the pattern ordering is resulting in a different canonicalized form because if say the insert fold was triggered first, it would be folded away right? If that's the case, maybe it is worth having a similar mirrored pattern for the insertOp as well. That could ensure that no matter what is the order of applied canonicalization patterns, we would get the same result.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is the cause of it. The leftover pair only survives because 'InsertOp::canonicalize' folds an inserted-extracted qubit only when the extract and insert name the same register value (its 'sameQreg' check), and the distinct-index rewrite makes the extract read the earlier register while the insert still targets the later one. So the result ends up depending on which pattern the greedy driven happens to apply first.

A mirrored 'InsertOp' pattern is the correct fix in this case. Generalize that 'sameQreg' check from "same register" to "same register up to inserts at distinct static indicies" so 'insert(reg2, extract(reg1, i), i) folds when 'reg1' and 'reg2' agree at index 'i'. That makes the canonical form order-independent, and as a bonus it should (!) fold this awkward extract/insert pair away instead of leaving it, which should also remove the +2-ops drawback noted above.

One coupling I should flag is that this presupposes the rewrite stays in 'canonicalize'. (my open question from my reply last night regardging keeping in 'canonicalize' as the delay-insert form versus moving it to a pass after 'apply-transform-sequence') ; If it is kept in 'canonicalize' I would add the mirrored pattern with its own tests; if it moves to a dedicated pass the confluence concern mostly goes away. Either can be done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. I think If making the existing passes' (including the parity-synth pass) assumptions inline with this canonical form then keeping it in the same pass is ideal, but I am not super familiar with the parity-synth implementation to know if it is trivial or not. I'll let @dime10 comment on that.

@paul0403

Copy link
Copy Markdown
Member

The CI issues have been resolved. Can you merge from main again? Then we will run the CI and see the full impacts of this change in canonicalization 👍

Rename the distinct-index guard from insertOnRegisterChain to
insertHasNonExtractUser to make its intent explicit; behavior is
unchanged. Switch the distinct-static-fold test to the [[...]] FileCheck
capture style, and expand both new canonicalization tests to check the
full extract/insert dataflow.
@rylanmalarchick

Copy link
Copy Markdown
Author

Pushed the review updates and merged main; should be ready for CI now. Thanks both for the careful look.

@rylanmalarchick

Copy link
Copy Markdown
Author

Just wanted to nudge this. Merged main on the 19th as @paul0403 noted, so should be ready for CI now. No rush just wanted to make sure this was noted.

@mehrdad2m

Copy link
Copy Markdown
Contributor

Hi @rylanmalarchick, Thank you so much for working on this issue, and apologies on the delay in the review process. We'll try to complete the review as soon as possible. In the meanwhile, we ran the CI and it seems like there are a few minor failures.
Cheers!

@rylanmalarchick

Copy link
Copy Markdown
Author

@mehrdad2m no worries, thanks! I looked into those 2 frontend failures locally. One is a pretty clear bug with a one line fix and the other is a more subtle interaction that you guys will probably want to steer me on.

  1. SIGABRT in convert-to-value-semantics (test_dev_wires_have_pytree, both backends)

The distinct-index rewrite reassigns the extract's qreg operand but leaves the extract positioned after the insert. So it reads a register version a later insert has already superseded. This then trips IRMapping::lookup ("expected 'from' to be contained within the map") during value-semantics threading. Fix would be to also move the extract ahead of the insert:

rewriter.modifyOpInPlace(extract, [&] {
extract.getQregMutable().assign(insert.getInQreg());
extract->moveBefore(insert);
});

I verified locally and crash is gone with output numerically identical to baseline (maxdiff 0.0).

  1. Parity-synth emits more CNOTs (test_xdsl_parity_synth::test_qjit_with_dynamic_allocation)

Structural thing, not just SSA renaming; the canonicalization runs before apply-transform-sequence, and hoisting the extract across the insert drops a quantum.insert mid-region. ParitySynthPattern treats InsertOp as a block boundary, so it synthesizes two smaller parity networks instead of one with 7 CNOTs vs the expected 4. I was able to confirm by toggling the pattern (off → 4, on → 7); numerics were correct either way. A local guard cannot fix this I think, as a canonicalizer can't see downstream block fragmentation, and the pattern fires in every canonicalize run.

Two paths I see here:

  • Move the hoisting into a dedicated pass after apply-transform-sequence. Would keep the optimization, stops it clobbering parity-synth.
  • Keep it in canonicalize and update the CHECK lines. This would be more simple, but it ships the CNOT regression.

Or other options if you have better ideas than me :)

@dime10

dime10 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Hi @rylanmalarchick, thanks for checking out those issues so quickly.

  1. SIGABRT in convert-to-value-semantics (test_dev_wires_have_pytree, both backends)

Good analysis! It looks like we are running into some hidden assumptions about the structure of inserts/extracts. The question boils down to whether these operations are pure (that is they may appear in any order as long as SSA dominance is satisfied), or whether their positions are relevant as well. Technically, they are declared side-effect free as seen here:

def ExtractOp : Memory_Op<"extract", [NoMemoryEffect]> {

I'd like to take a closer look before okaying the proposed solution. Swapping the extract and insert does solve their ordering, but it still leaves gates acting on the extracted qubit after the insert has "modified" the register. It might be fine (it should be if these operations were truly pure), but worth double checking.
The safest solution would be to follow the approach outlined in the issue:

  %1 = quantum.extract %0[ 0] : !quantum.reg -> !quantum.bit
  %out_qubits = quantum.custom "PauliX"() %1 : !quantum.bit

  %3 = quantum.extract %0[ 1] : !quantum.reg -> !quantum.bit
  %out_qubits_0 = quantum.custom "PauliZ"() %3 : !quantum.bit

  %2 = quantum.insert %0[ 0], %out_qubits : !quantum.reg, !quantum.bit
  %4 = quantum.insert %2[ 1], %out_qubits_0 : !quantum.reg, !quantum.bit

We can see it's actually the insert that is moved down (delayed) as far as possible, which leads to a nice structure.

@rylanmalarchick

Copy link
Copy Markdown
Author

Thanks @dime10, I think that was the right cal here. I took the two issues above as separate, and rebuilt to test. They are less coupled than I implied above.

For SIGABRT: since extract/insert are NoMemoryEffect, the rewrite is value-correct regardless of position; extract %0[b] reads the same qubit whether it sits before or after insert %0[a] at a distinct index, because the register is a functional SSA value rather than mutable storage. So I think the crash wasn't a semantic error, it was a stale-position assumption in convert-to-value-semantics (IRMapping::lookup expecting the extract to read the still-current register version).

So the instinct to delay the insert downward rather than hoist the extract is the cleaner fix, because it satisfies that assumption and lands the nice extracts-up / inserts-down structure from the issue. I did a rewrite that reads the pre-insert register and then sinks the bypassed insert to just above the earliest user of its result, so inserts cluster after the gate run instead of being stranded mid-circuit.

I also isolated the cause on parity-synth. I think it is worth laying out because it bears on where this rewrite should live. ParitySynthPattern walks ops in textual order and flushes the accumulated phase-polynomial run whenever it hits an InsertOp, so the synthesized network (and its CNOT count) depends on exactly where inserts sit relative to the gates. I measured the checked circuit CNOTs with 3 separate ways. Baseline (no rewrite) is 8; the current operand-redirect is 11 (+3); the delay-insert form is 9 (+1). So the delay-insert structure you suggested cuts the regression substantially and fixes the crash, and 54/55 of the parity-synth tests pass with it.

I wanted input on the remaining +1. I think I had assumed a more contiguous gate run would always synthesize to fewer CNOTs, but that's not the case here. It seems that the baseline's particular insert placement happens to split the run at a point that synthesizes better than the fully-sunk form does. Looks like the synthesis isn't monotone in how the gates are grouped, so any reshape that runs before parity-synth carries some small, circuit-dependent delta. It seems the delay-insert form minimizes it but can't reliably drive it to zero. The only way I see to guarantee no parity-synth change is to not reshape the IR before it runs; i.e. move the distinct-index rewrite out of canonicalize into a pass scheduled after apply-transform-sequence. That keeps the #2725 false-dependency fix but trades away its benefit for any analysis (like parity-synth) that runs inside the transform sequence and actually wants the threaded form.

So this comes down a policy call (that I think you would rather take); keep it in canonicalize as the delay-insert form (fixes the crash, +1 on this one synthetic test, value-correct, and I would update the goldens), or move it to a dedicated post-transform pass for an exact-baseline parity-synth at the cost of the early-analysis benefit. I have the delay-insert version ready to push. But can take other direction if needed as well just lmk

@mehrdad2m mehrdad2m left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @rylanmalarchick for the great work, I added two comments.


LogicalResult ExtractOp::canonicalize(ExtractOp extract, mlir::PatternRewriter &rewriter)
{
// Handle the pattern: %reg2 = insert %reg1[idx], %qubit -> %q = extract %reg2[idx]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update this comment as well to reflect the new pattern?

Comment on lines +752 to +754
// CHECK: [[q2_pre:%.+]] = quantum.extract [[reg]][ 2] : !quantum.reg -> !quantum.bit
// CHECK: [[insert_2:%.+]] = quantum.insert [[H_insert]][ 2], [[q2_pre]] : !quantum.reg, !quantum.bit
// CHECK: [[full_insert:%.+]] = quantum.insert [[insert_2]][ 1], [[q1]] : !quantum.reg, !quantum.bit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is semantically equivalent, but it looks slightly awkward. If I understood correctly, this is a case where the pattern ordering is resulting in a different canonicalized form because if say the insert fold was triggered first, it would be folded away right? If that's the case, maybe it is worth having a similar mirrored pattern for the insertOp as well. That could ensure that no matter what is the order of applied canonicalization patterns, we would get the same result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external PRs where the author is not a part of PennyLane Org (or part of external contributors team)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Canonicalization does not recognize extracts from different indices on the same qreg

4 participants