Canonicalize extract past distinct-index insert#2965
Conversation
Extend ExtractOp::canonicalize so an extract whose index is a statically-distinct constant from the index of the quantum.insert defining its register reads from the register feeding the insert. This removes the false data dependency between wires that act on different qubits of the same register. The rewrite only fires while the bypassed insert keeps a non-extract user, so it is never left dead. A dead insert (NoMemoryEffect) would otherwise be removed by DCE along with the gate feeding it. Add lit coverage for the rewrite and for the leaf-insert guard, and update the canonicalization, QRef semantic-conversion, and decompose-lowering golden tests that run --canonicalize. Fixes PennyLaneAI#2725
|
Thanks @rylanmalarchick! We're currently resolving some CI issue and will get back to you afterwards. |
| bool insertOnRegisterChain = llvm::any_of( | ||
| insert.getResult().getUsers(), [](Operation *user) { return !isa<ExtractOp>(user); }); |
There was a problem hiding this comment.
Would it be clearer to change the condition to be, perform the rewrite if the insert op has any users that is not the current extract op? Because the point is to prevent DCE eliminating the insert op if the extract op was the only user before the rewrite right?
There was a problem hiding this comment.
I tried that exact condition (rewrite when the insert has any user other than this extract) and it brings back the gate drop this guard prevents: DecomposeLoweringTest fails and an RX disappears, with the custom-op count going from 43 to 42. The reason is that "a user other than this extract" is also true when the only other user is another extract, which the greedy driver then redirects away as well; the insert ends up with no users and DCE removes it along with the gate it was feeding. Requiring a non-extract user keeps the insert on the register chain (a downstream insert or dealloc), so it can't be stranded. I kept that condition and renamed it to insertHasNonExtractUser to make the intent clear.
| // CHECK: [[q2_pre:%.+]] = quantum.extract [[reg]][ 2] : !quantum.reg -> !quantum.bit | ||
| // CHECK: [[insert_2:%.+]] = quantum.insert [[H_insert]][ 2], [[q2_pre]] : !quantum.reg, !quantum.bit | ||
| // CHECK: [[full_insert:%.+]] = quantum.insert [[insert_2]][ 1], [[q1]] : !quantum.reg, !quantum.bit |
There was a problem hiding this comment.
Trying to understand what was adding these new ops. Can you explain a little bit? Because the new pattern should only be changing an operand on an extract op right?
There was a problem hiding this comment.
You're right that the pattern only swaps the extract's qreg operand; it doesn't add ops. The extra extract/insert come from that swap turning off a downstream fold. In the @my_cnot decomposition, qubit 2 gets re-inserted into %3 at index 2. InsertOp::canonicalize only folds an insert of an extracted qubit when the extract and insert share the register (its sameQreg check). Before this change qubit 2 was extract %3[2], so insert %3[2], extract(%3[2]) hit that case and folded away. After the redirect qubit 2 is extract %0[2], so the extract reads %0 while the insert targets %3, sameQreg is false, and the pair stays. It's semantically identical, since the only insert into %3 is at index 0, so %0[2] and %3[2] are the same qubit, and the gate counts match baseline. So it trades the false dependency for two register ops that previously cancelled.
There was a problem hiding this comment.
I agree that this is semantically equivalent, but it looks slightly awkward. If I understood correctly, this is a case where the pattern ordering is resulting in a different canonicalized form because if say the insert fold was triggered first, it would be folded away right? If that's the case, maybe it is worth having a similar mirrored pattern for the insertOp as well. That could ensure that no matter what is the order of applied canonicalization patterns, we would get the same result.
There was a problem hiding this comment.
Yeah that is the cause of it. The leftover pair only survives because 'InsertOp::canonicalize' folds an inserted-extracted qubit only when the extract and insert name the same register value (its 'sameQreg' check), and the distinct-index rewrite makes the extract read the earlier register while the insert still targets the later one. So the result ends up depending on which pattern the greedy driven happens to apply first.
A mirrored 'InsertOp' pattern is the correct fix in this case. Generalize that 'sameQreg' check from "same register" to "same register up to inserts at distinct static indicies" so 'insert(reg2, extract(reg1, i), i) folds when 'reg1' and 'reg2' agree at index 'i'. That makes the canonical form order-independent, and as a bonus it should (!) fold this awkward extract/insert pair away instead of leaving it, which should also remove the +2-ops drawback noted above.
One coupling I should flag is that this presupposes the rewrite stays in 'canonicalize'. (my open question from my reply last night regardging keeping in 'canonicalize' as the delay-insert form versus moving it to a pass after 'apply-transform-sequence') ; If it is kept in 'canonicalize' I would add the mirrored pattern with its own tests; if it moves to a dedicated pass the confluence concern mostly goes away. Either can be done
There was a problem hiding this comment.
Yeah that makes sense. I think If making the existing passes' (including the parity-synth pass) assumptions inline with this canonical form then keeping it in the same pass is ideal, but I am not super familiar with the parity-synth implementation to know if it is trivial or not. I'll let @dime10 comment on that.
|
The CI issues have been resolved. Can you merge from main again? Then we will run the CI and see the full impacts of this change in canonicalization 👍 |
Rename the distinct-index guard from insertOnRegisterChain to insertHasNonExtractUser to make its intent explicit; behavior is unchanged. Switch the distinct-static-fold test to the [[...]] FileCheck capture style, and expand both new canonicalization tests to check the full extract/insert dataflow.
|
Pushed the review updates and merged |
|
Just wanted to nudge this. Merged main on the 19th as @paul0403 noted, so should be ready for CI now. No rush just wanted to make sure this was noted. |
|
Hi @rylanmalarchick, Thank you so much for working on this issue, and apologies on the delay in the review process. We'll try to complete the review as soon as possible. In the meanwhile, we ran the CI and it seems like there are a few minor failures. |
|
@mehrdad2m no worries, thanks! I looked into those 2 frontend failures locally. One is a pretty clear bug with a one line fix and the other is a more subtle interaction that you guys will probably want to steer me on.
The distinct-index rewrite reassigns the extract's qreg operand but leaves the extract positioned after the insert. So it reads a register version a later insert has already superseded. This then trips IRMapping::lookup ("expected 'from' to be contained within the map") during value-semantics threading. Fix would be to also move the extract ahead of the insert: rewriter.modifyOpInPlace(extract, [&] { I verified locally and crash is gone with output numerically identical to baseline (maxdiff 0.0).
Structural thing, not just SSA renaming; the canonicalization runs before apply-transform-sequence, and hoisting the extract across the insert drops a quantum.insert mid-region. ParitySynthPattern treats InsertOp as a block boundary, so it synthesizes two smaller parity networks instead of one with 7 CNOTs vs the expected 4. I was able to confirm by toggling the pattern (off → 4, on → 7); numerics were correct either way. A local guard cannot fix this I think, as a canonicalizer can't see downstream block fragmentation, and the pattern fires in every canonicalize run. Two paths I see here:
Or other options if you have better ideas than me :) |
|
Hi @rylanmalarchick, thanks for checking out those issues so quickly.
Good analysis! It looks like we are running into some hidden assumptions about the structure of inserts/extracts. The question boils down to whether these operations are pure (that is they may appear in any order as long as SSA dominance is satisfied), or whether their positions are relevant as well. Technically, they are declared side-effect free as seen here: I'd like to take a closer look before okaying the proposed solution. Swapping the extract and insert does solve their ordering, but it still leaves gates acting on the extracted qubit after the insert has "modified" the register. It might be fine (it should be if these operations were truly pure), but worth double checking. %1 = quantum.extract %0[ 0] : !quantum.reg -> !quantum.bit
%out_qubits = quantum.custom "PauliX"() %1 : !quantum.bit
%3 = quantum.extract %0[ 1] : !quantum.reg -> !quantum.bit
%out_qubits_0 = quantum.custom "PauliZ"() %3 : !quantum.bit
%2 = quantum.insert %0[ 0], %out_qubits : !quantum.reg, !quantum.bit
%4 = quantum.insert %2[ 1], %out_qubits_0 : !quantum.reg, !quantum.bitWe can see it's actually the insert that is moved down (delayed) as far as possible, which leads to a nice structure. |
|
Thanks @dime10, I think that was the right cal here. I took the two issues above as separate, and rebuilt to test. They are less coupled than I implied above. For SIGABRT: since So the instinct to delay the insert downward rather than hoist the extract is the cleaner fix, because it satisfies that assumption and lands the nice extracts-up / inserts-down structure from the issue. I did a rewrite that reads the pre-insert register and then sinks the bypassed insert to just above the earliest user of its result, so inserts cluster after the gate run instead of being stranded mid-circuit. I also isolated the cause on parity-synth. I think it is worth laying out because it bears on where this rewrite should live. I wanted input on the remaining +1. I think I had assumed a more contiguous gate run would always synthesize to fewer CNOTs, but that's not the case here. It seems that the baseline's particular insert placement happens to split the run at a point that synthesizes better than the fully-sunk form does. Looks like the synthesis isn't monotone in how the gates are grouped, so any reshape that runs before parity-synth carries some small, circuit-dependent delta. It seems the delay-insert form minimizes it but can't reliably drive it to zero. The only way I see to guarantee no parity-synth change is to not reshape the IR before it runs; i.e. move the distinct-index rewrite out of So this comes down a policy call (that I think you would rather take); keep it in |
mehrdad2m
left a comment
There was a problem hiding this comment.
Thanks again @rylanmalarchick for the great work, I added two comments.
|
|
||
| LogicalResult ExtractOp::canonicalize(ExtractOp extract, mlir::PatternRewriter &rewriter) | ||
| { | ||
| // Handle the pattern: %reg2 = insert %reg1[idx], %qubit -> %q = extract %reg2[idx] |
There was a problem hiding this comment.
Could we update this comment as well to reflect the new pattern?
| // CHECK: [[q2_pre:%.+]] = quantum.extract [[reg]][ 2] : !quantum.reg -> !quantum.bit | ||
| // CHECK: [[insert_2:%.+]] = quantum.insert [[H_insert]][ 2], [[q2_pre]] : !quantum.reg, !quantum.bit | ||
| // CHECK: [[full_insert:%.+]] = quantum.insert [[insert_2]][ 1], [[q1]] : !quantum.reg, !quantum.bit |
There was a problem hiding this comment.
I agree that this is semantically equivalent, but it looks slightly awkward. If I understood correctly, this is a case where the pattern ordering is resulting in a different canonicalized form because if say the insert fold was triggered first, it would be folded away right? If that's the case, maybe it is worth having a similar mirrored pattern for the insertOp as well. That could ensure that no matter what is the order of applied canonicalization patterns, we would get the same result.
Context:
quantum.extract/quantum.insertthread qubits through an SSA register value. When a circuit extracts qubitbafter an insert at a distinct constant indexa, the extract reads the post-insert register even though that insert never touched qubitb. That is a false data dependency between wires acting on different qubits of the same register (issue #2725).Description of the Change:
Extend
ExtractOp::canonicalizeto handle%q = extract (insert %reg[a], v)[b]whereaandbare distinct static constants: rewrite the extract to read%reg[b]directly, since the insert at indexaleaves qubitbunchanged.The rewrite only fires while the bypassed insert still has a non-extract user, so it can never be left dead. A dead insert (
NoMemoryEffect) would be removed by DCE along with the gate feeding it, which would silently drop a gate. This was observed in--decompose-loweringduring development, and the guard prevents it.Verified with
make test-mlir(check-dialects): 147/147 pass.Benefits:
Breaks the false wire dependency the issue describes, exposing independent qubits to downstream analyses. In most cases it also simplifies the IR by collapsing inverse extract/insert round-trips: the
QRef/SemanticConversiongoldens shrink, and the long-standing TODO inTestFlatCircuits.mlirabout distinct-index inverse pairs is now satisfied.Possible Drawbacks:
In one
--decompose-loweringtest module (@different_qreg_values) the rewrite adds two register ops (an extract and an insert) while preserving all gate counts. It is value-correct, but it grows the IR there rather than shrinking it. Goldens inQRef/SemanticConversionandDecomposeLoweringTestchange because those tests run--canonicalize.Related GitHub Issues:
Fixes #2725. Also satisfies the distinct-static-index case of the inverse extract/insert TODO in
TestFlatCircuits.mlir; the insert-chain case (#2726) remains open.