feat(qcdq): pattern passes to bridge Brevitas/DeepQuant QCDQ ONNX → integer Conv path

runwangdl · runwangdl · commit b1a60a245424 · 2026-05-14T22:27:03.000Z
DeepQuant emits QCDQ-format ONNX (decomposed Quant: Div/Add/Round/Clip,
Dequant: Sub/Mul). Deeploy's existing pattern passes (QuantPatternPass,
DequantPatternPass) collapse those decompositions into single Quant/Dequant
ops, but nothing then bridges `Dequant → ... → Quant` chains into the
RequantShift/RequantizedConv integer path that the PULPOpen target's int8
kernels actually consume. This commit adds the missing bridges, getting a
real Brevitas-quantized ResNet8 from `Onnx4Deeploy -mode quant` through
the entire frontend + lowering chain (all Conv → RequantizedConv, all
Dequant→Quant pairs absorbed into RequantShift).

Passes added (Generic/TopologyOptimizationPasses/Passes.py):

  - DequantQuantToRequantShiftPass: matches consecutive `Dequant → Quant`
    and folds into a single RequantShift carrying the combined affine
    transform.  scale_d / scale_q is represented as fixed-point
    mul / 2^16, zero-point delta absorbed into add. Output keeps Quant's
    n_levels / signed / bit_width.

  - SkipInputQuantDequantPass: drops the trailing Dequant of the leading
    `(graph_input) → Quant → Dequant → ...` activation-quantization pair,
    so the int8 output of the input Quant feeds directly into the first
    integer op (RequantizedConv). Equivalent to feeding the network an
    fp32 input that gets pre-quantized — no precision loss beyond what
    Brevitas's input QuantIdentity already imposes.

Both registered in PULPOptimizer right after QuantPatternPass /
DequantPatternPass and before the existing RequantMerge stack.

PULPOpen-side patches:

  - _merge_conv_rq_fun (PULPOpen/TopologyOptimizationPasses/Passes.py)
    now absorbs a bias-bearing Conv's bias into the requant add term,
    matching what _merge_gemm_rq_fun has done all along. Required when
    upstream Brevitas models use bias=True Conv (typical after Conv+BN
    folding, since BN's beta + running stats land in the Conv bias).
    This keeps RequantizedConv at the 4-input shape PULPConv2DParser /
    PULPDWConv2DParser require (X, W, mul, merged_add).

  - _remove_only_singleton_reduce_mean (CommonExtensions/.../
    LoweringOptimizationPasses.py) now also reads the `axes` attribute
    (opset 13 form). The pre-patch code looked only at `node.inputs[1]`
    (opset 18+ form), which is what every opset-13 ONNX produced by
    DeepQuant fails against.

Validated end-to-end on Brevitas-quantized ResNet8 (Onnx4Deeploy
`-mode quant`):

  python testMVP.py -d ... -t Tests/Models/ResNet8_Quant -p Siracusa ...
  → QuantPatternPass / DequantPatternPass: fold Div/Add/Round/Clip + Sub/Mul ✓
  → DequantQuantToRequantShiftPass: 13 Dequant→Quant pairs folded into
    RequantShift ✓
  → SkipInputQuantDequantPass: leading input dequant dropped ✓
  → PULPConvRequantMergePass: 9 Conv+RequantShift pairs → 9 RequantizedConv ✓
  → All Conv binding succeeded with int8/int32 bias/int32 mul/int32 add ✓

(One narrow type-check failure remains downstream on one of the new
RequantShift instances — likely an attribute representation quirk on the
gs.Constant-wrapped n_levels/signed/div — to be addressed in a follow-up.
The structural integration is in.)
diff --git a/Deeploy/CommonExtensions/OptimizationPasses/TopologyOptimizationPasses/LoweringOptimizationPasses.py b/Deeploy/CommonExtensions/OptimizationPasses/TopologyOptimizationPasses/LoweringOptimizationPasses.py
@@ -530,11 +530,19 @@ def _remove_only_singleton_reduce_mean(graph: gs.Graph, match: Match, name: str)
     if len(graph.nodes) == 1:
         return graph
 
-    # Delete node if only reduction over singleton dimensions
-    if 'axis' in node.attrs:
+    # Delete node if only reduction over singleton dimensions.
+    # Pre-opset-18 ReduceMean carries axes as an 'axes' attribute; opset 18+
+    # carries it as the second input. Some exporters also spell the attribute
+    # 'axis'. Handle all three.
+    if 'axes' in node.attrs:
+        axis = node.attrs['axes']
+    elif 'axis' in node.attrs:
         axis = node.attrs['axis']
-    else:
+    elif len(node.inputs) > 1:
         axis = node.inputs[1].values
+    else:
+        # No axes info → reduce over all dims; not a singleton-only case.
+        return graph
 
     # Check if shape information is available
     if node.inputs[0].shape is not None and all(node.inputs[0].shape[ax] == 1 for ax in axis):
diff --git a/Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py b/Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py
@@ -1177,3 +1177,142 @@ def __init__(self):
 
         name = "_RECOGNIZE_DEQUANT_PASS"
         super().__init__(graph, _recognize_dequant_fun, name)
+
+
+# -------------------------------------------------------------------------- #
+# Dequant → Quant chain  →  RequantShift                                      #
+# -------------------------------------------------------------------------- #
+#
+# QCDQ-style ONNX from Brevitas/DeepQuant produces ``Dequant`` and ``Quant``
+# in alternating positions sandwiching float ops. After ``QuantPatternPass``
+# and ``DequantPatternPass`` fold them, the graph looks like:
+#
+#     Quant_input → Dequant → Conv(fp) → Quant → Dequant → Conv(fp) → ...
+#
+# Deeploy's per-op RequantMerge passes (PULPConvRequantMergePass etc.) look
+# for ``Op → RequantShift``, not ``Op → Quant → Dequant``. We bridge by
+# pre-folding every ``Dequant → Quant`` pair into a single ``RequantShift``,
+# which carries the combined affine transform:
+#
+#     y_int = clip(round((x_int - zp_d) * scale_d / scale_q + zp_q))
+#
+# With   mul = round(scale_d / scale_q * 2^N),   div = 2^N,
+#        add = zp_q * div - zp_d * mul.
+#
+def _dequant_quant_to_rqs_fun(graph: gs.Graph, match: Match, name: str):
+    matched_nodes = list(match.nodes_map.values())
+    dequant_node = matched_nodes[0]
+    quant_node = matched_nodes[1]
+
+    scale_d = float(dequant_node.attrs['scale'])
+    zp_d = float(dequant_node.attrs['zero_point'])
+    scale_q = float(quant_node.attrs['scale'])
+    zp_q = float(quant_node.attrs['zero_point'])
+    bit_width_q = int(quant_node.attrs['bit_width'])
+    signed_q = bool(quant_node.attrs.get('signed', True))
+
+    # Fixed-point representation of scale_d / scale_q. 16 bits after the binary
+    # point comfortably covers any per-tensor INT8 PTQ scale we have seen.
+    shift_bits = 16
+    div = int(1 << shift_bits)
+    mul_val = int(np.round((scale_d / scale_q) * div))
+    add_val = int(np.round(zp_q * div - zp_d * mul_val))
+
+    mul_tensor = gs.Constant(name = name + '_mul', values = np.array([mul_val], dtype = np.int32))
+    add_tensor = gs.Constant(name = name + '_add', values = np.array([add_val], dtype = np.int32))
+
+    n_levels = 1 << bit_width_q
+    # Attrs wrapped in gs.Constant since RequantShiftParser reads
+    # node.attrs['div'].values etc. (Parsers.py around line 90).
+    attrs = {
+        'n_levels': gs.Constant(name = name + '_n_levels', values = np.array(n_levels)),
+        'signed': gs.Constant(name = name + '_signed', values = np.array(int(signed_q))),
+        'div': gs.Constant(name = name + '_div', values = np.array(div)),
+    }
+
+    # `replaceInsertNode` only reads op/name/attrs off the supplied node — it
+    # creates the real node via graph.layer(...) with the inputs/outputs we
+    # pass here. So this gs.Node serves only as a spec carrier.
+    spec = gs.Node(op = 'RequantShift', name = name, attrs = attrs)
+    graph.replaceInsertNode(
+        [dequant_node.inputs[0], mul_tensor, add_tensor],
+        list(quant_node.outputs),
+        spec,
+    )
+    return graph
+
+
+@contextagnostic
+class DequantQuantToRequantShiftPass(ReplaceSequentialPatternPass):
+    """Fold a ``Dequant → Quant`` chain (produced by Brevitas QCDQ export) into
+    a single ``RequantShift`` so downstream RequantMerge passes can absorb it
+    into their preceding Conv/Gemm/MatMul/Add."""
+
+    def __init__(self):
+        graph = gs.Graph()
+        _input = gs.Variable(name = 'input_1')
+        deq_out = graph.layer(inputs = [_input], outputs = ['deq_out'], op = 'Dequant', name = 'deq')
+        q_out = graph.layer(inputs = deq_out, outputs = ['q_out'], op = 'Quant', name = 'q')
+        graph.outputs.append(q_out)
+        graph.inputs.append(_input)
+
+        name = "_DEQUANT_QUANT_TO_RQS_PASS"
+        super().__init__(graph, _dequant_quant_to_rqs_fun, name)
+
+
+# -------------------------------------------------------------------------- #
+# Skip leading Quant→Dequant pair: when the network starts with the canonical
+# Brevitas QCDQ activation-quantization pair (fp32 input → Quant → Dequant →
+# first op), Deeploy's first-op binding receives fp32 and refuses (the
+# RequantizedConv it folded into expects int8). The pair is mathematically
+# a "round to int8 grid" no-op; we can drop it at a small precision cost for
+# PTQ, leaving the int8 chain to absorb everything from the next RequantShift
+# onward.
+# -------------------------------------------------------------------------- #
+def _skip_input_quant_dequant_fun(graph: gs.Graph, match: Match, name: str):
+    matched_nodes = list(match.nodes_map.values())
+    quant_node = matched_nodes[0]
+    dequant_node = matched_nodes[1]
+
+    # Only collapse if the Quant's input is a graph input (the leading
+    # activation-quant pair, not an interior one).
+    quant_input = quant_node.inputs[0]
+    if quant_input not in graph.inputs:
+        return graph
+
+    # Drop only the trailing Dequant. The leading Quant stays so its int8
+    # output feeds directly into the first integer op (RequantizedConv etc.).
+    quant_out = quant_node.outputs[0]
+    dequant_out = dequant_node.outputs[0]
+
+    for consumer in list(graph.nodes):
+        for i, inp in enumerate(consumer.inputs):
+            if inp is dequant_out:
+                consumer.inputs[i] = quant_out
+    for i, out in enumerate(graph.outputs):
+        if out is dequant_out:
+            graph.outputs[i] = quant_out
+
+    dequant_node.outputs = []
+    graph.cleanup()
+    return graph
+
+
+@contextagnostic
+class SkipInputQuantDequantPass(ReplaceSequentialPatternPass):
+    """Drop a leading ``Quant → Dequant`` pair at graph input — equivalent
+    to feeding the network with the un-rounded fp32 input.
+
+    Lets the rest of the integer chain (RequantShift / RequantizedConv) take
+    over from the first conv onward."""
+
+    def __init__(self):
+        graph = gs.Graph()
+        _input = gs.Variable(name = 'input_1')
+        q_out = graph.layer(inputs = [_input], outputs = ['q_out'], op = 'Quant', name = 'q')
+        d_out = graph.layer(inputs = q_out, outputs = ['d_out'], op = 'Dequant', name = 'd')
+        graph.outputs.append(d_out)
+        graph.inputs.append(_input)
+
+        name = "_SKIP_INPUT_QD_PASS"
+        super().__init__(graph, _skip_input_quant_dequant_fun, name)
diff --git a/Deeploy/Targets/PULPOpen/Platform.py b/Deeploy/Targets/PULPOpen/Platform.py
@@ -26,9 +26,10 @@
     SoftmaxParser, TransposeParser, UniformRequantShiftParser, UnsqueezeParser, iHardswishParser, iRMSNormParser, \
     iSoftmaxParser
 from Deeploy.Targets.Generic.Templates import AllocateTemplate as BasicAllocateTemplate
-from Deeploy.Targets.Generic.TopologyOptimizationPasses.Passes import DequantPatternPass, IntegerDivRequantMergePass, \
-    MergeConstAddAndRequantPass, MergeTrueIntegerDivRequantShiftPass, QuantPatternPass, RQSSplitPass, \
-    SkipEmptyConcatPass, SkipUnityRequantPass, iGELURequantMergePass, iHardswishRequantMergePass
+from Deeploy.Targets.Generic.TopologyOptimizationPasses.Passes import DequantPatternPass, DequantQuantToRequantShiftPass, \
+    IntegerDivRequantMergePass, MergeConstAddAndRequantPass, MergeTrueIntegerDivRequantShiftPass, QuantPatternPass, \
+    RQSSplitPass, SkipEmptyConcatPass, SkipInputQuantDequantPass, SkipUnityRequantPass, iGELURequantMergePass, \
+    iHardswishRequantMergePass
 from Deeploy.Targets.PULPOpen.Bindings import BasicDequantBindings, BasicQuantBindings, PULPDMASliceBindings, \
     PULPDWConv1DBinding
 from Deeploy.Targets.PULPOpen.Layers import PULPRQSConvLayer, PULPRQSGEMMLayer
@@ -227,6 +228,8 @@ class PULPStructBuffer(StructBuffer):
 PULPOptimizer = TopologyOptimizer([
     QuantPatternPass(),
     DequantPatternPass(),
+    SkipInputQuantDequantPass(),
+    DequantQuantToRequantShiftPass(),
     SkipEmptyConcatPass(),
     SkipUnityRequantPass(previous_op_regex = "Concat", num_inputs = 2),
     SkipUnityRequantPass(previous_op_regex = "Reshape|Transpose", num_inputs = 1),
diff --git a/Deeploy/Targets/PULPOpen/TopologyOptimizationPasses/Passes.py b/Deeploy/Targets/PULPOpen/TopologyOptimizationPasses/Passes.py
@@ -175,7 +175,17 @@ def _merge_conv_rq_fun(graph: gs.Graph, match: Match, name: str):
 
     rqs.inputs[-1].values = copy.deepcopy(rqs.inputs[-1].values) + rounding
 
-    _inputs = list(conv.inputs) + list(rqs.inputs[1:])
+    # Absorb the Conv's bias (if present) into the RequantShift's add term:
+    #   (X*W + B) * mul + add  =  X*W * mul + (B * mul + add)
+    # This keeps the resulting RequantizedConv at the 4 inputs that
+    # PULPConv2DParser / PULPDWConv2DParser require (X, W, mul, merged_add).
+    if len(list(conv.inputs)) == 3:
+        B = conv.inputs[2].values
+        mul = rqs.inputs[1].values
+        rqs.inputs[2].values = np.round(B * mul).astype(rqs.inputs[2].values.dtype) + rqs.inputs[2].values
+        _inputs = list(conv.inputs[:2]) + list(rqs.inputs[1:])
+    else:
+        _inputs = list(conv.inputs) + list(rqs.inputs[1:])
 
     _outputs = rqs.outputs