[Deeploy PR] NE16 Linear Layer Kernels by pauloohaha · Pull Request #184 · pulp-platform/Deeploy

pauloohaha · 2026-04-14T10:18:24Z

Describe the intent of your PR here.

Added

Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings
The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format
The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias
Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates
Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift
Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer

Changed

Fixed

Add output signedness check in QuantChecker
Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack
- Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3
- Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

- Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings - The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format - The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias - Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates - Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift - Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer Bug fixes: - Add output signedness check in QuantChecker - Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack - Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3 - Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers

coderabbitai · 2026-04-14T10:33:27Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added NE16 GEMM acceleration support for quantized and integer matrix operations on GAP9.
- Introduced GAP9-specific SDK templates for improved quantization and dequantization operations.
- Added NE16 int8-to-uint8 conversion utility kernel.
Bug Fixes
- Fixed L3 DMA 2D transfer implementation with proper type casting for parameters.
Chores
- Updated build configuration to include NE16 and CNN kernel libraries.
- Enhanced platform optimizer with new topology optimization passes for NE16 GEMM weight layout adjustment.

Walkthrough

This PR introduces NE16 (Neural Engine 16) backend support for GEMM operations on GAP9, replaces the L3 DMA blocking adapter pattern with direct instantiation, adds GAP9 SDK-specific quantization templates, and updates the build system to include NE16 kernel sources with a separate dory memory/DMA library.

Changes

Cohort / File(s)	Summary
L3 DMA Refactoring `Deeploy/Targets/GAP9/DMA/L3Dma.py`, `Deeploy/Targets/GAP9/Bindings.py`	Replaced `gap9L3DmaHack` blocking adapter export with direct `GAP9L3Dma()` instantiation in transformers; added type casts (`uint32_t`, `void *`) in L3 2D transfer template call.
NE16 GEMM Template & Constraints `Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py`, `Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py`, `Deeploy/Targets/GAP9/Parsers.py`	Implemented new NE16 GEMM template with weight bitplane-packing, input bias compensation, and two template variants (8-bit and int32 output); added NE16 GEMM tile constraint for geometrical/policy constraints and tiling solution serialization; created `NE16GEMMParser` for RequantizedGemm node parsing with 5 input validation.
NE16 GEMM Bindings & Tiling `Deeploy/Targets/GAP9/Bindings.py`, `Deeploy/Targets/GAP9/Tiler.py`	Added `GAP9NE16RQSGEMMBindings` and `GAP9NE16GEMMInt32Bindings` for NE16 backend; created `GAP9NE16RQSGEMMTilingReadyBindings` and `GAP9NE16GEMMInt32TilingReadyBindings` with `NE16GEMMTileConstraint`.
Quantization/Dequantization Templates `Deeploy/Targets/GAP9/Templates/GAP9SDKDequantQuantTemplate.py`, `Deeploy/Targets/Generic/TypeCheckers.py`	Added eight GAP9 SDK templates for quant/dequant operations (fp16/fp32↔int8/uint8) using corresponding CNN kernels; added `QuantChecker.checkOutputType()` to validate signedness compatibility.
Quantization/Dequantization Bindings `Deeploy/Targets/GAP9/Bindings.py`, `Deeploy/Targets/GAP9/Tiler.py`	Updated `GAP9QuantBindings` and `GAP9DequantBindings` to use GAP9 SDK templates; created `QuantTilingReadyBindings` and `DeQuantTilingReadyBindings` with `UnaryTileConstraint`.
Platform Integration & Optimizer `Deeploy/Targets/GAP9/Platform.py`, `DeeployTest/testUtils/platformMapping.py`	Introduced `GAP9Optimizer` with topology/pattern/merge/split passes; updated quantization/GEMM layer mappers to use new bindings and parsers; swapped default `loweringOptimizer` from `PULPOptimizer` to `GAP9Optimizer` for GAP9 platforms.
NE16 Topology Optimization `Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py`	Added `NE16AdjustGEMMWeightLayoutPass` to transpose weights when `transB==0`, compute per-channel NE16 scales, and rescale bias/mul tensors for GEMM/RequantizedGemm nodes.
Dequant-Quant Merge Optimization `Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py`	Added `DequantQuantMergePass` to fuse consecutive Dequant→Quant operations into identity or `RequantShift` paths based on scale/zero-point compatibility.
NE16 Utility Kernels & Build System `TargetLibraries/GAP9/inc/ne16_utils.h`, `TargetLibraries/GAP9/src/ne16_utils.c`, `TargetLibraries/GAP9/CMakeLists.txt`	Implemented `ne16_int8_to_uint8()` multi-core SIMD kernel with +128 offset; expanded CMake to glob NE16/CNN autotiler sources, created separate `dory_lib` static library for memory/DMA code, extended `deeploygap9` includes/flags/links.
Build Configuration `DeeployTest/Platforms/GAP9/CMakeLists.txt`, `Deeploy/Targets/GAP9/TopologyOptimizationPasses/__init__.py`	Added `-O3` optimization flag to GAP9 test network target; added SPDX license header to optimization passes module.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Parser as NE16GEMMParser
    participant Tiler as NE16GEMMTiler
    participant Template as NE16GEMMTemplate
    participant Kernel as NE16 SDK Kernel

    Client->>Parser: Parse RequantizedGemm node (A, B, C, mul, scale_n)
    Parser->>Parser: Validate 5 inputs & shift attribute
    Parser-->>Client: Return parse success with context mapping

    Client->>Tiler: Apply NE16GEMMTileConstraint
    Tiler->>Tiler: Add geometrical constraints (M, O, N dimensions)
    Tiler->>Tiler: Add policy constraints (N untiled, O divisible by 32)
    Tiler-->>Client: Return tiling solution with cubes

    Client->>Template: Align to context with operatorRepresentation
    Template->>Template: Derive signedness from type metadata
    Template->>Template: Compute weight layout (bitplane-pack with +128 offset)
    Template->>Template: Apply input bias compensation (128 * w_sum)
    Template->>Template: Rescale per-channel scales
    Template-->>Client: Return updated context & schedule

    Client->>Kernel: Execute generated code
    Kernel->>Kernel: Convert int8 inputs to uint8 (+128 offset)
    Kernel->>Kernel: Perform NE16 1x1 GEMM
    Kernel->>Kernel: Apply requantization
    Kernel-->>Client: Return output

sequenceDiagram
    participant Graph as Computation Graph
    participant Pass as DequantQuantMergePass
    participant Merger as Scale/ZeroPoint Analyzer
    participant Optimizer as Requantization Builder

    Graph->>Pass: Match Dequant→Quant pattern
    Pass->>Merger: Compute effective scaling ratio
    Merger->>Merger: Extract Dequant scale/zero_point
    Merger->>Merger: Extract Quant scale/zero_point
    Merger-->>Pass: Return scaling ratio & zero_point deltas

    alt Ratio ≈ 1.0 & zero_points ≈ 0 & signed==true
        Pass->>Graph: Rewire Quant consumers to Dequant input
        Pass->>Graph: Remove Dequant & Quant nodes
        Pass-->>Graph: Identity path (no intermediate compute)
    else Fallback to requantization
        Pass->>Optimizer: Compute integer mul/add from scales
        Optimizer->>Optimizer: Use right shift 2**16 for quantization
        Optimizer-->>Pass: Return mul/add constants
        Pass->>Graph: Insert RequantShift node
        Pass-->>Graph: Requantized path
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PR #143: Continuation of GAP9 platform enhancements—directly builds upon prior work with same modules (L3Dma, Bindings, Platform) and replaces blocking adapter patterns with explicit instantiation.
PR #114: Related L3 DMA implementation changes—both modify blocking adapter usage and export patterns for asynchronous DMA integration.
PR #105: Related L3 DMA adapter refactoring—both address removal/replacement of module-level blocking adapter instances in favor of explicit instantiation patterns.

Suggested labels

Feature

Suggested reviewers

Xeratec
Victor-Jung
runwangdl

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[Deeploy PR] NE16 Linear Layer Kernels' directly and clearly describes the main addition: NE16 linear layer kernel support, which is the primary focus across all changed files.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, covering added features (NE16 kernels, templates, passes), changed aspects, and fixes. It clearly explains the intent and scope of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

pauloohaha requested review from Victor-Jung, Xeratec and runwangdl as code owners April 14, 2026 10:18

runwangdl closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Deeploy PR] NE16 Linear Layer Kernels#184

[Deeploy PR] NE16 Linear Layer Kernels#184
pauloohaha wants to merge 1 commit into
pulp-platform:develfrom
pauloohaha:fix/NE16Linear

pauloohaha commented Apr 14, 2026

Uh oh!

coderabbitai Bot commented Apr 14, 2026

Review failed

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pauloohaha commented Apr 14, 2026

Added

Changed

Fixed

PR Merge Checklist

Uh oh!

coderabbitai Bot commented Apr 14, 2026

Review failed

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants