Skip to content

[Deeploy PR] NE16 Linear Layer Kernels#184

Closed
pauloohaha wants to merge 1 commit into
pulp-platform:develfrom
pauloohaha:fix/NE16Linear
Closed

[Deeploy PR] NE16 Linear Layer Kernels#184
pauloohaha wants to merge 1 commit into
pulp-platform:develfrom
pauloohaha:fix/NE16Linear

Conversation

@pauloohaha
Copy link
Copy Markdown
Contributor

Describe the intent of your PR here.

Added

  • Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings
  • The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format
  • The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias
  • Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates
  • Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift
  • Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer

Changed

Fixed

  • Add output signedness check in QuantChecker
  • Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack
    • Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3
    • Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

  - Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings
  - The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format
 - The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias
  - Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates
  - Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift
  - Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer

  Bug fixes:
  - Add output signedness check in QuantChecker
- Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack
  - Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3
  - Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added NE16 GEMM acceleration support for quantized and integer matrix operations on GAP9.
    • Introduced GAP9-specific SDK templates for improved quantization and dequantization operations.
    • Added NE16 int8-to-uint8 conversion utility kernel.
  • Bug Fixes

    • Fixed L3 DMA 2D transfer implementation with proper type casting for parameters.
  • Chores

    • Updated build configuration to include NE16 and CNN kernel libraries.
    • Enhanced platform optimizer with new topology optimization passes for NE16 GEMM weight layout adjustment.

Walkthrough

This PR introduces NE16 (Neural Engine 16) backend support for GEMM operations on GAP9, replaces the L3 DMA blocking adapter pattern with direct instantiation, adds GAP9 SDK-specific quantization templates, and updates the build system to include NE16 kernel sources with a separate dory memory/DMA library.

Changes

Cohort / File(s) Summary
L3 DMA Refactoring
Deeploy/Targets/GAP9/DMA/L3Dma.py, Deeploy/Targets/GAP9/Bindings.py
Replaced gap9L3DmaHack blocking adapter export with direct GAP9L3Dma() instantiation in transformers; added type casts (uint32_t, void *) in L3 2D transfer template call.
NE16 GEMM Template & Constraints
Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py, Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py, Deeploy/Targets/GAP9/Parsers.py
Implemented new NE16 GEMM template with weight bitplane-packing, input bias compensation, and two template variants (8-bit and int32 output); added NE16 GEMM tile constraint for geometrical/policy constraints and tiling solution serialization; created NE16GEMMParser for RequantizedGemm node parsing with 5 input validation.
NE16 GEMM Bindings & Tiling
Deeploy/Targets/GAP9/Bindings.py, Deeploy/Targets/GAP9/Tiler.py
Added GAP9NE16RQSGEMMBindings and GAP9NE16GEMMInt32Bindings for NE16 backend; created GAP9NE16RQSGEMMTilingReadyBindings and GAP9NE16GEMMInt32TilingReadyBindings with NE16GEMMTileConstraint.
Quantization/Dequantization Templates
Deeploy/Targets/GAP9/Templates/GAP9SDKDequantQuantTemplate.py, Deeploy/Targets/Generic/TypeCheckers.py
Added eight GAP9 SDK templates for quant/dequant operations (fp16/fp32↔int8/uint8) using corresponding CNN kernels; added QuantChecker.checkOutputType() to validate signedness compatibility.
Quantization/Dequantization Bindings
Deeploy/Targets/GAP9/Bindings.py, Deeploy/Targets/GAP9/Tiler.py
Updated GAP9QuantBindings and GAP9DequantBindings to use GAP9 SDK templates; created QuantTilingReadyBindings and DeQuantTilingReadyBindings with UnaryTileConstraint.
Platform Integration & Optimizer
Deeploy/Targets/GAP9/Platform.py, DeeployTest/testUtils/platformMapping.py
Introduced GAP9Optimizer with topology/pattern/merge/split passes; updated quantization/GEMM layer mappers to use new bindings and parsers; swapped default loweringOptimizer from PULPOptimizer to GAP9Optimizer for GAP9 platforms.
NE16 Topology Optimization
Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py
Added NE16AdjustGEMMWeightLayoutPass to transpose weights when transB==0, compute per-channel NE16 scales, and rescale bias/mul tensors for GEMM/RequantizedGemm nodes.
Dequant-Quant Merge Optimization
Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py
Added DequantQuantMergePass to fuse consecutive Dequant→Quant operations into identity or RequantShift paths based on scale/zero-point compatibility.
NE16 Utility Kernels & Build System
TargetLibraries/GAP9/inc/ne16_utils.h, TargetLibraries/GAP9/src/ne16_utils.c, TargetLibraries/GAP9/CMakeLists.txt
Implemented ne16_int8_to_uint8() multi-core SIMD kernel with +128 offset; expanded CMake to glob NE16/CNN autotiler sources, created separate dory_lib static library for memory/DMA code, extended deeploygap9 includes/flags/links.
Build Configuration
DeeployTest/Platforms/GAP9/CMakeLists.txt, Deeploy/Targets/GAP9/TopologyOptimizationPasses/__init__.py
Added -O3 optimization flag to GAP9 test network target; added SPDX license header to optimization passes module.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Parser as NE16GEMMParser
    participant Tiler as NE16GEMMTiler
    participant Template as NE16GEMMTemplate
    participant Kernel as NE16 SDK Kernel

    Client->>Parser: Parse RequantizedGemm node (A, B, C, mul, scale_n)
    Parser->>Parser: Validate 5 inputs & shift attribute
    Parser-->>Client: Return parse success with context mapping

    Client->>Tiler: Apply NE16GEMMTileConstraint
    Tiler->>Tiler: Add geometrical constraints (M, O, N dimensions)
    Tiler->>Tiler: Add policy constraints (N untiled, O divisible by 32)
    Tiler-->>Client: Return tiling solution with cubes

    Client->>Template: Align to context with operatorRepresentation
    Template->>Template: Derive signedness from type metadata
    Template->>Template: Compute weight layout (bitplane-pack with +128 offset)
    Template->>Template: Apply input bias compensation (128 * w_sum)
    Template->>Template: Rescale per-channel scales
    Template-->>Client: Return updated context & schedule

    Client->>Kernel: Execute generated code
    Kernel->>Kernel: Convert int8 inputs to uint8 (+128 offset)
    Kernel->>Kernel: Perform NE16 1x1 GEMM
    Kernel->>Kernel: Apply requantization
    Kernel-->>Client: Return output
Loading
sequenceDiagram
    participant Graph as Computation Graph
    participant Pass as DequantQuantMergePass
    participant Merger as Scale/ZeroPoint Analyzer
    participant Optimizer as Requantization Builder

    Graph->>Pass: Match Dequant→Quant pattern
    Pass->>Merger: Compute effective scaling ratio
    Merger->>Merger: Extract Dequant scale/zero_point
    Merger->>Merger: Extract Quant scale/zero_point
    Merger-->>Pass: Return scaling ratio & zero_point deltas

    alt Ratio ≈ 1.0 & zero_points ≈ 0 & signed==true
        Pass->>Graph: Rewire Quant consumers to Dequant input
        Pass->>Graph: Remove Dequant & Quant nodes
        Pass-->>Graph: Identity path (no intermediate compute)
    else Fallback to requantization
        Pass->>Optimizer: Compute integer mul/add from scales
        Optimizer->>Optimizer: Use right shift 2**16 for quantization
        Optimizer-->>Pass: Return mul/add constants
        Pass->>Graph: Insert RequantShift node
        Pass-->>Graph: Requantized path
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #143: Continuation of GAP9 platform enhancements—directly builds upon prior work with same modules (L3Dma, Bindings, Platform) and replaces blocking adapter patterns with explicit instantiation.
  • PR #114: Related L3 DMA implementation changes—both modify blocking adapter usage and export patterns for asynchronous DMA integration.
  • PR #105: Related L3 DMA adapter refactoring—both address removal/replacement of module-level blocking adapter instances in favor of explicit instantiation patterns.

Suggested labels

Feature

Suggested reviewers

  • Xeratec
  • Victor-Jung
  • runwangdl
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title '[Deeploy PR] NE16 Linear Layer Kernels' directly and clearly describes the main addition: NE16 linear layer kernel support, which is the primary focus across all changed files.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering added features (NE16 kernels, templates, passes), changed aspects, and fixes. It clearly explains the intent and scope of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants