This document provides comprehensive guidelines for designing neural network models that are optimally configured for the TI NPU (Neural Processing Unit) accelerator found in devices like F28P55 and F28P65.
- Overview
- Supported Layer Types
- Terminology and Notation
- Layer Configuration Constraints
- Optimal Design Patterns
- Common Pitfalls to Avoid
- Model Design Checklist
The TI NPU accelerator provides hardware acceleration for common neural network operations. However, to fully leverage the NPU, models must conform to specific layer configurations. Layers that don't meet these constraints will fall back to software execution, significantly reducing performance.
Key Principle: Design models with NPU constraints in mind from the start, rather than trying to adapt existing models.
| Layer Type | NPU Name | Description |
|---|---|---|
| First Convolution | FCONV | Convolution with input channel = 1 |
| Generic Convolution | GCONV | Standard convolution with input channels as multiple of 4 |
| Depth-Wise Convolution | DWCONV | Convolution where groups = input channels |
| Point-Wise Convolution | PWCONV | 1x1 convolution for channel mixing |
| Point-Wise Conv + Residual | PWCONVRES | 1x1 convolution with residual addition |
| Transposed Convolution | TCONV | Upsampling convolution |
| Fully-Connected | FC | Dense/Linear layer |
| Average Pooling | AVGPOOL | Global and non-global average pooling |
| Max Pooling | MAXPOOL | Maximum pooling |
| Symbol | Meaning | Example |
|---|---|---|
| iB | Input bit-width | 8 (8-bit quantized) |
| oB | Output bit-width | 8 |
| kB | Kernel/weight bit-width | 2, 4, or 8 |
| iH | Input height | Sequence length for 1D |
| iW | Input width | 1 for 1D time series |
| iC | Input channels | Number of input features |
| oH | Output height | After convolution/pooling |
| oW | Output width | After convolution/pooling |
| oC | Output channels | Number of output features |
| kH | Kernel height | Convolution kernel size |
| kW | Kernel width | Convolution kernel size |
| sH | Stride height | Vertical stride |
| sW | Stride width | Horizontal stride |
| Notation | Meaning | Examples |
|---|---|---|
| any | Any positive integer | 1, 2, 3, ... |
| m4 | Multiples of 4 | 4, 8, 12, 16, 20, ... |
| m8 | Multiples of 8 | 8, 16, 24, 32, ... |
| m1b2e7 | Range from 2 to 7 | 2, 3, 4, 5, 6, 7 |
| m1b8 | Minimum 8, any value | 8, 9, 10, 11, ... |
| m1b16 | Minimum 16, any value | 16, 17, 18, ... |
Use when the input has exactly 1 channel (e.g., single-variable time series).
| Parameter | Constraint | Notes |
|---|---|---|
| Input Channels (iC) | 1 | Fixed - this defines FCONV |
| Output Channels (oC) | m4 | Must be 4, 8, 12, 16, ... |
| Kernel Height (kH) | any | Flexible |
| Kernel Width (kW) | 1-8 | Maximum 8 for 1D convolutions |
| Kernel Bit-width (kB) | 2, 4, or 8 | 8-bit most common |
Example (PyTorch):
# Good: FCONV with iC=1, oC=8
Conv2d(in_channels=1, out_channels=8, kernel_size=(5, 1))
# Bad: oC=6 is not m4
Conv2d(in_channels=1, out_channels=6, kernel_size=(5, 1))Use for intermediate convolution layers where input channels > 1.
| Parameter | Constraint | Notes |
|---|---|---|
| Input Channels (iC) | m4 | Will be padded to m4 if not |
| Output Channels (oC) | m4 | Must be 4, 8, 12, 16, ... |
| Kernel Height (kH) | any (if kW=1) or specific (if kW>1) | Can be flexible with certain kW values |
| Kernel Width (kW) | any (if kH=1) or specific (if kH>1) | Can be flexible with certain kH values |
| Kernel Bit-width (kB) | 2, 4, or 8 | 8-bit most common |
Rule: For 1D convolutions (kW=1), kH can be any value (unlimited). Similar flexibility applies for other configurations where one dimension is constrained.
Example (PyTorch):
# Good: 1D convolution - kW=1 allows any kH
Conv2d(in_channels=16, out_channels=32, kernel_size=(5, 1)) # kH=5, kW=1
Conv2d(in_channels=16, out_channels=32, kernel_size=(100, 1)) # kH=any, kW=1 ✓
# Good: 1D convolution with kH=1
Conv2d(in_channels=16, out_channels=32, kernel_size=(1, 5)) # kH=1, kW=5Use for efficient spatial filtering with groups=in_channels.
| Parameter | Constraint | Notes |
|---|---|---|
| Input Channels (iC) | m4 | Must be 4, 8, 12, 16, ... |
| Output Channels (oC) | m4 | Equal to iC for true depthwise |
| Kernel Height (kH) | any (if kW constrained) or specific (if kW=any) | Can be flexible with certain kW values |
| Kernel Width (kW) | any (if kH constrained) or specific (if kH=any) | Can be flexible with certain kH values |
| Groups | iC | Must equal input channels |
Rule: Similar to GCONV, one dimension can often be flexible while the other is constrained. Compiler guide shows configurations like kH=9 with kW=1.
Example (PyTorch):
# Good: Depthwise with kW=1, allows various kH
Conv2d(in_channels=16, out_channels=16, kernel_size=(3, 1), groups=16) # kH=3, kW=1 ✓
Conv2d(in_channels=16, out_channels=16, kernel_size=(9, 1), groups=16) # kH=9, kW=1 ✓
Conv2d(in_channels=16, out_channels=16, kernel_size=(5, 1), groups=16) # kH=5, kW=1 ✓Use for channel mixing after depthwise convolution (1x1 convolution).
| Parameter | Constraint | Notes |
|---|---|---|
| Input Channels (iC) | m4 | Must be 4, 8, 12, 16, ... |
| Output Channels (oC) | m4 | Must be 4, 8, 12, 16, ... |
| Kernel Size | (1, 1) | Fixed for pointwise |
| Stride | (1, 1) | Fixed |
Example (PyTorch):
# Good: 1x1 conv with m4 channels
Conv2d(in_channels=16, out_channels=32, kernel_size=(1, 1))| Parameter | Constraint (8-bit) | Constraint (4-bit) |
|---|---|---|
| Input Features | >= 16 | >= 8 |
| Output Features | any | any |
Critical: Ensure sufficient input features before FC layer!
Example (PyTorch):
# Good: input features = 64 (from 16 channels * 4 spatial)
AdaptiveAvgPool2d((4, 1)) # With 16 channels -> 64 features
Linear(in_features=64, out_features=num_classes)
# Bad: input features = 4 (below minimum)
AdaptiveAvgPool2d((1, 1)) # With 4 channels -> 4 features
Linear(in_features=4, out_features=num_classes)| Parameter | Constraint | Notes |
|---|---|---|
| Input Channels (iC) | m4 | Must be 4, 8, 12, 16, ... |
| Output Channels (oC) | m4 | Same as input |
| Kernel Height (kH) | 1-4 or any | If kW is fixed (1-4), then kH can be any; if kH is fixed, max is 4 |
| Kernel Width (kW) | 1-4 or any | If kH is fixed (1-4), then kW can be any; if kW is fixed, max is 4 |
Rule: At least one dimension must be constrained to 1-4; the other can be flexible (any).
Valid Examples (PyTorch):
# All valid - one dimension is fixed to 1-4
MaxPool2d(kernel_size=(3, 1), stride=(2, 1)) # kH=3, kW=1 ✓
MaxPool2d(kernel_size=(1, 4), stride=(1, 2)) # kH=1, kW=4 ✓
MaxPool2d(kernel_size=(8, 1), stride=(4, 1)) # kH=any, kW=1 ✓ (VALID!)
MaxPool2d(kernel_size=(256, 1), stride=(2, 1)) # kH=any, kW=1 ✓
MaxPool2d(kernel_size=(1, 128), stride=(1, 2)) # kH=1, kW=any ✓Invalid Examples:
# Invalid - both dimensions exceed 4
MaxPool2d(kernel_size=(8, 8), stride=(4, 4)) # kH=8, kW=8 ✗
MaxPool2d(kernel_size=(128, 2), stride=(2, 1)) # kH=128, kW=2 (kH > 4) ✗Global Average Pooling:
| Parameter | Constraint | Notes |
|---|---|---|
| Input Channels (iC) | m4 | Must be 4, 8, 12, 16, ... |
| Output Size | (1, 1) | Global pooling |
| Condition | (iH * iW) > 2 | Must have spatial dimensions |
Non-Global Average Pooling: Converted to DWCONV internally. Follow DWCONV constraints.
The most efficient pattern for NPU acceleration:
# Depthwise convolution (spatial filtering)
Conv2d(in_channels=16, out_channels=16, kernel_size=(3, 1), groups=16)
BatchNorm2d(16)
ReLU()
# Pointwise convolution (channel mixing)
Conv2d(in_channels=16, out_channels=32, kernel_size=(1, 1))
BatchNorm2d(32)
ReLU()Efficient channel progression that maintains m4 constraint:
# Start: 1 -> 8 (FCONV)
# Then: 8 -> 16 -> 32 -> 64 (GCONV)
channels = [1, 8, 16, 32, 64]Instead of one large kernel, use multiple smaller ones:
# Bad: Single large kernel (kH=9 exceeds limit)
Conv2d(in_channels=16, out_channels=32, kernel_size=(9, 1))
# Good: Two smaller kernels (both kH<=7)
Conv2d(in_channels=16, out_channels=16, kernel_size=(5, 1))
Conv2d(in_channels=16, out_channels=32, kernel_size=(5, 1))Design pooling to ensure minimum FC input features:
# Good: Ensure >= 16 features for FC
# Option A: More channels
AdaptiveAvgPool2d((1, 1)) # With 16+ channels
# Option B: Larger spatial output
AdaptiveAvgPool2d((4, 1)) # With 4+ channels -> 16+ featuresProblem: Both kernel dimensions exceed limits simultaneously.
# WRONG - Both dimensions too large
Conv2d(in_channels=16, out_channels=32, kernel_size=(8, 8)) # kH=8, kW=8 ✗
Conv2d(in_channels=16, out_channels=32, kernel_size=(128, 5)) # Both > limits ✗
# CORRECT - One dimension can be large if other is fixed
Conv2d(in_channels=16, out_channels=32, kernel_size=(100, 1)) # kH=any, kW=1 ✓
Conv2d(in_channels=16, out_channels=32, kernel_size=(1, 100)) # kH=1, kW=any ✓Solution: Ensure at least one dimension is within bounds. For 1D convolutions (kW=1 or kH=1), the other dimension can be flexible.
Problem: Channels not divisible by 4.
# WRONG
Conv2d(in_channels=8, out_channels=12, kernel_size=(3, 1)) # 12 is OK
Conv2d(in_channels=12, out_channels=18, kernel_size=(3, 1)) # 18 is NOT m4!Solution: Always use channels: 4, 8, 12, 16, 20, 24, 32, 48, 64...
Problem: FC layer receives fewer than 16 features (8-bit) or 8 features (4-bit).
# WRONG: Only 4 features to FC
Conv2d(1, 4, kernel_size=(3, 1))
AdaptiveAvgPool2d((1, 1)) # 4 channels * 1 * 1 = 4 features
Linear(4, num_classes) # FAILS on NPU!Solution: Increase channels or spatial size before FC.
Problem: MaxPool with both dimensions exceeding the 1-4 limit.
# WRONG - Both dimensions exceed limit
MaxPool2d(kernel_size=(8, 8), stride=(4, 4)) # kH=8, kW=8 ✗
MaxPool2d(kernel_size=(128, 2), stride=(2, 1)) # kH=128, kW=2 ✗ (kH too large)
# CORRECT - One dimension can be large if other is fixed 1-4
MaxPool2d(kernel_size=(8, 1), stride=(4, 1)) # kH=any, kW=1 ✓
MaxPool2d(kernel_size=(256, 1), stride=(2, 1)) # kH=any, kW=1 ✓
MaxPool2d(kernel_size=(1, 128), stride=(1, 2)) # kH=1, kW=any ✓Solution: Ensure at least one dimension is 1-4. The other dimension can be larger if the first is fixed within the limit.
Use this checklist when designing or reviewing models for NPU compatibility:
- First layer input channels = 1 (for FCONV) OR multiple of 4
- All intermediate layer channels are multiples of 4
- Output channels of all convolutions are multiples of 4
- GCONV: For 1D (kW=1), any kH allowed; for 2D, check constraints per compiler guide
- DWCONV: For 1D (kW=1), any kH allowed; for 2D, check constraints per compiler guide
- FCONV: Verify kH and kW per compiler guide (examples show kH up to 10)
- MaxPool: At least one dimension must be 1-4; other can be any (not both > 4)
- FC input features >= 16 (for 8-bit weights)
- FC input features >= 8 (for 4-bit weights)
- Global AvgPool has (iH * iW) > 2
- Non-global AvgPool follows DWCONV constraints
- MaxPool kernels <= 4
- Groups parameter equals input channels
- Both input and output channels are m4
- No padding requirements exceed NPU support
- Stride values are supported for each layer type
| Layer Type | kH | kW | iC Constraint | oC Constraint |
|---|---|---|---|---|
| FCONV | 10 or any | 1, 4, or any | 1 | m4 |
| GCONV | any (kW=1) or specific | any (kH=1) or specific | m4 (padded) | m4 |
| DWCONV | any/3/9 | any/1/3 | m4 | m4 |
| PWCONV | 1 | 1 | m4 | m4 |
| MAXPOOL | 1-4 or any | 1-4 or any | m4 | m4 |
| FC | N/A | N/A | >=16 (8-bit) | any |
- TI Neural Network Compiler for MCUs User's Guide v2.1.0
- Section 5: Layer Configurations Supported on the NPU
- TI Software Download
Document Version: 1.0 Last Updated: January 2025