[Bug] Engine build segfaults when SparseConvolution layer exceeds 128 channels (v1.3.2)

First of all, thank you so much @hopef for the swift response on #361 and for providing the `aarch64_cuda13.0` binary with sm_120 support! The binary works perfectly for models with up to 128 channels on our GB10 (sm_121) platform.

However, we've discovered a critical bug: **`spconv::load_engine_from_onnx()` segfaults during engine build when any `SparseConvolution` layer has more than 128 channels.**

## Summary

| in/out channels | Engine build | Inference |
|:-:|:-:|:-:|
| 64 | OK | OK |
| 128 | OK | OK |
| **129** | **SEGFAULT** | - |
| **256** | **SEGFAULT** | - |

The crash occurs **during engine building** (inside `load_engine_from_onnx()`), not at inference time. This affects both `SubMConv3d` and `SparseConv3d` with >128 channels.

## Impact

We are deploying **PillarNet-18** (PillarRes18BackBone8x from OpenPCDet) for real-time 3D object detection. The sparse backbone has 4 stages:

| Stage | Type | Channels | libspconv |
|:-:|:-:|:-:|:-:|
| conv1 | SubMConv3d | 32 | OK |
| conv2 | SparseConv3d + SubMConv3d | 64 | OK |
| conv3 | SparseConv3d + SubMConv3d | 128 | OK |
| **conv4** | **SparseConv3d + SubMConv3d** | **256** | **CRASH** |

Because of this bug, we cannot export the full sparse backbone. We are currently using a workaround (hybrid: libspconv for conv1-3, TensorRT Dense Conv2d for conv4+), but this loses the sparsity benefit of the most expensive stage.

## Crash Log

```
[engine.cu:2596]: Engine build failure on SparseConvolution layer with 256 channels
Segmentation fault (core dumped)
```

The crash occurs inside `libspconv.so` at `engine.cu:2596` during the engine build phase.

## Environment

- **Platform**: aarch64, NVIDIA GB10 (sm_121, Blackwell)
- **OS**: Ubuntu 24.04
- **CUDA**: 12.8
- **libspconv**: v1.3.2 (`aarch64_cuda13.0/libspconv.so`, sm_120 cubin from #361)
- **spconv-cu128**: 2.3.6 (Python, for ONNX export)
- **Python**: 3.10

## Reproduction

### Minimal test script

Save the following as `reproduce_128ch_crash.py` in the `tool/pillarnet-export/` directory (or any directory containing `exptool.py` and `funcs.py` from `tool/centerpoint-export/`):

```python
"""
Minimal reproduction: libspconv segfaults when SparseConvolution > 128 channels.

Usage:
  cd Lidar_AI_Solution/libraries/3DSparseConvolution/tool/pillarnet-export
  python reproduce_128ch_crash.py --out-dir /tmp/libspconv_test

Then build & run:
  cd Lidar_AI_Solution/libraries/3DSparseConvolution
  CUDA_HOME=/usr/local/cuda SPCONV_CUDA_VERSION=13.0 make workspace/infer -j

  cd workspace
  # This should SUCCEED (128 channels):
  ./infer --onnx=/tmp/libspconv_test/test_128ch.onnx \
          --feature=/tmp/libspconv_test/test_128ch.voxels \
          --indice=/tmp/libspconv_test/test_128ch.coors \
          --grid_size=2,100,100 --verbose

  # This should SEGFAULT (256 channels):
  ./infer --onnx=/tmp/libspconv_test/test_256ch.onnx \
          --feature=/tmp/libspconv_test/test_256ch.voxels \
          --indice=/tmp/libspconv_test/test_256ch.coors \
          --grid_size=2,100,100 --verbose

Requires: torch, spconv-cu1xx, cumm, onnx, numpy
"""

import sys
import os
import argparse

# Add exptool to path
script_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, script_dir)
sys.path.insert(0, os.path.join(script_dir, "..", "centerpoint-export"))

import torch
import torch.nn as nn
import numpy as np
import struct
import spconv.pytorch as spconv
import cumm.tensorview as tv
import exptool


# ─── Tensor save (libspconv Tensor::load format) ─────────────────────
TENSOR_MAGIC = 0x33ff1101
DTYPE_MAP = {
    np.dtype(np.int32):   1,
    np.dtype(np.float16): 2,
    np.dtype(np.float32): 3,
}

def save_tensor(tensor, filename):
    if isinstance(tensor, torch.Tensor):
        data = tensor.cpu().numpy()
    elif isinstance(tensor, list):
        data = np.array(tensor, dtype=np.int32)
    else:
        data = np.array(tensor)
    if data.dtype == np.int64:
        data = data.astype(np.int32)
    dtype_id = DTYPE_MAP[data.dtype]
    with open(filename, "wb") as f:
        f.write(struct.pack('<I', TENSOR_MAGIC))
        f.write(struct.pack('<i', len(data.shape)))
        f.write(struct.pack('<i', dtype_id))
        for s in data.shape:
            f.write(struct.pack('<i', s))
        f.write(data.tobytes())
    print(f"  Saved {filename}: shape={list(data.shape)}, dtype={data.dtype}")


# ─── Minimal model: single SubMConv3d ────────────────────────────────
class SingleSubMConv3d(nn.Module):
    """Minimal model with one SubMConv3d + ScatterDense."""

    def __init__(self, channels):
        super().__init__()
        self.conv = spconv.SubMConv3d(
            channels, channels,
            kernel_size=3, padding=1, bias=True,
            indice_key='subm_test'
        )
        # Fused ReLU activation (as used in real models after BN fusion)
        self.conv.act_type = tv.gemm.Activation.ReLU

    def forward(self, voxels, coors, batch_size, spatial_shape):
        x = spconv.SparseConvTensor(
            features=voxels,
            indices=coors.int(),
            spatial_shape=spatial_shape,
            batch_size=batch_size,
        )
        y = self.conv(x)
        return [y.dense()]


# ─── Main ─────────────────────────────────────────────────────────────
def export_test_case(channels, out_dir):
    """Export a single-layer SubMConv3d model with given channel count."""
    prefix = os.path.join(out_dir, f"test_{channels}ch")
    onnx_path = f"{prefix}.onnx"

    print(f"\n{'=' * 60}")
    print(f"Exporting test case: {channels} channels")
    print(f"{'=' * 60}")

    model = SingleSubMConv3d(channels).cuda().eval().half()

    # Initialize weights (values don't matter for crash reproduction)
    nn.init.kaiming_normal_(model.conv.weight)
    model.conv.bias.data.zero_()

    # Spatial shape: Z=2, H=100, W=100 (small grid for fast testing)
    spatial_shape = [2, 100, 100]
    n_points = 500

    # Generate random sparse input
    voxels = torch.randn(n_points, channels, dtype=torch.float16, device='cuda')
    coors = torch.zeros(n_points, 4, dtype=torch.int32, device='cuda')
    coors[:, 0] = 0  # batch index
    coors[:, 1] = 0  # z (all at z=0)
    coors[:, 2] = torch.randint(0, 100, (n_points,), device='cuda')  # y
    coors[:, 3] = torch.randint(0, 100, (n_points,), device='cuda')  # x

    # Export ONNX via exptool tracing
    exptool.export_onnx(
        model, voxels, coors,
        batch_size=1,
        spatial_shape=spatial_shape,
        save_onnx=onnx_path,
        save_tensor=prefix,
    )

    # Also save tensors explicitly (exptool saves them too, but let's be sure)
    save_tensor(voxels, f"{prefix}.voxels")
    save_tensor(coors, f"{prefix}.coors")

    print(f"\nFiles generated:")
    print(f"  ONNX:    {onnx_path}")
    print(f"  Voxels:  {prefix}.voxels")
    print(f"  Coors:   {prefix}.coors")
    print(f"\nTo test with ./infer:")
    print(f"  ./infer --onnx={onnx_path} \\")
    print(f"          --feature={prefix}.voxels \\")
    print(f"          --indice={prefix}.coors \\")
    print(f"          --grid_size=2,100,100 --verbose")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--out-dir", default="/tmp/libspconv_test",
                        help="Output directory for test files")
    args = parser.parse_args()

    os.makedirs(args.out_dir, exist_ok=True)

    # 128 channels — should BUILD and RUN successfully
    export_test_case(128, args.out_dir)

    # 256 channels — should SEGFAULT during engine build
    export_test_case(256, args.out_dir)

    print(f"\n{'=' * 60}")
    print("Done! Now build and run the infer tool:")
    print(f"{'=' * 60}")
    print("""
  cd Lidar_AI_Solution/libraries/3DSparseConvolution
  CUDA_HOME=/usr/local/cuda SPCONV_CUDA_VERSION=13.0 make workspace/infer -j

  cd workspace

  # PASS (128ch):
  ./infer --onnx={0}/test_128ch.onnx \\
          --feature={0}/test_128ch.voxels \\
          --indice={0}/test_128ch.coors \\
          --grid_size=2,100,100 --verbose

  # CRASH (256ch):
  ./infer --onnx={0}/test_256ch.onnx \\
          --feature={0}/test_256ch.voxels \\
          --indice={0}/test_256ch.coors \\
          --grid_size=2,100,100 --verbose
""".format(args.out_dir))
```

### Build & run

```bash
# 1. Build the infer tool
cd Lidar_AI_Solution/libraries/3DSparseConvolution
CUDA_HOME=/usr/local/cuda SPCONV_CUDA_VERSION=13.0 make workspace/infer -j

# 2. Generate test ONNX + tensor files
cd tool/pillarnet-export
python reproduce_128ch_crash.py --out-dir /tmp/libspconv_test

# 3. Test 128ch (should succeed)
cd ../../workspace
./infer --onnx=/tmp/libspconv_test/test_128ch.onnx \
        --feature=/tmp/libspconv_test/test_128ch.voxels \
        --indice=/tmp/libspconv_test/test_128ch.coors \
        --grid_size=2,100,100 --verbose

# 4. Test 256ch (should segfault at engine build)
./infer --onnx=/tmp/libspconv_test/test_256ch.onnx \
        --feature=/tmp/libspconv_test/test_256ch.voxels \
        --indice=/tmp/libspconv_test/test_256ch.coors \
        --grid_size=2,100,100 --verbose
```

### Expected results

**128ch** (PASS):
```
Load inference task from arguments: /tmp/libspconv_test/test_128ch.onnx
  ...
Run inference task: /tmp/libspconv_test/test_128ch.onnx
Save output[0] to output0_2.tensor
Done inference task: /tmp/libspconv_test/test_128ch.onnx
```

**256ch** (CRASH):
```
[engine.cu:2596]: Engine build failure on SparseConvolution layer with 256 channels
Segmentation fault (core dumped)
```

## Channel limit boundary

We tested systematically on our GB10:

| Channels | `load_engine_from_onnx()` |
|:-:|:-:|
| 64 | OK |
| 96 | OK |
| 128 | OK |
| 129 | **SEGFAULT** |
| 192 | **SEGFAULT** |
| 256 | **SEGFAULT** |

The hard boundary is exactly at **128 → 129 channels**.

## Workaround

We split the model into:
- **conv1–conv3** (max 128ch) → libspconv engine (works)
- **conv4** (256ch) → Dense Conv2d via TensorRT

This loses sparsity acceleration for the 256ch stage, which is the most computationally expensive part.

## Request

Could you investigate and fix the 128-channel limit in `engine.cu`? Models like PillarNet-18, CenterPoint, and BEVFusion commonly use 256+ channel sparse convolutions. Lifting this limit would enable full sparse inference for these architectures.

Thank you again for the excellent work on libspconv and for the sm_120 support!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Engine build segfaults when SparseConvolution layer exceeds 128 channels (v1.3.2) #363

Summary

Impact

Crash Log

Environment

Reproduction

Minimal test script

Build & run

Expected results

Channel limit boundary

Workaround

Request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stage	Type	Channels	libspconv
conv1	SubMConv3d	32	OK
conv2	SparseConv3d + SubMConv3d	64	OK
conv3	SparseConv3d + SubMConv3d	128	OK
conv4	SparseConv3d + SubMConv3d	256	CRASH

[Bug] Engine build segfaults when SparseConvolution layer exceeds 128 channels (v1.3.2) #363

Description

Summary

Impact

Crash Log

Environment

Reproduction

Minimal test script

Build & run

Expected results

Channel limit boundary

Workaround

Request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions