Skip to content

[Bug] Engine build segfaults when SparseConvolution layer exceeds 128 channels (v1.3.2) #363

@kkurang

Description

@kkurang

First of all, thank you so much @hopef for the swift response on #361 and for providing the aarch64_cuda13.0 binary with sm_120 support! The binary works perfectly for models with up to 128 channels on our GB10 (sm_121) platform.

However, we've discovered a critical bug: spconv::load_engine_from_onnx() segfaults during engine build when any SparseConvolution layer has more than 128 channels.

Summary

in/out channels Engine build Inference
64 OK OK
128 OK OK
129 SEGFAULT -
256 SEGFAULT -

The crash occurs during engine building (inside load_engine_from_onnx()), not at inference time. This affects both SubMConv3d and SparseConv3d with >128 channels.

Impact

We are deploying PillarNet-18 (PillarRes18BackBone8x from OpenPCDet) for real-time 3D object detection. The sparse backbone has 4 stages:

Stage Type Channels libspconv
conv1 SubMConv3d 32 OK
conv2 SparseConv3d + SubMConv3d 64 OK
conv3 SparseConv3d + SubMConv3d 128 OK
conv4 SparseConv3d + SubMConv3d 256 CRASH

Because of this bug, we cannot export the full sparse backbone. We are currently using a workaround (hybrid: libspconv for conv1-3, TensorRT Dense Conv2d for conv4+), but this loses the sparsity benefit of the most expensive stage.

Crash Log

[engine.cu:2596]: Engine build failure on SparseConvolution layer with 256 channels
Segmentation fault (core dumped)

The crash occurs inside libspconv.so at engine.cu:2596 during the engine build phase.

Environment

Reproduction

Minimal test script

Save the following as reproduce_128ch_crash.py in the tool/pillarnet-export/ directory (or any directory containing exptool.py and funcs.py from tool/centerpoint-export/):

"""
Minimal reproduction: libspconv segfaults when SparseConvolution > 128 channels.

Usage:
  cd Lidar_AI_Solution/libraries/3DSparseConvolution/tool/pillarnet-export
  python reproduce_128ch_crash.py --out-dir /tmp/libspconv_test

Then build & run:
  cd Lidar_AI_Solution/libraries/3DSparseConvolution
  CUDA_HOME=/usr/local/cuda SPCONV_CUDA_VERSION=13.0 make workspace/infer -j

  cd workspace
  # This should SUCCEED (128 channels):
  ./infer --onnx=/tmp/libspconv_test/test_128ch.onnx \
          --feature=/tmp/libspconv_test/test_128ch.voxels \
          --indice=/tmp/libspconv_test/test_128ch.coors \
          --grid_size=2,100,100 --verbose

  # This should SEGFAULT (256 channels):
  ./infer --onnx=/tmp/libspconv_test/test_256ch.onnx \
          --feature=/tmp/libspconv_test/test_256ch.voxels \
          --indice=/tmp/libspconv_test/test_256ch.coors \
          --grid_size=2,100,100 --verbose

Requires: torch, spconv-cu1xx, cumm, onnx, numpy
"""

import sys
import os
import argparse

# Add exptool to path
script_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, script_dir)
sys.path.insert(0, os.path.join(script_dir, "..", "centerpoint-export"))

import torch
import torch.nn as nn
import numpy as np
import struct
import spconv.pytorch as spconv
import cumm.tensorview as tv
import exptool


# ─── Tensor save (libspconv Tensor::load format) ─────────────────────
TENSOR_MAGIC = 0x33ff1101
DTYPE_MAP = {
    np.dtype(np.int32):   1,
    np.dtype(np.float16): 2,
    np.dtype(np.float32): 3,
}

def save_tensor(tensor, filename):
    if isinstance(tensor, torch.Tensor):
        data = tensor.cpu().numpy()
    elif isinstance(tensor, list):
        data = np.array(tensor, dtype=np.int32)
    else:
        data = np.array(tensor)
    if data.dtype == np.int64:
        data = data.astype(np.int32)
    dtype_id = DTYPE_MAP[data.dtype]
    with open(filename, "wb") as f:
        f.write(struct.pack('<I', TENSOR_MAGIC))
        f.write(struct.pack('<i', len(data.shape)))
        f.write(struct.pack('<i', dtype_id))
        for s in data.shape:
            f.write(struct.pack('<i', s))
        f.write(data.tobytes())
    print(f"  Saved {filename}: shape={list(data.shape)}, dtype={data.dtype}")


# ─── Minimal model: single SubMConv3d ────────────────────────────────
class SingleSubMConv3d(nn.Module):
    """Minimal model with one SubMConv3d + ScatterDense."""

    def __init__(self, channels):
        super().__init__()
        self.conv = spconv.SubMConv3d(
            channels, channels,
            kernel_size=3, padding=1, bias=True,
            indice_key='subm_test'
        )
        # Fused ReLU activation (as used in real models after BN fusion)
        self.conv.act_type = tv.gemm.Activation.ReLU

    def forward(self, voxels, coors, batch_size, spatial_shape):
        x = spconv.SparseConvTensor(
            features=voxels,
            indices=coors.int(),
            spatial_shape=spatial_shape,
            batch_size=batch_size,
        )
        y = self.conv(x)
        return [y.dense()]


# ─── Main ─────────────────────────────────────────────────────────────
def export_test_case(channels, out_dir):
    """Export a single-layer SubMConv3d model with given channel count."""
    prefix = os.path.join(out_dir, f"test_{channels}ch")
    onnx_path = f"{prefix}.onnx"

    print(f"\n{'=' * 60}")
    print(f"Exporting test case: {channels} channels")
    print(f"{'=' * 60}")

    model = SingleSubMConv3d(channels).cuda().eval().half()

    # Initialize weights (values don't matter for crash reproduction)
    nn.init.kaiming_normal_(model.conv.weight)
    model.conv.bias.data.zero_()

    # Spatial shape: Z=2, H=100, W=100 (small grid for fast testing)
    spatial_shape = [2, 100, 100]
    n_points = 500

    # Generate random sparse input
    voxels = torch.randn(n_points, channels, dtype=torch.float16, device='cuda')
    coors = torch.zeros(n_points, 4, dtype=torch.int32, device='cuda')
    coors[:, 0] = 0  # batch index
    coors[:, 1] = 0  # z (all at z=0)
    coors[:, 2] = torch.randint(0, 100, (n_points,), device='cuda')  # y
    coors[:, 3] = torch.randint(0, 100, (n_points,), device='cuda')  # x

    # Export ONNX via exptool tracing
    exptool.export_onnx(
        model, voxels, coors,
        batch_size=1,
        spatial_shape=spatial_shape,
        save_onnx=onnx_path,
        save_tensor=prefix,
    )

    # Also save tensors explicitly (exptool saves them too, but let's be sure)
    save_tensor(voxels, f"{prefix}.voxels")
    save_tensor(coors, f"{prefix}.coors")

    print(f"\nFiles generated:")
    print(f"  ONNX:    {onnx_path}")
    print(f"  Voxels:  {prefix}.voxels")
    print(f"  Coors:   {prefix}.coors")
    print(f"\nTo test with ./infer:")
    print(f"  ./infer --onnx={onnx_path} \\")
    print(f"          --feature={prefix}.voxels \\")
    print(f"          --indice={prefix}.coors \\")
    print(f"          --grid_size=2,100,100 --verbose")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--out-dir", default="/tmp/libspconv_test",
                        help="Output directory for test files")
    args = parser.parse_args()

    os.makedirs(args.out_dir, exist_ok=True)

    # 128 channels — should BUILD and RUN successfully
    export_test_case(128, args.out_dir)

    # 256 channels — should SEGFAULT during engine build
    export_test_case(256, args.out_dir)

    print(f"\n{'=' * 60}")
    print("Done! Now build and run the infer tool:")
    print(f"{'=' * 60}")
    print("""
  cd Lidar_AI_Solution/libraries/3DSparseConvolution
  CUDA_HOME=/usr/local/cuda SPCONV_CUDA_VERSION=13.0 make workspace/infer -j

  cd workspace

  # PASS (128ch):
  ./infer --onnx={0}/test_128ch.onnx \\
          --feature={0}/test_128ch.voxels \\
          --indice={0}/test_128ch.coors \\
          --grid_size=2,100,100 --verbose

  # CRASH (256ch):
  ./infer --onnx={0}/test_256ch.onnx \\
          --feature={0}/test_256ch.voxels \\
          --indice={0}/test_256ch.coors \\
          --grid_size=2,100,100 --verbose
""".format(args.out_dir))

Build & run

# 1. Build the infer tool
cd Lidar_AI_Solution/libraries/3DSparseConvolution
CUDA_HOME=/usr/local/cuda SPCONV_CUDA_VERSION=13.0 make workspace/infer -j

# 2. Generate test ONNX + tensor files
cd tool/pillarnet-export
python reproduce_128ch_crash.py --out-dir /tmp/libspconv_test

# 3. Test 128ch (should succeed)
cd ../../workspace
./infer --onnx=/tmp/libspconv_test/test_128ch.onnx \
        --feature=/tmp/libspconv_test/test_128ch.voxels \
        --indice=/tmp/libspconv_test/test_128ch.coors \
        --grid_size=2,100,100 --verbose

# 4. Test 256ch (should segfault at engine build)
./infer --onnx=/tmp/libspconv_test/test_256ch.onnx \
        --feature=/tmp/libspconv_test/test_256ch.voxels \
        --indice=/tmp/libspconv_test/test_256ch.coors \
        --grid_size=2,100,100 --verbose

Expected results

128ch (PASS):

Load inference task from arguments: /tmp/libspconv_test/test_128ch.onnx
  ...
Run inference task: /tmp/libspconv_test/test_128ch.onnx
Save output[0] to output0_2.tensor
Done inference task: /tmp/libspconv_test/test_128ch.onnx

256ch (CRASH):

[engine.cu:2596]: Engine build failure on SparseConvolution layer with 256 channels
Segmentation fault (core dumped)

Channel limit boundary

We tested systematically on our GB10:

Channels load_engine_from_onnx()
64 OK
96 OK
128 OK
129 SEGFAULT
192 SEGFAULT
256 SEGFAULT

The hard boundary is exactly at 128 → 129 channels.

Workaround

We split the model into:

  • conv1–conv3 (max 128ch) → libspconv engine (works)
  • conv4 (256ch) → Dense Conv2d via TensorRT

This loses sparsity acceleration for the 256ch stage, which is the most computationally expensive part.

Request

Could you investigate and fix the 128-channel limit in engine.cu? Models like PillarNet-18, CenterPoint, and BEVFusion commonly use 256+ channel sparse convolutions. Lifting this limit would enable full sparse inference for these architectures.

Thank you again for the excellent work on libspconv and for the sm_120 support!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions