First of all, thank you so much @hopef for the swift response on #361 and for providing the aarch64_cuda13.0 binary with sm_120 support! The binary works perfectly for models with up to 128 channels on our GB10 (sm_121) platform.
However, we've discovered a critical bug: spconv::load_engine_from_onnx() segfaults during engine build when any SparseConvolution layer has more than 128 channels.
Summary
| in/out channels |
Engine build |
Inference |
| 64 |
OK |
OK |
| 128 |
OK |
OK |
| 129 |
SEGFAULT |
- |
| 256 |
SEGFAULT |
- |
The crash occurs during engine building (inside load_engine_from_onnx()), not at inference time. This affects both SubMConv3d and SparseConv3d with >128 channels.
Impact
We are deploying PillarNet-18 (PillarRes18BackBone8x from OpenPCDet) for real-time 3D object detection. The sparse backbone has 4 stages:
| Stage |
Type |
Channels |
libspconv |
| conv1 |
SubMConv3d |
32 |
OK |
| conv2 |
SparseConv3d + SubMConv3d |
64 |
OK |
| conv3 |
SparseConv3d + SubMConv3d |
128 |
OK |
| conv4 |
SparseConv3d + SubMConv3d |
256 |
CRASH |
Because of this bug, we cannot export the full sparse backbone. We are currently using a workaround (hybrid: libspconv for conv1-3, TensorRT Dense Conv2d for conv4+), but this loses the sparsity benefit of the most expensive stage.
Crash Log
[engine.cu:2596]: Engine build failure on SparseConvolution layer with 256 channels
Segmentation fault (core dumped)
The crash occurs inside libspconv.so at engine.cu:2596 during the engine build phase.
Environment
Reproduction
Minimal test script
Save the following as reproduce_128ch_crash.py in the tool/pillarnet-export/ directory (or any directory containing exptool.py and funcs.py from tool/centerpoint-export/):
"""
Minimal reproduction: libspconv segfaults when SparseConvolution > 128 channels.
Usage:
cd Lidar_AI_Solution/libraries/3DSparseConvolution/tool/pillarnet-export
python reproduce_128ch_crash.py --out-dir /tmp/libspconv_test
Then build & run:
cd Lidar_AI_Solution/libraries/3DSparseConvolution
CUDA_HOME=/usr/local/cuda SPCONV_CUDA_VERSION=13.0 make workspace/infer -j
cd workspace
# This should SUCCEED (128 channels):
./infer --onnx=/tmp/libspconv_test/test_128ch.onnx \
--feature=/tmp/libspconv_test/test_128ch.voxels \
--indice=/tmp/libspconv_test/test_128ch.coors \
--grid_size=2,100,100 --verbose
# This should SEGFAULT (256 channels):
./infer --onnx=/tmp/libspconv_test/test_256ch.onnx \
--feature=/tmp/libspconv_test/test_256ch.voxels \
--indice=/tmp/libspconv_test/test_256ch.coors \
--grid_size=2,100,100 --verbose
Requires: torch, spconv-cu1xx, cumm, onnx, numpy
"""
import sys
import os
import argparse
# Add exptool to path
script_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, script_dir)
sys.path.insert(0, os.path.join(script_dir, "..", "centerpoint-export"))
import torch
import torch.nn as nn
import numpy as np
import struct
import spconv.pytorch as spconv
import cumm.tensorview as tv
import exptool
# ─── Tensor save (libspconv Tensor::load format) ─────────────────────
TENSOR_MAGIC = 0x33ff1101
DTYPE_MAP = {
np.dtype(np.int32): 1,
np.dtype(np.float16): 2,
np.dtype(np.float32): 3,
}
def save_tensor(tensor, filename):
if isinstance(tensor, torch.Tensor):
data = tensor.cpu().numpy()
elif isinstance(tensor, list):
data = np.array(tensor, dtype=np.int32)
else:
data = np.array(tensor)
if data.dtype == np.int64:
data = data.astype(np.int32)
dtype_id = DTYPE_MAP[data.dtype]
with open(filename, "wb") as f:
f.write(struct.pack('<I', TENSOR_MAGIC))
f.write(struct.pack('<i', len(data.shape)))
f.write(struct.pack('<i', dtype_id))
for s in data.shape:
f.write(struct.pack('<i', s))
f.write(data.tobytes())
print(f" Saved {filename}: shape={list(data.shape)}, dtype={data.dtype}")
# ─── Minimal model: single SubMConv3d ────────────────────────────────
class SingleSubMConv3d(nn.Module):
"""Minimal model with one SubMConv3d + ScatterDense."""
def __init__(self, channels):
super().__init__()
self.conv = spconv.SubMConv3d(
channels, channels,
kernel_size=3, padding=1, bias=True,
indice_key='subm_test'
)
# Fused ReLU activation (as used in real models after BN fusion)
self.conv.act_type = tv.gemm.Activation.ReLU
def forward(self, voxels, coors, batch_size, spatial_shape):
x = spconv.SparseConvTensor(
features=voxels,
indices=coors.int(),
spatial_shape=spatial_shape,
batch_size=batch_size,
)
y = self.conv(x)
return [y.dense()]
# ─── Main ─────────────────────────────────────────────────────────────
def export_test_case(channels, out_dir):
"""Export a single-layer SubMConv3d model with given channel count."""
prefix = os.path.join(out_dir, f"test_{channels}ch")
onnx_path = f"{prefix}.onnx"
print(f"\n{'=' * 60}")
print(f"Exporting test case: {channels} channels")
print(f"{'=' * 60}")
model = SingleSubMConv3d(channels).cuda().eval().half()
# Initialize weights (values don't matter for crash reproduction)
nn.init.kaiming_normal_(model.conv.weight)
model.conv.bias.data.zero_()
# Spatial shape: Z=2, H=100, W=100 (small grid for fast testing)
spatial_shape = [2, 100, 100]
n_points = 500
# Generate random sparse input
voxels = torch.randn(n_points, channels, dtype=torch.float16, device='cuda')
coors = torch.zeros(n_points, 4, dtype=torch.int32, device='cuda')
coors[:, 0] = 0 # batch index
coors[:, 1] = 0 # z (all at z=0)
coors[:, 2] = torch.randint(0, 100, (n_points,), device='cuda') # y
coors[:, 3] = torch.randint(0, 100, (n_points,), device='cuda') # x
# Export ONNX via exptool tracing
exptool.export_onnx(
model, voxels, coors,
batch_size=1,
spatial_shape=spatial_shape,
save_onnx=onnx_path,
save_tensor=prefix,
)
# Also save tensors explicitly (exptool saves them too, but let's be sure)
save_tensor(voxels, f"{prefix}.voxels")
save_tensor(coors, f"{prefix}.coors")
print(f"\nFiles generated:")
print(f" ONNX: {onnx_path}")
print(f" Voxels: {prefix}.voxels")
print(f" Coors: {prefix}.coors")
print(f"\nTo test with ./infer:")
print(f" ./infer --onnx={onnx_path} \\")
print(f" --feature={prefix}.voxels \\")
print(f" --indice={prefix}.coors \\")
print(f" --grid_size=2,100,100 --verbose")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--out-dir", default="/tmp/libspconv_test",
help="Output directory for test files")
args = parser.parse_args()
os.makedirs(args.out_dir, exist_ok=True)
# 128 channels — should BUILD and RUN successfully
export_test_case(128, args.out_dir)
# 256 channels — should SEGFAULT during engine build
export_test_case(256, args.out_dir)
print(f"\n{'=' * 60}")
print("Done! Now build and run the infer tool:")
print(f"{'=' * 60}")
print("""
cd Lidar_AI_Solution/libraries/3DSparseConvolution
CUDA_HOME=/usr/local/cuda SPCONV_CUDA_VERSION=13.0 make workspace/infer -j
cd workspace
# PASS (128ch):
./infer --onnx={0}/test_128ch.onnx \\
--feature={0}/test_128ch.voxels \\
--indice={0}/test_128ch.coors \\
--grid_size=2,100,100 --verbose
# CRASH (256ch):
./infer --onnx={0}/test_256ch.onnx \\
--feature={0}/test_256ch.voxels \\
--indice={0}/test_256ch.coors \\
--grid_size=2,100,100 --verbose
""".format(args.out_dir))
Build & run
# 1. Build the infer tool
cd Lidar_AI_Solution/libraries/3DSparseConvolution
CUDA_HOME=/usr/local/cuda SPCONV_CUDA_VERSION=13.0 make workspace/infer -j
# 2. Generate test ONNX + tensor files
cd tool/pillarnet-export
python reproduce_128ch_crash.py --out-dir /tmp/libspconv_test
# 3. Test 128ch (should succeed)
cd ../../workspace
./infer --onnx=/tmp/libspconv_test/test_128ch.onnx \
--feature=/tmp/libspconv_test/test_128ch.voxels \
--indice=/tmp/libspconv_test/test_128ch.coors \
--grid_size=2,100,100 --verbose
# 4. Test 256ch (should segfault at engine build)
./infer --onnx=/tmp/libspconv_test/test_256ch.onnx \
--feature=/tmp/libspconv_test/test_256ch.voxels \
--indice=/tmp/libspconv_test/test_256ch.coors \
--grid_size=2,100,100 --verbose
Expected results
128ch (PASS):
Load inference task from arguments: /tmp/libspconv_test/test_128ch.onnx
...
Run inference task: /tmp/libspconv_test/test_128ch.onnx
Save output[0] to output0_2.tensor
Done inference task: /tmp/libspconv_test/test_128ch.onnx
256ch (CRASH):
[engine.cu:2596]: Engine build failure on SparseConvolution layer with 256 channels
Segmentation fault (core dumped)
Channel limit boundary
We tested systematically on our GB10:
| Channels |
load_engine_from_onnx() |
| 64 |
OK |
| 96 |
OK |
| 128 |
OK |
| 129 |
SEGFAULT |
| 192 |
SEGFAULT |
| 256 |
SEGFAULT |
The hard boundary is exactly at 128 → 129 channels.
Workaround
We split the model into:
- conv1–conv3 (max 128ch) → libspconv engine (works)
- conv4 (256ch) → Dense Conv2d via TensorRT
This loses sparsity acceleration for the 256ch stage, which is the most computationally expensive part.
Request
Could you investigate and fix the 128-channel limit in engine.cu? Models like PillarNet-18, CenterPoint, and BEVFusion commonly use 256+ channel sparse convolutions. Lifting this limit would enable full sparse inference for these architectures.
Thank you again for the excellent work on libspconv and for the sm_120 support!
First of all, thank you so much @hopef for the swift response on #361 and for providing the
aarch64_cuda13.0binary with sm_120 support! The binary works perfectly for models with up to 128 channels on our GB10 (sm_121) platform.However, we've discovered a critical bug:
spconv::load_engine_from_onnx()segfaults during engine build when anySparseConvolutionlayer has more than 128 channels.Summary
The crash occurs during engine building (inside
load_engine_from_onnx()), not at inference time. This affects bothSubMConv3dandSparseConv3dwith >128 channels.Impact
We are deploying PillarNet-18 (PillarRes18BackBone8x from OpenPCDet) for real-time 3D object detection. The sparse backbone has 4 stages:
Because of this bug, we cannot export the full sparse backbone. We are currently using a workaround (hybrid: libspconv for conv1-3, TensorRT Dense Conv2d for conv4+), but this loses the sparsity benefit of the most expensive stage.
Crash Log
The crash occurs inside
libspconv.soatengine.cu:2596during the engine build phase.Environment
aarch64_cuda13.0/libspconv.so, sm_120 cubin from [Request] Add aarch64 libspconv binary with SM120 (Blackwell GB10) support #361)Reproduction
Minimal test script
Save the following as
reproduce_128ch_crash.pyin thetool/pillarnet-export/directory (or any directory containingexptool.pyandfuncs.pyfromtool/centerpoint-export/):Build & run
Expected results
128ch (PASS):
256ch (CRASH):
Channel limit boundary
We tested systematically on our GB10:
load_engine_from_onnx()The hard boundary is exactly at 128 → 129 channels.
Workaround
We split the model into:
This loses sparsity acceleration for the 256ch stage, which is the most computationally expensive part.
Request
Could you investigate and fix the 128-channel limit in
engine.cu? Models like PillarNet-18, CenterPoint, and BEVFusion commonly use 256+ channel sparse convolutions. Lifting this limit would enable full sparse inference for these architectures.Thank you again for the excellent work on libspconv and for the sm_120 support!