System: macOS / Electron / WebGPU Date: 2025-11-10 Test Method: Binary search with completion flag detection
This GPU has severe concurrent execution limits that affect production workloads:
- Workgroup Dimension Limit: Any dimension ≥32 fails (max(X,Y,Z) ≤ 16)
- Total Thread Limit: ~256-384 concurrent threads across all workgroups
- Production Impact: Batched radial rasterization exceeds these limits
Method: Single workgroup dispatch with varying dimensions and work intensity
- 16×16×1 (256 threads): Can handle 1 billion tests/thread ✓
- 32×32×1 (1024 threads): Fails even at minimum work ❌
- 32×8×1 (256 threads): Fails despite same thread count as 16×16×1 ❌
Conclusion: Limit is max(X, Y, Z) ≤ 16, not total thread count.
Method: Multiple workgroups dispatched simultaneously with moderate work (1000 tests/thread)
| Workgroup Size | Threads/WG | Max Workgroups | Total Threads | Status |
|---|---|---|---|---|
| 16×16×1 | 256 | 1 | 256 | ✓ |
| 8×8×1 | 64 | 4 | 256 | ✓ |
| 4×4×1 | 16 | 24 | 384 | ✓ |
Conclusion: Limit is ~256-384 total concurrent threads across all workgroups in a single dispatch.
✓ 24 workgroups: 384 threads in 2.8ms
❌ 25 workgroups: 400 threads in 2.7ms (4 failed)
❌ 27 workgroups: 432 threads in 2.7ms (20 failed)
The limit appears to be around 384-400 threads total.
// path-radial.js - Current batching strategy
const anglesPerBatch = 360; // Process 360 angles at once
passEncoder.dispatchWorkgroups(angleDispatchSize, bucketCount, 1);Problem: For a model with 100 X-buckets and 1 batch (360 angles):
- Dispatches: 360 workgroups × 100 = 36,000 workgroups
- Threads: 36,000 × 256 = 9,216,000 concurrent threads
- This exceeds the ~384 thread limit by 24,000×
The observed "flaky" behavior where re-running produces different results is likely due to:
- GPU thermal throttling
- OS scheduler preemption
- Other GPU workloads interfering
- Watchdog timer variance
Instead of dispatching all workgroups at once, break into chunks:
// Process in batches of workgroups that stay under thread limit
const MAX_CONCURRENT_THREADS = 256; // Conservative
const WORKGROUP_SIZE = 256;
const maxWorkgroupsPerDispatch = Math.floor(MAX_CONCURRENT_THREADS / WORKGROUP_SIZE);
for (let batch = 0; batch < totalWorkgroups; batch += maxWorkgroupsPerDispatch) {
const count = Math.min(maxWorkgroupsPerDispatch, totalWorkgroups - batch);
passEncoder.dispatchWorkgroups(count, 1, 1);
}Reduce from 16×16×1 to 8×8×1 to allow 4× more concurrent dispatches:
// Current: 16×16×1 = 256 threads, max 1 workgroup
// Better: 8×8×1 = 64 threads, max 4 workgroups
@compute @workgroup_size(8, 8, 1)Add auto-calibration as config parameter:
const raster = new RasterPath({
mode: 'radial',
autoCalibrate: true, // NEW: Run calibration on init
resolution: 0.1,
});
await raster.init();
// Calibration runs automatically, stores results in localStorage// After calibration, use detected limits
const { maxConcurrentThreads, maxWorkgroupSize } = raster.gpuLimits;
const threadsPerWorkgroup = maxWorkgroupSize[0] * maxWorkgroupSize[1];
const maxWorkgroupsPerDispatch = Math.floor(maxConcurrentThreads / threadsPerWorkgroup);
// Batch angle processing to stay under limit
const anglesPerBatch = Math.min(360, maxWorkgroupsPerDispatch);Problem: GPU watchdog kills are silent - threads just stop executing.
Solution: Completion flag detection
- Initialize output buffer to all zeros
- Each thread writes 1 on completion
- Any remaining zeros = that thread was killed by watchdog
@compute @workgroup_size(16, 16, 1)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
// ... do work ...
// Write completion flag
completion_flags[thread_index] = 1u;
}// CPU side: check for failures
for (let i = 0; i < totalThreads; i++) {
if (completionData[i] === 0) {
failedThreads++; // This thread was killed
}
}- Integrate calibration into RasterPath API (src/core/raster-path.js)
- Add localStorage caching to avoid re-running calibration
- Update radial batching to use calibrated limits (path-radial.js)
- Add fallback behavior for systems with extreme limits
- Test on multiple GPUs to gather more data points
Run calibration tests:
# Workgroup dimension limits
npm run test:calibrate
# Dispatch count limits
npm run test:calibrateEdit src/test/calibrate-test.cjs to configure:
calibrationType: 'workgroup' or 'dispatch'workgroupSize: [X, Y, Z]triangleTests: Work intensity per threadmaxDispatch: Maximum workgroups to test
maxComputeWorkgroupSizeX: 256
maxComputeWorkgroupSizeY: 256
maxComputeWorkgroupSizeZ: 64
maxComputeWorkgroupsPerDimension: 65535
Note: Hardware reports support for much larger workgroups (256×256), but watchdog limits are far more restrictive in practice.