docs: add GitHub issue templates for experimental features

codenimja · codenimja · commit 6b82f2f855dd · 2025-11-02T00:43:18.000-04:00
- ISSUE_TASKGROUP_BUG.md: Nested async macro issues (blocking v0.3.0)
- ISSUE_MPSC_CHANNELS.md: Multi-producer channel implementation (high priority)
- ISSUE_NUMA_VALIDATION.md: Cross-socket performance testing (medium priority)

Ready to attract contributors with clear, actionable issues
diff --git a/.github/ISSUE_MPSC_CHANNELS.md b/.github/ISSUE_MPSC_CHANNELS.md
@@ -0,0 +1,95 @@
+# MPSC: Multi-Producer Single-Consumer Channels
+
+## Description
+Implement production-ready MPSC (Multi-Producer Single-Consumer) channels to enable multi-threaded actor systems and parallel workloads.
+
+## Current Status
+- **Module**: `src/nimsync/channels.nim` (SPSC only)
+- **MPSC**: ⚠️ Experimental / Not implemented
+- **Blocking**: Actor system, parallel processing
+
+## Why MPSC Matters
+Current SPSC (Single-Producer Single-Consumer) channels work great for:
+- ✅ Pipeline stages (one producer → one consumer)
+- ✅ Thread-to-thread communication
+- ✅ Lock-free performance (615M ops/sec)
+
+But many real-world patterns need multiple producers:
+- ❌ Multiple workers → single aggregator
+- ❌ Actor mailboxes (many senders → one actor)
+- ❌ Event bus patterns
+- ❌ Work-stealing schedulers
+
+## Technical Challenges
+MPSC is harder than SPSC because:
+1. **Contention**: Multiple producers need coordination
+2. **Lock-free is complex**: CAS operations, ABA problem
+3. **Performance**: Goal is <100ns P99 latency (vs SPSC's 31ns)
+
+## Design Options
+
+### Option 1: Lock-Based (Simplest)
+```nim
+type
+  MPSCChannel[T] = object
+    queue: Deque[T]
+    lock: Lock  # Protect producer side only
+    consumerHead: Atomic[int]
+```
+**Pros**: Easy to implement, correct by default  
+**Cons**: Lock contention under high load, not truly lock-free
+
+### Option 2: CAS-Based Lock-Free (Industry Standard)
+```nim
+# Based on Michael-Scott queue or similar
+type
+  MPSCNode[T] = object
+    data: T
+    next: Atomic[ptr MPSCNode[T]]
+  
+  MPSCChannel[T] = object
+    head: Atomic[ptr MPSCNode[T]]  # Consumer only
+    tail: Atomic[ptr MPSCNode[T]]  # Producers compete with CAS
+```
+**Pros**: Lock-free, better scaling  
+**Cons**: Complex, ABA problem, memory management tricky
+
+### Option 3: Hybrid Approach
+Lock-free fast path, fallback to lock on contention
+
+## Acceptance Criteria
+- [ ] MPSC channel implementation passes all tests
+- [ ] Performance benchmarks:
+  - [ ] 2 producers: >400M ops/sec total throughput
+  - [ ] 8 producers: >300M ops/sec total throughput
+  - [ ] P99 latency <100ns
+  - [ ] Contention <10% under stress
+- [ ] Memory safety verified (no leaks, no use-after-free)
+- [ ] Integration tests with actor system
+- [ ] Documentation with examples
+- [ ] Comparison benchmarks vs Go channels, Tokio mpsc
+
+## Reference Implementations
+- **Tokio MPSC**: https://github.com/tokio-rs/tokio/tree/master/tokio/src/sync/mpsc
+- **Crossbeam**: https://github.com/crossbeam-rs/crossbeam/tree/master/crossbeam-channel
+- **Go channels**: https://github.com/golang/go/blob/master/src/runtime/chan.go
+- **Michael-Scott Queue**: Classic lock-free MPSC algorithm
+
+## Help Wanted
+**Skills needed**: Concurrent data structures, atomic operations, memory ordering, benchmarking
+
+**Resources**:
+- "The Art of Multiprocessor Programming" (Herlihy & Shavit)
+- Linux kernel's `kfifo` MPMC implementation
+- Chronos async internals for integration
+
+**Mentorship**: Available - @boonzy can provide guidance on nimsync architecture and benchmarking standards
+
+---
+
+**Priority**: High 🔴 (enables actor system)
+**Difficulty**: Very Hard 🔴🔴 (lock-free concurrency is complex)
+**Impact**: Very High 🟢🟢 (unlocks entire actor ecosystem)
+
+## Bonus: MPMC Later
+After MPSC works, consider MPMC (Multi-Producer Multi-Consumer) for work-stealing schedulers. But MPSC is the critical path.
diff --git a/.github/ISSUE_NUMA_VALIDATION.md b/.github/ISSUE_NUMA_VALIDATION.md
@@ -0,0 +1,98 @@
+# NUMA: Cross-Socket Performance Validation
+
+## Description
+Validate and optimize nimsync's performance on NUMA (Non-Uniform Memory Access) architectures with multiple CPU sockets.
+
+## Current Status
+- **Testing**: ❌ Not validated on multi-socket systems
+- **Optimization**: ⚠️ Unknown if cache-line alignment helps/hurts across sockets
+- **Blocking**: Large server deployments (2+ socket systems)
+
+## Why NUMA Matters
+Modern servers often have multiple CPU sockets:
+- **2-socket systems**: AMD EPYC, Intel Xeon (common in cloud)
+- **4-socket systems**: High-end servers
+- **8+ socket systems**: Specialized HPC
+
+NUMA introduces memory access latency differences:
+- **Local memory**: ~70ns access time
+- **Remote socket**: ~140ns access time (2x slower!)
+- **Cache effects**: Cross-socket cache coherency traffic
+
+## Current Unknowns
+1. **Does SPSC work well across sockets?**
+   - If producer on socket 0, consumer on socket 1, does 615M ops/sec hold?
+   - Or does it degrade to 100M ops/sec due to remote memory access?
+
+2. **Is cache-line alignment (64 bytes) optimal?**
+   - Current padding prevents false sharing on single socket
+   - But does it cause excessive cache coherency traffic on NUMA?
+
+3. **Should we pin threads to cores?**
+   - Prevents migration across sockets
+   - But reduces OS flexibility
+
+## Testing Needed
+### Hardware
+- Access to 2+ socket AMD EPYC or Intel Xeon system
+- `numactl` for thread/memory pinning
+- Hardware performance counters (perf)
+
+### Benchmarks
+```bash
+# Same socket (baseline)
+numactl --cpunodebind=0 --membind=0 ./benchmark_spsc_simple
+
+# Cross socket (worst case)
+# Producer on socket 0, consumer on socket 1
+taskset -c 0 ./producer & taskset -c 64 ./consumer
+
+# Measure:
+# - Throughput degradation
+# - Latency increase
+# - Cache miss rates (perf stat -e LLC-load-misses)
+```
+
+## Expected Outcomes
+1. **Quantify NUMA penalty**: "Cross-socket reduces throughput by X%"
+2. **Optimization guide**: "For best performance on NUMA, do Y"
+3. **Code changes if needed**: 
+   - NUMA-aware allocation (`numa_alloc_onnode`)
+   - Socket-specific optimizations
+   - Documentation on thread pinning
+
+## Acceptance Criteria
+- [ ] Benchmarks run on 2-socket system
+- [ ] Document same-socket vs cross-socket performance
+- [ ] Recommendations for NUMA deployments
+- [ ] (Optional) NUMA-aware channel allocation API
+- [ ] CI tests on NUMA hardware (if available)
+
+## Reference Implementations
+- **DPDK**: Heavily NUMA-optimized, good patterns to study
+- **ScyllaDB**: Sharded architecture for NUMA
+- **LMAX Disruptor**: NUMA considerations in ring buffer
+
+## Help Wanted
+**Skills needed**: NUMA architecture understanding, systems programming, performance analysis
+
+**Resources**:
+- `man numa` and `man numactl`
+- Intel's NUMA optimization guide
+- AMD EPYC tuning guide
+
+**Hardware access**: This is the blocker - need access to multi-socket system for testing
+
+---
+
+**Priority**: Medium 🟡 (not blocking single-socket deployments)
+**Difficulty**: Medium 🟡 (testing complexity, not implementation)
+**Impact**: Medium 🟡 (only affects large server deployments)
+
+## Current Workaround
+For now, users on NUMA systems should:
+- Pin producer/consumer to same socket
+- Use one channel per socket
+- Benchmark their specific workload
+
+But proper validation and docs would be better!
diff --git a/.github/ISSUE_TASKGROUP_BUG.md b/.github/ISSUE_TASKGROUP_BUG.md
@@ -0,0 +1,64 @@
+# TaskGroup: Nested async macros fail
+
+## Description
+TaskGroup implementation has bugs with nested async macro contexts, preventing it from being exported in the public API.
+
+## Current Status
+- **Module**: `src/nimsync/group.nim`
+- **Exported**: ❌ No (commented out in `src/nimsync.nim`)
+- **Blocking**: v0.3.0 release
+
+## Problem Details
+Nested async macros fail when TaskGroup tries to coordinate multiple async operations. The macro expansion doesn't properly handle nested contexts.
+
+## Expected Behavior
+```nim
+import nimsync
+
+proc example() {.async.} =
+  var group = newTaskGroup()
+  
+  group.spawn:
+    await someAsyncOp()
+  
+  group.spawn:
+    await anotherAsyncOp()
+  
+  await group.wait()  # Should wait for all tasks
+```
+
+## Current Behavior
+- Macro expansion errors in nested async contexts
+- Compilation fails with async macro nesting
+- Not safe to export publicly
+
+## Impact
+- **Structured concurrency** unavailable
+- Users must manually track async operations
+- Blocks adoption of modern async patterns
+
+## Acceptance Criteria
+- [ ] TaskGroup works with nested async macros
+- [ ] All tests pass in `tests/unit/test_taskgroup.nim`
+- [ ] Can be exported in public API
+- [ ] Documentation updated with examples
+- [ ] Benchmark shows <5% overhead vs manual coordination
+
+## Related Issues
+- Linked to MPSC implementation (needs TaskGroup for coordination)
+- Blocks actors system (requires task groups for supervision)
+
+## Help Wanted
+**Skills needed**: Nim macro system, async/await internals, Chronos knowledge
+
+**Resources**:
+- Chronos async internals: https://github.com/status-im/nim-chronos
+- Nim macro docs: https://nim-lang.org/docs/manual.html#macros
+
+**Mentorship**: Available - @boonzy can provide guidance on codebase architecture
+
+---
+
+**Priority**: High 🔴 (blocking v0.3.0)
+**Difficulty**: Hard 🔴 (requires deep Nim macro knowledge)
+**Impact**: High 🟢 (enables structured concurrency)