Skip to content

Commit 6b82f2f

Browse files
committed
docs: add GitHub issue templates for experimental features
- ISSUE_TASKGROUP_BUG.md: Nested async macro issues (blocking v0.3.0) - ISSUE_MPSC_CHANNELS.md: Multi-producer channel implementation (high priority) - ISSUE_NUMA_VALIDATION.md: Cross-socket performance testing (medium priority) Ready to attract contributors with clear, actionable issues
1 parent 78667f8 commit 6b82f2f

3 files changed

Lines changed: 257 additions & 0 deletions

File tree

.github/ISSUE_MPSC_CHANNELS.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# MPSC: Multi-Producer Single-Consumer Channels
2+
3+
## Description
4+
Implement production-ready MPSC (Multi-Producer Single-Consumer) channels to enable multi-threaded actor systems and parallel workloads.
5+
6+
## Current Status
7+
- **Module**: `src/nimsync/channels.nim` (SPSC only)
8+
- **MPSC**: ⚠️ Experimental / Not implemented
9+
- **Blocking**: Actor system, parallel processing
10+
11+
## Why MPSC Matters
12+
Current SPSC (Single-Producer Single-Consumer) channels work great for:
13+
- ✅ Pipeline stages (one producer → one consumer)
14+
- ✅ Thread-to-thread communication
15+
- ✅ Lock-free performance (615M ops/sec)
16+
17+
But many real-world patterns need multiple producers:
18+
- ❌ Multiple workers → single aggregator
19+
- ❌ Actor mailboxes (many senders → one actor)
20+
- ❌ Event bus patterns
21+
- ❌ Work-stealing schedulers
22+
23+
## Technical Challenges
24+
MPSC is harder than SPSC because:
25+
1. **Contention**: Multiple producers need coordination
26+
2. **Lock-free is complex**: CAS operations, ABA problem
27+
3. **Performance**: Goal is <100ns P99 latency (vs SPSC's 31ns)
28+
29+
## Design Options
30+
31+
### Option 1: Lock-Based (Simplest)
32+
```nim
33+
type
34+
MPSCChannel[T] = object
35+
queue: Deque[T]
36+
lock: Lock # Protect producer side only
37+
consumerHead: Atomic[int]
38+
```
39+
**Pros**: Easy to implement, correct by default
40+
**Cons**: Lock contention under high load, not truly lock-free
41+
42+
### Option 2: CAS-Based Lock-Free (Industry Standard)
43+
```nim
44+
# Based on Michael-Scott queue or similar
45+
type
46+
MPSCNode[T] = object
47+
data: T
48+
next: Atomic[ptr MPSCNode[T]]
49+
50+
MPSCChannel[T] = object
51+
head: Atomic[ptr MPSCNode[T]] # Consumer only
52+
tail: Atomic[ptr MPSCNode[T]] # Producers compete with CAS
53+
```
54+
**Pros**: Lock-free, better scaling
55+
**Cons**: Complex, ABA problem, memory management tricky
56+
57+
### Option 3: Hybrid Approach
58+
Lock-free fast path, fallback to lock on contention
59+
60+
## Acceptance Criteria
61+
- [ ] MPSC channel implementation passes all tests
62+
- [ ] Performance benchmarks:
63+
- [ ] 2 producers: >400M ops/sec total throughput
64+
- [ ] 8 producers: >300M ops/sec total throughput
65+
- [ ] P99 latency <100ns
66+
- [ ] Contention <10% under stress
67+
- [ ] Memory safety verified (no leaks, no use-after-free)
68+
- [ ] Integration tests with actor system
69+
- [ ] Documentation with examples
70+
- [ ] Comparison benchmarks vs Go channels, Tokio mpsc
71+
72+
## Reference Implementations
73+
- **Tokio MPSC**: https://github.com/tokio-rs/tokio/tree/master/tokio/src/sync/mpsc
74+
- **Crossbeam**: https://github.com/crossbeam-rs/crossbeam/tree/master/crossbeam-channel
75+
- **Go channels**: https://github.com/golang/go/blob/master/src/runtime/chan.go
76+
- **Michael-Scott Queue**: Classic lock-free MPSC algorithm
77+
78+
## Help Wanted
79+
**Skills needed**: Concurrent data structures, atomic operations, memory ordering, benchmarking
80+
81+
**Resources**:
82+
- "The Art of Multiprocessor Programming" (Herlihy & Shavit)
83+
- Linux kernel's `kfifo` MPMC implementation
84+
- Chronos async internals for integration
85+
86+
**Mentorship**: Available - @boonzy can provide guidance on nimsync architecture and benchmarking standards
87+
88+
---
89+
90+
**Priority**: High 🔴 (enables actor system)
91+
**Difficulty**: Very Hard 🔴🔴 (lock-free concurrency is complex)
92+
**Impact**: Very High 🟢🟢 (unlocks entire actor ecosystem)
93+
94+
## Bonus: MPMC Later
95+
After MPSC works, consider MPMC (Multi-Producer Multi-Consumer) for work-stealing schedulers. But MPSC is the critical path.

.github/ISSUE_NUMA_VALIDATION.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# NUMA: Cross-Socket Performance Validation
2+
3+
## Description
4+
Validate and optimize nimsync's performance on NUMA (Non-Uniform Memory Access) architectures with multiple CPU sockets.
5+
6+
## Current Status
7+
- **Testing**: ❌ Not validated on multi-socket systems
8+
- **Optimization**: ⚠️ Unknown if cache-line alignment helps/hurts across sockets
9+
- **Blocking**: Large server deployments (2+ socket systems)
10+
11+
## Why NUMA Matters
12+
Modern servers often have multiple CPU sockets:
13+
- **2-socket systems**: AMD EPYC, Intel Xeon (common in cloud)
14+
- **4-socket systems**: High-end servers
15+
- **8+ socket systems**: Specialized HPC
16+
17+
NUMA introduces memory access latency differences:
18+
- **Local memory**: ~70ns access time
19+
- **Remote socket**: ~140ns access time (2x slower!)
20+
- **Cache effects**: Cross-socket cache coherency traffic
21+
22+
## Current Unknowns
23+
1. **Does SPSC work well across sockets?**
24+
- If producer on socket 0, consumer on socket 1, does 615M ops/sec hold?
25+
- Or does it degrade to 100M ops/sec due to remote memory access?
26+
27+
2. **Is cache-line alignment (64 bytes) optimal?**
28+
- Current padding prevents false sharing on single socket
29+
- But does it cause excessive cache coherency traffic on NUMA?
30+
31+
3. **Should we pin threads to cores?**
32+
- Prevents migration across sockets
33+
- But reduces OS flexibility
34+
35+
## Testing Needed
36+
### Hardware
37+
- Access to 2+ socket AMD EPYC or Intel Xeon system
38+
- `numactl` for thread/memory pinning
39+
- Hardware performance counters (perf)
40+
41+
### Benchmarks
42+
```bash
43+
# Same socket (baseline)
44+
numactl --cpunodebind=0 --membind=0 ./benchmark_spsc_simple
45+
46+
# Cross socket (worst case)
47+
# Producer on socket 0, consumer on socket 1
48+
taskset -c 0 ./producer & taskset -c 64 ./consumer
49+
50+
# Measure:
51+
# - Throughput degradation
52+
# - Latency increase
53+
# - Cache miss rates (perf stat -e LLC-load-misses)
54+
```
55+
56+
## Expected Outcomes
57+
1. **Quantify NUMA penalty**: "Cross-socket reduces throughput by X%"
58+
2. **Optimization guide**: "For best performance on NUMA, do Y"
59+
3. **Code changes if needed**:
60+
- NUMA-aware allocation (`numa_alloc_onnode`)
61+
- Socket-specific optimizations
62+
- Documentation on thread pinning
63+
64+
## Acceptance Criteria
65+
- [ ] Benchmarks run on 2-socket system
66+
- [ ] Document same-socket vs cross-socket performance
67+
- [ ] Recommendations for NUMA deployments
68+
- [ ] (Optional) NUMA-aware channel allocation API
69+
- [ ] CI tests on NUMA hardware (if available)
70+
71+
## Reference Implementations
72+
- **DPDK**: Heavily NUMA-optimized, good patterns to study
73+
- **ScyllaDB**: Sharded architecture for NUMA
74+
- **LMAX Disruptor**: NUMA considerations in ring buffer
75+
76+
## Help Wanted
77+
**Skills needed**: NUMA architecture understanding, systems programming, performance analysis
78+
79+
**Resources**:
80+
- `man numa` and `man numactl`
81+
- Intel's NUMA optimization guide
82+
- AMD EPYC tuning guide
83+
84+
**Hardware access**: This is the blocker - need access to multi-socket system for testing
85+
86+
---
87+
88+
**Priority**: Medium 🟡 (not blocking single-socket deployments)
89+
**Difficulty**: Medium 🟡 (testing complexity, not implementation)
90+
**Impact**: Medium 🟡 (only affects large server deployments)
91+
92+
## Current Workaround
93+
For now, users on NUMA systems should:
94+
- Pin producer/consumer to same socket
95+
- Use one channel per socket
96+
- Benchmark their specific workload
97+
98+
But proper validation and docs would be better!

.github/ISSUE_TASKGROUP_BUG.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# TaskGroup: Nested async macros fail
2+
3+
## Description
4+
TaskGroup implementation has bugs with nested async macro contexts, preventing it from being exported in the public API.
5+
6+
## Current Status
7+
- **Module**: `src/nimsync/group.nim`
8+
- **Exported**: ❌ No (commented out in `src/nimsync.nim`)
9+
- **Blocking**: v0.3.0 release
10+
11+
## Problem Details
12+
Nested async macros fail when TaskGroup tries to coordinate multiple async operations. The macro expansion doesn't properly handle nested contexts.
13+
14+
## Expected Behavior
15+
```nim
16+
import nimsync
17+
18+
proc example() {.async.} =
19+
var group = newTaskGroup()
20+
21+
group.spawn:
22+
await someAsyncOp()
23+
24+
group.spawn:
25+
await anotherAsyncOp()
26+
27+
await group.wait() # Should wait for all tasks
28+
```
29+
30+
## Current Behavior
31+
- Macro expansion errors in nested async contexts
32+
- Compilation fails with async macro nesting
33+
- Not safe to export publicly
34+
35+
## Impact
36+
- **Structured concurrency** unavailable
37+
- Users must manually track async operations
38+
- Blocks adoption of modern async patterns
39+
40+
## Acceptance Criteria
41+
- [ ] TaskGroup works with nested async macros
42+
- [ ] All tests pass in `tests/unit/test_taskgroup.nim`
43+
- [ ] Can be exported in public API
44+
- [ ] Documentation updated with examples
45+
- [ ] Benchmark shows <5% overhead vs manual coordination
46+
47+
## Related Issues
48+
- Linked to MPSC implementation (needs TaskGroup for coordination)
49+
- Blocks actors system (requires task groups for supervision)
50+
51+
## Help Wanted
52+
**Skills needed**: Nim macro system, async/await internals, Chronos knowledge
53+
54+
**Resources**:
55+
- Chronos async internals: https://github.com/status-im/nim-chronos
56+
- Nim macro docs: https://nim-lang.org/docs/manual.html#macros
57+
58+
**Mentorship**: Available - @boonzy can provide guidance on codebase architecture
59+
60+
---
61+
62+
**Priority**: High 🔴 (blocking v0.3.0)
63+
**Difficulty**: Hard 🔴 (requires deep Nim macro knowledge)
64+
**Impact**: High 🟢 (enables structured concurrency)

0 commit comments

Comments
 (0)