Exp function lacks SIMD implementation - requires AVX optimization

## Problem

The `Exp` (exponential) function currently only has a Pure Go implementation using `math.Exp`. It lacks SIMD optimizations for both AMD64 and ARM64 platforms.

**Current status:**
- ✅ f32/Exp and f64/Exp exposed in public API
- ✅ Pure Go implementation with overflow protection
- ❌ No AVX implementation (AMD64)
- ❌ No NEON implementation (ARM64)

## Implementation Details

### Current Pure Go Code

```go
// e^x with clamping to prevent overflow
// Clamps to ±709 (f64) or ±88 (f32) to prevent overflow
func exp32Go(dst, src []float32) {
    for i := range dst {
        x := src[i]
        // Clamp extreme values
        if x > 88.0 {
            dst[i] = math.Inf(1)
        } else if x < -88.0 {
            dst[i] = 0
        } else {
            dst[i] = float32(math.Exp(float64(x)))
        }
    }
}
```

## Performance Opportunity

Based on benchmarks of similar activation functions:
- **AMD64 AVX**: Potential 10-30x speedup
- **ARM64 NEON**: Potential 5-15x speedup

### Reference Performance (Sigmoid at 1024 elements)
- AMD64 AVX: **43x speedup** @ 59.3 GB/s
- ARM64 NEON: Better throughput with vector operations

## Implementation Requirements

### AMD64 AVX (f32)

```
- Load source values into YMM registers (8x float32)
- Clamp values to ±88.0 using VCMPPS + VBLENDVPS
- Compute e^x using polynomial approximation or exp2 + scale
- Handle overflow: set to ±inf for clamped values
- Store results with VST1
- Include scalar remainder handling
```

### AMD64 AVX (f64)

```
- Similar to f32 but with XMM registers (4x float64)
- Clamp values to ±709.0
- Compute e^x with higher precision
```

### ARM64 NEON (both f32 and f64)

```
- Vector clamping with FCMGT/FCMLT
- Polynomial approximation for e^x
- Handle extreme values with saturation
- Scalar remainder handling
```

## Math Approximation

For SIMD implementations, consider:

1. **Polynomial approximation** (fast, good accuracy)
   ```
   e^x ≈ 1 + x + x²/2! + x³/3! + ... + x^n/n!
   ```
   Horner's method for numerical stability

2. **Exp2-based approach** (alternative)
   ```
   e^x = 2^(x / ln(2))
   Decompose x = k + f where k is integer, 0 ≤ f < 1
   e^x = 2^k * e^f
   ```

3. **Use existing SIMD exp functions** (if available in SVML or similar)

## References

- [Math.Exp documentation](https://golang.org/pkg/math/#Exp)
- [ARM NEON floating point operations](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon)
- [x86-64 SIMD comparison](https://www.agner.org/optimize/)
- [Sigmoid/Tanh SIMD implementations in codebase](../f32/f32_amd64.s) (reference)

## Related Issues

- #6 ARM64 tanh NEON debugging
- Activation functions feature (merged in v1.0.15)

## Priority

**Medium** - Exp is less commonly used than Sigmoid/Tanh/ReLU in neural networks, but still valuable for certain applications. Pure Go fallback is acceptable.

## Acceptance Criteria

- [ ] AVX implementation for f32 Exp (AMD64)
- [ ] AVX implementation for f64 Exp (AMD64)
- [ ] NEON implementation for f32 Exp (ARM64)
- [ ] NEON implementation for f64 Exp (ARM64)
- [ ] Benchmarks showing speedup over Pure Go
- [ ] Tests passing on all platforms
- [ ] Proper overflow handling (clamp to ±inf)
- [ ] Documentation updated with performance data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exp function lacks SIMD implementation - requires AVX optimization #7

Problem

Implementation Details

Current Pure Go Code

Performance Opportunity

Reference Performance (Sigmoid at 1024 elements)

Implementation Requirements

AMD64 AVX (f32)

AMD64 AVX (f64)

ARM64 NEON (both f32 and f64)

Math Approximation

References

Related Issues

Priority

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Exp function lacks SIMD implementation - requires AVX optimization #7

Description

Problem

Implementation Details

Current Pure Go Code

Performance Opportunity

Reference Performance (Sigmoid at 1024 elements)

Implementation Requirements

AMD64 AVX (f32)

AMD64 AVX (f64)

ARM64 NEON (both f32 and f64)

Math Approximation

References

Related Issues

Priority

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions