|
| 1 | +# qd.precise |
| 2 | + |
| 3 | +`qd.precise(expr)` marks a floating-point expression as IEEE-strict. Every binary and unary FP op inside the wrapped subtree is evaluated in source order with no reassociation, no FMA contraction, and no algebraic simplification, regardless of the module-level `fast_math` setting. It is the moral equivalent of the `precise` keyword in MSL / HLSL. |
| 4 | + |
| 5 | +## Why |
| 6 | + |
| 7 | +Quadrants compiles kernels with `fast_math=True` by default. Under that mode the compiler is free to: |
| 8 | + |
| 9 | +- **reassociate** FP ops (e.g. `(a + b) + c -> a + (b + c)`) |
| 10 | +- **contract** mul-then-add into FMA |
| 11 | +- **substitute approximations** for `sqrt`, `sin`, `cos`, `log`, `1/x` |
| 12 | +- **algebraically simplify** (e.g. `a - a -> 0`, `a / a -> 1`) |
| 13 | + |
| 14 | +This silently destroys compensated-arithmetic primitives (Dekker / Kahan 2Sum, Veltkamp split, double-single accumulators) whose entire correctness rests on the fact that `(a - aa) + (b - bb)` is non-zero under IEEE arithmetic. The traditional workaround is to flip the global `fast_math=False` switch, but that pays the perf cost everywhere, even when only a handful of lines need IEEE semantics. |
| 15 | + |
| 16 | +`qd.precise(expr)` is the per-expression opt-in: keep `fast_math=True` globally for speed, and wrap the expressions that must be IEEE-exact. |
| 17 | + |
| 18 | +## Basic usage |
| 19 | + |
| 20 | +```python |
| 21 | +@qd.func |
| 22 | +def fast_two_sum(a, b): |
| 23 | + s = qd.precise(a + b) |
| 24 | + e = qd.precise(b - (s - a)) # would fold to 0 under fast-math without precise |
| 25 | + return s, e |
| 26 | +``` |
| 27 | + |
| 28 | +Any expression value can be wrapped. The wrapper returns the same expression with every reachable FP op tagged as precise; at codegen time the tagged ops opt out of the optimizations above. |
| 29 | + |
| 30 | +## What gets protected |
| 31 | + |
| 32 | +`qd.precise` walks the wrapped expression tree and tags: |
| 33 | + |
| 34 | +- Every `BinaryOp` (`+`, `-`, `*`, `/`, `%`, comparisons, bit ops on FP types) |
| 35 | +- Every `UnaryOp` (`neg`, `sqrt`, `sin`, `cos`, `log`, `exp`, `rsqrt`, casts, bit_cast, ...) |
| 36 | + |
| 37 | +The walker descends through `BinaryOp`, `UnaryOp`, and `TernaryOp` (e.g. `qd.select`) nodes, so wrapping a composite expression protects the inner ops too: |
| 38 | + |
| 39 | +```python |
| 40 | +# All three FP ops below are tagged: the outer sqrt, the inner add, and the inner mul. |
| 41 | +r = qd.precise(qd.sqrt(a * a + b * b)) |
| 42 | + |
| 43 | +# Ternary is traversed through; the two branches and the condition's inner ops are tagged. |
| 44 | +r = qd.precise(qd.select(cond, a + b, a - b)) |
| 45 | +``` |
| 46 | + |
| 47 | +## Where the walker stops |
| 48 | + |
| 49 | +`qd.precise` does not descend into: |
| 50 | + |
| 51 | +- Loads (ndarray indexing, field access) |
| 52 | +- Constants |
| 53 | +- `qd.func` call sites |
| 54 | +- Atomic ops |
| 55 | + |
| 56 | +Semantics inside a `qd.func` body are governed by that body's own ops. If you want IEEE-strict behavior inside a called function, wrap the relevant ops inside the function's body, not at the call site: |
| 57 | + |
| 58 | +```python |
| 59 | +@qd.func |
| 60 | +def dot_precise(a, b, c, d): |
| 61 | + # Wrap inside the body, not at the caller. |
| 62 | + return qd.precise(a * b + c * d) |
| 63 | + |
| 64 | +@qd.kernel |
| 65 | +def k(...): |
| 66 | + r = dot_precise(x, y, z, w) # inner ops are already precise |
| 67 | +``` |
| 68 | + |
| 69 | +## Interaction with fast_math |
| 70 | + |
| 71 | +`qd.precise` is a per-op override. It takes effect whether `fast_math` is on or off: |
| 72 | + |
| 73 | +| Setting | Non-precise op | `qd.precise` op | |
| 74 | +|---|---|---| |
| 75 | +| `fast_math=True` | reassoc / contract / simplify | IEEE-strict | |
| 76 | +| `fast_math=False` | IEEE-strict | IEEE-strict (redundant but harmless) | |
| 77 | + |
| 78 | +The recommended workflow is to leave `fast_math=True` globally for throughput and reach for `qd.precise` only in the handful of spots that need IEEE behavior. |
| 79 | + |
| 80 | +## Backend coverage |
| 81 | + |
| 82 | +| Backend | Reassoc / contraction / algebraic folds | Approximate transcendentals (`sin` / `cos` / `log`) | |
| 83 | +|---|---|---| |
| 84 | +| CPU | LLVM FMF cleared | libc `sinf` is already correctly rounded | |
| 85 | +| CUDA | LLVM FMF cleared | libdevice `__nv_<fn>f` (non-fast) selected | |
| 86 | +| AMDGPU | LLVM FMF cleared | `__ocml_<fn>` already correctly rounded | |
| 87 | +| Vulkan / MoltenVK | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (typ. 1-2 ULP) | |
| 88 | +| Metal | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (typ. 1-2 ULP) | |
| 89 | + |
| 90 | +On SPIR-V backends, `NoContraction` is defined by the spec to apply to arithmetic instructions only; most consumers ignore it on the `OpExtInst` calls used for transcendentals. The decoration is still emitted (it is harmless and future-proofs against downstream toolchains that start honoring it), but correctness of `qd.precise(qd.sin(x))` on Metal / Vulkan currently relies on the driver's default (non-fast-math) transcendental implementation being accurate enough for your use case. |
| 91 | + |
| 92 | +## Example: Dekker 2Sum |
| 93 | + |
| 94 | +A textbook compensated addition that computes `s + e = a + b` exactly in f32: |
| 95 | + |
| 96 | +```python |
| 97 | +@qd.func |
| 98 | +def two_sum(a, b): |
| 99 | + s = qd.precise(a + b) |
| 100 | + bb = qd.precise(s - a) |
| 101 | + aa = qd.precise(s - bb) |
| 102 | + e = qd.precise((a - aa) + (b - bb)) |
| 103 | + return s, e |
| 104 | +``` |
| 105 | + |
| 106 | +Without the `qd.precise` wrappers, under `fast_math=True` the compiler recognizes `(a - (s - (s - a))) + (b - (s - a))` as algebraically zero and folds `e` to `0`. The wrappers prevent that fold, and `s + e` reproduces `a + b` to full precision. |
| 107 | + |
| 108 | +## Caveats |
| 109 | + |
| 110 | +- `qd.precise` is a scalar primitive. Passing a `Vector` / `Matrix` will raise. Apply it to individual components instead, or refactor your expression to use scalar ops inside. |
| 111 | +- The tag is a property of the expression value, not the use site. If you alias a subexpression and then wrap one alias, both uses get IEEE semantics. |
| 112 | +- Caching: a kernel that uses `qd.precise` has a different offline-cache key than a structurally-identical kernel without it, so the two can coexist without collisions. |
0 commit comments