Skip to content

Commit 97d7fb6

Browse files
committed
[Doc] User guide entry for qd.precise
1 parent 44c99f8 commit 97d7fb6

2 files changed

Lines changed: 113 additions & 0 deletions

File tree

docs/source/user_guide/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ scalar_tensors
1919
matrix_vector
2020
compound_types
2121
static
22+
precise
2223
sub_functions
2324
parallelization
2425
```

docs/source/user_guide/precise.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# qd.precise
2+
3+
`qd.precise(expr)` marks a floating-point expression as IEEE-strict. Every binary and unary FP op inside the wrapped subtree is evaluated in source order with no reassociation, no FMA contraction, and no algebraic simplification, regardless of the module-level `fast_math` setting. It is the moral equivalent of the `precise` keyword in MSL / HLSL.
4+
5+
## Why
6+
7+
Quadrants compiles kernels with `fast_math=True` by default. Under that mode the compiler is free to:
8+
9+
- **reassociate** FP ops (e.g. `(a + b) + c -> a + (b + c)`)
10+
- **contract** mul-then-add into FMA
11+
- **substitute approximations** for `sqrt`, `sin`, `cos`, `log`, `1/x`
12+
- **algebraically simplify** (e.g. `a - a -> 0`, `a / a -> 1`)
13+
14+
This silently destroys compensated-arithmetic primitives (Dekker / Kahan 2Sum, Veltkamp split, double-single accumulators) whose entire correctness rests on the fact that `(a - aa) + (b - bb)` is non-zero under IEEE arithmetic. The traditional workaround is to flip the global `fast_math=False` switch, but that pays the perf cost everywhere, even when only a handful of lines need IEEE semantics.
15+
16+
`qd.precise(expr)` is the per-expression opt-in: keep `fast_math=True` globally for speed, and wrap the expressions that must be IEEE-exact.
17+
18+
## Basic usage
19+
20+
```python
21+
@qd.func
22+
def fast_two_sum(a, b):
23+
s = qd.precise(a + b)
24+
e = qd.precise(b - (s - a)) # would fold to 0 under fast-math without precise
25+
return s, e
26+
```
27+
28+
Any expression value can be wrapped. The wrapper returns the same expression with every reachable FP op tagged as precise; at codegen time the tagged ops opt out of the optimizations above.
29+
30+
## What gets protected
31+
32+
`qd.precise` walks the wrapped expression tree and tags:
33+
34+
- Every `BinaryOp` (`+`, `-`, `*`, `/`, `%`, comparisons, bit ops on FP types)
35+
- Every `UnaryOp` (`neg`, `sqrt`, `sin`, `cos`, `log`, `exp`, `rsqrt`, casts, bit_cast, ...)
36+
37+
The walker descends through `BinaryOp`, `UnaryOp`, and `TernaryOp` (e.g. `qd.select`) nodes, so wrapping a composite expression protects the inner ops too:
38+
39+
```python
40+
# All three FP ops below are tagged: the outer sqrt, the inner add, and the inner mul.
41+
r = qd.precise(qd.sqrt(a * a + b * b))
42+
43+
# Ternary is traversed through; the two branches and the condition's inner ops are tagged.
44+
r = qd.precise(qd.select(cond, a + b, a - b))
45+
```
46+
47+
## Where the walker stops
48+
49+
`qd.precise` does not descend into:
50+
51+
- Loads (ndarray indexing, field access)
52+
- Constants
53+
- `qd.func` call sites
54+
- Atomic ops
55+
56+
Semantics inside a `qd.func` body are governed by that body's own ops. If you want IEEE-strict behavior inside a called function, wrap the relevant ops inside the function's body, not at the call site:
57+
58+
```python
59+
@qd.func
60+
def dot_precise(a, b, c, d):
61+
# Wrap inside the body, not at the caller.
62+
return qd.precise(a * b + c * d)
63+
64+
@qd.kernel
65+
def k(...):
66+
r = dot_precise(x, y, z, w) # inner ops are already precise
67+
```
68+
69+
## Interaction with fast_math
70+
71+
`qd.precise` is a per-op override. It takes effect whether `fast_math` is on or off:
72+
73+
| Setting | Non-precise op | `qd.precise` op |
74+
|---|---|---|
75+
| `fast_math=True` | reassoc / contract / simplify | IEEE-strict |
76+
| `fast_math=False` | IEEE-strict | IEEE-strict (redundant but harmless) |
77+
78+
The recommended workflow is to leave `fast_math=True` globally for throughput and reach for `qd.precise` only in the handful of spots that need IEEE behavior.
79+
80+
## Backend coverage
81+
82+
| Backend | Reassoc / contraction / algebraic folds | Approximate transcendentals (`sin` / `cos` / `log`) |
83+
|---|---|---|
84+
| CPU | LLVM FMF cleared | libc `sinf` is already correctly rounded |
85+
| CUDA | LLVM FMF cleared | libdevice `__nv_<fn>f` (non-fast) selected |
86+
| AMDGPU | LLVM FMF cleared | `__ocml_<fn>` already correctly rounded |
87+
| Vulkan / MoltenVK | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (typ. 1-2 ULP) |
88+
| Metal | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (typ. 1-2 ULP) |
89+
90+
On SPIR-V backends, `NoContraction` is defined by the spec to apply to arithmetic instructions only; most consumers ignore it on the `OpExtInst` calls used for transcendentals. The decoration is still emitted (it is harmless and future-proofs against downstream toolchains that start honoring it), but correctness of `qd.precise(qd.sin(x))` on Metal / Vulkan currently relies on the driver's default (non-fast-math) transcendental implementation being accurate enough for your use case.
91+
92+
## Example: Dekker 2Sum
93+
94+
A textbook compensated addition that computes `s + e = a + b` exactly in f32:
95+
96+
```python
97+
@qd.func
98+
def two_sum(a, b):
99+
s = qd.precise(a + b)
100+
bb = qd.precise(s - a)
101+
aa = qd.precise(s - bb)
102+
e = qd.precise((a - aa) + (b - bb))
103+
return s, e
104+
```
105+
106+
Without the `qd.precise` wrappers, under `fast_math=True` the compiler recognizes `(a - (s - (s - a))) + (b - (s - a))` as algebraically zero and folds `e` to `0`. The wrappers prevent that fold, and `s + e` reproduces `a + b` to full precision.
107+
108+
## Caveats
109+
110+
- `qd.precise` is a scalar primitive. Passing a `Vector` / `Matrix` will raise. Apply it to individual components instead, or refactor your expression to use scalar ops inside.
111+
- The tag is a property of the expression value, not the use site. If you alias a subexpression and then wrap one alias, both uses get IEEE semantics.
112+
- Caching: a kernel that uses `qd.precise` has a different offline-cache key than a structurally-identical kernel without it, so the two can coexist without collisions.

0 commit comments

Comments
 (0)