Commit 1dd4d5b
FP8 Phase 3 (PTX/CUDA) COMPLETE: portable bit-manip - FP8 now on ALL 6 BACKENDS
Wires the portable FP8<->f32 PTX conversion (EmitFP8BitsToF32 / EmitF32ToFP8Bits, committed unwired
in 3bd846d) into the CUDA load/store/convert/scalar-param. FP8 has no native PTX cvt on the cards we
target, so - exactly like the bf16 pre-Ampere fix - it uses only basic integer ops (branchless
setp/selp, unrolled subnormal-normalize), working on EVERY CUDA arch incl. the 1080 (sm_61) / 2060
(sm_75).
- ConvertValue: FP8<->f32 (and FP8<->FP8) is a register no-op (f32-register model, like bf16) - this
is what makes PrecisionConvert.ConvertToSingle/ConvertFromSingle<FP8> lower to nothing on PTX, and
closes the earlier "Float32 -> Float8E4M3 does not have an intrinsic implementation" on CUDA.
- Load: ArrayView<FP8> -> ld.u8 -> EmitFP8BitsToF32. Store: EmitF32ToFP8Bits -> st.u8 (keyed off the
target buffer element type, like bf16).
- FP8 scalar param: .b8 param declaration (AppendParamDeclaration) + ld.param.u8 + widen
(BindParameters) - the f32-register model, so the host's 1-byte pack lines up (was arriving 0).
VERIFIED on the 4070 (`DemoConsole -- fp8-verify`): CPU + OpenCL + CUDA all E4M3 257/257 + E5M2
257/257 (the relu(x*scale+bias) generic kernel through the FP8 load/store/arith/scalar/convert paths).
Basic-ops-only => 4070-correct implies 1080/2060-correct (universal instruction set). Test skip
removed - FP8 now runs on all 6 backends; full PMT round-trip next.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>1 parent 3bd846d commit 1dd4d5b
3 files changed
Lines changed: 93 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
241 | 241 | | |
242 | 242 | | |
243 | 243 | | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
244 | 263 | | |
245 | 264 | | |
246 | 265 | | |
| |||
938 | 957 | | |
939 | 958 | | |
940 | 959 | | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
| 977 | + | |
| 978 | + | |
| 979 | + | |
| 980 | + | |
941 | 981 | | |
942 | 982 | | |
943 | 983 | | |
| |||
1072 | 1112 | | |
1073 | 1113 | | |
1074 | 1114 | | |
| 1115 | + | |
| 1116 | + | |
| 1117 | + | |
| 1118 | + | |
| 1119 | + | |
| 1120 | + | |
| 1121 | + | |
| 1122 | + | |
| 1123 | + | |
| 1124 | + | |
| 1125 | + | |
| 1126 | + | |
| 1127 | + | |
| 1128 | + | |
| 1129 | + | |
| 1130 | + | |
| 1131 | + | |
| 1132 | + | |
| 1133 | + | |
| 1134 | + | |
| 1135 | + | |
1075 | 1136 | | |
1076 | 1137 | | |
1077 | 1138 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1014 | 1014 | | |
1015 | 1015 | | |
1016 | 1016 | | |
| 1017 | + | |
| 1018 | + | |
| 1019 | + | |
| 1020 | + | |
| 1021 | + | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
| 1028 | + | |
| 1029 | + | |
| 1030 | + | |
| 1031 | + | |
| 1032 | + | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
1017 | 1036 | | |
1018 | 1037 | | |
1019 | 1038 | | |
| |||
1107 | 1126 | | |
1108 | 1127 | | |
1109 | 1128 | | |
| 1129 | + | |
| 1130 | + | |
| 1131 | + | |
| 1132 | + | |
| 1133 | + | |
| 1134 | + | |
| 1135 | + | |
| 1136 | + | |
| 1137 | + | |
| 1138 | + | |
1110 | 1139 | | |
1111 | 1140 | | |
1112 | 1141 | | |
| |||
Lines changed: 3 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
| 82 | + | |
| 83 | + | |
82 | 84 | | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | | - | |
| 85 | + | |
92 | 86 | | |
93 | 87 | | |
94 | 88 | | |
| |||
0 commit comments