Skip to content

Accumulate CPU half-precision sums in float32#3488

Open
sofinvalery wants to merge 1 commit intoml-explore:mainfrom
sofinvalery:sum-float-accum
Open

Accumulate CPU half-precision sums in float32#3488
sofinvalery wants to merge 1 commit intoml-explore:mainfrom
sofinvalery:sum-float-accum

Conversation

@sofinvalery
Copy link
Copy Markdown

Proposed changes

Accumulate CPU float16 and bfloat16 sum reductions in float32, while preserving the output dtype. This fixes precision loss in ops that use sum().

Fixes #3326.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

@sofinvalery
Copy link
Copy Markdown
Author

Made it inline. Feels cleaner.

Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating a new array would have heavy performance penalty, the correct way would be refactoring strided_reduce/contiguous_reduce to accumulate in float32 rather than the output type.

@sofinvalery
Copy link
Copy Markdown
Author

Refactored strided_reduce/contiguous_reduce to support accumulation in a separate type.
float16/bfloat16 sums now accumulate in float32 without allocating a temp array.
Also added tests for reductions across full, contiguous, and strided axes.

for (int i = 0; i < out.size(); i++, out_ptr++, in_ptr += reduction_size) {
*out_ptr = init;
contiguous_reduce(in_ptr, out_ptr, reduction_size, Op{}, init);
*out_ptr = contiguous_reduce(in_ptr, reduction_size, Op{}, init, init);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it supposed to be:

Suggested change
*out_ptr = contiguous_reduce(in_ptr, reduction_size, Op{}, init, init);
*out_ptr = contiguous_reduce(in_ptr, reduction_size, Op{}, *out_ptr, init);

constexpr int N = std::min(simd::max_size<T>, simd::max_size<U>);
simd::Simd<U, N> accumulator_v(init);
while (size >= N) {
accumulator_v = op(accumulator_v, simd::Simd<U, N>(simd::load<T, N>(x)));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When building for macOS 14 it seems that we can't simply convert Simd<float16> to Simd<float>:

  /Users/runner/actions-runner/_work/mlx/mlx/mlx/backend/cpu/simd/accelerate_simd.h:63:40: error: no matching function for call to 'convert'
     63 |   Simd<T, N>(Simd<U, N> other) : value(asd::convert<scalar_t>(other.value)) {}
        |                                        ^~~~~~~~~~~~~~~~~~~~~~
  /Users/runner/actions-runner/_work/mlx/mlx/mlx/backend/cpu/reduce.cpp:113:39: note: in instantiation of function template specialization 'mlx::core::simd::Simd<float, 8>::Simd<__fp16>' requested here
    113 |     accumulator_v = op(accumulator_v, simd::Simd<U, N>(simd::load<T, N>(x)));
        |                                       ^

const std::vector<int>& axes) {
if (rtype == Reduce::And) {
reduction_op<InT, bool, AndReduce>(in, out, axes, true);
reduction_op<InT, bool, AndReduce, bool>(in, out, axes, true);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer omitting the optional AccT so it would be obvious when the code is accumulating in a different type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"test random uniform" fails for CPU backend

2 participants