Add sigmoid_fp64 kernels#10272
Merged
Merged
Conversation
97141de to
63d3c4a
Compare
And rewrite the sigmoid_fp32 kernel using the same technique. It turns out that this kernel is faster, which is a little surprising. It does have less "overhead" (special cases, piecewise branches, etc.) in exchange for more polynomial arithmetic. Change in performance for `sigmoid_fp32`: ``` name time/op time/op vs base bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:1/n:4096/real_time 1.759µ ± 3% 1.771µ ± 21% ~ (p=0.818 n=6) bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:4/n:1024/real_time 1.740µ ± 3% 1.774µ ± 3% ~ (p=0.065 n=6) bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:16/n:256/real_time 1.773µ ± 2% 1.783µ ± 1% ~ (p=0.699 n=6) bench/sigmoid_fp32_1x16_x86_avx2/m:1/n:4096/real_time 4.909µ ± 1% 3.670µ ± 2% -25.24% (p=0.002 n=6) bench/sigmoid_fp32_1x16_x86_avx2/m:4/n:1024/real_time 4.830µ ± 2% 3.706µ ± 4% -23.28% (p=0.002 n=6) bench/sigmoid_fp32_1x16_x86_avx2/m:16/n:256/real_time 4.912µ ± 1% 3.740µ ± 2% -23.87% (p=0.002 n=6) bench/sigmoid_fp32_1x32_x86_sse2/m:1/n:4096/real_time 6.632µ ± 3% 5.437µ ± 2% -18.02% (p=0.002 n=6) bench/sigmoid_fp32_1x32_x86_sse2/m:4/n:1024/real_time 6.637µ ± 3% 5.524µ ± 3% -16.77% (p=0.002 n=6) bench/sigmoid_fp32_1x32_x86_sse2/m:16/n:256/real_time 6.692µ ± 4% 5.493µ ± 1% -17.92% (p=0.002 n=6) geomean 3.851µ 3.305µ -14.19% ``` `sigmoid_fp64` compared to other kernels: ``` ---------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ---------------------------------------------------------------------------------------------------------------------------- bench_reference/sigmoid_float/m:1/n:4096/real_time 38579 ns 38571 ns 7348 Bytes=849.382M/s Op=106.173M/s bench_reference/sigmoid_float/m:4/n:1024/real_time 38345 ns 38338 ns 7440 Bytes=854.556M/s Op=106.819M/s bench_reference/sigmoid_float/m:16/n:256/real_time 39192 ns 39187 ns 7190 Bytes=836.095M/s Op=104.512M/s bench_reference/sigmoid_double/m:1/n:4096/real_time 91326 ns 91313 ns 3045 Bytes=717.606M/s Op=44.8504M/s bench_reference/sigmoid_double/m:4/n:1024/real_time 91307 ns 91290 ns 3043 Bytes=717.757M/s Op=44.8598M/s bench_reference/sigmoid_double/m:16/n:256/real_time 93505 ns 93486 ns 3018 Bytes=700.885M/s Op=43.8053M/s bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:1/n:4096/real_time 1786 ns 1786 ns 155658 Bytes=18.3422G/s Op=2.29277G/s bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:4/n:1024/real_time 1802 ns 1802 ns 157599 Bytes=18.1847G/s Op=2.27309G/s bench/sigmoid_fp32_1x32_x86_avx512f_avx512bw/m:16/n:256/real_time 1791 ns 1791 ns 156134 Bytes=18.2963G/s Op=2.28704G/s bench/sigmoid_fp64_1x16_x86_avx512f_avx512bw/m:1/n:4096/real_time 4475 ns 4475 ns 60425 Bytes=14.6433G/s Op=915.207M/s bench/sigmoid_fp64_1x16_x86_avx512f_avx512bw/m:4/n:1024/real_time 4822 ns 4821 ns 59593 Bytes=13.5913G/s Op=849.459M/s bench/sigmoid_fp64_1x16_x86_avx512f_avx512bw/m:16/n:256/real_time 4842 ns 4840 ns 56596 Bytes=13.5363G/s Op=846.016M/s bench/sigmoid_fp32_1x16_x86_avx2/m:1/n:4096/real_time 3789 ns 3788 ns 69486 Bytes=8.64752G/s Op=1.08094G/s bench/sigmoid_fp32_1x16_x86_avx2/m:4/n:1024/real_time 3892 ns 3892 ns 74142 Bytes=8.41825G/s Op=1.05228G/s bench/sigmoid_fp32_1x16_x86_avx2/m:16/n:256/real_time 3757 ns 3756 ns 72827 Bytes=8.72073G/s Op=1.09009G/s bench/sigmoid_fp64_1x8_x86_avx2/m:1/n:4096/real_time 10451 ns 10450 ns 26516 Bytes=6.27103G/s Op=391.939M/s bench/sigmoid_fp64_1x8_x86_avx2/m:4/n:1024/real_time 11010 ns 11007 ns 24451 Bytes=5.95261G/s Op=372.038M/s bench/sigmoid_fp64_1x8_x86_avx2/m:16/n:256/real_time 10475 ns 10472 ns 26374 Bytes=6.2567G/s Op=391.044M/s bench/sigmoid_fp32_1x32_x86_sse2/m:1/n:4096/real_time 5649 ns 5648 ns 49675 Bytes=5.80048G/s Op=725.06M/s bench/sigmoid_fp32_1x32_x86_sse2/m:4/n:1024/real_time 5646 ns 5645 ns 50916 Bytes=5.80353G/s Op=725.441M/s bench/sigmoid_fp32_1x32_x86_sse2/m:16/n:256/real_time 5571 ns 5571 ns 48792 Bytes=5.88151G/s Op=735.188M/s bench/sigmoid_fp64_1x8_x86_sse2/m:1/n:4096/real_time 15957 ns 15952 ns 17116 Bytes=4.10712G/s Op=256.695M/s bench/sigmoid_fp64_1x8_x86_sse2/m:4/n:1024/real_time 15657 ns 15654 ns 17451 Bytes=4.18581G/s Op=261.613M/s bench/sigmoid_fp64_1x8_x86_sse2/m:16/n:256/real_time 15748 ns 15744 ns 17804 Bytes=4.16163G/s Op=260.102M/s ``` PiperOrigin-RevId: 918081537
63d3c4a to
cc68da8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add sigmoid_fp64 kernels
And rewrite the sigmoid_fp32 kernel using the same technique.
It turns out that this kernel is faster, which is a little surprising. It does have less "overhead" (special cases, piecewise branches, etc.) in exchange for more polynomial arithmetic.
Change in performance for
sigmoid_fp32:sigmoid_fp64compared to other kernels: