diff --git a/docs_input/api/signalimage/radar/sar_bp.rst b/docs_input/api/signalimage/radar/sar_bp.rst index 11fe5854c..ef51fe85a 100644 --- a/docs_input/api/signalimage/radar/sar_bp.rst +++ b/docs_input/api/signalimage/radar/sar_bp.rst @@ -11,9 +11,361 @@ namespace as its API is subject to change. .. doxygenfunction:: sar_bp(const ImageType &initial_image, const RangeProfilesType &range_profiles, const PlatPosType &platform_positions, const VoxLocType &voxel_locations, const RangeToMcpType &range_to_mcp, const SarBpParams ¶ms) .. doxygenenum:: matx::SarBpComputeType .. doxygenenum:: matx::SarBpFeature +.. doxygenstruct:: matx::PropSarBpTaylorFastAddThirdOrder .. doxygenstruct:: matx::SarBpParams :members: +TaylorFast Range Approximation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``SarBpComputeType::TaylorFast`` approximates the range from each pulse to each +pixel by expanding the range about a reference pixel for the current thread +block. The goal is to replace most per-pixel range work with arithmetic in a +local coordinate system, while computing the expensive reference quantities once +per pulse and per thread block. By default, ``TaylorFast`` keeps terms through +second order in the local pixel offset. Users can add the third-order term with +the ``PropSarBpTaylorFastAddThirdOrder`` property. The third-order term is most +useful for short stand-off ranges where the second-order approximation may not +provide sufficient accuracy. + +The derivation of the Taylor approximation follows. +For pulse :math:`p`, let :math:`a_p` be the platform position and let :math:`m_p` be the +range to the motion compensation point. For image pixel position :math:`x`, the +range quantity used by backprojection is + +.. math:: + + \Delta R_p(x) = \|x - a_p\| - m_p. + +For a thread block, choose a reference pixel :math:`x_0` near the center of the +block. Define + +.. math:: + + r_0 = x_0 - a_p, \qquad R_0 = \|r_0\|, \qquad u = \frac{r_0}{R_0}, + +where :math:`u` is the unit vector pointing from the platform position to the reference pixel. Hereafter, we generally drop +the :math:`p` subscript for brevity. For any pixel in the same thread block, write its local offset as + +.. math:: + + d = x - x_0. + +Then the exact range to the pixel is + +.. math:: + + R(d) = \|r_0 + d\| = \|(x_0 - a_p) + (x - x_0)\| = \|x - a_p\|. + +Decompose :math:`d` into the component along :math:`u` and the component perpendicular +to :math:`u`: + +.. math:: + + s = u \cdot d. + +Here, :math:`s` is the signed scalar projection of :math:`d` along :math:`u` and :math:`u` is the +along-range direction. The perpendicular, or cross-range, vector +is then :math:`d - s u`. The squared norm of this perpendicular component is: + +.. math:: + + \begin{aligned} + \|d - s u\|^2 + &= d \cdot d - 2s(u \cdot d) + s^2(u \cdot u) \\ + &= \|d\|^2 - 2s(u \cdot d) + s^2 \\ + &= \|d\|^2 - 2s^2 + s^2 \\ + &= \|d\|^2 - s^2 + \end{aligned} + +We denote this squared norm term as :math:`q`: + +.. math:: + + q = \|d\|^2 - s^2. + +Mathematically, :math:`q` is the squared perpendicular distance from the local pixel +to the pulse-to-reference-pixel line. With these definitions, recall that +:math:`R(d) = \|r_0 + d\|`. Thus, + +.. math:: + + \begin{aligned} + R(d)^2 &= \|r_0 + d\|^2 \\ + &= \|R_0 u + d\|^2 = (R_0 u + d) \cdot (R_0 u + d) \\ + &= R_0^2 (u \cdot u) + 2 R_0 (u \cdot d) + d \cdot d \\ + &= R_0^2 + 2 R_0 s + \|d\|^2 \\ + &= R_0^2 + 2 R_0 s + s^2 + q \\ + &= (R_0 + s)^2 + q + \end{aligned} + +where we used :math:`q = \|d\|^2 - s^2` and thus :math:`\|d\|^2 = s^2 + q`. + +Finally, + +.. math:: + + R(d) = \sqrt{(R_0 + s)^2 + q}. + +:math:`R(d)` is the exact range to the pixel :math:`x` and the range difference relative to the block reference is + +.. math:: + + \Delta R(d) = R(d) - R_0. + +We started this section with the differential range from pixel :math:`x` to the motion compensation range +:math:`m_p`. We can now write this as: + +.. math:: + + \begin{aligned} + \Delta R_p(x) &= \|x - a_p\| - m_p \\ + &= R_p(x) - m_p \\ + &= (R_0 + \Delta R(d)) - m_p \\ + &= \Delta R(d) + (R_0 - m_p) + \end{aligned} + +The bin coordinate for pixel :math:`x` is typically: + +.. math:: + + b(x) = \frac{\Delta R_p(x)}{\Delta r} + b_{\mathrm{offset}} + +Here :math:`\Delta r` is the range-bin spacing and :math:`b_{\mathrm{offset}}` is the centered +range-bin offset used by the SAR backprojection operator. We can now reformulate this as: + +.. math:: + + \begin{aligned} + b(x) &= \frac{\Delta R_p(x)}{\Delta r} + b_{\mathrm{offset}} \\ + &= \frac{\Delta R(d) + (R_0 - m_p)}{\Delta r} + b_{\mathrm{offset}} \\ + &= \frac{\Delta R(d)}{\Delta r} + \frac{R_0 - m_p}{\Delta r} + b_{\mathrm{offset}} + \end{aligned} + +We can precompute the right-hand terms as: + +.. math:: + + b_0 = \frac{R_0 - m_p}{\Delta r} + b_{\mathrm{offset}} + +and thus: + +.. math:: + + b(x) = b_0 + \frac{\Delta R(d)}{\Delta r} + +The following sections derive an approximation of :math:`\Delta R(d)` using a Taylor series expansion. + +Derivation of the Taylor Expansion +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Starting from + +.. math:: + + R(d) = \sqrt{R_0^2 + 2 R_0 s + \|d\|^2}, + +factor out :math:`R_0`: + +.. math:: + + R(d) + = R_0 + \sqrt{1 + \frac{2s}{R_0} + \frac{\|d\|^2}{R_0^2}}. + +Let + +.. math:: + + y = \frac{2s}{R_0} + \frac{\|d\|^2}{R_0^2}. + +The scalar Taylor series is applied to the one-variable function +:math:`f(y) = \sqrt{1 + y}` about :math:`y = 0`: + +.. math:: + + f(y) = \sqrt{1 + y} + = 1 + \frac{y}{2} - \frac{y^2}{8} + \frac{y^3}{16} + + O(y^4), + +so + +.. math:: + + R(d) + = R_0 + \left( + 1 + \frac{y}{2} - \frac{y^2}{8} + \frac{y^3}{16} + \right) + + O(R_0 y^4). + +The order used by ``TaylorFast`` is not the number of scalar Taylor-series +terms retained in :math:`f(y)`. It is the order in the local pixel offset +:math:`d`. Recall that :math:`s = u \cdot d`, so :math:`s` is linear in :math:`d`. + +The following table shows how the scalar Taylor-series terms contribute to the +local-offset orders used by the ``TaylorFast`` approximation. + +.. list-table:: + :header-rows: 1 + :widths: 24 42 34 + + * - Scalar term + - Contribution + - Local-offset order + * - :math:`R_0` + - :math:`R_0` + - Zeroth order. This is the reference range and cancels in :math:`\Delta R(d) = R(d) - R_0`. + * - :math:`R_0 y / 2` + - :math:`s + \dfrac{s^2 + q}{2R_0}` + - First-order term :math:`s`; second-order term :math:`(s^2 + q)/(2R_0)`. + * - :math:`-R_0 y^2 / 8` + - :math:`-\dfrac{s^2}{2R_0} - \dfrac{s(s^2 + q)}{2R_0^2} + O(d^4)` + - Second-order term :math:`-s^2/(2R_0)`; third-order term :math:`-s(s^2 + q)/(2R_0^2)`. + * - :math:`R_0 y^3 / 16` + - :math:`\dfrac{s^3}{2R_0^2} + O(d^4)` + - Third-order term :math:`s^3/(2R_0^2)`. + +The first-order local-offset term is therefore + +.. math:: + + s. + +The second-order local-offset terms are + +.. math:: + + R_0 + \left( + \frac{s^2 + q}{2 R_0^2} + - \frac{4s^2}{8 R_0^2} + \right) + = + \frac{q}{2 R_0}. + +The third-order local-offset terms are + +.. math:: + + R_0 + \left( + -\frac{4s(s^2 + q)}{8 R_0^3} + + \frac{8s^3}{16 R_0^3} + \right) + = + -\frac{s q}{2 R_0^2}. + +Combining terms gives + +.. math:: + + R(d) + = + R_0 + + s + + \frac{q}{2 R_0} + - \frac{s q}{2 R_0^2} + + O\left(\frac{\|d\|^4}{R_0^3}\right). + +Equivalently, the local range delta is + +.. math:: + + \Delta R(d) + = + s + + \frac{q}{2 R_0} + - \frac{s q}{2 R_0^2} + + O\left(\frac{\|d\|^4}{R_0^3}\right). + + +Second-Order Approximation +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The second-order approximation keeps the linear along-range term and the +quadratic cross-range correction: + +.. math:: + + \Delta R(d) + = + s + \frac{q}{2 R_0}. + +We then compute the bin coordinate as + +.. math:: + + b(d) \approx b_0 + \frac{1}{\Delta r} + \left(s + \frac{q}{2R_0}\right). + +The leading omitted term is + +.. math:: + + \epsilon(d) + \approx + -\frac{s q}{2 R_0^2}. + +This is the default ``TaylorFast`` implementation. It is most accurate when the pixel block +is small relative to the platform stand-off range and when the along-range +offset :math:`s` remains small. + +Third-Order Approximation +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``PropSarBpTaylorFastAddThirdOrder`` property instantiates a ``TaylorFast`` +kernel variant that also keeps the cubic term: + +.. math:: + + \Delta R(d) + = + s + + \frac{q}{2 R_0} + - \frac{s q}{2 R_0^2}. + +The corresponding bin coordinate is + +.. math:: + + b(d) \approx b_0 + \frac{1}{\Delta r} + \left( + s + + \frac{q}{2R_0} + - \frac{s q}{2R_0^2} + \right). + +With the third-order term included, the leading omitted fourth-order terms are + +.. math:: + + \epsilon(d) + \approx + \frac{s^2 q}{2 R_0^3} + - \frac{q^2}{8 R_0^3}. + +Compared to the second-order form, this requires additional per-pixel arithmetic +and will thus have correspondingly lower throughput than the second-order form. +Keeping the third-order term avoids the short-range accuracy loss that occurs when +the image block is large relative to :math:`R_0` or when the platform is close enough +that the :math:`s q / R_0^2` term is no longer negligible. + +Accuracy Considerations +^^^^^^^^^^^^^^^^^^^^^^^ + +The approximation is local to a thread block. Its accuracy depends on the ratio +between the maximum local pixel offset and :math:`R_0`. For satellite-scale data, +:math:`R_0` is typically very large compared to the dimensions of a CUDA thread block, so +the second-order approximation is likely sufficient. It is an assumption that adjacent +pixels are spatially near one another and thus a compact pixel tile has a relatively small +spatial extent. + +For shorter stand-off ranges, larger image tiles, or geometries where the look direction varies rapidly across +a thread block, the third-order term becomes more important. The third-order +property uses a separate kernel instantiation, which avoids a run-time order +dispatch inside the backprojection kernel. + Examples ~~~~~~~~ @@ -22,3 +374,16 @@ Examples :start-after: example-begin sar-bp-1 :end-before: example-end sar-bp-1 :dedent: + +The ``PropSarBpTaylorFastAddThirdOrder`` property is a compile-time MatX operator property. The compute type is +selected separately through ``SarBpParams``; the property only adds the +third-order term to a ``TaylorFast`` launch: + +.. literalinclude:: ../../../../test/00_transform/SarBp.cu + :language: cpp + :start-after: example-begin sar-bp-2 + :end-before: example-end sar-bp-2 + :dedent: + +The property only affects ``SarBpComputeType::TaylorFast``. Other compute types +continue to use their ordinary kernel instantiations. diff --git a/examples/sarbp/README.md b/examples/sarbp/README.md index b3d05c1fe..86a5a7a64 100644 --- a/examples/sarbp/README.md +++ b/examples/sarbp/README.md @@ -106,7 +106,8 @@ real/imag, row-major), written to `output_image.raw` in this example. | `-w {hamming,none}` | Window for range compression (default: hamming) | | `-b {auto,all,0,N}` | Pulses per processing block. `auto` uses the GPU L2 cache size to choose a block size; `all` and `0` use all pulses (default: auto) | | `--image-tiles N` | Process the image as N x N tiles during backprojection (default: 1) | -| `--precision {double,float,fltflt,mixed}` | Backprojection compute precision (default: mixed) | +| `--taylor-fast-third-order` | Add the third-order range term when using `--precision taylor_fast` | +| `--precision {double,float,fltflt,mixed,taylor_fast}` | Backprojection compute precision (default: mixed) | | `--warmup` | Warmup GPU kernels and FFT plans before timed run | The `--precision` flag controls the arithmetic used by the `sar_bp` operator. For spaceborne SAR, `float` does not provide enough precision to store fractional wavelengths at the range-to-MCP magnitudes (hundreds of km), so pure `float` is not sufficient to produce focused images. The available modes are: @@ -114,6 +115,7 @@ The `--precision` flag controls the arithmetic used by the `sar_bp` operator. Fo - `double` -- full double-precision arithmetic. Most accurate. - `mixed` -- double-precision for range computation, single-precision elsewhere. Default. Close to `double` in image quality with slightly higher throughput on GPUs with reduced double-precision throughput. Other than `float`, this is the fastest option on hardware with full-throughput double-precision (e.g., A100, H100/H200, B200). - `fltflt` -- float-float evaluation using two `float` values for the high-precision range math. Significantly higher throughput on GPUs where `double` throughput is reduced (e.g., RTX PROs, Jetson Orin/Thor, gaming GPUs). +- `taylor_fast` -- local Taylor approximation of the pulse-to-pixel range about a centered per-thread-block reference point. Highest-throughput experimental mode for spaceborne SAR geometries where moderate approximation error is acceptable. - `float` -- single-precision throughout. Fastest but not accurate enough for most spaceborne data. ## 6. View the Result diff --git a/examples/sarbp/sarbp.cu b/examples/sarbp/sarbp.cu index 10853014e..38067bdd5 100755 --- a/examples/sarbp/sarbp.cu +++ b/examples/sarbp/sarbp.cu @@ -226,6 +226,7 @@ struct BpRunCtx { bool is_fx_domain; bool is_int16_mode; bool apply_window; + bool taylor_fast_add_third_order; int sgn; bool do_warmup; std::string output_file; @@ -316,9 +317,13 @@ static int run_bp_device(PosTensor blk_positions, RtmTensor blk_rtm, auto cur_image = matx::slice(image, {y0, x0}, {y1, x1}); auto cur_voxel_locations = matx::slice(voxel_locations, {y0, x0}, {y1, x1}); - (cur_image = matx::experimental::sar_bp( - cur_image, cur_profiles, cur_positions, cur_voxel_locations, cur_rtm, ctx.params)) - .run(ctx.exec); + auto bp = matx::experimental::sar_bp( + cur_image, cur_profiles, cur_positions, cur_voxel_locations, cur_rtm, ctx.params); + if (ctx.taylor_fast_add_third_order) { + (cur_image = bp.template props()).run(ctx.exec); + } else { + (cur_image = bp).run(ctx.exec); + } } } }; @@ -388,11 +393,12 @@ static int run_bp_device(PosTensor blk_positions, RtmTensor blk_rtm, std::cout << " done" << std::endl; } - // Pre-allocate pinned host buffer for image output + // Pre-allocate pinned host buffer for image output. const size_t num_pixels = static_cast(ctx.image_height) * static_cast(ctx.image_width); + const size_t image_bytes = num_pixels * sizeof(complex_t); complex_t *h_image = nullptr; - MATX_CUDA_CHECK(cudaHostAlloc(&h_image, num_pixels * sizeof(complex_t), cudaHostAllocDefault)); + MATX_CUDA_CHECK(cudaHostAlloc(&h_image, image_bytes, cudaHostAllocDefault)); std::cout << "Running backprojection (" << ctx.output_range_bins << " range bins, del_r=" << ctx.del_r << " m)..." << std::endl; @@ -533,8 +539,8 @@ static int run_bp_device(PosTensor blk_positions, RtmTensor blk_rtm, } } - // Copy result to pinned host buffer (included in timed region) - MATX_CUDA_CHECK(cudaMemcpyAsync(h_image, image.Data(), num_pixels * sizeof(complex_t), + // Copy result to host buffer (included in timed region) + MATX_CUDA_CHECK(cudaMemcpyAsync(h_image, image.Data(), image_bytes, cudaMemcpyDeviceToHost, ctx.stream)); MATX_CUDA_CHECK(cudaEventRecord(ev_stop, ctx.stream)); @@ -582,7 +588,7 @@ static int run_bp_device(PosTensor blk_positions, RtmTensor blk_rtm, return 1; } out.write(reinterpret_cast(h_image), - static_cast(num_pixels * sizeof(complex_t))); + static_cast(image_bytes)); out.close(); std::cout << "Wrote " << ctx.image_height << " x " << ctx.image_width @@ -674,8 +680,10 @@ int main(int argc, char **argv) { << " -b, --block-size \n" << " Pulses per block; 0/all use all pulses, auto uses an L2-cache heuristic (default: auto)\n" << " --image-tiles Process image as N x N tiles (default: 1)\n" + << " --taylor-fast-third-order\n" + << " Add the third-order term for --precision taylor_fast\n" << " --warmup Warmup GPU kernels and FFT plans before timed run\n" - << " --precision Compute precision: double, float, fltflt, mixed (default: mixed)\n" + << " --precision Compute precision: double, float, fltflt, mixed, taylor_fast (default: mixed)\n" << " -h, --help Print this help message and exit\n"; }; @@ -691,6 +699,7 @@ int main(int argc, char **argv) { std::string block_size_arg = "auto"; index_t image_tiles = 1; bool do_warmup = false; + bool taylor_fast_add_third_order = false; std::string precision_type = "mixed"; auto needs_value = [&](int i) -> bool { @@ -728,6 +737,8 @@ int main(int argc, char **argv) { } } else if (std::strcmp(argv[i], "--warmup") == 0) { do_warmup = true; + } else if (std::strcmp(argv[i], "--taylor-fast-third-order") == 0) { + taylor_fast_add_third_order = true; } else if (std::strcmp(argv[i], "--precision") == 0) { if (!needs_value(i)) return 1; precision_type = argv[++i]; @@ -765,6 +776,12 @@ int main(int argc, char **argv) { output_file = (dot != std::string::npos ? input_file.substr(0, dot) : input_file) + ".raw"; } + if (taylor_fast_add_third_order && precision_type != "taylor_fast") { + std::cerr << "ERROR: --taylor-fast-third-order requires --precision taylor_fast" << std::endl; + print_usage(); + return 1; + } + // ------------------------------------------------------------------- // Read .sarbp file header // ------------------------------------------------------------------- @@ -784,6 +801,15 @@ int main(int argc, char **argv) { const bool is_int16_mode = (hdr.flags & 0x2) != 0; const int sgn = hdr.sgn; + if (num_pulses <= 0 || num_range_bins <= 0 || + image_width <= 0 || image_height <= 0) { + std::cerr << "ERROR: invalid .sarbp dimensions: pulses=" << num_pulses + << ", samples=" << num_range_bins + << ", image=" << image_height << " x " << image_width + << std::endl; + return 1; + } + if (image_tiles > image_width || image_tiles > image_height) { std::cerr << "ERROR: --image-tiles " << image_tiles << " exceeds image dimensions " << image_height << " x " @@ -904,7 +930,9 @@ int main(int argc, char **argv) { // incompatible types and is undefined behaviour under strict aliasing. // unsigned char* may alias any object type, and std::memcpy of // trivially-copyable types is the standards-blessed way to bit-cast. - if (precision_type == "fltflt") { + const bool use_fltflt_platform_inputs = + precision_type == "fltflt"; + if (use_fltflt_platform_inputs) { auto *pos_bytes = reinterpret_cast(h_positions); for (size_t i = 0; i < static_cast(num_pulses) * 3; i++) { unsigned char *slot = pos_bytes + i * sizeof(double); @@ -985,12 +1013,16 @@ int main(int argc, char **argv) { params.compute_type = SarBpComputeType::FloatFloat; } else if (precision_type == "mixed") { params.compute_type = SarBpComputeType::Mixed; + } else if (precision_type == "taylor_fast") { + params.compute_type = SarBpComputeType::TaylorFast; } else { std::cerr << "ERROR: unknown precision type '" << precision_type - << "' (use double, float, fltflt, or mixed)" << std::endl; + << "' (use double, float, fltflt, mixed, or taylor_fast)" << std::endl; return 1; } - params.features = SarBpFeature::PhaseLUTOptimization; + if (params.compute_type != SarBpComputeType::Double) { + params.features = SarBpFeature::PhaseLUTOptimization; + } params.center_frequency = (sgn >= 0) ? -hdr.center_frequency : hdr.center_frequency; params.del_r = del_r; @@ -1034,7 +1066,11 @@ int main(int argc, char **argv) { << static_cast(phase_lut_bytes) / (1024.0 * 1024.0) << " MiB" << std::endl; } - std::cout << "BP precision : " << precision_type << std::endl; + std::cout << "BP precision : " << precision_type; + if (precision_type == "taylor_fast") { + std::cout << (taylor_fast_add_third_order ? " (third order)" : " (second order)"); + } + std::cout << std::endl; std::cout << "Image tiles : " << image_tiles << " x " << image_tiles << std::endl; @@ -1047,11 +1083,11 @@ int main(int argc, char **argv) { // Bundle non-tensor state for run_bp_device(). The platform-positions and // range-to-mcp tensors are allocated below with a type that matches - // `precision_type`: `tensor` / `tensor` for the standard - // paths, or `tensor` / `tensor` when - // `precision_type == "fltflt"`. In the fltflt case we already converted - // the pinned host buffers in-place from double to fltflt right after the - // file read above (same byte layout, no extra storage). The kernel + // `use_fltflt_platform_inputs`: `tensor` / `tensor` for the + // standard paths, or `tensor` / `tensor` for the FloatFloat + // path. In the fltflt case we already converted the pinned host buffers + // in-place from double to fltflt right after the file read above (same byte + // layout, no extra storage). The kernel // template (`SarBp<...>`) already handles both rank-1 (vector-of-double3 // / -fltflt3-style) and rank-2 (matrix of [pulses, 3]) layouts for // platform_positions, so run_bp_device() branches internally only on the @@ -1073,6 +1109,7 @@ int main(int argc, char **argv) { .is_fx_domain = is_fx_domain, .is_int16_mode = is_int16_mode, .apply_window = apply_window, + .taylor_fast_add_third_order = taylor_fast_add_third_order, .sgn = sgn, .do_warmup = do_warmup, .output_file = output_file, @@ -1086,7 +1123,7 @@ int main(int argc, char **argv) { }; int dev_status; - if (precision_type == "fltflt") { + if (use_fltflt_platform_inputs) { auto blk_positions = make_tensor({block_size, 3}, matx::MATX_DEVICE_MEMORY); auto blk_rtm = make_tensor({block_size}, matx::MATX_DEVICE_MEMORY); dev_status = run_bp_device(blk_positions, blk_rtm, voxel_locations, ctx); @@ -1098,8 +1135,8 @@ int main(int argc, char **argv) { if (dev_status != 0) return dev_status; - MATX_CUDA_CHECK(cudaFreeHost(h_positions)); - MATX_CUDA_CHECK(cudaFreeHost(h_range_to_mcp)); + if (h_positions) MATX_CUDA_CHECK(cudaFreeHost(h_positions)); + if (h_range_to_mcp) MATX_CUDA_CHECK(cudaFreeHost(h_range_to_mcp)); if (h_range_profiles) MATX_CUDA_CHECK(cudaFreeHost(h_range_profiles)); if (h_range_profiles_i16) MATX_CUDA_CHECK(cudaFreeHost(h_range_profiles_i16)); if (h_ampsf) MATX_CUDA_CHECK(cudaFreeHost(h_ampsf)); diff --git a/include/matx/kernels/sar_bp.cuh b/include/matx/kernels/sar_bp.cuh index 62f16f932..00d9b3019 100644 --- a/include/matx/kernels/sar_bp.cuh +++ b/include/matx/kernels/sar_bp.cuh @@ -74,7 +74,158 @@ static __device__ __forceinline__ double NewtonRaphsonSqrt(double x) { return NR_2; } -// ComputeBinToPixelFloatFloat() fuses the FloatFloat path's +// The base range bin index and interpolation weight. These are bundled together +// so that they can be returned from a single function call. +template +struct SarBpBinAndWeight { + weight_t w; + int32_t bin_floor_int; +}; + +template +__device__ inline strict_compute_t ComputeRangeToPixel(const PlatPosAccessor &ant_pos, const index_t pulse_idx, const loose_compute_t px, const loose_compute_t py, const loose_compute_t z0) { + using plat_pos_t = typename PlatPosAccessor::value_type; + constexpr int Rank = PlatPosAccessor::Rank; + constexpr bool IsFloat = ComputeType == SarBpComputeType::Float; + constexpr bool IsMixed = ComputeType == SarBpComputeType::Mixed; + + strict_compute_t dx, dy, dz; + static_assert((Rank == 1 && (cuda::std::is_same_v || cuda::std::is_same_v || + cuda::std::is_same_v || cuda::std::is_same_v)) || Rank == 2, + "ComputeRangeToPixel: plat_pos_t must be a 1D tensor of 3D or 4D vectorized type or a 2D tensor with size 3 (x,y,z) in the second dimension"); + + if constexpr (Rank == 1) { + const plat_pos_t ant_pos_p = ant_pos(pulse_idx); + dx = static_cast(px) - static_cast(ant_pos_p.x); + dy = static_cast(py) - static_cast(ant_pos_p.y); + dz = static_cast(z0) - static_cast(ant_pos_p.z); + } else { + dx = static_cast(px) - static_cast(ant_pos(pulse_idx, 0)); + dy = static_cast(py) - static_cast(ant_pos(pulse_idx, 1)); + dz = static_cast(z0) - static_cast(ant_pos(pulse_idx, 2)); + } + + if constexpr (IsFloat) { + return ::sqrtf(dx*dx + dy*dy + dz*dz); + } else { + if constexpr (IsMixed) { +#if __CUDA_ARCH__ == 700 || __CUDA_ARCH__ == 800 || __CUDA_ARCH__ == 900 || __CUDA_ARCH__ == 1000 + return ::sqrt(dx*dx + dy*dy + dz*dz); +#else + // Only use the Newton-Raphson approach on systems with reduced FP64 throughput + return NewtonRaphsonSqrt(dx*dx + dy*dy + dz*dz); +#endif + } else { + return ::sqrt(dx*dx + dy*dy + dz*dz); + } + } +} + +template +__global__ void SarBpFillPhaseLUT(cuda::std::complex *phase_lut, ComputeType ref_freq, ComputeType dr, index_t num_range_bins) +{ + const int tid = blockIdx.x * blockDim.x + threadIdx.x; + if (tid >= num_range_bins) return; + constexpr ComputeType four_pi_over_c = static_cast(4.0 * M_PI / SPEED_OF_LIGHT); + const ComputeType range_bin_start = static_cast(tid - 0.5 * (num_range_bins-1)) * dr; + const ComputeType phase = four_pi_over_c * ref_freq * range_bin_start; + if constexpr (cuda::std::is_same_v) { + ComputeType sinx, cosx; + ::sincosf(phase, &sinx, &cosx); + phase_lut[tid] = cuda::std::complex( + static_cast(cosx), static_cast(sinx)); + } else { + ComputeType sinx, cosx; + ::sincos(phase, &sinx, &cosx); + phase_lut[tid] = cuda::std::complex( + static_cast(cosx), static_cast(sinx)); + } +} + +// Template alias for the strict compute parameter type used in SarBp kernel +template +using strict_compute_param_t = typename std::conditional< + ComputeType == SarBpComputeType::Double || + ComputeType == SarBpComputeType::Mixed || + ComputeType == SarBpComputeType::FloatFloat || + ComputeType == SarBpComputeType::TaylorFast, + double, float>::type; + +template +using strict_or_ff_compute_param_t = typename std::conditional< + ComputeType == SarBpComputeType::FloatFloat, + fltflt, strict_compute_param_t>::type; + +template +using sarbp_strict_compute_t = typename std::conditional< + ComputeType == SarBpComputeType::Double || + ComputeType == SarBpComputeType::Mixed || + ComputeType == SarBpComputeType::TaylorFast, + double, float>::type; + +template +using loose_compute_param_t = typename std::conditional< + ComputeType == SarBpComputeType::Double, + double, float>::type; + +// Shared-memory layout for per-pulse-block cached state. Some compute modes +// cache per-pulse antenna data or Taylor reference values once per pulse block, +// amortizing global FP64 loads and FP64->compute-type casts across all threads. +template +struct SarBpPulseBlockCache {}; + +template +struct SarBpPulseBlockCache { + fltflt ant_pos[PULSE_BLOCK_SIZE][4]; +}; + +template <> +struct SarBpPulseBlockCache { + // Slots: [0..2] = ant_pos.x/y/z as float, [3] = bin_offset - rtm*dr_inv + float ant_pos[PULSE_BLOCK_SIZE][4]; +}; + +template <> +struct SarBpPulseBlockCache { + // Same layout as Float, but with double elements so strict_compute_t (double) + // arithmetic in the inner loop reads directly without up-casting. + double ant_pos[PULSE_BLOCK_SIZE][4]; +}; + +struct SarBpTaylorFastSharedMemory { + int32_t ref_bin_int[PULSE_BLOCK_SIZE]; + float ref_bin_frac[PULSE_BLOCK_SIZE]; + float ux [PULSE_BLOCK_SIZE]; + float uy [PULSE_BLOCK_SIZE]; + float uz [PULSE_BLOCK_SIZE]; + float half_inv_R [PULSE_BLOCK_SIZE]; + double ref_x; + double ref_y; + double ref_z; +}; + +template <> +struct SarBpPulseBlockCache : + SarBpTaylorFastSharedMemory {}; + +template +struct SarBpThreadState {}; + +template +struct SarBpThreadState { + sarbp_strict_compute_t range_delta; +}; + +template <> +struct SarBpThreadState { + float ref_dx = 0.0f; + float ref_dy = 0.0f; + float ref_dz = 0.0f; + float ref_dsq = 0.0f; + float dr_inv_f32 = 0.0f; +}; + +// ComputeBinWeightToPixelFloatFloat() fuses the FloatFloat path's // coordinate-difference, squared-norm, sqrt, and scale/add steps that would // otherwise be a separate norm3d + fma_approx pair. The kernel only needs the // range bin (FloatFloat requires PhaseLUT, so the bare range R is never used @@ -104,12 +255,17 @@ static __device__ __forceinline__ double NewtonRaphsonSqrt(double x) { // Domain: at least one of dx/dy/dz must not fully cancel (sum of squared // distances must be strictly positive) -- three-way simultaneous // cancellation is not a meaningful SAR geometry and is not supported. -__device__ inline fltflt ComputeBinToPixelFloatFloat( - fltflt apx, fltflt apy, fltflt apz, +__device__ inline SarBpBinAndWeight ComputeBinWeightToPixelFloatFloat( + const SarBpPulseBlockCache &sh_mem, + index_t ip, float px, float py, float pz, - fltflt dr_inv, - fltflt mcp_partial) + fltflt dr_inv) { + const fltflt apx = sh_mem.ant_pos[ip][0]; + const fltflt apy = sh_mem.ant_pos[ip][1]; + const fltflt apz = sh_mem.ant_pos[ip][2]; + const fltflt mcp_partial = sh_mem.ant_pos[ip][3]; + // Stage 1: loose dx/dy/dz. We use the general fltflt_two_sum (6 fp32 ops) // for all three dimensions. When it is known that one coordinate is // always greater than the other (e.g., |apz.hi| >= |pz|), the faster @@ -176,104 +332,100 @@ __device__ inline fltflt ComputeBinToPixelFloatFloat( // non-canonical sqrt result. fma_approx tolerates this because its only // dropped term is a.lo*b.lo = r_lo * dr_inv.lo, which is O(ULP^2 * R) // regardless of whether (yn, r_lo) has been canonicalized. - return fltflt_fma_approx(fltflt{yn, r_lo}, dr_inv, mcp_partial); + const fltflt bin = fltflt_fma_approx(fltflt{yn, r_lo}, dr_inv, mcp_partial); + float floor_hi = ::floorf(bin.hi); + float frac = (bin.hi - floor_hi) + bin.lo; + // bin.lo may push bin over a boundary, in which case floor and frac are incorrect. + // Compute an adjustment based on whether or not the fractional part is outside (0.0, 1.0). + const float adjust = ::floorf(frac); // -1, 0, or 1 + return SarBpBinAndWeight{ + frac - adjust, + static_cast(floor_hi + adjust) + }; } -template -__device__ inline strict_compute_t ComputeRangeToPixel(const PlatPosAccessor &ant_pos, const index_t pulse_idx, const loose_compute_t px, const loose_compute_t py, const loose_compute_t z0) { - using plat_pos_t = typename PlatPosAccessor::value_type; - constexpr int Rank = PlatPosAccessor::Rank; - - strict_compute_t dx, dy, dz; - static_assert((Rank == 1 && (cuda::std::is_same_v || cuda::std::is_same_v || - cuda::std::is_same_v || cuda::std::is_same_v)) || Rank == 2, - "ComputeRangeToPixel: plat_pos_t must be a 1D tensor of 3D or 4D vectorized type or a 2D tensor with size 3 (x,y,z) in the second dimension"); - - if constexpr (Rank == 1) { - const plat_pos_t ant_pos_p = ant_pos(pulse_idx); - dx = static_cast(px) - static_cast(ant_pos_p.x); - dy = static_cast(py) - static_cast(ant_pos_p.y); - dz = static_cast(z0) - static_cast(ant_pos_p.z); - } else { - dx = static_cast(px) - static_cast(ant_pos(pulse_idx, 0)); - dy = static_cast(py) - static_cast(ant_pos(pulse_idx, 1)); - dz = static_cast(z0) - static_cast(ant_pos(pulse_idx, 2)); +template +__device__ inline SarBpBinAndWeight ComputeBinWeightToPixelTaylorFast( + const SarBpPulseBlockCache &sh_mem, + const SarBpThreadState &thread_state, + index_t ip) +{ + // In the mathematical derivation in the documentation, we defined s = dot(u, d) as + // the dot product of the along-range unit vector u=(ux, uy, uz) with the local offset d, + // and q as the squared perpendicular distance from the local pixel + // to the pulse-to-reference-pixel line. These calculations then map directly to the + // derivation. We split the reference bin b_0 from the writeup into ref_bin_int and + // ref_bin_frac to preserve the fractional interpolation weight: otherwise, + // b_0 can be large enough that storing it as a float loses low-order bin + // precision. The inner loop computes only ref_bin_frac + delta_bin in fp32; + // the integer component is added back after floorf(), relying on the exact + // identity floor(n + x) = n + floor(x) for integer n while avoiding the + // lossy fp32 computation of n + x. ref_bin_int and ref_bin_frac + // are per-pulse values that were populated in FillPulseBlockCache. + const float s = sh_mem.ux[ip] * thread_state.ref_dx + + sh_mem.uy[ip] * thread_state.ref_dy + + sh_mem.uz[ip] * thread_state.ref_dz; + const float q = fmaf(-s, s, thread_state.ref_dsq); + const float half_inv_R = sh_mem.half_inv_R[ip]; + const float dR_2nd = fmaf(q, half_inv_R, s); + float dR = dR_2nd; + if constexpr (TaylorFastAddThirdOrder) { + dR = fmaf(-2.0f * half_inv_R * half_inv_R * s, q, dR_2nd); } + const float bin_loc = fmaf(dR, thread_state.dr_inv_f32, sh_mem.ref_bin_frac[ip]); + const float bin_floor = ::floorf(bin_loc); + return SarBpBinAndWeight{ + bin_loc - bin_floor, + sh_mem.ref_bin_int[ip] + static_cast(bin_floor) + }; +} - if constexpr (ComputeType == SarBpComputeType::Float) { - return ::sqrtf(dx*dx + dy*dy + dz*dz); +template +__device__ inline SarBpBinAndWeight ComputeBinWeightToPixelPulseBlockCache( + const SarBpPulseBlockCache &sh_mem, + index_t ip, + loose_compute_t px, + loose_compute_t py, + loose_compute_t pz, + strict_compute_t dr_inv) +{ + constexpr bool IsFloat = ComputeType == SarBpComputeType::Float; + constexpr bool IsMixed = ComputeType == SarBpComputeType::Mixed; + static_assert(IsFloat || IsMixed, "SarBp: pulse-block cache bin helper only supports Float and Mixed"); + + const strict_compute_t apx = sh_mem.ant_pos[ip][0]; + const strict_compute_t apy = sh_mem.ant_pos[ip][1]; + const strict_compute_t apz = sh_mem.ant_pos[ip][2]; + const strict_compute_t mcp_partial = sh_mem.ant_pos[ip][3]; + const strict_compute_t dx = static_cast(px) - apx; + const strict_compute_t dy = static_cast(py) - apy; + const strict_compute_t dz = static_cast(pz) - apz; + strict_compute_t dist; + if constexpr (IsFloat) { + dist = ::sqrtf(dx*dx + dy*dy + dz*dz); } else { - if constexpr (ComputeType == SarBpComputeType::Mixed) { #if __CUDA_ARCH__ == 700 || __CUDA_ARCH__ == 800 || __CUDA_ARCH__ == 900 || __CUDA_ARCH__ == 1000 - return ::sqrt(dx*dx + dy*dy + dz*dz); + dist = ::sqrt(dx*dx + dy*dy + dz*dz); #else - // Only use the Newton-Raphson approach on systems with reduced FP64 throughput - return NewtonRaphsonSqrt(dx*dx + dy*dy + dz*dz); + dist = NewtonRaphsonSqrt(dx*dx + dy*dy + dz*dz); #endif - } else { - return ::sqrt(dx*dx + dy*dy + dz*dz); - } } -} -template -__global__ void SarBpFillPhaseLUT(cuda::std::complex *phase_lut, ComputeType ref_freq, ComputeType dr, index_t num_range_bins) -{ - const int tid = blockIdx.x * blockDim.x + threadIdx.x; - if (tid >= num_range_bins) return; - constexpr ComputeType four_pi_over_c = static_cast(4.0 * M_PI / SPEED_OF_LIGHT); - const ComputeType range_bin_start = static_cast(tid - 0.5 * (num_range_bins-1)) * dr; - const ComputeType phase = four_pi_over_c * ref_freq * range_bin_start; - if constexpr (cuda::std::is_same_v) { - ComputeType sinx, cosx; - ::sincosf(phase, &sinx, &cosx); - phase_lut[tid] = cuda::std::complex( - static_cast(cosx), static_cast(sinx)); + // bin = (dist - rtm)*dr_inv + bin_offset = dist*dr_inv + mcp_partial + const strict_compute_t bin = dist * dr_inv + mcp_partial; + strict_compute_t bin_floor; + if constexpr (cuda::std::is_same_v) { + bin_floor = ::floor(bin); } else { - ComputeType sinx, cosx; - ::sincos(phase, &sinx, &cosx); - phase_lut[tid] = cuda::std::complex( - static_cast(cosx), static_cast(sinx)); + bin_floor = ::floorf(bin); } + return SarBpBinAndWeight{ + static_cast(bin - bin_floor), + static_cast(bin_floor) + }; } -// Template alias for the strict compute parameter type used in SarBp kernel -template -using strict_compute_param_t = typename std::conditional::type; - -template -using strict_or_ff_compute_param_t = typename std::conditional>::type; - -template -using loose_compute_param_t = typename std::conditional::type; - -// Shared-memory layout is parameterized on whether the kernel runs the -// cooperative preamble (see UseSharedPreamble in SarBp). The preamble loads -// per-pulse antenna positions and pre-computes mcp_partial = bin_offset -// - rtm*dr_inv once per pulse block, amortizing the global FP64 loads and -// FP64->compute-type casts across all threads. -template -struct SarBpSharedMemory {}; - -template -struct SarBpSharedMemory { - fltflt ant_pos[PULSE_BLOCK_SIZE][4]; -}; - -template <> -struct SarBpSharedMemory { - // Slots: [0..2] = ant_pos.x/y/z as float, [3] = bin_offset - rtm*dr_inv - float ant_pos[PULSE_BLOCK_SIZE][4]; -}; - -template <> -struct SarBpSharedMemory { - // Same layout as Float, but with double elements so strict_compute_t (double) - // arithmetic in the inner loop reads directly without up-casting. - double ant_pos[PULSE_BLOCK_SIZE][4]; -}; - -template +template __launch_bounds__(16*16) __global__ void SarBp(OutImageType output, const InitialImageType initial_image, const __grid_constant__ RangeProfilesType range_profiles, const __grid_constant__ PlatPosType platform_positions, const __grid_constant__ VoxLocType voxel_locations, const __grid_constant__ RangeToMcpType range_to_mcp, strict_or_ff_compute_param_t dr_inv, @@ -298,16 +450,22 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image, cuda::std::is_same_v, "RangeToMcpType::value_type must be float, double, or fltflt"); + constexpr bool IsDouble = ComputeType == SarBpComputeType::Double; + constexpr bool IsMixed = ComputeType == SarBpComputeType::Mixed; + constexpr bool IsFloatFloat = ComputeType == SarBpComputeType::FloatFloat; + constexpr bool IsFloat = ComputeType == SarBpComputeType::Float; + constexpr bool IsTaylorFast = ComputeType == SarBpComputeType::TaylorFast; + using initial_image_t = typename InitialImageType::value_type; using image_t = typename OutImageType::value_type; using range_profiles_t = typename RangeProfilesType::value_type; using plat_pos_t = typename PlatPosType::value_type; using voxel_loc_t = typename VoxLocType::value_type; - using compute_t = typename std::conditional::type; - using strict_compute_t = typename std::conditional::type; - using strict_or_ff_compute_t = typename std::conditional::type; + using compute_t = typename std::conditional::type; + using strict_compute_t = sarbp_strict_compute_t; + using strict_or_ff_compute_t = typename std::conditional::type; using strict_complex_compute_t = cuda::std::complex; - using loose_compute_t = typename std::conditional::type; + using loose_compute_t = typename std::conditional::type; using loose_complex_compute_t = cuda::std::complex; const index_t image_height = output.Size(0); @@ -326,21 +484,20 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image, "SarBp IsUnitStride path requires platform_positions to be a tensor view"); } - // The cooperative shared-memory preamble is required for FloatFloat (its - // inner loop reads directly from shared memory) and enabled for Float and - // Mixed when PhaseLUT is on. With PhaseLUT=false the inner loop still - // needs diffR = dist - rtm for sincos, which the preamble does not - // preserve, so fall back to direct global reads in that case. - constexpr bool UseSharedPreamble = - (ComputeType == SarBpComputeType::FloatFloat) || - ((ComputeType == SarBpComputeType::Float || - ComputeType == SarBpComputeType::Mixed) && PhaseLUT); + // Some compute modes cache per-pulse data in shared memory once per pulse + // block. With PhaseLUT=false, Float/Mixed still need the range-to-MCP + // delta for sincos, which the cache does not preserve, so fall back to + // direct global reads in that case. + constexpr bool UsePulseBlockCache = + IsFloatFloat || + IsTaylorFast || + ((IsFloat || IsMixed) && PhaseLUT); - __shared__ SarBpSharedMemory sh_mem; + __shared__ SarBpPulseBlockCache sh_mem; const bool is_valid = ix < image_width && iy < image_height; - if constexpr (!UseSharedPreamble) { - // When the preamble is enabled, keep all threads active so they can + if constexpr (!UsePulseBlockCache) { + // When the cache is enabled, keep all threads active so they can // participate in the CTA-wide cooperative loads of antenna positions. if (! is_valid) return; } @@ -381,32 +538,31 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image, }; [[maybe_unused]] const loose_compute_t phase_correction_partial_loose = static_cast(phase_correction_partial); - const auto get_reference_phase = [&phase_lut, &phase_correction_partial, &phase_correction_partial_loose](strict_or_ff_compute_t diffR, int32_t bin_floor_int, loose_compute_t w) -> loose_complex_compute_t { - if constexpr (PhaseLUT) { - const loose_complex_compute_t base_phase = phase_lut[bin_floor_int]; - loose_compute_t incr_sinx, incr_cosx; - if constexpr (cuda::std::is_same_v) { - ::sincos(phase_correction_partial_loose * w, &incr_sinx, &incr_cosx); - } else { - __sincosf(phase_correction_partial_loose * w, &incr_sinx, &incr_cosx); - } + const auto get_reference_phase_lut = [&phase_lut, &phase_correction_partial_loose](int32_t bin_floor_int, loose_compute_t w) -> loose_complex_compute_t { + const loose_complex_compute_t base_phase = phase_lut[bin_floor_int]; + loose_compute_t incr_sinx, incr_cosx; + if constexpr (cuda::std::is_same_v) { + ::sincos(phase_correction_partial_loose * w, &incr_sinx, &incr_cosx); + } else { + __sincosf(phase_correction_partial_loose * w, &incr_sinx, &incr_cosx); + } + + return loose_complex_compute_t{ + base_phase.real() * incr_cosx - base_phase.imag() * incr_sinx, + base_phase.real() * incr_sinx + base_phase.imag() * incr_cosx + }; + }; - return loose_complex_compute_t{ - base_phase.real() * incr_cosx - base_phase.imag() * incr_sinx, - base_phase.real() * incr_sinx + base_phase.imag() * incr_cosx - }; + const auto get_reference_phase_direct = [&phase_correction_partial](strict_compute_t range_delta) -> loose_complex_compute_t { + strict_compute_t sinx, cosx; + if constexpr (cuda::std::is_same_v) { + ::sincos(phase_correction_partial * range_delta, &sinx, &cosx); } else { - // With PhaseLUT == false, strict_or_ff_compute_t is either float or double, so we can use sincos[f] directly. - strict_or_ff_compute_t sinx, cosx; - if constexpr (cuda::std::is_same_v) { - ::sincos(phase_correction_partial * diffR, &sinx, &cosx); - } else { - ::sincosf(phase_correction_partial * diffR, &sinx, &cosx); - } - return loose_complex_compute_t{ - static_cast(cosx), static_cast(sinx) - }; + ::sincosf(phase_correction_partial * range_delta, &sinx, &cosx); } + return loose_complex_compute_t{ + static_cast(cosx), static_cast(sinx) + }; }; // Explicitly use FMA instructions for pixel accumulations @@ -431,17 +587,47 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image, [[maybe_unused]] const int tid = threadIdx.x + threadIdx.y * blockDim.x; + SarBpThreadState thread_state{}; + if constexpr (IsTaylorFast) { + static_assert(PhaseLUT == true, "SarBp: TaylorFast compute type requires PhaseLUT optimization"); + if constexpr (PhaseLUT) { + thread_state.dr_inv_f32 = static_cast(dr_inv); + if (tid == 0) { + // Use the midpoint of the CTA's valid image footprint. For + // partial right/bottom edge CTAs, clamping the nominal CTA + // center to image_width-1/image_height-1 would choose an edge + // or corner reference and unnecessarily increase local offsets. + const index_t block_x0 = static_cast(blockIdx.x * blockDim.x); + const index_t block_y0 = static_cast(blockIdx.y * blockDim.y); + const index_t valid_width = ((image_width - block_x0) < static_cast(blockDim.x)) ? + (image_width - block_x0) : static_cast(blockDim.x); + const index_t valid_height = ((image_height - block_y0) < static_cast(blockDim.y)) ? + (image_height - block_y0) : static_cast(blockDim.y); + const index_t ax_c = block_x0 + valid_width / 2; + const index_t ay_c = block_y0 + valid_height / 2; + const voxel_loc_t ref_loc = voxel_locations(ay_c, ax_c); + sh_mem.ref_x = static_cast(ref_loc.x); + sh_mem.ref_y = static_cast(ref_loc.y); + sh_mem.ref_z = static_cast(ref_loc.z); + } + __syncthreads(); + + thread_state.ref_dx = is_valid ? static_cast(static_cast(voxel_loc.x) - sh_mem.ref_x) : 0.0f; + thread_state.ref_dy = is_valid ? static_cast(static_cast(voxel_loc.y) - sh_mem.ref_y) : 0.0f; + thread_state.ref_dz = is_valid ? static_cast(static_cast(voxel_loc.z) - sh_mem.ref_z) : 0.0f; + thread_state.ref_dsq = fmaf(thread_state.ref_dx, thread_state.ref_dx, + fmaf(thread_state.ref_dy, thread_state.ref_dy, + thread_state.ref_dz * thread_state.ref_dz)); + } + } + loose_complex_compute_t accum{}; const loose_compute_t bin_offset = static_cast(0.5) * static_cast(num_range_bins-1); - const int num_pulse_blocks = static_cast( - (num_pulses + static_cast(PULSE_BLOCK_SIZE) - 1) / static_cast(PULSE_BLOCK_SIZE)); - for (int block = 0; block < num_pulse_blocks; ++block) { - const index_t pulse_base = static_cast(block) * static_cast(PULSE_BLOCK_SIZE); - const index_t pulses_remaining = num_pulses - pulse_base; - const index_t num_pulses_in_block = - (pulses_remaining < static_cast(PULSE_BLOCK_SIZE)) ? pulses_remaining : static_cast(PULSE_BLOCK_SIZE); - if constexpr (UseSharedPreamble) { + // Most ComputeTypes amortize some redundant computations or data conversions by leveraging + // per-pulse-block shared memory. + const auto FillPulseBlockCache = [&](index_t pulse_base, index_t num_pulses_in_block) { + if constexpr (UsePulseBlockCache) { __syncthreads(); for (index_t ip = tid; ip < num_pulses_in_block; ip += blockDim.x * blockDim.y) { const index_t p = pulse_base + ip; @@ -457,13 +643,36 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image, }; const auto xyz = load_xyz(); - if constexpr (ComputeType == SarBpComputeType::FloatFloat) { + if constexpr (IsFloatFloat) { sh_mem.ant_pos[ip][0] = static_cast(cuda::std::get<0>(xyz)); sh_mem.ant_pos[ip][1] = static_cast(cuda::std::get<1>(xyz)); sh_mem.ant_pos[ip][2] = static_cast(cuda::std::get<2>(xyz)); const fltflt rtm = static_cast(r_to_mcp(p)); const fltflt neg_rtm = fltflt{-rtm.hi, -rtm.lo}; sh_mem.ant_pos[ip][3] = fltflt_fma_approx(neg_rtm, dr_inv, bin_offset); + } else if constexpr (IsTaylorFast) { + const strict_compute_t pcx = static_cast(cuda::std::get<0>(xyz)); + const strict_compute_t pcy = static_cast(cuda::std::get<1>(xyz)); + const strict_compute_t pcz = static_cast(cuda::std::get<2>(xyz)); + const strict_compute_t rtm = static_cast(r_to_mcp(p)); + + const strict_compute_t dxa = sh_mem.ref_x - pcx; + const strict_compute_t dya = sh_mem.ref_y - pcy; + const strict_compute_t dza = sh_mem.ref_z - pcz; + // We assume double for TaylorFast strict_compute_t and thus use ::sqrt()/::floor() here + static_assert(cuda::std::is_same_v, "SarBp: TaylorFast compute type requires double precision"); + const strict_compute_t ref_range_pure = ::sqrt(dxa*dxa + dya*dya + dza*dza); + const strict_compute_t ref_range = ref_range_pure - rtm; + const strict_compute_t ref_bin = ref_range * dr_inv + static_cast(bin_offset); + const strict_compute_t bin_floor = ::floor(ref_bin); + const strict_compute_t inv_R_d = static_cast(1.0) / ref_range_pure; + + sh_mem.ref_bin_int [ip] = static_cast(bin_floor); + sh_mem.ref_bin_frac[ip] = static_cast(ref_bin - bin_floor); + sh_mem.ux [ip] = static_cast(dxa * inv_R_d); + sh_mem.uy [ip] = static_cast(dya * inv_R_d); + sh_mem.uz [ip] = static_cast(dza * inv_R_d); + sh_mem.half_inv_R [ip] = static_cast(static_cast(0.5) * inv_R_d); } else { // Float / Mixed: cast inputs to strict_compute_t (float / double) // once per pulse here, instead of once per pulse per pixel. @@ -479,111 +688,99 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image, } } __syncthreads(); - if (! is_valid) { - continue; - } } + }; + + const int num_pulse_blocks = static_cast( + (num_pulses + static_cast(PULSE_BLOCK_SIZE) - 1) / static_cast(PULSE_BLOCK_SIZE)); + for (int block = 0; block < num_pulse_blocks; ++block) { + const index_t pulse_base = static_cast(block) * static_cast(PULSE_BLOCK_SIZE); + const index_t pulses_remaining = num_pulses - pulse_base; + const index_t num_pulses_in_block = + (pulses_remaining < static_cast(PULSE_BLOCK_SIZE)) ? pulses_remaining : static_cast(PULSE_BLOCK_SIZE); + + FillPulseBlockCache(pulse_base, num_pulses_in_block); + if (! is_valid) { + continue; + } + #pragma unroll 4 for (index_t ip = 0; ip < num_pulses_in_block; ++ip) { const index_t p = pulse_base + ip; - strict_or_ff_compute_t diffR; - loose_compute_t w; - int32_t bin_floor_int; - if constexpr (ComputeType == SarBpComputeType::FloatFloat) { - // ComputeBinToPixelFloatFloat fuses the coordinate-difference, - // squared-norm, sqrt, and (R-mcp)*dr_inv + bin_offset chain into - // one helper that never materializes a canonical R. The - // shared-memory slot sh_mem.ant_pos[ip][3] holds - // -mcp*dr_inv + bin_offset, precomputed in the pulse-block - // preamble. - const fltflt bin = ComputeBinToPixelFloatFloat( - sh_mem.ant_pos[ip][0], sh_mem.ant_pos[ip][1], sh_mem.ant_pos[ip][2], - px, py, pz, dr_inv, sh_mem.ant_pos[ip][3]); - diffR = bin; // unused below (FloatFloat requires PhaseLUT); assign to silence maybe-uninitialized warning - float floor_hi = ::floorf(bin.hi); - float frac = (bin.hi - floor_hi) + bin.lo; - // bin.lo may push bin over a boundary, in which case floor and frac are incorrect. - // Compute an adjustment based on whether or not the fractional part is outside (0.0, 1.0). - const float adjust = ::floorf(frac); // -1, 0, or 1 - bin_floor_int = static_cast(floor_hi + adjust); - w = frac - adjust; - } else if constexpr (UseSharedPreamble) { - // Float / Mixed with shared-mem preamble: antenna position and - // mcp_partial have been pre-loaded / pre-computed in shared - // memory, so the inner loop is pure strict_compute_t arithmetic. - const strict_compute_t apx = sh_mem.ant_pos[ip][0]; - const strict_compute_t apy = sh_mem.ant_pos[ip][1]; - const strict_compute_t apz = sh_mem.ant_pos[ip][2]; - const strict_compute_t mcp_partial = sh_mem.ant_pos[ip][3]; - const strict_compute_t dx = static_cast(px) - apx; - const strict_compute_t dy = static_cast(py) - apy; - const strict_compute_t dz = static_cast(pz) - apz; - strict_compute_t dist; - if constexpr (ComputeType == SarBpComputeType::Float) { - dist = ::sqrtf(dx*dx + dy*dy + dz*dz); - } else { -#if __CUDA_ARCH__ == 700 || __CUDA_ARCH__ == 800 || __CUDA_ARCH__ == 900 || __CUDA_ARCH__ == 1000 - dist = ::sqrt(dx*dx + dy*dy + dz*dz); -#else - dist = NewtonRaphsonSqrt(dx*dx + dy*dy + dz*dz); -#endif - } - // bin = (dist - rtm)*dr_inv + bin_offset = dist*dr_inv + mcp_partial - const strict_compute_t bin = dist * dr_inv + mcp_partial; - strict_compute_t bin_floor; - if constexpr (cuda::std::is_same_v) { - bin_floor = ::floor(bin); - } else { - bin_floor = ::floorf(bin); - } - w = static_cast(bin - bin_floor); - bin_floor_int = static_cast(bin_floor); - // diffR is unused when PhaseLUT=true (required for this branch); - // assign to avoid any maybe-uninitialized warning downstream. - diffR = dist; + + // The following branches all compute at least the base range bin index and interpolation weight + // for the current pulse. Paths not using the PhaseLUT optimization will also populate the + // thread_state.range_delta member for use in get_reference_phase_direct(). + SarBpBinAndWeight bin_weight; + if constexpr (IsFloatFloat) { + bin_weight = ComputeBinWeightToPixelFloatFloat(sh_mem, ip, px, py, pz, dr_inv); + } else if constexpr (IsTaylorFast) { + static_assert(PhaseLUT == true, "SarBp: TaylorFast compute type requires PhaseLUT optimization"); + bin_weight = ComputeBinWeightToPixelTaylorFast(sh_mem, thread_state, ip); + } else if constexpr (UsePulseBlockCache) { + bin_weight = ComputeBinWeightToPixelPulseBlockCache(sh_mem, ip, px, py, pz, dr_inv); } else { - diffR = ComputeRangeToPixel( - pp, p, px, py, pz) - static_cast(r_to_mcp(p)); - const strict_compute_t bin = diffR * dr_inv + bin_offset; + // Below is the most direct / standard path for backprojection. We compute the distance + // from the antenna phase center to the pixel for this pulse and from that the differential range + // to the MCP. From that differential range, we compute the base range bin index and interpolation weight + // for range profile sampling. Finally, we use the differential range to compute the reference phase + // and accumulate the contribution to the pixel. All other paths implement similar logic using various + // optimizations. + const strict_compute_t range_delta = + ComputeRangeToPixel( + pp, p, px, py, pz) - static_cast(r_to_mcp(p)); + if constexpr (!PhaseLUT) { + thread_state.range_delta = range_delta; + } + const strict_compute_t bin = range_delta * dr_inv + bin_offset; strict_compute_t bin_floor; if constexpr (cuda::std::is_same_v) { bin_floor = ::floor(bin); } else { bin_floor = ::floorf(bin); } - w = static_cast(bin - bin_floor); - bin_floor_int = static_cast(bin_floor); + bin_weight.w = static_cast(bin - bin_floor); + bin_weight.bin_floor_int = static_cast(bin_floor); } - if (bin_floor_int >= 0 && bin_floor_int < num_range_bins-1) { - // rp accessor picks the fast pointer path on IsUnitStride or - // falls through to operator(). - const range_profiles_t sample_lo = rp(p, bin_floor_int); - const range_profiles_t sample_hi = rp(p, bin_floor_int + 1); - - const loose_complex_compute_t sample = [&sample_lo, &sample_hi, &w]() -> loose_complex_compute_t { - const loose_complex_compute_t loose_sample_lo = static_cast(sample_lo); - const loose_complex_compute_t loose_sample_hi = static_cast(sample_hi); - if constexpr (cuda::std::is_same_v) { - return loose_complex_compute_t{ - __fmaf_rn(w, loose_sample_hi.real(), __fmaf_rn(-w, loose_sample_lo.real(), loose_sample_lo.real())), - __fmaf_rn(w, loose_sample_hi.imag(), __fmaf_rn(-w, loose_sample_lo.imag(), loose_sample_lo.imag())) - }; - } else { - return loose_complex_compute_t{ - fma(w, loose_sample_hi.real(), fma(-w, loose_sample_lo.real(), loose_sample_lo.real())), - fma(w, loose_sample_hi.imag(), fma(-w, loose_sample_lo.imag(), loose_sample_lo.imag())) - }; - } - }(); - - // For FloatFloat mode, diffR has been set to the distance to the pixel rather than the differential range to the MCP. - // However, FloatFloat mode currently requires PhaseLUT optimization due to missing fltflt sin/cos implementations, - // so diffR will not actually be used in get_reference_phase() below. - static_assert(ComputeType != SarBpComputeType::FloatFloat || PhaseLUT == true, "SarBp: FloatFloat compute type requires PhaseLUT optimization"); - const loose_complex_compute_t ref_phase = get_reference_phase(diffR, bin_floor_int, w); - accum = accumulate_contribution(accum, sample, ref_phase); + if (bin_weight.bin_floor_int < 0 || bin_weight.bin_floor_int >= num_range_bins-1) { + continue; } + + // rp accessor picks the fast pointer path on IsUnitStride or + // falls through to operator(). + const range_profiles_t sample_lo = rp(p, bin_weight.bin_floor_int); + const range_profiles_t sample_hi = rp(p, bin_weight.bin_floor_int + 1); + + const loose_complex_compute_t sample = [&sample_lo, &sample_hi, &bin_weight]() -> loose_complex_compute_t { + const loose_complex_compute_t loose_sample_lo = static_cast(sample_lo); + const loose_complex_compute_t loose_sample_hi = static_cast(sample_hi); + const loose_compute_t w = bin_weight.w; + if constexpr (cuda::std::is_same_v) { + return loose_complex_compute_t{ + __fmaf_rn(w, loose_sample_hi.real(), __fmaf_rn(-w, loose_sample_lo.real(), loose_sample_lo.real())), + __fmaf_rn(w, loose_sample_hi.imag(), __fmaf_rn(-w, loose_sample_lo.imag(), loose_sample_lo.imag())) + }; + } else { + return loose_complex_compute_t{ + fma(w, loose_sample_hi.real(), fma(-w, loose_sample_lo.real(), loose_sample_lo.real())), + fma(w, loose_sample_hi.imag(), fma(-w, loose_sample_lo.imag(), loose_sample_lo.imag())) + }; + } + }(); + + static_assert((!IsFloatFloat && + !IsTaylorFast) || PhaseLUT == true, + "SarBp: FloatFloat and TaylorFast compute types require PhaseLUT optimization"); + const loose_complex_compute_t ref_phase = [&]() { + if constexpr (PhaseLUT) { + return get_reference_phase_lut(bin_weight.bin_floor_int, bin_weight.w); + } else { + return get_reference_phase_direct(thread_state.range_delta); + } + }(); + + accum = accumulate_contribution(accum, sample, ref_phase); } // pulse } // pulse block diff --git a/include/matx/operators/sar_bp.h b/include/matx/operators/sar_bp.h index c7ac37451..f50fc945e 100644 --- a/include/matx/operators/sar_bp.h +++ b/include/matx/operators/sar_bp.h @@ -32,6 +32,7 @@ #pragma once +#include "matx/core/props.h" #include "matx/core/type_utils.h" #include "matx/operators/base_operator.h" @@ -44,9 +45,14 @@ namespace matx { * While the inputs (range profiles, antenna positions, etc.) and output (image) have their own data * types, we may wish to use a different precision for the internal calculations. For example, the * output may be cuda::std::complex while the intermediate calculations are done in double. + * For some compute types, additional approximations will be applied to further improve run-time performance. */ enum class SarBpComputeType { - Double, //!< Uses double precision for all intermediate calculations. + Double, /**< Uses double precision for all intermediate calculations. The backprojection kernel works with user-provided input + for the range profiles, platform positions, ranges to mocomp point, and voxel locations. Thus, the user should provide + values in double precision if needed. Typically, the platform positions and ranges to mocomp point will be double + precision and range profiles and voxel locations will be single precision, but users can provide double precision + range profiles and voxel locations if needed. */ Mixed, /**< Uses mixed precision for intermediate calculations. This compute type offers a trade-off between performance and precision. With \p Mixed precision, the range calculated per pixel-pulse pair will still typically be done in double-precision, but interpolation and accumulation will be single-precision. @@ -56,8 +62,19 @@ enum class SarBpComputeType { float-float handling of the values for which fp64 would otherwise be needed. The float-float representation offers increased precision relative to fp32, but not full fp64 precision, through the use of increased fp32 computation and representing each value as an unevaluated - sum of two fp32 components. */ - Float //!< Uses single precision for all intermediate calculations. + sum of two fp32 components. For optimal performance, users can provide any inputs that require + additional precision (e.g., platform positions and range to mocomp point) directly in \c fltflt format. + Otherwise, the kernel will perform double-to-fltflt conversions internally. + FloatFloat requires \p PhaseLUTOptimization. */ + TaylorFast, /**< Uses a local Taylor approximation of the pulse-to-pixel range about a centered per-thread-block + reference point. This mode prioritizes throughput and requires \p PhaseLUTOptimization. + The \p PropSarBpTaylorFastAddThirdOrder property can be used to add a third-order term to the Taylor approximation. + On systems with reduced double-precision throughput, TaylorFast is typically the fastest compute type using either + second order or third order Taylor approximation. TaylorFast will divide by reference range values to certain image + pixels and thus does not support geometries where the antenna phase center is located at any of the image pixels + (i.e., at range 0 from some pixel). In such cases, the user should choose a different compute type.*/ + Float /**< Uses single precision for all intermediate calculations. This mode is fast, but typically does not provide + sufficient precision for cases where the ranges exceed several kilometers. It is not suited for spaceborne SAR geometries. */ }; /** @@ -69,9 +86,21 @@ enum class SarBpFeature : uint32_t { to store partial values for the reference phases used during backprojection. The value from the lookup table will be combined with an incremental phase calculation within a single range bin that is computed using the lower-precision intrinsic sine/cosine functions. This optimization will utilize a small amount of device memory as a workspace - buffer. This optimization is typically only useful for the \p Mixed and \p FloatFloat compute types. */ + buffer. This optimization is typically useful for the \p Mixed, \p FloatFloat, and \p TaylorFast compute types. + It is required for \p FloatFloat and \p TaylorFast. */ }; +/** + * @brief Adds the third-order local range term to SarBpComputeType::TaylorFast. + * + * TaylorFast uses a second-order local Taylor range approximation by default. + * This property instantiates a TaylorFast kernel variant that also evaluates + * the third-order term. It has no effect unless SarBpParams::compute_type is + * SarBpComputeType::TaylorFast. This is primarily useful for short stand-off + * ranges where the second-order approximation may not be accurate enough. + */ +struct PropSarBpTaylorFastAddThirdOrder {}; + // Enable bitmask operations for SarBpFeature constexpr SarBpFeature operator|(SarBpFeature lhs, SarBpFeature rhs) noexcept { return static_cast(static_cast(lhs) | static_cast(rhs)); @@ -107,8 +136,11 @@ struct SarBpParams { namespace matx { namespace detail { - template - class SarBpOp : public BaseOp> + template + struct make_sar_bp_op; + + template + class SarBpOp : public BaseOp> { static_assert(is_complex_v, "Phase history must be complex"); static_assert(is_complex_v, "Image must be complex"); @@ -169,7 +201,11 @@ namespace detail { void Exec(Out &&out, Executor &&ex) const { static_assert(is_cuda_executor_v, "sarbp() only supports the CUDA executor currently"); - sar_bp_impl(cuda::std::get<0>(out), initial_image_, range_profiles_, platform_positions_, voxel_locations_, range_to_mcp_, params_, ex.getStream()); + constexpr bool TaylorFastAddThirdOrder = + has_property_tag; + sar_bp_impl( + cuda::std::get<0>(out), initial_image_, range_profiles_, platform_positions_, + voxel_locations_, range_to_mcp_, params_, ex.getStream()); } static __MATX_INLINE__ constexpr __MATX_HOST__ __MATX_DEVICE__ int32_t Rank() @@ -226,22 +262,42 @@ namespace detail { return out_dims_[dim]; } + template + constexpr auto props() const { + using AllProps = typename merge_props_unique, NewProps...>::type; + return make_sar_bp_op::make( + initial_image_, range_profiles_, platform_positions_, voxel_locations_, range_to_mcp_, params_); + } + + }; + + template + struct make_sar_bp_op> { + static constexpr auto make(const ImageType &initial_image, const RangeProfilesType &range_profiles, + const PlatPosType &platform_positions, const VoxLocType &voxel_locations, + const RangeToMcpType &range_to_mcp, const SarBpParams ¶ms) { + return SarBpOp( + initial_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, params); + } }; } namespace experimental { /** -* @brief SAR backprojection. +* @brief sar_bp implements standard or direct SAR backprojection. Backprojection is an image formation algorithm +* that reconstructs a complex-valued image from a set of range-compressed complex samples. The runtime complexity +* of the sar_bp transform is O(MNP) where there are MxN pixels and P pulses. In the case that M=N=P, sar_bp is +* O(N^3). * -* @note The number of range bins (\p range_profiles second dimension) is constrained as follows: -* For the Float, Mixed, and FloatFloat compute types, num_range_bins is capped at 2^24 (16,777,216). -* Those paths compute the per-pulse bin offset and (for Float / FloatFloat) the bin floor in fp32, and -* fp32 can exactly represent all integers only in [-2^24, 2^24]; above 2^24 the representable gaps grow -* (2.0 at 2^24+, 4.0 at 2^25+, ...), so the bin index would lose precision near the upper boundary. -* The Double compute type uses fp64 throughout for the bin computation and is only constrained by the -* int32_t tensor-indexing limit (num_range_bins <= cuda::std::numeric_limits::max()). The transform throws -* \c matxInvalidParameter at launch if these limits are exceeded. Typical raw num_range_bins is on the +* @note The number of range bins is constrained as follows: +* - For the Float, Mixed, FloatFloat, and TaylorFast compute types, num_range_bins is capped at 2^24 (16,777,216). +* Those paths compute the per-pulse bin offset and (for Float / FloatFloat / TaylorFast) the bin floor in fp32, and +* fp32 can exactly represent all integers only in [-2^24, 2^24]. Exceeding 2^24 would result in loss of precision +* in the bin index used for range profile sampling. +* - The Double compute type uses fp64 throughout for the bin computation and is only constrained by the +* int32_t tensor-indexing limit (num_range_bins <= cuda::std::numeric_limits::max()). The transform throws \c matxInvalidParameter +* at launch if these limits are exceeded. Typical raw num_range_bins is on the * order of 10^4-10^5 and well below the 2^24 bound. The fp32 limit can be reached, however, when heavy * range oversampling is applied. An upsample factor that pushes the FFT length above ~2^24 / * num_samples_raw (e.g., upsampling 32k raw samples by 512x or more) for non-Double compute types will trigger the runtime check. @@ -251,21 +307,26 @@ namespace experimental { * @tparam RangeProfilesType Type of range_profiles. RangeProfilesType must represent a 2D operator of size num_pulses x num_range_bins containing the range-compressed complex samples. * RangeProfilesType must be a complex type. Typical data types are cuda::std::complex or cuda::std::complex. * @tparam PlatPosType Type of platform positions. PlatPosType must represent a 1D operator of size num_pulses containing the platform positions. Currently, the only supported data -* types for PlatPosType are double3, double4, float3, and float4. If the user has three separate operators for the x, y, and z coordinates, they can be combined using the zipvec operator. +* types for PlatPosType are double3, double4, float3, and float4. For the float4 and double4 data types, the w coordinate is ignored. +* If the user has three separate operators for the x, y, and z coordinates, they can be combined using the zipvec operator. * @tparam VoxLocType Type of voxel locations. VoxLocType must represent a 2D operator of size image_height x image_width containing the voxel locations. Currently, the only supported * data types for VoxLocType are double3, double4, float3, and float4. For the float4 and double4 data types, the w coordinate is ignored. * If the user has three separate operators for the x, y, and z coordinates, they can be combined using the zipvec operator. -* @tparam RangeToMcpType Type of range to motion compensation point. RangeToMcpType must represent a 0D or 1D real-valued operator of size 1 or num_pulses. +* @tparam RangeToMcpType Type of range to motion compensation point. RangeToMcpType must represent a 0D or 1D real-valued operator of size 1 or num_pulses, respectively. * @param initial_image Initial image. Initial image must represent a 2D operator of size image_height x image_width for an image of the corresponding dimensions. Contributions computed * during backprojection will be added to the initial image. The user can use the zeros generator (i.e., matx::zeros) if no initial image is needed. +* See \p ImageType documentation for details on supported rank and data types. * @param range_profiles Range profiles. Range profiles must represent a 2D operator of size num_pulses x num_range_bins containing the range-compressed complex samples. +* See \p RangeProfilesType documentation for details on supported rank and data types. * @param platform_positions Platform positions represent the x, y, and z coordinates of the aperture phase center for each pulse. The coordinates should be in * the same coordinate system and units as the voxel locations. See \p PlatPosType documentation for details on supported rank and data types. * @param voxel_locations Voxel locations represent the x, y, and z coordinates of the voxels in the image. The coordinates should be in * the same coordinate system and units as the platform positions. See \p VoxLocType documentation for details on supported rank and data types. * @param range_to_mcp Range to motion compensation point is the distance (range) from each platform position to the motion compensation point. * See \p RangeToMcpType documentation for details on supported rank and data types. -* @param params SAR backprojection parameters. See \p SarBpParams documentation for details on supported parameters. +* @param params SAR backprojection parameters. See \p SarBpParams documentation for details on supported parameters. Note that the \p TaylorFast +* compute type does not support geometries where the antenna phase center is located at any of the image pixels (i.e., at range 0 from some pixel). +* In such cases, the user should choose another compute type to avoid potential divide-by-zero errors at run-time. */ template inline auto sar_bp(const ImageType &initial_image, const RangeProfilesType &range_profiles, diff --git a/include/matx/transforms/sar_bp.h b/include/matx/transforms/sar_bp.h index 323f15a79..77bf9cd47 100644 --- a/include/matx/transforms/sar_bp.h +++ b/include/matx/transforms/sar_bp.h @@ -44,7 +44,7 @@ namespace matx { -template +template inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image, const RangeProfilesType &range_profiles, const PlatPosType &platform_positions, const VoxLocType &voxel_locations, const RangeToMcpType &range_to_mcp, const SarBpParams ¶ms, cudaStream_t stream = 0) { #ifdef __CUDACC__ @@ -58,11 +58,12 @@ inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image MATX_STATIC_ASSERT_STR(RangeProfilesType::Rank() == 2, matxInvalidDim, "sar_bp: range profiles must be a 2D tensor"); const bool phase_lut_optimization = has_feature(params.features, SarBpFeature::PhaseLUTOptimization); - if (params.compute_type == SarBpComputeType::FloatFloat && ! phase_lut_optimization) { - // We currently require that phase LUT optimization be enabled for the FloatFloat compute type + if ((params.compute_type == SarBpComputeType::FloatFloat || + params.compute_type == SarBpComputeType::TaylorFast) && ! phase_lut_optimization) { + // We currently require that phase LUT optimization be enabled for these compute types // because we do not yet have float-float based sin/cos implementations. Thus, we would fall back - // to double-precision sincos() computations, which defeats the purpose of the FloatFloat compute type. - MATX_THROW(matxInvalidParameter, "sar_bp: FloatFloat compute type requires phase LUT optimization"); + // to double-precision sincos() computations for FloatFloat and TaylorFast is defined as a LUT-backed mode. + MATX_THROW(matxInvalidParameter, "sar_bp: FloatFloat and TaylorFast compute types require phase LUT optimization"); } const index_t num_pulses = range_profiles.Size(0); @@ -83,12 +84,12 @@ inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image "the kernel indexes range bins via a 32-bit integer"); } - // The Float, Mixed, and FloatFloat compute types all use loose_compute_t = + // The Float, Mixed, FloatFloat, and TaylorFast compute types all use loose_compute_t = // float, which makes the per-pulse `bin_offset = 0.5 * (num_range_bins - 1)` // an fp32 value. fp32 can exactly represent all integers in [-2^24, 2^24]; // above that, the gaps grow (2.0 at 2^24+, 4.0 at 2^25+, ...), so // bin_offset would lose precision and bin_floor_int would be off by up to - // ~1 bin for every pixel. (Float and FloatFloat additionally use floorf() + // ~1 bin for every pixel. (Float, FloatFloat, and TaylorFast additionally use floorf() // on an fp32 bin value, which is constrained by the same limit.) We // therefore cap num_range_bins at 2^24 for those paths. // @@ -100,7 +101,7 @@ inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image if (fp32_bin_path && range_profiles.Size(1) > FP32_MAX_RANGE_BINS) { MATX_THROW(matxInvalidParameter, "sar_bp: num_range_bins exceeds the maximum supported value of 2^24 " - "(16,777,216) for Float/Mixed/FloatFloat compute types -- fp32 mantissa " + "(16,777,216) for Float/Mixed/FloatFloat/TaylorFast compute types -- fp32 mantissa " "cannot exactly represent bin indices above 2^24. Use the Double " "compute type for larger range-bin counts."); } @@ -177,6 +178,11 @@ inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image SarBpFillPhaseLUT<<>>(phase_lut, params.center_frequency, params.del_r, range_profiles.Size(1)); SarBp<<>>( out, initial_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, static_cast(dr_inv), phase_correction_partial, phase_lut); + } else if (params.compute_type == SarBpComputeType::TaylorFast) { + cuda::std::complex *phase_lut = static_cast *>(workspace); + SarBpFillPhaseLUT<<>>(phase_lut, params.center_frequency, params.del_r, range_profiles.Size(1)); + SarBp<<>>( + out, initial_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, dr_inv, phase_correction_partial, phase_lut); } else { cuda::std::complex *phase_lut = static_cast *>(workspace); SarBpFillPhaseLUT<<>>(phase_lut, static_cast(params.center_frequency), static_cast(params.del_r), range_profiles.Size(1)); @@ -193,10 +199,11 @@ inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image } else if (params.compute_type == SarBpComputeType::Mixed) { SarBp<<>>( out, initial_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, dr_inv, phase_correction_partial, nullptr); - } else if (params.compute_type == SarBpComputeType::FloatFloat) { - // We currently require that phase LUT optimization be enabled for the FloatFloat compute type. See comment + } else if (params.compute_type == SarBpComputeType::FloatFloat || + params.compute_type == SarBpComputeType::TaylorFast) { + // We currently require that phase LUT optimization be enabled for these compute types. See comment // in run-time check higher in this function. - MATX_THROW(matxInvalidParameter, "sar_bp: FloatFloat compute type requires phase LUT optimization"); + MATX_THROW(matxInvalidParameter, "sar_bp: FloatFloat and TaylorFast compute types require phase LUT optimization"); } else { SarBp<<>>( out, initial_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, diff --git a/test/00_transform/SarBp.cu b/test/00_transform/SarBp.cu index 9829d7a0e..614e67db0 100644 --- a/test/00_transform/SarBp.cu +++ b/test/00_transform/SarBp.cu @@ -136,7 +136,7 @@ TYPED_TEST(SarBpTestNonComplexNonHalfFloatTypes, NonMixedTypes) MATX_EXIT_HANDLER(); } -// Test Mixed and FloatFloat precisions. These precisions are used when the user wants +// Test Mixed, FloatFloat, and TaylorFast precisions. These precisions are used when the user wants // better than fp32 precision at lower cost than fp64 (especially for GPUs with reduced FP64 throughput). TYPED_TEST(SarBpTestDoubleType, MixedPrecision) { @@ -193,9 +193,10 @@ TYPED_TEST(SarBpTestDoubleType, MixedPrecision) this->exec.sync(); } - // Enable PhaseLUTOptimization + // Enable PhaseLUTOptimization for all remaining tests + params.features = SarBpFeature::PhaseLUTOptimization; + { - params.features = SarBpFeature::PhaseLUTOptimization; (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, params)).run(this->exec); this->exec.sync(); } @@ -208,11 +209,28 @@ TYPED_TEST(SarBpTestDoubleType, MixedPrecision) this->exec.sync(); } + { + params.compute_type = SarBpComputeType::TaylorFast; + (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, params)).run(this->exec); + this->exec.sync(); + + // We reset TaylorFast and enable PhaseLUTOptimization below to include the text in the + // example that will be used in the documentation. + // example-begin sar-bp-2 + params.compute_type = SarBpComputeType::TaylorFast; + params.features = SarBpFeature::PhaseLUTOptimization; + + (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, params) + .props()).run(this->exec); + // example-end sar-bp-2 + this->exec.sync(); + } + MATX_EXIT_HANDLER(); } // Verify the num_range_bins boundary at 2^24 for compute types that compute -// the per-pulse bin_offset in fp32 (Float, Mixed, FloatFloat). fp32 can +// the per-pulse bin_offset in fp32 (Float, Mixed, FloatFloat, TaylorFast). fp32 can // exactly represent all integers in [-2^24, 2^24] but not above, so those // paths accept num_range_bins == 2^24 and reject anything larger. // @@ -317,8 +335,10 @@ TYPED_TEST(SarBpTestDoubleType, PointTarget) using ExecType = cuda::std::tuple_element_t<1, TypeParam>; - constexpr matx::index_t image_width = 128; - constexpr matx::index_t image_height = 128; + // Divisible by 4 for target-location checks, but not by the 16x16 CUDA block + // size, so TaylorFast exercises partial edge CTAs. + constexpr matx::index_t image_width = 132; + constexpr matx::index_t image_height = 132; const matx::index_t num_pulses = 128; const matx::index_t num_range_bins = 128; @@ -337,7 +357,7 @@ TYPED_TEST(SarBpTestDoubleType, PointTarget) // Target range relative to mocomp point (0,0,0) const double target_R = ::sqrt(target_x * target_x + target_y * target_y); - matx::SarBpParams bp_params; + matx::SarBpParams bp_params{}; bp_params.del_r = ::sqrt((image_width * pix_dx) * (image_width * pix_dx) + (image_height * pix_dy) * (image_height * pix_dy)) / num_range_bins; bp_params.center_frequency = 10.0e9; const double bin_offset = 0.5 * (num_range_bins - 1); @@ -401,6 +421,9 @@ TYPED_TEST(SarBpTestDoubleType, PointTarget) } } }; + // Single-precision machine epsilon is ~1.1921e-7; allow one fp32 epsilon + // per pulse of accumulation at the focused target pixel. + const double f32_accum_re_thresh = 1.2e-7 * static_cast(num_pulses); // Start with fully fp64 backprojector. The range profiles and all position values are double precision. { @@ -480,7 +503,7 @@ TYPED_TEST(SarBpTestDoubleType, PointTarget) const float expected = 0.98f * num_pulses; const float actual = image(i, j).real(); ASSERT_GT(actual, expected); - ASSERT_LT(std::abs(image(i, j).imag()), 1.0f); + ASSERT_LT(std::abs(image(i, j).imag()), 3.0f); } else if (i < image_height / 8 || i >= 3 * image_height / 8) { // Away from the scatterer in range, we should not have any non-zero values. ASSERT_EQ(abs(image(i, j).real()), 0.0f); @@ -494,30 +517,66 @@ TYPED_TEST(SarBpTestDoubleType, PointTarget) } } - // Mixed and FloatFloat versions of the backprojector. For these, we need double-precision platform positions - // and range to the mcp. + // Mixed, FloatFloat, and TaylorFast versions of the backprojector. For these, we need double-precision + // platform positions and range to the mcp. { - bp_params.features = matx::SarBpFeature::PhaseLUTOptimization; - auto platform_positions = matx::make_tensor({num_pulses}); auto range_to_mcp = matx::make_tensor({num_pulses}); MATX_CUDA_CHECK(cudaMemcpy( platform_positions.Data(), h_antenna_phase_centers.data(), h_antenna_phase_centers.size() * sizeof(double3), cudaMemcpyHostToDevice)); MATX_CUDA_CHECK(cudaMemcpy(range_to_mcp.Data(), h_range_to_mcp.data(), h_range_to_mcp.size() * sizeof(double), cudaMemcpyHostToDevice)); - bp_params.compute_type = matx::SarBpComputeType::Mixed; - // example-begin sar-bp-1 - (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec); - // example-end sar-bp-1 - this->exec.sync(); + { + SCOPED_TRACE("Mixed no PhaseLUT"); - validate(image, 1.05e-10, 1.0e-2); + // example-begin sar-bp-1 + bp_params.compute_type = matx::SarBpComputeType::Mixed; + (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec); + // example-end sar-bp-1 + this->exec.sync(); - bp_params.compute_type = matx::SarBpComputeType::FloatFloat; - (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec); - this->exec.sync(); + validate(image, f32_accum_re_thresh, 1.0e-7); + } + + bp_params.features = matx::SarBpFeature::PhaseLUTOptimization; + + { + SCOPED_TRACE("Mixed PhaseLUT"); + + bp_params.compute_type = matx::SarBpComputeType::Mixed; + (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec); + this->exec.sync(); + + validate(image, f32_accum_re_thresh, 1.0e-2); + } + + { + SCOPED_TRACE("FloatFloat"); - validate(image, 1.05e-10, 1.0e-2); + bp_params.compute_type = matx::SarBpComputeType::FloatFloat; + (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec); + this->exec.sync(); + + validate(image, f32_accum_re_thresh, 1.0e-2); + } + + { + SCOPED_TRACE("TaylorFast second order"); + bp_params.compute_type = matx::SarBpComputeType::TaylorFast; + (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec); + this->exec.sync(); + + validate(image, f32_accum_re_thresh, 1.8e-2); + } + + { + SCOPED_TRACE("TaylorFast third order"); + (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params) + .props()).run(this->exec); + this->exec.sync(); + + validate(image, f32_accum_re_thresh, 1.8e-2); + } } MATX_EXIT_HANDLER();