diff --git a/docs_input/api/signalimage/radar/sar_bp.rst b/docs_input/api/signalimage/radar/sar_bp.rst
index 11fe5854c..ef51fe85a 100644
--- a/docs_input/api/signalimage/radar/sar_bp.rst
+++ b/docs_input/api/signalimage/radar/sar_bp.rst
@@ -11,9 +11,361 @@ namespace as its API is subject to change.
 .. doxygenfunction:: sar_bp(const ImageType &initial_image, const RangeProfilesType &range_profiles, const PlatPosType &platform_positions, const VoxLocType &voxel_locations, const RangeToMcpType &range_to_mcp, const SarBpParams &params)
 .. doxygenenum:: matx::SarBpComputeType
 .. doxygenenum:: matx::SarBpFeature
+.. doxygenstruct:: matx::PropSarBpTaylorFastAddThirdOrder
 .. doxygenstruct:: matx::SarBpParams
    :members:
 
+TaylorFast Range Approximation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``SarBpComputeType::TaylorFast`` approximates the range from each pulse to each
+pixel by expanding the range about a reference pixel for the current thread
+block. The goal is to replace most per-pixel range work with arithmetic in a
+local coordinate system, while computing the expensive reference quantities once
+per pulse and per thread block. By default, ``TaylorFast`` keeps terms through
+second order in the local pixel offset. Users can add the third-order term with
+the ``PropSarBpTaylorFastAddThirdOrder`` property. The third-order term is most
+useful for short stand-off ranges where the second-order approximation may not
+provide sufficient accuracy.
+
+The derivation of the Taylor approximation follows.
+For pulse :math:`p`, let :math:`a_p` be the platform position and let :math:`m_p` be the
+range to the motion compensation point. For image pixel position :math:`x`, the
+range quantity used by backprojection is
+
+.. math::
+
+   \Delta R_p(x) = \|x - a_p\| - m_p.
+
+For a thread block, choose a reference pixel :math:`x_0` near the center of the
+block. Define
+
+.. math::
+
+   r_0 = x_0 - a_p, \qquad R_0 = \|r_0\|, \qquad u = \frac{r_0}{R_0},
+
+where :math:`u` is the unit vector pointing from the platform position to the reference pixel. Hereafter, we generally drop
+the :math:`p` subscript for brevity. For any pixel in the same thread block, write its local offset as
+
+.. math::
+
+   d = x - x_0.
+
+Then the exact range to the pixel is
+
+.. math::
+
+   R(d) = \|r_0 + d\| = \|(x_0 - a_p) + (x - x_0)\| = \|x - a_p\|.
+
+Decompose :math:`d` into the component along :math:`u` and the component perpendicular
+to :math:`u`:
+
+.. math::
+
+   s = u \cdot d.
+
+Here, :math:`s` is the signed scalar projection of :math:`d` along :math:`u` and :math:`u` is the
+along-range direction. The perpendicular, or cross-range, vector
+is then :math:`d - s u`. The squared norm of this perpendicular component is:
+
+.. math::
+
+   \begin{aligned}
+   \|d - s u\|^2
+     &= d \cdot d - 2s(u \cdot d) + s^2(u \cdot u) \\
+     &= \|d\|^2 - 2s(u \cdot d) + s^2 \\
+     &= \|d\|^2 - 2s^2 + s^2 \\
+     &= \|d\|^2 - s^2
+   \end{aligned}
+
+We denote this squared norm term as :math:`q`:
+
+.. math::
+
+   q = \|d\|^2 - s^2.
+
+Mathematically, :math:`q` is the squared perpendicular distance from the local pixel
+to the pulse-to-reference-pixel line. With these definitions, recall that
+:math:`R(d) = \|r_0 + d\|`. Thus,
+
+.. math::
+
+   \begin{aligned}
+   R(d)^2 &= \|r_0 + d\|^2 \\
+          &= \|R_0 u + d\|^2 = (R_0 u + d) \cdot (R_0 u + d) \\
+          &= R_0^2 (u \cdot u) + 2 R_0 (u \cdot d) + d \cdot d \\
+          &= R_0^2 + 2 R_0 s + \|d\|^2 \\
+          &= R_0^2 + 2 R_0 s + s^2 + q \\
+          &= (R_0 + s)^2 + q
+   \end{aligned}
+
+where we used :math:`q = \|d\|^2 - s^2` and thus :math:`\|d\|^2 = s^2 + q`.
+
+Finally,
+
+.. math::
+
+   R(d) = \sqrt{(R_0 + s)^2 + q}.
+
+:math:`R(d)` is the exact range to the pixel :math:`x` and the range difference relative to the block reference is
+
+.. math::
+
+   \Delta R(d) = R(d) - R_0.
+
+We started this section with the differential range from pixel :math:`x` to the motion compensation range
+:math:`m_p`. We can now write this as:
+
+.. math::
+
+   \begin{aligned}
+   \Delta R_p(x) &= \|x - a_p\| - m_p \\
+                 &= R_p(x) - m_p \\
+                 &= (R_0 + \Delta R(d)) - m_p \\
+                 &= \Delta R(d) + (R_0 - m_p)
+   \end{aligned}
+
+The bin coordinate for pixel :math:`x` is typically:
+
+.. math::
+
+   b(x) = \frac{\Delta R_p(x)}{\Delta r} + b_{\mathrm{offset}}
+
+Here :math:`\Delta r` is the range-bin spacing and :math:`b_{\mathrm{offset}}` is the centered
+range-bin offset used by the SAR backprojection operator. We can now reformulate this as:
+
+.. math::
+
+   \begin{aligned}
+   b(x) &= \frac{\Delta R_p(x)}{\Delta r} + b_{\mathrm{offset}} \\
+        &= \frac{\Delta R(d) + (R_0 - m_p)}{\Delta r} + b_{\mathrm{offset}} \\
+        &= \frac{\Delta R(d)}{\Delta r} + \frac{R_0 - m_p}{\Delta r} + b_{\mathrm{offset}}
+   \end{aligned}
+
+We can precompute the right-hand terms as:
+
+.. math::
+
+   b_0 = \frac{R_0 - m_p}{\Delta r} + b_{\mathrm{offset}}
+
+and thus:
+
+.. math::
+
+   b(x) = b_0 + \frac{\Delta R(d)}{\Delta r}
+
+The following sections derive an approximation of :math:`\Delta R(d)` using a Taylor series expansion.
+
+Derivation of the Taylor Expansion
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Starting from
+
+.. math::
+
+   R(d) = \sqrt{R_0^2 + 2 R_0 s + \|d\|^2},
+
+factor out :math:`R_0`:
+
+.. math::
+
+   R(d)
+     = R_0
+       \sqrt{1 + \frac{2s}{R_0} + \frac{\|d\|^2}{R_0^2}}.
+
+Let
+
+.. math::
+
+   y = \frac{2s}{R_0} + \frac{\|d\|^2}{R_0^2}.
+
+The scalar Taylor series is applied to the one-variable function
+:math:`f(y) = \sqrt{1 + y}` about :math:`y = 0`:
+
+.. math::
+
+   f(y) = \sqrt{1 + y}
+     = 1 + \frac{y}{2} - \frac{y^2}{8} + \frac{y^3}{16}
+       + O(y^4),
+
+so
+
+.. math::
+
+   R(d)
+     = R_0
+       \left(
+         1 + \frac{y}{2} - \frac{y^2}{8} + \frac{y^3}{16}
+       \right)
+       + O(R_0 y^4).
+
+The order used by ``TaylorFast`` is not the number of scalar Taylor-series
+terms retained in :math:`f(y)`. It is the order in the local pixel offset
+:math:`d`. Recall that :math:`s = u \cdot d`, so :math:`s` is linear in :math:`d`.
+
+The following table shows how the scalar Taylor-series terms contribute to the
+local-offset orders used by the ``TaylorFast`` approximation.
+
+.. list-table::
+   :header-rows: 1
+   :widths: 24 42 34
+
+   * - Scalar term
+     - Contribution
+     - Local-offset order
+   * - :math:`R_0`
+     - :math:`R_0`
+     - Zeroth order. This is the reference range and cancels in :math:`\Delta R(d) = R(d) - R_0`.
+   * - :math:`R_0 y / 2`
+     - :math:`s + \dfrac{s^2 + q}{2R_0}`
+     - First-order term :math:`s`; second-order term :math:`(s^2 + q)/(2R_0)`.
+   * - :math:`-R_0 y^2 / 8`
+     - :math:`-\dfrac{s^2}{2R_0} - \dfrac{s(s^2 + q)}{2R_0^2} + O(d^4)`
+     - Second-order term :math:`-s^2/(2R_0)`; third-order term :math:`-s(s^2 + q)/(2R_0^2)`.
+   * - :math:`R_0 y^3 / 16`
+     - :math:`\dfrac{s^3}{2R_0^2} + O(d^4)`
+     - Third-order term :math:`s^3/(2R_0^2)`.
+
+The first-order local-offset term is therefore
+
+.. math::
+
+   s.
+
+The second-order local-offset terms are
+
+.. math::
+
+   R_0
+   \left(
+     \frac{s^2 + q}{2 R_0^2}
+     - \frac{4s^2}{8 R_0^2}
+   \right)
+   =
+   \frac{q}{2 R_0}.
+
+The third-order local-offset terms are
+
+.. math::
+
+   R_0
+   \left(
+     -\frac{4s(s^2 + q)}{8 R_0^3}
+     + \frac{8s^3}{16 R_0^3}
+   \right)
+   =
+   -\frac{s q}{2 R_0^2}.
+
+Combining terms gives
+
+.. math::
+
+   R(d)
+     =
+     R_0
+     + s
+     + \frac{q}{2 R_0}
+     - \frac{s q}{2 R_0^2}
+     + O\left(\frac{\|d\|^4}{R_0^3}\right).
+
+Equivalently, the local range delta is
+
+.. math::
+
+   \Delta R(d)
+     =
+     s
+     + \frac{q}{2 R_0}
+     - \frac{s q}{2 R_0^2}
+     + O\left(\frac{\|d\|^4}{R_0^3}\right).
+
+
+Second-Order Approximation
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The second-order approximation keeps the linear along-range term and the
+quadratic cross-range correction:
+
+.. math::
+
+   \Delta R(d)
+     =
+     s + \frac{q}{2 R_0}.
+
+We then compute the bin coordinate as
+
+.. math::
+
+   b(d) \approx b_0 + \frac{1}{\Delta r}
+            \left(s + \frac{q}{2R_0}\right).
+
+The leading omitted term is
+
+.. math::
+
+   \epsilon(d)
+     \approx
+     -\frac{s q}{2 R_0^2}.
+
+This is the default ``TaylorFast`` implementation. It is most accurate when the pixel block
+is small relative to the platform stand-off range and when the along-range
+offset :math:`s` remains small.
+
+Third-Order Approximation
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``PropSarBpTaylorFastAddThirdOrder`` property instantiates a ``TaylorFast``
+kernel variant that also keeps the cubic term:
+
+.. math::
+
+   \Delta R(d)
+     =
+     s
+     + \frac{q}{2 R_0}
+     - \frac{s q}{2 R_0^2}.
+
+The corresponding bin coordinate is
+
+.. math::
+
+   b(d) \approx b_0 + \frac{1}{\Delta r}
+            \left(
+              s
+              + \frac{q}{2R_0}
+              - \frac{s q}{2R_0^2}
+            \right).
+
+With the third-order term included, the leading omitted fourth-order terms are
+
+.. math::
+
+   \epsilon(d)
+     \approx
+     \frac{s^2 q}{2 R_0^3}
+     - \frac{q^2}{8 R_0^3}.
+
+Compared to the second-order form, this requires additional per-pixel arithmetic
+and will thus have correspondingly lower throughput than the second-order form.
+Keeping the third-order term avoids the short-range accuracy loss that occurs when
+the image block is large relative to :math:`R_0` or when the platform is close enough
+that the :math:`s q / R_0^2` term is no longer negligible.
+
+Accuracy Considerations
+^^^^^^^^^^^^^^^^^^^^^^^
+
+The approximation is local to a thread block. Its accuracy depends on the ratio
+between the maximum local pixel offset and :math:`R_0`. For satellite-scale data,
+:math:`R_0` is typically very large compared to the dimensions of a CUDA thread block, so
+the second-order approximation is likely sufficient. It is an assumption that adjacent
+pixels are spatially near one another and thus a compact pixel tile has a relatively small
+spatial extent.
+
+For shorter stand-off ranges, larger image tiles, or geometries where the look direction varies rapidly across
+a thread block, the third-order term becomes more important. The third-order
+property uses a separate kernel instantiation, which avoids a run-time order
+dispatch inside the backprojection kernel.
+
 Examples
 ~~~~~~~~
 
@@ -22,3 +374,16 @@ Examples
    :start-after: example-begin sar-bp-1
    :end-before: example-end sar-bp-1
    :dedent:
+
+The ``PropSarBpTaylorFastAddThirdOrder`` property is a compile-time MatX operator property. The compute type is
+selected separately through ``SarBpParams``; the property only adds the
+third-order term to a ``TaylorFast`` launch:
+
+.. literalinclude:: ../../../../test/00_transform/SarBp.cu
+   :language: cpp
+   :start-after: example-begin sar-bp-2
+   :end-before: example-end sar-bp-2
+   :dedent:
+
+The property only affects ``SarBpComputeType::TaylorFast``. Other compute types
+continue to use their ordinary kernel instantiations.
diff --git a/examples/sarbp/README.md b/examples/sarbp/README.md
index b3d05c1fe..86a5a7a64 100644
--- a/examples/sarbp/README.md
+++ b/examples/sarbp/README.md
@@ -106,7 +106,8 @@ real/imag, row-major), written to `output_image.raw` in this example.
 | `-w {hamming,none}` | Window for range compression (default: hamming) |
 | `-b {auto,all,0,N}` | Pulses per processing block. `auto` uses the GPU L2 cache size to choose a block size; `all` and `0` use all pulses (default: auto) |
 | `--image-tiles N` | Process the image as N x N tiles during backprojection (default: 1) |
-| `--precision {double,float,fltflt,mixed}` | Backprojection compute precision (default: mixed) |
+| `--taylor-fast-third-order` | Add the third-order range term when using `--precision taylor_fast` |
+| `--precision {double,float,fltflt,mixed,taylor_fast}` | Backprojection compute precision (default: mixed) |
 | `--warmup` | Warmup GPU kernels and FFT plans before timed run |
 
 The `--precision` flag controls the arithmetic used by the `sar_bp` operator. For spaceborne SAR, `float` does not provide enough precision to store fractional wavelengths at the range-to-MCP magnitudes (hundreds of km), so pure `float` is not sufficient to produce focused images. The available modes are:
@@ -114,6 +115,7 @@ The `--precision` flag controls the arithmetic used by the `sar_bp` operator. Fo
 - `double` -- full double-precision arithmetic. Most accurate.
 - `mixed` -- double-precision for range computation, single-precision elsewhere. Default. Close to `double` in image quality with slightly higher throughput on GPUs with reduced double-precision throughput. Other than `float`, this is the fastest option on hardware with full-throughput double-precision (e.g., A100, H100/H200, B200).
 - `fltflt` -- float-float evaluation using two `float` values for the high-precision range math. Significantly higher throughput on GPUs where `double` throughput is reduced (e.g., RTX PROs, Jetson Orin/Thor, gaming GPUs).
+- `taylor_fast` -- local Taylor approximation of the pulse-to-pixel range about a centered per-thread-block reference point. Highest-throughput experimental mode for spaceborne SAR geometries where moderate approximation error is acceptable.
 - `float` -- single-precision throughout. Fastest but not accurate enough for most spaceborne data.
 
 ## 6. View the Result
diff --git a/examples/sarbp/sarbp.cu b/examples/sarbp/sarbp.cu
index 10853014e..38067bdd5 100755
--- a/examples/sarbp/sarbp.cu
+++ b/examples/sarbp/sarbp.cu
@@ -226,6 +226,7 @@ struct BpRunCtx {
   bool is_fx_domain;
   bool is_int16_mode;
   bool apply_window;
+  bool taylor_fast_add_third_order;
   int sgn;
   bool do_warmup;
   std::string output_file;
@@ -316,9 +317,13 @@ static int run_bp_device(PosTensor blk_positions, RtmTensor blk_rtm,
 
         auto cur_image = matx::slice(image, {y0, x0}, {y1, x1});
         auto cur_voxel_locations = matx::slice(voxel_locations, {y0, x0}, {y1, x1});
-        (cur_image = matx::experimental::sar_bp(
-            cur_image, cur_profiles, cur_positions, cur_voxel_locations, cur_rtm, ctx.params))
-            .run(ctx.exec);
+        auto bp = matx::experimental::sar_bp(
+            cur_image, cur_profiles, cur_positions, cur_voxel_locations, cur_rtm, ctx.params);
+        if (ctx.taylor_fast_add_third_order) {
+          (cur_image = bp.template props<matx::PropSarBpTaylorFastAddThirdOrder>()).run(ctx.exec);
+        } else {
+          (cur_image = bp).run(ctx.exec);
+        }
       }
     }
   };
@@ -388,11 +393,12 @@ static int run_bp_device(PosTensor blk_positions, RtmTensor blk_rtm,
     std::cout << " done" << std::endl;
   }
 
-  // Pre-allocate pinned host buffer for image output
+  // Pre-allocate pinned host buffer for image output.
   const size_t num_pixels =
       static_cast<size_t>(ctx.image_height) * static_cast<size_t>(ctx.image_width);
+  const size_t image_bytes = num_pixels * sizeof(complex_t);
   complex_t *h_image = nullptr;
-  MATX_CUDA_CHECK(cudaHostAlloc(&h_image, num_pixels * sizeof(complex_t), cudaHostAllocDefault));
+  MATX_CUDA_CHECK(cudaHostAlloc(&h_image, image_bytes, cudaHostAllocDefault));
 
   std::cout << "Running backprojection (" << ctx.output_range_bins << " range bins, del_r="
             << ctx.del_r << " m)..." << std::endl;
@@ -533,8 +539,8 @@ static int run_bp_device(PosTensor blk_positions, RtmTensor blk_rtm,
     }
   }
 
-  // Copy result to pinned host buffer (included in timed region)
-  MATX_CUDA_CHECK(cudaMemcpyAsync(h_image, image.Data(), num_pixels * sizeof(complex_t),
+  // Copy result to host buffer (included in timed region)
+  MATX_CUDA_CHECK(cudaMemcpyAsync(h_image, image.Data(), image_bytes,
              cudaMemcpyDeviceToHost, ctx.stream));
 
   MATX_CUDA_CHECK(cudaEventRecord(ev_stop, ctx.stream));
@@ -582,7 +588,7 @@ static int run_bp_device(PosTensor blk_positions, RtmTensor blk_rtm,
     return 1;
   }
   out.write(reinterpret_cast<const char *>(h_image),
-            static_cast<std::streamsize>(num_pixels * sizeof(complex_t)));
+            static_cast<std::streamsize>(image_bytes));
   out.close();
 
   std::cout << "Wrote " << ctx.image_height << " x " << ctx.image_width
@@ -674,8 +680,10 @@ int main(int argc, char **argv) {
         << "  -b, --block-size <N|0|auto|all>\n"
         << "                          Pulses per block; 0/all use all pulses, auto uses an L2-cache heuristic (default: auto)\n"
         << "  --image-tiles <N>      Process image as N x N tiles (default: 1)\n"
+        << "  --taylor-fast-third-order\n"
+        << "                          Add the third-order term for --precision taylor_fast\n"
         << "  --warmup               Warmup GPU kernels and FFT plans before timed run\n"
-        << "  --precision <type>     Compute precision: double, float, fltflt, mixed (default: mixed)\n"
+        << "  --precision <type>     Compute precision: double, float, fltflt, mixed, taylor_fast (default: mixed)\n"
         << "  -h, --help             Print this help message and exit\n";
   };
 
@@ -691,6 +699,7 @@ int main(int argc, char **argv) {
   std::string block_size_arg = "auto";
   index_t image_tiles = 1;
   bool do_warmup = false;
+  bool taylor_fast_add_third_order = false;
   std::string precision_type = "mixed";
 
   auto needs_value = [&](int i) -> bool {
@@ -728,6 +737,8 @@ int main(int argc, char **argv) {
       }
     } else if (std::strcmp(argv[i], "--warmup") == 0) {
       do_warmup = true;
+    } else if (std::strcmp(argv[i], "--taylor-fast-third-order") == 0) {
+      taylor_fast_add_third_order = true;
     } else if (std::strcmp(argv[i], "--precision") == 0) {
       if (!needs_value(i)) return 1;
       precision_type = argv[++i];
@@ -765,6 +776,12 @@ int main(int argc, char **argv) {
     output_file = (dot != std::string::npos ? input_file.substr(0, dot) : input_file) + ".raw";
   }
 
+  if (taylor_fast_add_third_order && precision_type != "taylor_fast") {
+    std::cerr << "ERROR: --taylor-fast-third-order requires --precision taylor_fast" << std::endl;
+    print_usage();
+    return 1;
+  }
+
   // -------------------------------------------------------------------
   // Read .sarbp file header
   // -------------------------------------------------------------------
@@ -784,6 +801,15 @@ int main(int argc, char **argv) {
   const bool is_int16_mode  = (hdr.flags & 0x2) != 0;
   const int sgn             = hdr.sgn;
 
+  if (num_pulses <= 0 || num_range_bins <= 0 ||
+      image_width <= 0 || image_height <= 0) {
+    std::cerr << "ERROR: invalid .sarbp dimensions: pulses=" << num_pulses
+              << ", samples=" << num_range_bins
+              << ", image=" << image_height << " x " << image_width
+              << std::endl;
+    return 1;
+  }
+
   if (image_tiles > image_width || image_tiles > image_height) {
     std::cerr << "ERROR: --image-tiles " << image_tiles
               << " exceeds image dimensions " << image_height << " x "
@@ -904,7 +930,9 @@ int main(int argc, char **argv) {
   // incompatible types and is undefined behaviour under strict aliasing.
   // unsigned char* may alias any object type, and std::memcpy of
   // trivially-copyable types is the standards-blessed way to bit-cast.
-  if (precision_type == "fltflt") {
+  const bool use_fltflt_platform_inputs =
+      precision_type == "fltflt";
+  if (use_fltflt_platform_inputs) {
     auto *pos_bytes = reinterpret_cast<unsigned char *>(h_positions);
     for (size_t i = 0; i < static_cast<size_t>(num_pulses) * 3; i++) {
       unsigned char *slot = pos_bytes + i * sizeof(double);
@@ -985,12 +1013,16 @@ int main(int argc, char **argv) {
     params.compute_type = SarBpComputeType::FloatFloat;
   } else if (precision_type == "mixed") {
     params.compute_type = SarBpComputeType::Mixed;
+  } else if (precision_type == "taylor_fast") {
+    params.compute_type = SarBpComputeType::TaylorFast;
   } else {
     std::cerr << "ERROR: unknown precision type '" << precision_type
-              << "' (use double, float, fltflt, or mixed)" << std::endl;
+              << "' (use double, float, fltflt, mixed, or taylor_fast)" << std::endl;
     return 1;
   }
-  params.features = SarBpFeature::PhaseLUTOptimization;
+  if (params.compute_type != SarBpComputeType::Double) {
+    params.features = SarBpFeature::PhaseLUTOptimization;
+  }
   params.center_frequency = (sgn >= 0) ? -hdr.center_frequency : hdr.center_frequency;
   params.del_r = del_r;
 
@@ -1034,7 +1066,11 @@ int main(int argc, char **argv) {
               << static_cast<double>(phase_lut_bytes) / (1024.0 * 1024.0)
               << " MiB" << std::endl;
   }
-  std::cout << "BP precision     : " << precision_type << std::endl;
+  std::cout << "BP precision     : " << precision_type;
+  if (precision_type == "taylor_fast") {
+    std::cout << (taylor_fast_add_third_order ? " (third order)" : " (second order)");
+  }
+  std::cout << std::endl;
   std::cout << "Image tiles      : " << image_tiles << " x " << image_tiles
             << std::endl;
 
@@ -1047,11 +1083,11 @@ int main(int argc, char **argv) {
 
   // Bundle non-tensor state for run_bp_device(). The platform-positions and
   // range-to-mcp tensors are allocated below with a type that matches
-  // `precision_type`: `tensor<double3>` / `tensor<double>` for the standard
-  // paths, or `tensor<fltflt>` / `tensor<fltflt>` when
-  // `precision_type == "fltflt"`. In the fltflt case we already converted
-  // the pinned host buffers in-place from double to fltflt right after the
-  // file read above (same byte layout, no extra storage). The kernel
+  // `use_fltflt_platform_inputs`: `tensor<double3>` / `tensor<double>` for the
+  // standard paths, or `tensor<fltflt>` / `tensor<fltflt>` for the FloatFloat
+  // path. In the fltflt case we already converted the pinned host buffers
+  // in-place from double to fltflt right after the file read above (same byte
+  // layout, no extra storage). The kernel
   // template (`SarBp<...>`) already handles both rank-1 (vector-of-double3
   // / -fltflt3-style) and rank-2 (matrix of [pulses, 3]) layouts for
   // platform_positions, so run_bp_device() branches internally only on the
@@ -1073,6 +1109,7 @@ int main(int argc, char **argv) {
       .is_fx_domain          = is_fx_domain,
       .is_int16_mode         = is_int16_mode,
       .apply_window          = apply_window,
+      .taylor_fast_add_third_order = taylor_fast_add_third_order,
       .sgn                   = sgn,
       .do_warmup             = do_warmup,
       .output_file           = output_file,
@@ -1086,7 +1123,7 @@ int main(int argc, char **argv) {
   };
 
   int dev_status;
-  if (precision_type == "fltflt") {
+  if (use_fltflt_platform_inputs) {
     auto blk_positions = make_tensor<matx::fltflt>({block_size, 3}, matx::MATX_DEVICE_MEMORY);
     auto blk_rtm       = make_tensor<matx::fltflt>({block_size},    matx::MATX_DEVICE_MEMORY);
     dev_status = run_bp_device(blk_positions, blk_rtm, voxel_locations, ctx);
@@ -1098,8 +1135,8 @@ int main(int argc, char **argv) {
   if (dev_status != 0) return dev_status;
 
 
-  MATX_CUDA_CHECK(cudaFreeHost(h_positions));
-  MATX_CUDA_CHECK(cudaFreeHost(h_range_to_mcp));
+  if (h_positions) MATX_CUDA_CHECK(cudaFreeHost(h_positions));
+  if (h_range_to_mcp) MATX_CUDA_CHECK(cudaFreeHost(h_range_to_mcp));
   if (h_range_profiles) MATX_CUDA_CHECK(cudaFreeHost(h_range_profiles));
   if (h_range_profiles_i16) MATX_CUDA_CHECK(cudaFreeHost(h_range_profiles_i16));
   if (h_ampsf) MATX_CUDA_CHECK(cudaFreeHost(h_ampsf));
diff --git a/include/matx/kernels/sar_bp.cuh b/include/matx/kernels/sar_bp.cuh
index 62f16f932..00d9b3019 100644
--- a/include/matx/kernels/sar_bp.cuh
+++ b/include/matx/kernels/sar_bp.cuh
@@ -74,7 +74,158 @@ static __device__ __forceinline__ double NewtonRaphsonSqrt(double x) {
     return NR_2;
 }
 
-// ComputeBinToPixelFloatFloat() fuses the FloatFloat path's
+// The base range bin index and interpolation weight. These are bundled together
+// so that they can be returned from a single function call.
+template <typename weight_t>
+struct SarBpBinAndWeight {
+    weight_t w;
+    int32_t bin_floor_int;
+};
+
+template <typename PlatPosAccessor, SarBpComputeType ComputeType, typename strict_compute_t, typename loose_compute_t>
+__device__ inline strict_compute_t ComputeRangeToPixel(const PlatPosAccessor &ant_pos, const index_t pulse_idx, const loose_compute_t px, const loose_compute_t py, const loose_compute_t z0) {
+    using plat_pos_t = typename PlatPosAccessor::value_type;
+    constexpr int Rank = PlatPosAccessor::Rank;
+    constexpr bool IsFloat = ComputeType == SarBpComputeType::Float;
+    constexpr bool IsMixed = ComputeType == SarBpComputeType::Mixed;
+
+    strict_compute_t dx, dy, dz;
+    static_assert((Rank == 1 && (cuda::std::is_same_v<plat_pos_t, double3> || cuda::std::is_same_v<plat_pos_t, double4> ||
+        cuda::std::is_same_v<plat_pos_t, float3> || cuda::std::is_same_v<plat_pos_t, float4>)) || Rank == 2,
+        "ComputeRangeToPixel: plat_pos_t must be a 1D tensor of 3D or 4D vectorized type or a 2D tensor with size 3 (x,y,z) in the second dimension");
+
+    if constexpr (Rank == 1) {
+        const plat_pos_t ant_pos_p = ant_pos(pulse_idx);
+        dx = static_cast<strict_compute_t>(px) - static_cast<strict_compute_t>(ant_pos_p.x);
+        dy = static_cast<strict_compute_t>(py) - static_cast<strict_compute_t>(ant_pos_p.y);
+        dz = static_cast<strict_compute_t>(z0) - static_cast<strict_compute_t>(ant_pos_p.z);
+    } else {
+        dx = static_cast<strict_compute_t>(px) - static_cast<strict_compute_t>(ant_pos(pulse_idx, 0));
+        dy = static_cast<strict_compute_t>(py) - static_cast<strict_compute_t>(ant_pos(pulse_idx, 1));
+        dz = static_cast<strict_compute_t>(z0) - static_cast<strict_compute_t>(ant_pos(pulse_idx, 2));
+    }
+
+    if constexpr (IsFloat) {
+        return ::sqrtf(dx*dx + dy*dy + dz*dz);
+    } else {
+        if constexpr (IsMixed) {
+#if __CUDA_ARCH__ == 700 || __CUDA_ARCH__ == 800 || __CUDA_ARCH__ == 900 || __CUDA_ARCH__ == 1000
+            return ::sqrt(dx*dx + dy*dy + dz*dz);
+#else
+            // Only use the Newton-Raphson approach on systems with reduced FP64 throughput
+            return NewtonRaphsonSqrt(dx*dx + dy*dy + dz*dz);
+#endif
+        } else {
+            return ::sqrt(dx*dx + dy*dy + dz*dz);
+        }
+    }
+}
+
+template <typename ComputeType, typename StorageType>
+__global__ void SarBpFillPhaseLUT(cuda::std::complex<StorageType> *phase_lut, ComputeType ref_freq, ComputeType dr, index_t num_range_bins)
+{
+    const int tid = blockIdx.x * blockDim.x + threadIdx.x;
+    if (tid >= num_range_bins) return;
+    constexpr ComputeType four_pi_over_c = static_cast<ComputeType>(4.0 * M_PI / SPEED_OF_LIGHT);
+    const ComputeType range_bin_start = static_cast<ComputeType>(tid - 0.5 * (num_range_bins-1)) * dr;
+    const ComputeType phase = four_pi_over_c * ref_freq * range_bin_start;
+    if constexpr (cuda::std::is_same_v<ComputeType, float>) {
+        ComputeType sinx, cosx;
+        ::sincosf(phase, &sinx, &cosx);
+        phase_lut[tid] = cuda::std::complex<StorageType>(
+            static_cast<StorageType>(cosx), static_cast<StorageType>(sinx));
+    } else {
+        ComputeType sinx, cosx;
+        ::sincos(phase, &sinx, &cosx);
+        phase_lut[tid] = cuda::std::complex<StorageType>(
+            static_cast<StorageType>(cosx), static_cast<StorageType>(sinx));
+    }
+}
+
+// Template alias for the strict compute parameter type used in SarBp kernel
+template <SarBpComputeType ComputeType>
+using strict_compute_param_t = typename std::conditional<
+    ComputeType == SarBpComputeType::Double ||
+    ComputeType == SarBpComputeType::Mixed ||
+    ComputeType == SarBpComputeType::FloatFloat ||
+    ComputeType == SarBpComputeType::TaylorFast,
+    double, float>::type;
+
+template <SarBpComputeType ComputeType>
+using strict_or_ff_compute_param_t = typename std::conditional<
+    ComputeType == SarBpComputeType::FloatFloat,
+    fltflt, strict_compute_param_t<ComputeType>>::type;
+
+template <SarBpComputeType ComputeType>
+using sarbp_strict_compute_t = typename std::conditional<
+    ComputeType == SarBpComputeType::Double ||
+    ComputeType == SarBpComputeType::Mixed ||
+    ComputeType == SarBpComputeType::TaylorFast,
+    double, float>::type;
+
+template <SarBpComputeType ComputeType>
+using loose_compute_param_t = typename std::conditional<
+    ComputeType == SarBpComputeType::Double,
+    double, float>::type;
+
+// Shared-memory layout for per-pulse-block cached state. Some compute modes
+// cache per-pulse antenna data or Taylor reference values once per pulse block,
+// amortizing global FP64 loads and FP64->compute-type casts across all threads.
+template <SarBpComputeType ComputeType, bool CacheEnabled>
+struct SarBpPulseBlockCache {};
+
+template <bool CacheEnabled>
+struct SarBpPulseBlockCache<SarBpComputeType::FloatFloat, CacheEnabled> {
+    fltflt ant_pos[PULSE_BLOCK_SIZE][4];
+};
+
+template <>
+struct SarBpPulseBlockCache<SarBpComputeType::Float, true> {
+    // Slots: [0..2] = ant_pos.x/y/z as float, [3] = bin_offset - rtm*dr_inv
+    float ant_pos[PULSE_BLOCK_SIZE][4];
+};
+
+template <>
+struct SarBpPulseBlockCache<SarBpComputeType::Mixed, true> {
+    // Same layout as Float, but with double elements so strict_compute_t (double)
+    // arithmetic in the inner loop reads directly without up-casting.
+    double ant_pos[PULSE_BLOCK_SIZE][4];
+};
+
+struct SarBpTaylorFastSharedMemory {
+    int32_t ref_bin_int[PULSE_BLOCK_SIZE];
+    float  ref_bin_frac[PULSE_BLOCK_SIZE];
+    float  ux          [PULSE_BLOCK_SIZE];
+    float  uy          [PULSE_BLOCK_SIZE];
+    float  uz          [PULSE_BLOCK_SIZE];
+    float  half_inv_R  [PULSE_BLOCK_SIZE];
+    double ref_x;
+    double ref_y;
+    double ref_z;
+};
+
+template <>
+struct SarBpPulseBlockCache<SarBpComputeType::TaylorFast, true> :
+    SarBpTaylorFastSharedMemory {};
+
+template <SarBpComputeType ComputeType, bool PhaseLUT>
+struct SarBpThreadState {};
+
+template <SarBpComputeType ComputeType>
+struct SarBpThreadState<ComputeType, false> {
+    sarbp_strict_compute_t<ComputeType> range_delta;
+};
+
+template <>
+struct SarBpThreadState<SarBpComputeType::TaylorFast, true> {
+    float ref_dx = 0.0f;
+    float ref_dy = 0.0f;
+    float ref_dz = 0.0f;
+    float ref_dsq = 0.0f;
+    float dr_inv_f32 = 0.0f;
+};
+
+// ComputeBinWeightToPixelFloatFloat() fuses the FloatFloat path's
 // coordinate-difference, squared-norm, sqrt, and scale/add steps that would
 // otherwise be a separate norm3d + fma_approx pair. The kernel only needs the
 // range bin (FloatFloat requires PhaseLUT, so the bare range R is never used
@@ -104,12 +255,17 @@ static __device__ __forceinline__ double NewtonRaphsonSqrt(double x) {
 // Domain: at least one of dx/dy/dz must not fully cancel (sum of squared
 // distances must be strictly positive) -- three-way simultaneous
 // cancellation is not a meaningful SAR geometry and is not supported.
-__device__ inline fltflt ComputeBinToPixelFloatFloat(
-    fltflt apx, fltflt apy, fltflt apz,
+__device__ inline SarBpBinAndWeight<float> ComputeBinWeightToPixelFloatFloat(
+    const SarBpPulseBlockCache<SarBpComputeType::FloatFloat, true> &sh_mem,
+    index_t ip,
     float px, float py, float pz,
-    fltflt dr_inv,
-    fltflt mcp_partial)
+    fltflt dr_inv)
 {
+    const fltflt apx = sh_mem.ant_pos[ip][0];
+    const fltflt apy = sh_mem.ant_pos[ip][1];
+    const fltflt apz = sh_mem.ant_pos[ip][2];
+    const fltflt mcp_partial = sh_mem.ant_pos[ip][3];
+
     // Stage 1: loose dx/dy/dz. We use the general fltflt_two_sum (6 fp32 ops)
     // for all three dimensions. When it is known that one coordinate is
     // always greater than the other (e.g., |apz.hi| >= |pz|), the faster
@@ -176,104 +332,100 @@ __device__ inline fltflt ComputeBinToPixelFloatFloat(
     // non-canonical sqrt result. fma_approx tolerates this because its only
     // dropped term is a.lo*b.lo = r_lo * dr_inv.lo, which is O(ULP^2 * R)
     // regardless of whether (yn, r_lo) has been canonicalized.
-    return fltflt_fma_approx(fltflt{yn, r_lo}, dr_inv, mcp_partial);
+    const fltflt bin = fltflt_fma_approx(fltflt{yn, r_lo}, dr_inv, mcp_partial);
+    float floor_hi = ::floorf(bin.hi);
+    float frac = (bin.hi - floor_hi) + bin.lo;
+    // bin.lo may push bin over a boundary, in which case floor and frac are incorrect.
+    // Compute an adjustment based on whether or not the fractional part is outside (0.0, 1.0).
+    const float adjust = ::floorf(frac);  // -1, 0, or 1
+    return SarBpBinAndWeight<float>{
+        frac - adjust,
+        static_cast<int32_t>(floor_hi + adjust)
+    };
 }
 
-template <typename PlatPosAccessor, SarBpComputeType ComputeType, typename strict_compute_t, typename loose_compute_t>
-__device__ inline strict_compute_t ComputeRangeToPixel(const PlatPosAccessor &ant_pos, const index_t pulse_idx, const loose_compute_t px, const loose_compute_t py, const loose_compute_t z0) {
-    using plat_pos_t = typename PlatPosAccessor::value_type;
-    constexpr int Rank = PlatPosAccessor::Rank;
-
-    strict_compute_t dx, dy, dz;
-    static_assert((Rank == 1 && (cuda::std::is_same_v<plat_pos_t, double3> || cuda::std::is_same_v<plat_pos_t, double4> ||
-        cuda::std::is_same_v<plat_pos_t, float3> || cuda::std::is_same_v<plat_pos_t, float4>)) || Rank == 2,
-        "ComputeRangeToPixel: plat_pos_t must be a 1D tensor of 3D or 4D vectorized type or a 2D tensor with size 3 (x,y,z) in the second dimension");
-
-    if constexpr (Rank == 1) {
-        const plat_pos_t ant_pos_p = ant_pos(pulse_idx);
-        dx = static_cast<strict_compute_t>(px) - static_cast<strict_compute_t>(ant_pos_p.x);
-        dy = static_cast<strict_compute_t>(py) - static_cast<strict_compute_t>(ant_pos_p.y);
-        dz = static_cast<strict_compute_t>(z0) - static_cast<strict_compute_t>(ant_pos_p.z);
-    } else {
-        dx = static_cast<strict_compute_t>(px) - static_cast<strict_compute_t>(ant_pos(pulse_idx, 0));
-        dy = static_cast<strict_compute_t>(py) - static_cast<strict_compute_t>(ant_pos(pulse_idx, 1));
-        dz = static_cast<strict_compute_t>(z0) - static_cast<strict_compute_t>(ant_pos(pulse_idx, 2));
+template <bool TaylorFastAddThirdOrder = false>
+__device__ inline SarBpBinAndWeight<float> ComputeBinWeightToPixelTaylorFast(
+    const SarBpPulseBlockCache<SarBpComputeType::TaylorFast, true> &sh_mem,
+    const SarBpThreadState<SarBpComputeType::TaylorFast, true> &thread_state,
+    index_t ip)
+{
+    // In the mathematical derivation in the documentation, we defined s = dot(u, d) as
+    // the dot product of the along-range unit vector u=(ux, uy, uz) with the local offset d,
+    // and q as the squared perpendicular distance from the local pixel
+    // to the pulse-to-reference-pixel line. These calculations then map directly to the
+    // derivation. We split the reference bin b_0 from the writeup into ref_bin_int and
+    // ref_bin_frac to preserve the fractional interpolation weight: otherwise,
+    // b_0 can be large enough that storing it as a float loses low-order bin
+    // precision. The inner loop computes only ref_bin_frac + delta_bin in fp32;
+    // the integer component is added back after floorf(), relying on the exact
+    // identity floor(n + x) = n + floor(x) for integer n while avoiding the
+    // lossy fp32 computation of n + x. ref_bin_int and ref_bin_frac
+    // are per-pulse values that were populated in FillPulseBlockCache.
+    const float s = sh_mem.ux[ip] * thread_state.ref_dx +
+                    sh_mem.uy[ip] * thread_state.ref_dy +
+                    sh_mem.uz[ip] * thread_state.ref_dz;
+    const float q = fmaf(-s, s, thread_state.ref_dsq);
+    const float half_inv_R = sh_mem.half_inv_R[ip];
+    const float dR_2nd = fmaf(q, half_inv_R, s);
+    float dR = dR_2nd;
+    if constexpr (TaylorFastAddThirdOrder) {
+        dR = fmaf(-2.0f * half_inv_R * half_inv_R * s, q, dR_2nd);
     }
+    const float bin_loc = fmaf(dR, thread_state.dr_inv_f32, sh_mem.ref_bin_frac[ip]);
+    const float bin_floor = ::floorf(bin_loc);
+    return SarBpBinAndWeight<float>{
+        bin_loc - bin_floor,
+        sh_mem.ref_bin_int[ip] + static_cast<int32_t>(bin_floor)
+    };
+}
 
-    if constexpr (ComputeType == SarBpComputeType::Float) {
-        return ::sqrtf(dx*dx + dy*dy + dz*dz);
+template <SarBpComputeType ComputeType, typename strict_compute_t, typename loose_compute_t>
+__device__ inline SarBpBinAndWeight<loose_compute_t> ComputeBinWeightToPixelPulseBlockCache(
+    const SarBpPulseBlockCache<ComputeType, true> &sh_mem,
+    index_t ip,
+    loose_compute_t px,
+    loose_compute_t py,
+    loose_compute_t pz,
+    strict_compute_t dr_inv)
+{
+    constexpr bool IsFloat = ComputeType == SarBpComputeType::Float;
+    constexpr bool IsMixed = ComputeType == SarBpComputeType::Mixed;
+    static_assert(IsFloat || IsMixed, "SarBp: pulse-block cache bin helper only supports Float and Mixed");
+
+    const strict_compute_t apx = sh_mem.ant_pos[ip][0];
+    const strict_compute_t apy = sh_mem.ant_pos[ip][1];
+    const strict_compute_t apz = sh_mem.ant_pos[ip][2];
+    const strict_compute_t mcp_partial = sh_mem.ant_pos[ip][3];
+    const strict_compute_t dx = static_cast<strict_compute_t>(px) - apx;
+    const strict_compute_t dy = static_cast<strict_compute_t>(py) - apy;
+    const strict_compute_t dz = static_cast<strict_compute_t>(pz) - apz;
+    strict_compute_t dist;
+    if constexpr (IsFloat) {
+        dist = ::sqrtf(dx*dx + dy*dy + dz*dz);
     } else {
-        if constexpr (ComputeType == SarBpComputeType::Mixed) {
 #if __CUDA_ARCH__ == 700 || __CUDA_ARCH__ == 800 || __CUDA_ARCH__ == 900 || __CUDA_ARCH__ == 1000
-            return ::sqrt(dx*dx + dy*dy + dz*dz);
+        dist = ::sqrt(dx*dx + dy*dy + dz*dz);
 #else
-            // Only use the Newton-Raphson approach on systems with reduced FP64 throughput
-            return NewtonRaphsonSqrt(dx*dx + dy*dy + dz*dz);
+        dist = NewtonRaphsonSqrt(dx*dx + dy*dy + dz*dz);
 #endif
-        } else {
-            return ::sqrt(dx*dx + dy*dy + dz*dz);
-        }
     }
-}
 
-template <typename ComputeType, typename StorageType>
-__global__ void SarBpFillPhaseLUT(cuda::std::complex<StorageType> *phase_lut, ComputeType ref_freq, ComputeType dr, index_t num_range_bins)
-{
-    const int tid = blockIdx.x * blockDim.x + threadIdx.x;
-    if (tid >= num_range_bins) return;
-    constexpr ComputeType four_pi_over_c = static_cast<ComputeType>(4.0 * M_PI / SPEED_OF_LIGHT);
-    const ComputeType range_bin_start = static_cast<ComputeType>(tid - 0.5 * (num_range_bins-1)) * dr;
-    const ComputeType phase = four_pi_over_c * ref_freq * range_bin_start;
-    if constexpr (cuda::std::is_same_v<ComputeType, float>) {
-        ComputeType sinx, cosx;
-        ::sincosf(phase, &sinx, &cosx);
-        phase_lut[tid] = cuda::std::complex<StorageType>(
-            static_cast<StorageType>(cosx), static_cast<StorageType>(sinx));
+    // bin = (dist - rtm)*dr_inv + bin_offset = dist*dr_inv + mcp_partial
+    const strict_compute_t bin = dist * dr_inv + mcp_partial;
+    strict_compute_t bin_floor;
+    if constexpr (cuda::std::is_same_v<strict_compute_t, double>) {
+        bin_floor = ::floor(bin);
     } else {
-        ComputeType sinx, cosx;
-        ::sincos(phase, &sinx, &cosx);
-        phase_lut[tid] = cuda::std::complex<StorageType>(
-            static_cast<StorageType>(cosx), static_cast<StorageType>(sinx));
+        bin_floor = ::floorf(bin);
     }
+    return SarBpBinAndWeight<loose_compute_t>{
+        static_cast<loose_compute_t>(bin - bin_floor),
+        static_cast<int32_t>(bin_floor)
+    };
 }
 
-// Template alias for the strict compute parameter type used in SarBp kernel
-template <SarBpComputeType ComputeType>
-using strict_compute_param_t = typename std::conditional<ComputeType == SarBpComputeType::Double || ComputeType == SarBpComputeType::Mixed || ComputeType == SarBpComputeType::FloatFloat, double, float>::type;
-
-template <SarBpComputeType ComputeType>
-using strict_or_ff_compute_param_t = typename std::conditional<ComputeType == SarBpComputeType::FloatFloat, fltflt, strict_compute_param_t<ComputeType>>::type;
-
-template <SarBpComputeType ComputeType>
-using loose_compute_param_t = typename std::conditional<ComputeType == SarBpComputeType::Double, double, float>::type;
-
-// Shared-memory layout is parameterized on whether the kernel runs the
-// cooperative preamble (see UseSharedPreamble in SarBp). The preamble loads
-// per-pulse antenna positions and pre-computes mcp_partial = bin_offset
-// - rtm*dr_inv once per pulse block, amortizing the global FP64 loads and
-// FP64->compute-type casts across all threads.
-template <SarBpComputeType ComputeType, bool PreambleEnabled>
-struct SarBpSharedMemory {};
-
-template <bool PreambleEnabled>
-struct SarBpSharedMemory<SarBpComputeType::FloatFloat, PreambleEnabled> {
-    fltflt ant_pos[PULSE_BLOCK_SIZE][4];
-};
-
-template <>
-struct SarBpSharedMemory<SarBpComputeType::Float, true> {
-    // Slots: [0..2] = ant_pos.x/y/z as float, [3] = bin_offset - rtm*dr_inv
-    float ant_pos[PULSE_BLOCK_SIZE][4];
-};
-
-template <>
-struct SarBpSharedMemory<SarBpComputeType::Mixed, true> {
-    // Same layout as Float, but with double elements so strict_compute_t (double)
-    // arithmetic in the inner loop reads directly without up-casting.
-    double ant_pos[PULSE_BLOCK_SIZE][4];
-};
-
-template <SarBpComputeType ComputeType, typename OutImageType, typename InitialImageType, typename RangeProfilesType, typename PlatPosType, typename VoxLocType, typename RangeToMcpType, bool PhaseLUT, bool IsUnitStride>
+template <SarBpComputeType ComputeType, typename OutImageType, typename InitialImageType, typename RangeProfilesType, typename PlatPosType, typename VoxLocType, typename RangeToMcpType, bool PhaseLUT, bool IsUnitStride, bool TaylorFastAddThirdOrder = false>
 __launch_bounds__(16*16)
 __global__ void SarBp(OutImageType output, const InitialImageType initial_image, const __grid_constant__ RangeProfilesType range_profiles, const __grid_constant__ PlatPosType platform_positions, const __grid_constant__ VoxLocType voxel_locations, const __grid_constant__ RangeToMcpType range_to_mcp,
                       strict_or_ff_compute_param_t<ComputeType> dr_inv,
@@ -298,16 +450,22 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image,
                   cuda::std::is_same_v<typename RangeToMcpType::value_type, fltflt>,
         "RangeToMcpType::value_type must be float, double, or fltflt");
 
+    constexpr bool IsDouble = ComputeType == SarBpComputeType::Double;
+    constexpr bool IsMixed = ComputeType == SarBpComputeType::Mixed;
+    constexpr bool IsFloatFloat = ComputeType == SarBpComputeType::FloatFloat;
+    constexpr bool IsFloat = ComputeType == SarBpComputeType::Float;
+    constexpr bool IsTaylorFast = ComputeType == SarBpComputeType::TaylorFast;
+
     using initial_image_t = typename InitialImageType::value_type;
     using image_t = typename OutImageType::value_type;
     using range_profiles_t = typename RangeProfilesType::value_type;
     using plat_pos_t = typename PlatPosType::value_type;
     using voxel_loc_t = typename VoxLocType::value_type;
-    using compute_t = typename std::conditional<ComputeType == SarBpComputeType::Double, double, float>::type;
-    using strict_compute_t = typename std::conditional<ComputeType == SarBpComputeType::Double || ComputeType == SarBpComputeType::Mixed, double, float>::type;
-    using strict_or_ff_compute_t = typename std::conditional<ComputeType == SarBpComputeType::FloatFloat, fltflt, strict_compute_t>::type;
+    using compute_t = typename std::conditional<IsDouble, double, float>::type;
+    using strict_compute_t = sarbp_strict_compute_t<ComputeType>;
+    using strict_or_ff_compute_t = typename std::conditional<IsFloatFloat, fltflt, strict_compute_t>::type;
     using strict_complex_compute_t = cuda::std::complex<strict_compute_t>;
-    using loose_compute_t = typename std::conditional<ComputeType == SarBpComputeType::Double, double, float>::type;
+    using loose_compute_t = typename std::conditional<IsDouble, double, float>::type;
     using loose_complex_compute_t = cuda::std::complex<loose_compute_t>;
 
     const index_t image_height = output.Size(0);
@@ -326,21 +484,20 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image,
                       "SarBp IsUnitStride path requires platform_positions to be a tensor view");
     }
 
-    // The cooperative shared-memory preamble is required for FloatFloat (its
-    // inner loop reads directly from shared memory) and enabled for Float and
-    // Mixed when PhaseLUT is on.  With PhaseLUT=false the inner loop still
-    // needs diffR = dist - rtm for sincos, which the preamble does not
-    // preserve, so fall back to direct global reads in that case.
-    constexpr bool UseSharedPreamble =
-        (ComputeType == SarBpComputeType::FloatFloat) ||
-        ((ComputeType == SarBpComputeType::Float ||
-          ComputeType == SarBpComputeType::Mixed) && PhaseLUT);
+    // Some compute modes cache per-pulse data in shared memory once per pulse
+    // block. With PhaseLUT=false, Float/Mixed still need the range-to-MCP
+    // delta for sincos, which the cache does not preserve, so fall back to
+    // direct global reads in that case.
+    constexpr bool UsePulseBlockCache =
+        IsFloatFloat ||
+        IsTaylorFast ||
+        ((IsFloat || IsMixed) && PhaseLUT);
 
-    __shared__ SarBpSharedMemory<ComputeType, UseSharedPreamble> sh_mem;
+    __shared__ SarBpPulseBlockCache<ComputeType, UsePulseBlockCache> sh_mem;
 
     const bool is_valid = ix < image_width && iy < image_height;
-    if constexpr (!UseSharedPreamble) {
-        // When the preamble is enabled, keep all threads active so they can
+    if constexpr (!UsePulseBlockCache) {
+        // When the cache is enabled, keep all threads active so they can
         // participate in the CTA-wide cooperative loads of antenna positions.
         if (! is_valid) return;
     }
@@ -381,32 +538,31 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image,
     };
 
     [[maybe_unused]] const loose_compute_t phase_correction_partial_loose = static_cast<loose_compute_t>(phase_correction_partial);
-    const auto get_reference_phase = [&phase_lut, &phase_correction_partial, &phase_correction_partial_loose](strict_or_ff_compute_t diffR, int32_t bin_floor_int, loose_compute_t w) -> loose_complex_compute_t {
-        if constexpr (PhaseLUT) {
-            const loose_complex_compute_t base_phase = phase_lut[bin_floor_int];
-            loose_compute_t incr_sinx, incr_cosx;
-            if constexpr (cuda::std::is_same_v<loose_compute_t, double>) {
-                ::sincos(phase_correction_partial_loose * w, &incr_sinx, &incr_cosx);
-            } else {
-                __sincosf(phase_correction_partial_loose * w, &incr_sinx, &incr_cosx);
-            }
+    const auto get_reference_phase_lut = [&phase_lut, &phase_correction_partial_loose](int32_t bin_floor_int, loose_compute_t w) -> loose_complex_compute_t {
+        const loose_complex_compute_t base_phase = phase_lut[bin_floor_int];
+        loose_compute_t incr_sinx, incr_cosx;
+        if constexpr (cuda::std::is_same_v<loose_compute_t, double>) {
+            ::sincos(phase_correction_partial_loose * w, &incr_sinx, &incr_cosx);
+        } else {
+            __sincosf(phase_correction_partial_loose * w, &incr_sinx, &incr_cosx);
+        }
+
+        return loose_complex_compute_t{
+            base_phase.real() * incr_cosx - base_phase.imag() * incr_sinx,
+            base_phase.real() * incr_sinx + base_phase.imag() * incr_cosx
+        };
+    };
 
-            return loose_complex_compute_t{
-                base_phase.real() * incr_cosx - base_phase.imag() * incr_sinx,
-                base_phase.real() * incr_sinx + base_phase.imag() * incr_cosx
-            };
+    const auto get_reference_phase_direct = [&phase_correction_partial](strict_compute_t range_delta) -> loose_complex_compute_t {
+        strict_compute_t sinx, cosx;
+        if constexpr (cuda::std::is_same_v<strict_compute_t, double>) {
+            ::sincos(phase_correction_partial * range_delta, &sinx, &cosx);
         } else {
-            // With PhaseLUT == false, strict_or_ff_compute_t is either float or double, so we can use sincos[f] directly.
-            strict_or_ff_compute_t sinx, cosx;
-            if constexpr (cuda::std::is_same_v<strict_compute_t, double>) {
-                ::sincos(phase_correction_partial * diffR, &sinx, &cosx);
-            } else {
-                ::sincosf(phase_correction_partial * diffR, &sinx, &cosx);
-            }
-            return loose_complex_compute_t{
-                static_cast<loose_compute_t>(cosx), static_cast<loose_compute_t>(sinx)
-            };
+            ::sincosf(phase_correction_partial * range_delta, &sinx, &cosx);
         }
+        return loose_complex_compute_t{
+            static_cast<loose_compute_t>(cosx), static_cast<loose_compute_t>(sinx)
+        };
     };
 
     // Explicitly use FMA instructions for pixel accumulations
@@ -431,17 +587,47 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image,
 
     [[maybe_unused]] const int tid = threadIdx.x + threadIdx.y * blockDim.x;
 
+    SarBpThreadState<ComputeType, PhaseLUT> thread_state{};
+    if constexpr (IsTaylorFast) {
+        static_assert(PhaseLUT == true, "SarBp: TaylorFast compute type requires PhaseLUT optimization");
+        if constexpr (PhaseLUT) {
+            thread_state.dr_inv_f32 = static_cast<float>(dr_inv);
+            if (tid == 0) {
+                // Use the midpoint of the CTA's valid image footprint. For
+                // partial right/bottom edge CTAs, clamping the nominal CTA
+                // center to image_width-1/image_height-1 would choose an edge
+                // or corner reference and unnecessarily increase local offsets.
+                const index_t block_x0 = static_cast<index_t>(blockIdx.x * blockDim.x);
+                const index_t block_y0 = static_cast<index_t>(blockIdx.y * blockDim.y);
+                const index_t valid_width = ((image_width - block_x0) < static_cast<index_t>(blockDim.x)) ?
+                    (image_width - block_x0) : static_cast<index_t>(blockDim.x);
+                const index_t valid_height = ((image_height - block_y0) < static_cast<index_t>(blockDim.y)) ?
+                    (image_height - block_y0) : static_cast<index_t>(blockDim.y);
+                const index_t ax_c = block_x0 + valid_width / 2;
+                const index_t ay_c = block_y0 + valid_height / 2;
+                const voxel_loc_t ref_loc = voxel_locations(ay_c, ax_c);
+                sh_mem.ref_x = static_cast<strict_compute_t>(ref_loc.x);
+                sh_mem.ref_y = static_cast<strict_compute_t>(ref_loc.y);
+                sh_mem.ref_z = static_cast<strict_compute_t>(ref_loc.z);
+            }
+            __syncthreads();
+
+            thread_state.ref_dx = is_valid ? static_cast<float>(static_cast<strict_compute_t>(voxel_loc.x) - sh_mem.ref_x) : 0.0f;
+            thread_state.ref_dy = is_valid ? static_cast<float>(static_cast<strict_compute_t>(voxel_loc.y) - sh_mem.ref_y) : 0.0f;
+            thread_state.ref_dz = is_valid ? static_cast<float>(static_cast<strict_compute_t>(voxel_loc.z) - sh_mem.ref_z) : 0.0f;
+            thread_state.ref_dsq = fmaf(thread_state.ref_dx, thread_state.ref_dx,
+                                        fmaf(thread_state.ref_dy, thread_state.ref_dy,
+                                             thread_state.ref_dz * thread_state.ref_dz));
+        }
+    }
+
     loose_complex_compute_t accum{};
     const loose_compute_t bin_offset = static_cast<loose_compute_t>(0.5) * static_cast<loose_compute_t>(num_range_bins-1);
 
-    const int num_pulse_blocks = static_cast<int>(
-        (num_pulses + static_cast<index_t>(PULSE_BLOCK_SIZE) - 1) / static_cast<index_t>(PULSE_BLOCK_SIZE));
-    for (int block = 0; block < num_pulse_blocks; ++block) {
-        const index_t pulse_base = static_cast<index_t>(block) * static_cast<index_t>(PULSE_BLOCK_SIZE);
-        const index_t pulses_remaining = num_pulses - pulse_base;
-        const index_t num_pulses_in_block =
-            (pulses_remaining < static_cast<index_t>(PULSE_BLOCK_SIZE)) ? pulses_remaining : static_cast<index_t>(PULSE_BLOCK_SIZE);
-        if constexpr (UseSharedPreamble) {
+    // Most ComputeTypes amortize some redundant computations or data conversions by leveraging
+    // per-pulse-block shared memory.
+    const auto FillPulseBlockCache = [&](index_t pulse_base, index_t num_pulses_in_block) {
+        if constexpr (UsePulseBlockCache) {
             __syncthreads();
             for (index_t ip = tid; ip < num_pulses_in_block; ip += blockDim.x * blockDim.y) {
                 const index_t p = pulse_base + ip;
@@ -457,13 +643,36 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image,
                 };
                 const auto xyz = load_xyz();
 
-                if constexpr (ComputeType == SarBpComputeType::FloatFloat) {
+                if constexpr (IsFloatFloat) {
                     sh_mem.ant_pos[ip][0] = static_cast<fltflt>(cuda::std::get<0>(xyz));
                     sh_mem.ant_pos[ip][1] = static_cast<fltflt>(cuda::std::get<1>(xyz));
                     sh_mem.ant_pos[ip][2] = static_cast<fltflt>(cuda::std::get<2>(xyz));
                     const fltflt rtm = static_cast<fltflt>(r_to_mcp(p));
                     const fltflt neg_rtm = fltflt{-rtm.hi, -rtm.lo};
                     sh_mem.ant_pos[ip][3] = fltflt_fma_approx(neg_rtm, dr_inv, bin_offset);
+                } else if constexpr (IsTaylorFast) {
+                    const strict_compute_t pcx = static_cast<strict_compute_t>(cuda::std::get<0>(xyz));
+                    const strict_compute_t pcy = static_cast<strict_compute_t>(cuda::std::get<1>(xyz));
+                    const strict_compute_t pcz = static_cast<strict_compute_t>(cuda::std::get<2>(xyz));
+                    const strict_compute_t rtm = static_cast<strict_compute_t>(r_to_mcp(p));
+
+                    const strict_compute_t dxa = sh_mem.ref_x - pcx;
+                    const strict_compute_t dya = sh_mem.ref_y - pcy;
+                    const strict_compute_t dza = sh_mem.ref_z - pcz;
+                    // We assume double for TaylorFast strict_compute_t and thus use ::sqrt()/::floor() here
+                    static_assert(cuda::std::is_same_v<strict_compute_t, double>, "SarBp: TaylorFast compute type requires double precision");
+                    const strict_compute_t ref_range_pure = ::sqrt(dxa*dxa + dya*dya + dza*dza);
+                    const strict_compute_t ref_range = ref_range_pure - rtm;
+                    const strict_compute_t ref_bin = ref_range * dr_inv + static_cast<strict_compute_t>(bin_offset);
+                    const strict_compute_t bin_floor = ::floor(ref_bin);
+                    const strict_compute_t inv_R_d = static_cast<strict_compute_t>(1.0) / ref_range_pure;
+
+                    sh_mem.ref_bin_int [ip] = static_cast<int32_t>(bin_floor);
+                    sh_mem.ref_bin_frac[ip] = static_cast<float>(ref_bin - bin_floor);
+                    sh_mem.ux          [ip] = static_cast<float>(dxa * inv_R_d);
+                    sh_mem.uy          [ip] = static_cast<float>(dya * inv_R_d);
+                    sh_mem.uz          [ip] = static_cast<float>(dza * inv_R_d);
+                    sh_mem.half_inv_R  [ip] = static_cast<float>(static_cast<strict_compute_t>(0.5) * inv_R_d);
                 } else {
                     // Float / Mixed: cast inputs to strict_compute_t (float / double)
                     // once per pulse here, instead of once per pulse per pixel.
@@ -479,111 +688,99 @@ __global__ void SarBp(OutImageType output, const InitialImageType initial_image,
                 }
             }
             __syncthreads();
-            if (! is_valid) {
-                continue;
-            }
         }
+    };
+
+    const int num_pulse_blocks = static_cast<int>(
+        (num_pulses + static_cast<index_t>(PULSE_BLOCK_SIZE) - 1) / static_cast<index_t>(PULSE_BLOCK_SIZE));
+    for (int block = 0; block < num_pulse_blocks; ++block) {
+        const index_t pulse_base = static_cast<index_t>(block) * static_cast<index_t>(PULSE_BLOCK_SIZE);
+        const index_t pulses_remaining = num_pulses - pulse_base;
+        const index_t num_pulses_in_block =
+            (pulses_remaining < static_cast<index_t>(PULSE_BLOCK_SIZE)) ? pulses_remaining : static_cast<index_t>(PULSE_BLOCK_SIZE);
+
+        FillPulseBlockCache(pulse_base, num_pulses_in_block);
+        if (! is_valid) {
+            continue;
+        }
+
         #pragma unroll 4
         for (index_t ip = 0; ip < num_pulses_in_block; ++ip) {
             const index_t p = pulse_base + ip;
-            strict_or_ff_compute_t diffR;
-            loose_compute_t w;
-            int32_t bin_floor_int;
-            if constexpr (ComputeType == SarBpComputeType::FloatFloat) {
-                // ComputeBinToPixelFloatFloat fuses the coordinate-difference,
-                // squared-norm, sqrt, and (R-mcp)*dr_inv + bin_offset chain into
-                // one helper that never materializes a canonical R. The
-                // shared-memory slot sh_mem.ant_pos[ip][3] holds
-                // -mcp*dr_inv + bin_offset, precomputed in the pulse-block
-                // preamble.
-                const fltflt bin = ComputeBinToPixelFloatFloat(
-                    sh_mem.ant_pos[ip][0], sh_mem.ant_pos[ip][1], sh_mem.ant_pos[ip][2],
-                    px, py, pz, dr_inv, sh_mem.ant_pos[ip][3]);
-                diffR = bin;  // unused below (FloatFloat requires PhaseLUT); assign to silence maybe-uninitialized warning
-                float floor_hi = ::floorf(bin.hi);
-                float frac = (bin.hi - floor_hi) + bin.lo;
-                // bin.lo may push bin over a boundary, in which case floor and frac are incorrect.
-                // Compute an adjustment based on whether or not the fractional part is outside (0.0, 1.0).
-                const float adjust = ::floorf(frac);  // -1, 0, or 1
-                bin_floor_int = static_cast<int32_t>(floor_hi + adjust);
-                w = frac - adjust;
-            } else if constexpr (UseSharedPreamble) {
-                // Float / Mixed with shared-mem preamble: antenna position and
-                // mcp_partial have been pre-loaded / pre-computed in shared
-                // memory, so the inner loop is pure strict_compute_t arithmetic.
-                const strict_compute_t apx = sh_mem.ant_pos[ip][0];
-                const strict_compute_t apy = sh_mem.ant_pos[ip][1];
-                const strict_compute_t apz = sh_mem.ant_pos[ip][2];
-                const strict_compute_t mcp_partial = sh_mem.ant_pos[ip][3];
-                const strict_compute_t dx = static_cast<strict_compute_t>(px) - apx;
-                const strict_compute_t dy = static_cast<strict_compute_t>(py) - apy;
-                const strict_compute_t dz = static_cast<strict_compute_t>(pz) - apz;
-                strict_compute_t dist;
-                if constexpr (ComputeType == SarBpComputeType::Float) {
-                    dist = ::sqrtf(dx*dx + dy*dy + dz*dz);
-                } else {
-#if __CUDA_ARCH__ == 700 || __CUDA_ARCH__ == 800 || __CUDA_ARCH__ == 900 || __CUDA_ARCH__ == 1000
-                    dist = ::sqrt(dx*dx + dy*dy + dz*dz);
-#else
-                    dist = NewtonRaphsonSqrt(dx*dx + dy*dy + dz*dz);
-#endif
-                }
-                // bin = (dist - rtm)*dr_inv + bin_offset = dist*dr_inv + mcp_partial
-                const strict_compute_t bin = dist * dr_inv + mcp_partial;
-                strict_compute_t bin_floor;
-                if constexpr (cuda::std::is_same_v<strict_compute_t, double>) {
-                    bin_floor = ::floor(bin);
-                } else {
-                    bin_floor = ::floorf(bin);
-                }
-                w = static_cast<loose_compute_t>(bin - bin_floor);
-                bin_floor_int = static_cast<int32_t>(bin_floor);
-                // diffR is unused when PhaseLUT=true (required for this branch);
-                // assign to avoid any maybe-uninitialized warning downstream.
-                diffR = dist;
+
+            // The following branches all compute at least the base range bin index and interpolation weight
+            // for the current pulse. Paths not using the PhaseLUT optimization will also populate the
+            // thread_state.range_delta member for use in get_reference_phase_direct().
+            SarBpBinAndWeight<loose_compute_t> bin_weight;
+            if constexpr (IsFloatFloat) {
+                bin_weight = ComputeBinWeightToPixelFloatFloat(sh_mem, ip, px, py, pz, dr_inv);
+            } else if constexpr (IsTaylorFast) {
+                static_assert(PhaseLUT == true, "SarBp: TaylorFast compute type requires PhaseLUT optimization");
+                bin_weight = ComputeBinWeightToPixelTaylorFast<TaylorFastAddThirdOrder>(sh_mem, thread_state, ip);
+            } else if constexpr (UsePulseBlockCache) {
+                bin_weight = ComputeBinWeightToPixelPulseBlockCache<ComputeType>(sh_mem, ip, px, py, pz, dr_inv);
             } else {
-                diffR = ComputeRangeToPixel<decltype(pp), ComputeType, strict_compute_t, loose_compute_t>(
-                    pp, p, px, py, pz) - static_cast<strict_compute_t>(r_to_mcp(p));
-                const strict_compute_t bin = diffR * dr_inv + bin_offset;
+                // Below is the most direct / standard path for backprojection. We compute the distance
+                // from the antenna phase center to the pixel for this pulse and from that the differential range
+                // to the MCP. From that differential range, we compute the base range bin index and interpolation weight
+                // for range profile sampling. Finally, we use the differential range to compute the reference phase
+                // and accumulate the contribution to the pixel. All other paths implement similar logic using various
+                // optimizations.
+                const strict_compute_t range_delta =
+                    ComputeRangeToPixel<decltype(pp), ComputeType, strict_compute_t, loose_compute_t>(
+                        pp, p, px, py, pz) - static_cast<strict_compute_t>(r_to_mcp(p));
+                if constexpr (!PhaseLUT) {
+                    thread_state.range_delta = range_delta;
+                }
+                const strict_compute_t bin = range_delta * dr_inv + bin_offset;
                 strict_compute_t bin_floor;
                 if constexpr (cuda::std::is_same_v<strict_compute_t, double>) {
                     bin_floor = ::floor(bin);
                 } else {
                     bin_floor = ::floorf(bin);
                 }
-                w = static_cast<loose_compute_t>(bin - bin_floor);
-                bin_floor_int = static_cast<int32_t>(bin_floor);
+                bin_weight.w = static_cast<loose_compute_t>(bin - bin_floor);
+                bin_weight.bin_floor_int = static_cast<int32_t>(bin_floor);
             }
-            if (bin_floor_int >= 0 && bin_floor_int < num_range_bins-1) {
-                // rp accessor picks the fast pointer path on IsUnitStride or
-                // falls through to operator().
-                const range_profiles_t sample_lo = rp(p, bin_floor_int);
-                const range_profiles_t sample_hi = rp(p, bin_floor_int + 1);
-
-                const loose_complex_compute_t sample = [&sample_lo, &sample_hi, &w]() -> loose_complex_compute_t {
-                    const loose_complex_compute_t loose_sample_lo = static_cast<loose_complex_compute_t>(sample_lo);
-                    const loose_complex_compute_t loose_sample_hi = static_cast<loose_complex_compute_t>(sample_hi);
-                    if constexpr (cuda::std::is_same_v<loose_compute_t, float>) {
-                        return loose_complex_compute_t{
-                            __fmaf_rn(w, loose_sample_hi.real(), __fmaf_rn(-w, loose_sample_lo.real(), loose_sample_lo.real())),
-                            __fmaf_rn(w, loose_sample_hi.imag(), __fmaf_rn(-w, loose_sample_lo.imag(), loose_sample_lo.imag()))
-                        };
-                    } else {
-                        return loose_complex_compute_t{
-                            fma(w, loose_sample_hi.real(), fma(-w, loose_sample_lo.real(), loose_sample_lo.real())),
-                            fma(w, loose_sample_hi.imag(), fma(-w, loose_sample_lo.imag(), loose_sample_lo.imag()))
-                        };
-                    }
-                }();
-
-                // For FloatFloat mode, diffR has been set to the distance to the pixel rather than the differential range to the MCP.
-                // However, FloatFloat mode currently requires PhaseLUT optimization due to missing fltflt sin/cos implementations,
-                // so diffR will not actually be used in get_reference_phase() below.
-                static_assert(ComputeType != SarBpComputeType::FloatFloat || PhaseLUT == true, "SarBp: FloatFloat compute type requires PhaseLUT optimization");
-                const loose_complex_compute_t ref_phase = get_reference_phase(diffR, bin_floor_int, w);
 
-                accum = accumulate_contribution(accum, sample, ref_phase);
+            if (bin_weight.bin_floor_int < 0 || bin_weight.bin_floor_int >= num_range_bins-1) {
+                continue;
             }
+
+            // rp accessor picks the fast pointer path on IsUnitStride or
+            // falls through to operator().
+            const range_profiles_t sample_lo = rp(p, bin_weight.bin_floor_int);
+            const range_profiles_t sample_hi = rp(p, bin_weight.bin_floor_int + 1);
+
+            const loose_complex_compute_t sample = [&sample_lo, &sample_hi, &bin_weight]() -> loose_complex_compute_t {
+                const loose_complex_compute_t loose_sample_lo = static_cast<loose_complex_compute_t>(sample_lo);
+                const loose_complex_compute_t loose_sample_hi = static_cast<loose_complex_compute_t>(sample_hi);
+                const loose_compute_t w = bin_weight.w;
+                if constexpr (cuda::std::is_same_v<loose_compute_t, float>) {
+                    return loose_complex_compute_t{
+                        __fmaf_rn(w, loose_sample_hi.real(), __fmaf_rn(-w, loose_sample_lo.real(), loose_sample_lo.real())),
+                        __fmaf_rn(w, loose_sample_hi.imag(), __fmaf_rn(-w, loose_sample_lo.imag(), loose_sample_lo.imag()))
+                    };
+                } else {
+                    return loose_complex_compute_t{
+                        fma(w, loose_sample_hi.real(), fma(-w, loose_sample_lo.real(), loose_sample_lo.real())),
+                        fma(w, loose_sample_hi.imag(), fma(-w, loose_sample_lo.imag(), loose_sample_lo.imag()))
+                    };
+                }
+            }();
+
+            static_assert((!IsFloatFloat &&
+                            !IsTaylorFast) || PhaseLUT == true,
+                            "SarBp: FloatFloat and TaylorFast compute types require PhaseLUT optimization");
+            const loose_complex_compute_t ref_phase = [&]() {
+                if constexpr (PhaseLUT) {
+                    return get_reference_phase_lut(bin_weight.bin_floor_int, bin_weight.w);
+                } else {
+                    return get_reference_phase_direct(thread_state.range_delta);
+                }
+            }();
+
+            accum = accumulate_contribution(accum, sample, ref_phase);
         } // pulse
     } // pulse block
 
diff --git a/include/matx/operators/sar_bp.h b/include/matx/operators/sar_bp.h
index c7ac37451..f50fc945e 100644
--- a/include/matx/operators/sar_bp.h
+++ b/include/matx/operators/sar_bp.h
@@ -32,6 +32,7 @@
 
 #pragma once
 
+#include "matx/core/props.h"
 #include "matx/core/type_utils.h"
 #include "matx/operators/base_operator.h"
 
@@ -44,9 +45,14 @@ namespace matx {
  * While the inputs (range profiles, antenna positions, etc.) and output (image) have their own data
  * types, we may wish to use a different precision for the internal calculations. For example, the
  * output may be cuda::std::complex<float> while the intermediate calculations are done in double.
+ * For some compute types, additional approximations will be applied to further improve run-time performance.
  */
 enum class SarBpComputeType {
-  Double, //!< Uses double precision for all intermediate calculations.
+  Double, /**< Uses double precision for all intermediate calculations. The backprojection kernel works with user-provided input
+              for the range profiles, platform positions, ranges to mocomp point, and voxel locations. Thus, the user should provide
+              values in double precision if needed. Typically, the platform positions and ranges to mocomp point will be double
+              precision and range profiles and voxel locations will be single precision, but users can provide double precision
+              range profiles and voxel locations if needed. */
   Mixed, /**< Uses mixed precision for intermediate calculations. This compute type offers a trade-off between
               performance and precision. With \p Mixed precision, the range calculated per pixel-pulse pair will 
               still typically be done in double-precision, but interpolation and accumulation will be single-precision.
@@ -56,8 +62,19 @@ enum class SarBpComputeType {
               float-float handling of the values for which fp64 would otherwise be needed. The float-float
               representation offers increased precision relative to fp32, but not full fp64 precision,
               through the use of increased fp32 computation and representing each value as an unevaluated
-              sum of two fp32 components. */
-  Float //!< Uses single precision for all intermediate calculations.
+              sum of two fp32 components. For optimal performance, users can provide any inputs that require
+              additional precision (e.g., platform positions and range to mocomp point) directly in \c fltflt format.
+              Otherwise, the kernel will perform double-to-fltflt conversions internally.
+              FloatFloat requires \p PhaseLUTOptimization. */
+  TaylorFast, /**< Uses a local Taylor approximation of the pulse-to-pixel range about a centered per-thread-block
+              reference point. This mode prioritizes throughput and requires \p PhaseLUTOptimization.
+              The \p PropSarBpTaylorFastAddThirdOrder property can be used to add a third-order term to the Taylor approximation.
+              On systems with reduced double-precision throughput, TaylorFast is typically the fastest compute type using either
+              second order or third order Taylor approximation. TaylorFast will divide by reference range values to certain image
+              pixels and thus does not support geometries where the antenna phase center is located at any of the image pixels
+              (i.e., at range 0 from some pixel). In such cases, the user should choose a different compute type.*/
+  Float /**< Uses single precision for all intermediate calculations. This mode is fast, but typically does not provide
+              sufficient precision for cases where the ranges exceed several kilometers. It is not suited for spaceborne SAR geometries. */
 };
 
 /**
@@ -69,9 +86,21 @@ enum class SarBpFeature : uint32_t {
   to store partial values for the reference phases used during backprojection. The value from the lookup table will
   be combined with an incremental phase calculation within a single range bin that is computed using the lower-precision
   intrinsic sine/cosine functions. This optimization will utilize a small amount of device memory as a workspace
-  buffer. This optimization is typically only useful for the \p Mixed and \p FloatFloat compute types. */
+  buffer. This optimization is typically useful for the \p Mixed, \p FloatFloat, and \p TaylorFast compute types.
+  It is required for \p FloatFloat and \p TaylorFast. */
 };
 
+/**
+ * @brief Adds the third-order local range term to SarBpComputeType::TaylorFast.
+ *
+ * TaylorFast uses a second-order local Taylor range approximation by default.
+ * This property instantiates a TaylorFast kernel variant that also evaluates
+ * the third-order term. It has no effect unless SarBpParams::compute_type is
+ * SarBpComputeType::TaylorFast. This is primarily useful for short stand-off
+ * ranges where the second-order approximation may not be accurate enough.
+ */
+struct PropSarBpTaylorFastAddThirdOrder {};
+
 // Enable bitmask operations for SarBpFeature
 constexpr SarBpFeature operator|(SarBpFeature lhs, SarBpFeature rhs) noexcept {
   return static_cast<SarBpFeature>(static_cast<uint32_t>(lhs) | static_cast<uint32_t>(rhs));
@@ -107,8 +136,11 @@ struct SarBpParams {
 namespace matx {
 
 namespace detail {
-  template<typename ImageType, typename RangeProfilesType, typename PlatPosType, typename VoxLocType, typename RangeToMcpType>
-  class SarBpOp : public BaseOp<SarBpOp<ImageType, RangeProfilesType, PlatPosType, VoxLocType, RangeToMcpType>>
+  template <typename ImageType, typename RangeProfilesType, typename PlatPosType, typename VoxLocType, typename RangeToMcpType, typename List>
+  struct make_sar_bp_op;
+
+  template<typename ImageType, typename RangeProfilesType, typename PlatPosType, typename VoxLocType, typename RangeToMcpType, typename... CurrentProps>
+  class SarBpOp : public BaseOp<SarBpOp<ImageType, RangeProfilesType, PlatPosType, VoxLocType, RangeToMcpType, CurrentProps...>>
   {
     static_assert(is_complex_v<typename RangeProfilesType::value_type>, "Phase history must be complex");
     static_assert(is_complex_v<typename ImageType::value_type>, "Image must be complex");
@@ -169,7 +201,11 @@ namespace detail {
       void Exec(Out &&out, Executor &&ex) const {
         static_assert(is_cuda_executor_v<Executor>, "sarbp() only supports the CUDA executor currently");
 
-        sar_bp_impl(cuda::std::get<0>(out), initial_image_, range_profiles_, platform_positions_, voxel_locations_, range_to_mcp_, params_, ex.getStream());
+        constexpr bool TaylorFastAddThirdOrder =
+          has_property_tag<PropSarBpTaylorFastAddThirdOrder, CurrentProps...>;
+        sar_bp_impl<TaylorFastAddThirdOrder>(
+          cuda::std::get<0>(out), initial_image_, range_profiles_, platform_positions_,
+          voxel_locations_, range_to_mcp_, params_, ex.getStream());
       }
 
       static __MATX_INLINE__ constexpr __MATX_HOST__ __MATX_DEVICE__ int32_t Rank()
@@ -226,22 +262,42 @@ namespace detail {
         return out_dims_[dim];
       }
 
+      template <typename... NewProps>
+      constexpr auto props() const {
+        using AllProps = typename merge_props_unique<type_list<CurrentProps...>, NewProps...>::type;
+        return make_sar_bp_op<ImageType, RangeProfilesType, PlatPosType, VoxLocType, RangeToMcpType, AllProps>::make(
+          initial_image_, range_profiles_, platform_positions_, voxel_locations_, range_to_mcp_, params_);
+      }
+
+  };
+
+  template <typename ImageType, typename RangeProfilesType, typename PlatPosType, typename VoxLocType, typename RangeToMcpType, typename... Props>
+  struct make_sar_bp_op<ImageType, RangeProfilesType, PlatPosType, VoxLocType, RangeToMcpType, type_list<Props...>> {
+    static constexpr auto make(const ImageType &initial_image, const RangeProfilesType &range_profiles,
+      const PlatPosType &platform_positions, const VoxLocType &voxel_locations,
+      const RangeToMcpType &range_to_mcp, const SarBpParams &params) {
+      return SarBpOp<ImageType, RangeProfilesType, PlatPosType, VoxLocType, RangeToMcpType, Props...>(
+        initial_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, params);
+    }
   };
 }
 
 namespace experimental {
 
 /**
-* @brief SAR backprojection.
+* @brief sar_bp implements standard or direct SAR backprojection. Backprojection is an image formation algorithm
+* that reconstructs a complex-valued image from a set of range-compressed complex samples. The runtime complexity
+* of the sar_bp transform is O(MNP) where there are MxN pixels and P pulses. In the case that M=N=P, sar_bp is
+* O(N^3).
 *
-* @note The number of range bins (\p range_profiles second dimension) is constrained as follows:
-* For the Float, Mixed, and FloatFloat compute types, num_range_bins is capped at 2^24 (16,777,216).
-* Those paths compute the per-pulse bin offset and (for Float / FloatFloat) the bin floor in fp32, and
-* fp32 can exactly represent all integers only in [-2^24, 2^24]; above 2^24 the representable gaps grow
-* (2.0 at 2^24+, 4.0 at 2^25+, ...), so the bin index would lose precision near the upper boundary.
-* The Double compute type uses fp64 throughout for the bin computation and is only constrained by the
-* int32_t tensor-indexing limit (num_range_bins <= cuda::std::numeric_limits<int32_t>::max()). The transform throws
-* \c matxInvalidParameter at launch if these limits are exceeded. Typical raw num_range_bins is on the
+* @note The number of range bins is constrained as follows:
+* - For the Float, Mixed, FloatFloat, and TaylorFast compute types, num_range_bins is capped at 2^24 (16,777,216).
+* Those paths compute the per-pulse bin offset and (for Float / FloatFloat / TaylorFast) the bin floor in fp32, and
+* fp32 can exactly represent all integers only in [-2^24, 2^24]. Exceeding 2^24 would result in loss of precision
+* in the bin index used for range profile sampling.
+* - The Double compute type uses fp64 throughout for the bin computation and is only constrained by the
+* int32_t tensor-indexing limit (num_range_bins <= cuda::std::numeric_limits<int32_t>::max()). The transform throws \c matxInvalidParameter
+* at launch if these limits are exceeded. Typical raw num_range_bins is on the
 * order of 10^4-10^5 and well below the 2^24 bound. The fp32 limit can be reached, however, when heavy
 * range oversampling is applied. An upsample factor that pushes the FFT length above ~2^24 /
 * num_samples_raw (e.g., upsampling 32k raw samples by 512x or more) for non-Double compute types will trigger the runtime check.
@@ -251,21 +307,26 @@ namespace experimental {
 * @tparam RangeProfilesType Type of range_profiles. RangeProfilesType must represent a 2D operator of size num_pulses x num_range_bins containing the range-compressed complex samples.
 * RangeProfilesType must be a complex type. Typical data types are cuda::std::complex<float> or cuda::std::complex<double>.
 * @tparam PlatPosType Type of platform positions. PlatPosType must represent a 1D operator of size num_pulses containing the platform positions. Currently, the only supported data
-* types for PlatPosType are double3, double4, float3, and float4. If the user has three separate operators for the x, y, and z coordinates, they can be combined using the zipvec operator.
+* types for PlatPosType are double3, double4, float3, and float4. For the float4 and double4 data types, the w coordinate is ignored.
+* If the user has three separate operators for the x, y, and z coordinates, they can be combined using the zipvec operator.
 * @tparam VoxLocType Type of voxel locations. VoxLocType must represent a 2D operator of size image_height x image_width containing the voxel locations. Currently, the only supported
 * data types for VoxLocType are double3, double4, float3, and float4. For the float4 and double4 data types, the w coordinate is ignored.
 * If the user has three separate operators for the x, y, and z coordinates, they can be combined using the zipvec operator.
-* @tparam RangeToMcpType Type of range to motion compensation point. RangeToMcpType must represent a 0D or 1D real-valued operator of size 1 or num_pulses.
+* @tparam RangeToMcpType Type of range to motion compensation point. RangeToMcpType must represent a 0D or 1D real-valued operator of size 1 or num_pulses, respectively.
 * @param initial_image Initial image. Initial image must represent a 2D operator of size image_height x image_width for an image of the corresponding dimensions. Contributions computed
 * during backprojection will be added to the initial image. The user can use the zeros generator (i.e., matx::zeros) if no initial image is needed.
+* See \p ImageType documentation for details on supported rank and data types.
 * @param range_profiles Range profiles. Range profiles must represent a 2D operator of size num_pulses x num_range_bins containing the range-compressed complex samples.
+* See \p RangeProfilesType documentation for details on supported rank and data types.
 * @param platform_positions Platform positions represent the x, y, and z coordinates of the aperture phase center for each pulse. The coordinates should be in
 * the same coordinate system and units as the voxel locations. See \p PlatPosType documentation for details on supported rank and data types.
 * @param voxel_locations Voxel locations represent the x, y, and z coordinates of the voxels in the image. The coordinates should be in
 * the same coordinate system and units as the platform positions. See \p VoxLocType documentation for details on supported rank and data types.
 * @param range_to_mcp Range to motion compensation point is the distance (range) from each platform position to the motion compensation point.
 * See \p RangeToMcpType documentation for details on supported rank and data types.
-* @param params SAR backprojection parameters. See \p SarBpParams documentation for details on supported parameters.
+* @param params SAR backprojection parameters. See \p SarBpParams documentation for details on supported parameters. Note that the \p TaylorFast
+* compute type does not support geometries where the antenna phase center is located at any of the image pixels (i.e., at range 0 from some pixel).
+* In such cases, the user should choose another compute type to avoid potential divide-by-zero errors at run-time.
 */
 template <typename ImageType, typename RangeProfilesType, typename PlatPosType, typename VoxLocType, typename RangeToMcpType>
 inline auto sar_bp(const ImageType &initial_image, const RangeProfilesType &range_profiles,
diff --git a/include/matx/transforms/sar_bp.h b/include/matx/transforms/sar_bp.h
index 323f15a79..77bf9cd47 100644
--- a/include/matx/transforms/sar_bp.h
+++ b/include/matx/transforms/sar_bp.h
@@ -44,7 +44,7 @@
 
 namespace matx {
 
-template <typename OutImageType, typename InitialImageType, typename RangeProfilesType, typename PlatPosType, typename VoxLocType, typename RangeToMcpType>
+template <bool TaylorFastAddThirdOrder = false, typename OutImageType, typename InitialImageType, typename RangeProfilesType, typename PlatPosType, typename VoxLocType, typename RangeToMcpType>
 inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image, const RangeProfilesType &range_profiles, const PlatPosType &platform_positions,
   const VoxLocType &voxel_locations, const RangeToMcpType &range_to_mcp, const SarBpParams &params, cudaStream_t stream = 0) {
 #ifdef __CUDACC__
@@ -58,11 +58,12 @@ inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image
   MATX_STATIC_ASSERT_STR(RangeProfilesType::Rank() == 2, matxInvalidDim, "sar_bp: range profiles must be a 2D tensor");
 
   const bool phase_lut_optimization = has_feature(params.features, SarBpFeature::PhaseLUTOptimization);
-  if (params.compute_type == SarBpComputeType::FloatFloat && ! phase_lut_optimization) {
-    // We currently require that phase LUT optimization be enabled for the FloatFloat compute type
+  if ((params.compute_type == SarBpComputeType::FloatFloat ||
+       params.compute_type == SarBpComputeType::TaylorFast) && ! phase_lut_optimization) {
+    // We currently require that phase LUT optimization be enabled for these compute types
     // because we do not yet have float-float based sin/cos implementations. Thus, we would fall back
-    // to double-precision sincos() computations, which defeats the purpose of the FloatFloat compute type.
-    MATX_THROW(matxInvalidParameter, "sar_bp: FloatFloat compute type requires phase LUT optimization");
+    // to double-precision sincos() computations for FloatFloat and TaylorFast is defined as a LUT-backed mode.
+    MATX_THROW(matxInvalidParameter, "sar_bp: FloatFloat and TaylorFast compute types require phase LUT optimization");
   }
 
   const index_t num_pulses = range_profiles.Size(0);
@@ -83,12 +84,12 @@ inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image
                "the kernel indexes range bins via a 32-bit integer");
   }
 
-  // The Float, Mixed, and FloatFloat compute types all use loose_compute_t =
+  // The Float, Mixed, FloatFloat, and TaylorFast compute types all use loose_compute_t =
   // float, which makes the per-pulse `bin_offset = 0.5 * (num_range_bins - 1)`
   // an fp32 value. fp32 can exactly represent all integers in [-2^24, 2^24];
   // above that, the gaps grow (2.0 at 2^24+, 4.0 at 2^25+, ...), so
   // bin_offset would lose precision and bin_floor_int would be off by up to
-  // ~1 bin for every pixel. (Float and FloatFloat additionally use floorf()
+  // ~1 bin for every pixel. (Float, FloatFloat, and TaylorFast additionally use floorf()
   // on an fp32 bin value, which is constrained by the same limit.) We
   // therefore cap num_range_bins at 2^24 for those paths.
   //
@@ -100,7 +101,7 @@ inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image
   if (fp32_bin_path && range_profiles.Size(1) > FP32_MAX_RANGE_BINS) {
     MATX_THROW(matxInvalidParameter,
                "sar_bp: num_range_bins exceeds the maximum supported value of 2^24 "
-               "(16,777,216) for Float/Mixed/FloatFloat compute types -- fp32 mantissa "
+               "(16,777,216) for Float/Mixed/FloatFloat/TaylorFast compute types -- fp32 mantissa "
                "cannot exactly represent bin indices above 2^24. Use the Double "
                "compute type for larger range-bin counts.");
   }
@@ -177,6 +178,11 @@ inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image
         SarBpFillPhaseLUT<double, float><<<lut_grid, lut_block, 0, stream>>>(phase_lut, params.center_frequency, params.del_r, range_profiles.Size(1));
         SarBp<SarBpComputeType::FloatFloat, OutImageType, InitialImageType, RangeProfilesType, PlatPosType, VoxLocType, RangeToMcpType, PhaseLUT, IsUnitStride><<<grid, block, 0, stream>>>(
           out, initial_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, static_cast<fltflt>(dr_inv), phase_correction_partial, phase_lut);
+      } else if (params.compute_type == SarBpComputeType::TaylorFast) {
+        cuda::std::complex<float> *phase_lut = static_cast<cuda::std::complex<float> *>(workspace);
+        SarBpFillPhaseLUT<double, float><<<lut_grid, lut_block, 0, stream>>>(phase_lut, params.center_frequency, params.del_r, range_profiles.Size(1));
+        SarBp<SarBpComputeType::TaylorFast, OutImageType, InitialImageType, RangeProfilesType, PlatPosType, VoxLocType, RangeToMcpType, PhaseLUT, IsUnitStride, TaylorFastAddThirdOrder><<<grid, block, 0, stream>>>(
+          out, initial_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, dr_inv, phase_correction_partial, phase_lut);
       } else {
         cuda::std::complex<float> *phase_lut = static_cast<cuda::std::complex<float> *>(workspace);
         SarBpFillPhaseLUT<float, float><<<lut_grid, lut_block, 0, stream>>>(phase_lut, static_cast<float>(params.center_frequency), static_cast<float>(params.del_r), range_profiles.Size(1));
@@ -193,10 +199,11 @@ inline void sar_bp_impl(OutImageType &out, const InitialImageType &initial_image
       } else if (params.compute_type == SarBpComputeType::Mixed) {
         SarBp<SarBpComputeType::Mixed, OutImageType, InitialImageType, RangeProfilesType, PlatPosType, VoxLocType, RangeToMcpType, PhaseLUT, IsUnitStride><<<grid, block, 0, stream>>>(
           out, initial_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, dr_inv, phase_correction_partial, nullptr);
-      } else if (params.compute_type == SarBpComputeType::FloatFloat) {
-        // We currently require that phase LUT optimization be enabled for the FloatFloat compute type. See comment
+      } else if (params.compute_type == SarBpComputeType::FloatFloat ||
+                 params.compute_type == SarBpComputeType::TaylorFast) {
+        // We currently require that phase LUT optimization be enabled for these compute types. See comment
         // in run-time check higher in this function.
-        MATX_THROW(matxInvalidParameter, "sar_bp: FloatFloat compute type requires phase LUT optimization");
+        MATX_THROW(matxInvalidParameter, "sar_bp: FloatFloat and TaylorFast compute types require phase LUT optimization");
       } else {
         SarBp<SarBpComputeType::Float, OutImageType, InitialImageType, RangeProfilesType, PlatPosType, VoxLocType, RangeToMcpType, PhaseLUT, IsUnitStride><<<grid, block, 0, stream>>>(
           out, initial_image, range_profiles, platform_positions, voxel_locations, range_to_mcp,
diff --git a/test/00_transform/SarBp.cu b/test/00_transform/SarBp.cu
index 9829d7a0e..614e67db0 100644
--- a/test/00_transform/SarBp.cu
+++ b/test/00_transform/SarBp.cu
@@ -136,7 +136,7 @@ TYPED_TEST(SarBpTestNonComplexNonHalfFloatTypes, NonMixedTypes)
   MATX_EXIT_HANDLER();
 }
 
-// Test Mixed and FloatFloat precisions. These precisions are used when the user wants
+// Test Mixed, FloatFloat, and TaylorFast precisions. These precisions are used when the user wants
 // better than fp32 precision at lower cost than fp64 (especially for GPUs with reduced FP64 throughput).
 TYPED_TEST(SarBpTestDoubleType, MixedPrecision)
 {
@@ -193,9 +193,10 @@ TYPED_TEST(SarBpTestDoubleType, MixedPrecision)
     this->exec.sync();
   }
 
-  // Enable PhaseLUTOptimization
+  // Enable PhaseLUTOptimization for all remaining tests
+  params.features = SarBpFeature::PhaseLUTOptimization;
+
   {
-    params.features = SarBpFeature::PhaseLUTOptimization;
     (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, params)).run(this->exec);
     this->exec.sync();
   }
@@ -208,11 +209,28 @@ TYPED_TEST(SarBpTestDoubleType, MixedPrecision)
     this->exec.sync();
   }
 
+  {
+    params.compute_type = SarBpComputeType::TaylorFast;
+    (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, params)).run(this->exec);
+    this->exec.sync();
+
+    // We reset TaylorFast and enable PhaseLUTOptimization below to include the text in the
+    // example that will be used in the documentation.
+    // example-begin sar-bp-2
+    params.compute_type = SarBpComputeType::TaylorFast;
+    params.features = SarBpFeature::PhaseLUTOptimization;
+
+    (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, params)
+      .props<PropSarBpTaylorFastAddThirdOrder>()).run(this->exec);
+    // example-end sar-bp-2
+    this->exec.sync();
+  }
+
   MATX_EXIT_HANDLER();
 }
 
 // Verify the num_range_bins boundary at 2^24 for compute types that compute
-// the per-pulse bin_offset in fp32 (Float, Mixed, FloatFloat). fp32 can
+// the per-pulse bin_offset in fp32 (Float, Mixed, FloatFloat, TaylorFast). fp32 can
 // exactly represent all integers in [-2^24, 2^24] but not above, so those
 // paths accept num_range_bins == 2^24 and reject anything larger.
 //
@@ -317,8 +335,10 @@ TYPED_TEST(SarBpTestDoubleType, PointTarget)
 
   using ExecType = cuda::std::tuple_element_t<1, TypeParam>;
 
-  constexpr matx::index_t image_width = 128;
-  constexpr matx::index_t image_height = 128;
+  // Divisible by 4 for target-location checks, but not by the 16x16 CUDA block
+  // size, so TaylorFast exercises partial edge CTAs.
+  constexpr matx::index_t image_width = 132;
+  constexpr matx::index_t image_height = 132;
   const matx::index_t num_pulses = 128;
   const matx::index_t num_range_bins = 128;
 
@@ -337,7 +357,7 @@ TYPED_TEST(SarBpTestDoubleType, PointTarget)
   // Target range relative to mocomp point (0,0,0)
   const double target_R = ::sqrt(target_x * target_x + target_y * target_y);
 
-  matx::SarBpParams bp_params;
+  matx::SarBpParams bp_params{};
   bp_params.del_r = ::sqrt((image_width * pix_dx) * (image_width * pix_dx) + (image_height * pix_dy) * (image_height * pix_dy)) / num_range_bins;
   bp_params.center_frequency = 10.0e9;
   const double bin_offset = 0.5 * (num_range_bins - 1);
@@ -401,6 +421,9 @@ TYPED_TEST(SarBpTestDoubleType, PointTarget)
       }
     }
   };
+  // Single-precision machine epsilon is ~1.1921e-7; allow one fp32 epsilon
+  // per pulse of accumulation at the focused target pixel.
+  const double f32_accum_re_thresh = 1.2e-7 * static_cast<double>(num_pulses);
 
   // Start with fully fp64 backprojector. The range profiles and all position values are double precision.
   {
@@ -480,7 +503,7 @@ TYPED_TEST(SarBpTestDoubleType, PointTarget)
           const float expected = 0.98f * num_pulses;
           const float actual = image(i, j).real();
           ASSERT_GT(actual, expected);
-          ASSERT_LT(std::abs(image(i, j).imag()), 1.0f);
+          ASSERT_LT(std::abs(image(i, j).imag()), 3.0f);
         } else if (i < image_height / 8 || i >= 3 * image_height / 8) {
           // Away from the scatterer in range, we should not have any non-zero values.
           ASSERT_EQ(abs(image(i, j).real()), 0.0f);
@@ -494,30 +517,66 @@ TYPED_TEST(SarBpTestDoubleType, PointTarget)
     }
   }
 
-  // Mixed and FloatFloat versions of the backprojector. For these, we need double-precision platform positions
-  // and range to the mcp.
+  // Mixed, FloatFloat, and TaylorFast versions of the backprojector. For these, we need double-precision
+  // platform positions and range to the mcp.
   {
-    bp_params.features = matx::SarBpFeature::PhaseLUTOptimization;
-  
     auto platform_positions = matx::make_tensor<double3>({num_pulses});
     auto range_to_mcp = matx::make_tensor<double>({num_pulses});
     MATX_CUDA_CHECK(cudaMemcpy(
       platform_positions.Data(), h_antenna_phase_centers.data(), h_antenna_phase_centers.size() * sizeof(double3), cudaMemcpyHostToDevice));
     MATX_CUDA_CHECK(cudaMemcpy(range_to_mcp.Data(), h_range_to_mcp.data(), h_range_to_mcp.size() * sizeof(double), cudaMemcpyHostToDevice));
 
-    bp_params.compute_type = matx::SarBpComputeType::Mixed;
-    // example-begin sar-bp-1
-    (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec);
-    // example-end sar-bp-1
-    this->exec.sync();  
+    {
+      SCOPED_TRACE("Mixed no PhaseLUT");
 
-    validate(image, 1.05e-10, 1.0e-2);
+      // example-begin sar-bp-1
+      bp_params.compute_type = matx::SarBpComputeType::Mixed;
+      (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec);
+      // example-end sar-bp-1
+      this->exec.sync();
 
-    bp_params.compute_type = matx::SarBpComputeType::FloatFloat;
-    (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec);
-    this->exec.sync();  
+      validate(image, f32_accum_re_thresh, 1.0e-7);
+    }
+
+    bp_params.features = matx::SarBpFeature::PhaseLUTOptimization;
+
+    {
+      SCOPED_TRACE("Mixed PhaseLUT");
+
+      bp_params.compute_type = matx::SarBpComputeType::Mixed;
+      (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec);
+      this->exec.sync();
+
+      validate(image, f32_accum_re_thresh, 1.0e-2);
+    }
+
+    {
+      SCOPED_TRACE("FloatFloat");
 
-    validate(image, 1.05e-10, 1.0e-2);
+      bp_params.compute_type = matx::SarBpComputeType::FloatFloat;
+      (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec);
+      this->exec.sync();
+
+      validate(image, f32_accum_re_thresh, 1.0e-2);
+    }
+
+    {
+      SCOPED_TRACE("TaylorFast second order");
+      bp_params.compute_type = matx::SarBpComputeType::TaylorFast;
+      (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)).run(this->exec);
+      this->exec.sync();
+
+      validate(image, f32_accum_re_thresh, 1.8e-2);
+    }
+
+    {
+      SCOPED_TRACE("TaylorFast third order");
+      (image = matx::experimental::sar_bp(zero_image, range_profiles, platform_positions, voxel_locations, range_to_mcp, bp_params)
+        .props<matx::PropSarBpTaylorFastAddThirdOrder>()).run(this->exec);
+      this->exec.sync();
+
+      validate(image, f32_accum_re_thresh, 1.8e-2);
+    }
   }
 
   MATX_EXIT_HANDLER();