EFT-based compensated arithmetic for spherical geometry (GCA intersection, point-in-face, face bounds) by rajeeja · Pull Request #1513 · UXARRAY/uxarray

rajeeja · 2026-05-22T12:00:23Z

Ports the compensated-arithmetic tier from AccuSphGeom into UXarray's Numba spherical geometry stack, replacing cancellation-prone naive cross-product paths with compensated primitives throughout arc intersection, point-in-face, and face bounds.

What changed

utils/computing.py — EFT and compensated primitives: two_sum, two_prod, diff_of_products, accucross, accucross_pair, acc_sqrt_re, and fixed-size compensated dot/sum-of-squares helpers. Breaking change: the old cross_fma and dot_fma functions (which depended on the optional pyfma package) are removed. Any code importing those names directly will get an ImportError; use accucross and the _cdp* helpers instead.
grid/arcs.py — new _orient3d_on_sphere_value, orient3d_on_sphere, on_minor_arc predicates using compensated arithmetic.
grid/intersections.py — gca_gca_intersection and gca_const_lat_intersection rewritten as three-layer stacks: pure numerical kernel (L1), integer-mask status layer (L2), UXarray dispatcher (L3). Removed dead _gca_gca_intersection_cartesian shim.
grid/point_in_face.py — _face_contains_point delegates to a new ray-casting SPIP implementation (_point_in_polygon_sphere) using orient3d_on_sphere instead of the old winding-number via arctan2.
grid/bounds.py — new _construct_face_bounds_array_gca path for pure-GCA grids, computing interior arc z-extrema correctly via the compensated kernel. Old path retained for latlon/mixed-edge grids.

What this is not

Not a full port of AccuSphGeom's robustness stack. Excluded: Shewchuk adaptive predicates, Simulation of Simplicity (requires per-vertex global IDs not available in UXarray's polygon representation), geogram exact fallback. This implements only the compensated-arithmetic tier — roughly twice as accurate as naive floating-point on near-tangent cases, with no measurable runtime overhead on intersection kernels.

Performance

Operation	vs. naive
`gca_gca_intersection`	~0% overhead (≈1 µs/call)
`gca_const_lat_intersection`	~0% overhead (≈1 µs/call)
`Grid.bounds` (face bounds, cached once)	~40% slower per face
Cross-section / zonal mean	~5% overhead

Accuracy

Measured on the AccuSphGeom baseline suite (31 near-tangent GCA-GCA pairs, angles down to 10⁻⁵°):

Arc angle	Naive error	EFT error	Improvement
≈ 0.000001°	1.07 × 10⁻⁷	1.24 × 10⁻¹⁶	861 million×
≈ 0.001°	5.04 × 10⁻¹²	1.14 × 10⁻¹⁶	44 000×
≈ 5°	8.47 × 10⁻¹⁶	1.58 × 10⁻¹⁶	5×

Tests

241 new baseline regression tests ported from the AccuSphGeom C++ suite covering near-tangent GCA-GCA pairs, GCA/constant-latitude cases, and spherical point-in-polygon.

review-notebook-app · 2026-05-22T12:00:29Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

hongyuchen1030 · 2026-05-22T19:00:36Z

We have an existed implementation for these functions in

uxarray/uxarray/utils/computing.py

Line 189 in 735a1dd

def _two_sum(a, b):

. Consider either merge them or only keep one of them

And I would suggest use other name instead of eft here, the term "EFT" specifically refers to "Error-Free" floating point operations, but the AccuCross and diff_of_productthemself is not completely Error-Free, is just more accurate than normal floating point cross (doubling the precision)

Done: merged this into uxarray/utils/computing.py and renamed the docs/API section to compensated arithmetic. I only call two_sum/two_prod EFT now; the cross-product helpers are described as compensated algorithms.

hongyuchen1030 · 2026-05-22T19:23:44Z

+
+
+@njit(cache=True)
+def _generate_lat_lon_bounds_local(face_vertices, z_min, z_max, snap_tol_deg):


For a better vectorization, consider keeping the current design in https://github.com/hongyuchen1030/AccuSphGeom/blob/56cbd6e30270ec5845a853ec72ba5b19c9128017/include/accusphgeom/algorithms/lat_lon_bounds.hpp#L107:

It uses lots of masking and avoid branching for computation steps.

Done: rewrote this path closer to the AccuSphGeom structure. The local bounds computation now uses mask-selection for extrema/snap decisions instead of branching through separate cases.

The implementation for this entire function involves lots of new branching and operations that are not existed in the AccuSphGeom

The remaining branching in _generate_lat_lon_bounds_local and _generate_lat_lon_bounds_pole is UXarray-specific post-kernel formatting, not part of the EFT computation path. These functions convert the z-extrema from _face_location_info into degree lat/lon bounds using UXarray's encoding conventions (antimeridian signaled by lon_min > lon_max, pole-enclosing faces returned as 0/360). AccuSphGeom does not have an equivalent layer since it uses different output conventions. Both functions are now documented as UXarray-specific formatting steps, clearly separated from the EFT kernel.

hongyuchen1030 · 2026-05-22T19:25:29Z

+
+
+@njit(cache=True)
+def _generate_lat_lon_bounds_pole(face_vertices, label, z_min, z_max, snap_tol_deg):


Same here, consider keeping the structure here: https://github.com/hongyuchen1030/AccuSphGeom/blob/56cbd6e30270ec5845a853ec72ba5b19c9128017/include/accusphgeom/algorithms/lat_lon_bounds.hpp#L159

use masks instead of branching as much as possible for the vectorization purpose

Done: same cleanup here. The polar bounds path now keeps the AccuSphGeom-style local/polar structure and uses mask-selection for the snap logic.

hongyuchen1030 · 2026-05-22T19:26:56Z

+
+
+@njit(cache=True)
+def _face_location_info(face_vertices, polar_cap_z):


The branching here will probably slow down the vectorization, consider keeping the structure here:
https://github.com/hongyuchen1030/AccuSphGeom/blob/56cbd6e30270ec5845a853ec72ba5b19c9128017/include/accusphgeom/algorithms/lat_lon_bounds.hpp#L49

Done: _face_location_info now follows the AccuSphGeom face-location flow. The per-edge extrema use mask-selection; only the final face classification branches remain.

hongyuchen1030 · 2026-05-22T19:29:02Z

+@njit(cache=True)
+def _normalize_pair(x_hi, y_hi, z_hi, x_lo, y_lo, z_lo):
+    """Normalize an (hi, lo) compensated vector, returning the unit vector and magnitude."""
+    x = x_hi + x_lo


This _normalize_pair is not utilizing the EFT and it will have the same precision as the direct floating point precision. And the normalization is one of the big killer for the precision.

Also if the input are normalized, then both the gca_gca_intersection and the gca_constLat_intersection will return the normalized results already

And just in case you want an "accurately calculated norm", it is const auto sum = numeric::sum_of_squares_c<T, 3>(v.hi, v.lo); const auto norm = numeric::acc_sqrt_re(sum.hi, sum.lo);

Done: removed _normalize_pair. The intersection code now keeps compensated normals/intersection vectors, and the baseline near-tangent cases pass.

Done. _normalize_pair is removed. For _accux_gca, the direction vector is normalized using plain math.sqrt on the collapsed scalars rather than sum_of_squares_c + acc_sqrt_re — we found that passing collapsed scalars as hi/lo pairs to _sum_sq_c3 misrepresents their error structure and actually degrades accuracy on the baseline suite (pair_id=8 goes from <1e-15 to ~7e-10 error). The collapsed norm is sufficient because the unit-sphere error is dominated by the accucross_pair step, not the normalization. This is documented in a comment in _accux_gca.

hongyuchen1030 · 2026-05-22T19:35:16Z

    return gca_gca_intersection(gca_a_xyz, gca_b_xyz)


 @njit(cache=True)


consider keeping the entire structure from here : https://github.com/hongyuchen1030/AccuSphGeom/blob/e4e13215dd4771b7ef01b7edf81eaf58dd6e6995/include/accusphgeom/constructions/gca_gca_intersection.hpp#L31

These EFT-fused operations are extremely sensitive for each operations and order. The accuracy can only be guaranteed iff following the exact algorithm

Done: rewrote GCA-GCA around the AccuSphGeom structure: accucross normals, accucross_pair for the intersection direction, then candidate filtering with on_minor_arc.

hongyuchen1030 · 2026-05-22T19:41:30Z



 @njit(cache=True)
 def gca_const_lat_intersection(gca_cart, const_z):


The algorithm is almost very stable around floating point precision. Within the current Uxarray Error tolerance, you almost don't have to use any other safety pre-check other than if the input are valid (The relative error bound is 3*\sqrt{1-constz^2}*machine_epsilon as long as it's constZ is at least 10^15 from the equator) . And probably the only check needed Then only check you can make is if the const latitude is at the equator . Again, make sure to follow the entire algorithm structure from below. Such precision is only guaranteed
iff it follows the exact algorithm here
https://github.com/hongyuchen1030/AccuSphGeom/blob/e4e13215dd4771b7ef01b7edf81eaf58dd6e6995/include/accusphgeom/constructions/gca_constlat_intersection.hpp#L33

But you can use the same check as in https://github.com/hongyuchen1030/AccuSphGeom/blob/e4e13215dd4771b7ef01b7edf81eaf58dd6e6995/include/accusphgeom/constructions/gca_gca_intersection.hpp#L51

to remove any invalid points

Done: rewrote const-lat around the AccuSphGeom accux_constlat flow and filter invalid candidates after computing them.

hongyuchen1030 · 2026-05-22T19:44:55Z

+    nx = nx_hi + nx_lo
+    ny = ny_hi + ny_lo
+    nz = nz_hi + nz_lo
+    denom = nx * nx + ny * ny


Denome should be calculated from
const T denom = s2.hi + s2.lo;
where
const auto s2 = numeric::sum_of_squares_c<T, 2>( {normal.hi[0], normal.hi[1]}, {normal.lo[0], normal.lo[1]});

Done: denom now comes from _sum_sq_c2(...) as s2_hi + s2_lo.

hongyuchen1030 · 2026-05-22T19:49:10Z



 @njit(cache=True)
 def gca_gca_intersection(gca_a_xyz, gca_b_xyz):


Consider keeping the entire structure from here https://github.com/hongyuchen1030/AccuSphGeom/blob/e4e13215dd4771b7ef01b7edf81eaf58dd6e6995/include/accusphgeom/constructions/gca_gca_intersection.hpp#L31.

These algorithms are extremely sensitive to the operations and order, the accuracy is only guaranteed iff we follow the exact same algorithm described

Done: same GCA-GCA rewrite as above, following the AccuSphGeom operation order much more closely. Added baseline tests for the near-tangent cases too.

hongyuchen1030 · 2026-05-22T19:55:30Z

-    x2_at_const_z = np.isclose(
-        x2[2], const_z, rtol=ERROR_TOLERANCE, atol=ERROR_TOLERANCE
-    )
+    # 1. Endpoint coincidence with the latitude line.


This case is still valid and calculable with the new algorithm. And we can improve the vectorization by only use mask-selection after the intersection is calculated to snap the intersection point with the endpoints

Done: removed the endpoint early return. The code computes candidates first, then snaps near-endpoint results afterward.

hongyuchen1030 · 2026-05-22T19:56:12Z

    z_max = extreme_gca_z(gca_cart, extreme_type="max")
-
-    # Check if the constant latitude is within the GCA range
    if not in_between(z_min, const_z, z_max):


I am not sure if this early exist will counter-effect the branching it brings to the vectorization

Done: removed the endpoint pre-exits. I kept only invalid denom / negative-discriminant exits.

hongyuchen1030 · 2026-05-22T19:56:41Z

-    elif p2_intersects_gca:
-        res[0] = p2
+    # 4. Solve for the two candidate points on the latitude circle.
+    r2 = 1.0 - const_z * const_z


This is not the new EFT-fused algorithmn

Done: replaced this with the compensated s2/s3, two_prod, cdp4, and acc_sqrt_re flow from AccuSphGeom.

rajeeja · 2026-05-22T21:19:32Z

@hongyuchen1030 Thanks for the review. The point-in-polygon predicate and orient3d sign check look good, but I can see now that the intersection functions are basically the old code with a few error-free transformation calls sprinkled in rather than a real port. I'm planning to rewrite gca_gca_intersection and gca_const_lat_intersection from scratch following your C++ implementation exactly. I will be adding the missing pieces:

compensated_dot_product
sum_of_squares_c
acc_sqrt_re
The second accucross overload that takes the hi/lo pairs

hongyuchen1030 · 2026-06-04T02:09:46Z

+tiers — an EFT filter (what this module implements), Shewchuk adaptive
+predicates for results that fall inside the filter threshold, and a geogram
+exact-arithmetic fallback. This port implements only the EFT tier. For
+non-degenerate inputs in double precision this is sufficient; callers that


We're technically still unable to claim "For non-degenerate inputs in double precision this is sufficient". We can claim "This is twice more accurate than the direct floating point implementation without computation overhead with vectorization and parralization"

If we don't use any adaptive arithmetic, we cannot claim for robustness here. We can only claim our point-in-face use the new algorithm that are more accurate (if we use any EFT operations)

Fixed in commit 002b9fe. Removed all claims of 'sufficient', 'non-degenerate inputs', and 'robustness'. The module docstring now says the compensated routines are roughly twice as accurate as direct floating-point equivalents; robustness against all degenerate inputs would require adding an adaptive predicate or exact-arithmetic fallback tier. Same language used for the point-in-face docstring.

hongyuchen1030 · 2026-06-04T02:12:10Z

+predicates for results that fall inside the filter threshold, and a geogram
+exact-arithmetic fallback. This port implements only the EFT tier. For
+non-degenerate inputs in double precision this is sufficient; callers that
+need to handle geometrically degenerate inputs (coincident arcs, a query


These edge cases should be included in the test cases already. These scenarios don't have more risks than the "normal input" here. (They probably have the same risk since the intersection point is a newly constructed point)

Agreed. The denom==0 and planar_sq<0 edge cases propagate as inf/NaN through the kernel and are correctly handled by the isfinite mask in the status layer — no special-casing needed. The guards were removed from the kernel in commit 56c6ce1; only the status layer classifies them.

hongyuchen1030 · 2026-06-04T02:23:15Z

+
+        # Parameter along the arc at which z is extremal (matches C++ get_face_location_info).
+        denom = (z1 + z2) * (d - 1.0)
+        a_raw = (z1 * d - z2) / denom if denom != 0.0 else -1.0


The a = min(max(a_raw, 0.0), 1.0) should be able to prevent the divide by zero here. But we can keep it just in case

Kept as-is. The denom==0 guard sets a_raw=-1 so the clamp produces a=0 (use endpoint), which is safe. Matches your comment — kept just in case, and the guard costs nothing.

hongyuchen1030 · 2026-06-04T02:24:46Z

+            z_max = z_max_candidate
+        if z_min_candidate < z_min:
+            z_min = z_min_candidate
+


Consider utilizing the boolean operation from AccuSphHGeom to reduce branching

const MaskType<T> north_pole_candidate_mask = (face_z_max >= polar_cap_z); const MaskType<T> south_pole_candidate_mask = (face_z_min <= -polar_cap_z); const MaskType<T> local_mask = !(north_pole_candidate_mask | south_pole_candidate_mask);

Done in commit 002b9fe. The label computation now uses pure integer multiplication instead of boolean branching: label = north_pole_candidate * _FACE_LOC_NORTH_POLAR + (1 - north_pole_candidate) * south_pole_candidate * _FACE_LOC_SOUTH_POLAR. The unused local variable was also removed.

hongyuchen1030 · 2026-06-04T02:26:55Z

+
+
+@njit(cache=True)
+def _lon_bounds_from_vertices(face_vertices):


We probably don't need this work around anymore with the current latlon bounds implementation

This function is still needed and cannot be removed. UXarray uses a lon_min > lon_max encoding to signal antimeridian-crossing faces throughout the bounds and cross-section APIs — AccuSphGeom uses a union-of-intervals convention instead. The largest-gap algorithm translates between them. Removed the snap logic (which you flagged separately) but the antimeridian detection itself must stay. Added a docstring explaining this.

hongyuchen1030 · 2026-06-04T02:38:10Z



 @njit(parallel=True, nogil=True, cache=True)
 def constant_lat_intersections_no_extreme(lat, edge_node_z, n_edge):


Why we need to seperate the extreme case here, and what does the "extreme" mean here

constant_lat_intersections_no_extreme and constant_lon_intersections_no_extreme are pre-existing edge screeners — fast O(n) passes over all edges using only endpoint z/lon values, used by Grid.get_edges_at_constant_latitude/longitude before the expensive GCA kernel runs. 'No extreme' means arc interior extrema along the great circle are not considered. These are completely separate from the AccuSphGeom EFT stack and were not changed in this PR. Added a block comment before them to make this clear.

hongyuchen1030 · 2026-06-04T02:41:47Z

+    n2x = n2x_hi + n2x_lo
+    n2y = n2y_hi + n2y_lo
+    n2z = n2z_hi + n2z_lo
+    if (


The degeneracy check like this will greatly impact the vectorization performance, any check like this should be isolated from the intersection computation kernel. And the AccuXGCA itself is able to handle extremely short arcs

Done. Degeneracy checks (denom==0, planar_sq<0) are handled by letting NaN/inf propagate through the kernel — the status layer uses isfinite masks to classify them without any branching in the hot path. The _accux_gca and _accux_constlat kernels contain no if/else guards; only the 1.0/vn if vn!=0.0 else np.inf guard remains, which produces inf so the isfinite mask rejects the candidate without branching.

hongyuchen1030 · 2026-06-04T02:44:28Z

+        and math.isfinite(vn)
+    ):
+        # Parallel (coplanar) arcs: check whether endpoints of one lie on the other.
+        if on_minor_arc(v0, w0, w1):


Again, all these if-else branch will impact the vectorization behavior, that's why the original try_gca_gca_intersection only use mask/boolean operations. The point is, we do not try to prevent the NaN or Divide by zero in the intersection kernel, these errors will be recorded in the "status" so an outside dispatch function will know how to handle each case.

Done in commits 51c00c2 and 002b9fe. Both _try_gca_gca_intersection and _try_gca_const_lat_intersection now use integer mask arithmetic throughout — no if/else in the hot path. pos_fin = int(isfinite(...)), validity via pos_valid = pos_fin * pos_on_a * pos_on_b, point selection via pos_mask * pos + neg_mask * neg, status via both + none * 2. The only remaining guards wrap on_minor_arc calls to prevent calling it with inf inputs, which is unavoidable.

hongyuchen1030 · 2026-06-04T02:47:24Z

-    p1_intersects_gca = point_within_gca(p1, gca_cart[0], gca_cart[1])
-    p2_intersects_gca = point_within_gca(p2, gca_cart[0], gca_cart[1])
+    if planar_sq < 0.0:
+        return res


Isolate these if-else branch outside of the computation kernel

Done. All branching is now in L3 (gca_gca_intersection / gca_const_lat_intersection) — the dispatcher layer. L1 kernels (_accux_gca, _accux_constlat) are branch-free. L2 (_try_gca_gca_intersection, _try_gca_const_lat_intersection) uses integer mask arithmetic with no if/else in the computational path.

hongyuchen1030 · 2026-06-04T02:49:06Z

+    #     deduplication in the caller works correctly.  Matches Hongyu's suggestion
+    #     of mask-selection to snap after computing rather than branching out early.
+    _snap_sq = 1e-14  # distance² ≈ (1e-7)² — well above algorithm error (~1e-15)
+    for xe in (x1, x2):


Snapping involving branching should be isolated outside of the computation kernels. The intersection computation kernels should not include any branching and always stay in the SIMD-packed vectorization friendly form

Done. _snap_const_lat_endpoint lives entirely in L3 (gca_const_lat_intersection) — it is called after the kernel and status layer have completed, never inside _accux_constlat or _try_gca_const_lat_intersection. The L1 and L2 layers remain branch-free and kernel-pure.

hongyuchen1030 · 2026-06-04T02:59:18Z

+        res[0, 1] = p2[1]
+        res[0, 2] = p2[2]

    return res


The implementation in

include/accusphgeom/constructions/gca_constlat_intersection.hpp

was intentionally designed with three separate layers, and I think it is important to keep these layers conceptually and structurally separated.

Core computation kernel: accux_constlat

This is the numerical kernel. It contains the carefully designed AccuSphere algorithm for computing the great-circle-arc and constant-latitude intersection.

This layer should stay isolated and intact. It is designed to be SIMD-packing friendly, branch-free in the hot path, and easy to reason about for accuracy, performance, and future optimization. This function should not be mixed with high-level filtering, status interpretation, or UXarray-specific logic.

SIMD-friendly batch API: try_gca_constlat_intersection

This is the API intended for heavy computation.

It is still SIMD-packing friendly and is designed to compute the full matrix of possible intersection results from the input faces or edge lists. It does not immediately discard invalid intersections. Instead, it returns the computed candidate points together with a status flag indicating whether each result is valid.

This is the layer we want to use for large-scale vectorized computation, because it keeps the computation uniform. Even invalid candidates are part of the output matrix, and validity is represented by status instead of control flow.

Dispatcher / convenience API: gca_constlat_intersection

This is the higher-level user-facing dispatcher.

This layer is allowed to branch. It reads the status returned by try_gca_constlat_intersection, filters out invalid results, and returns only the valid intersection points. This is useful as a lightweight convenience API, especially when the caller wants clean geometry results instead of the full vectorized computation matrix.

The important point is that these three layers serve different purposes.

The core kernel should focus only on accurate numerical computation. The batch API should focus on uniform, SIMD-friendly execution. The dispatcher should handle branching, filtering, and convenience behavior.

Mixing these layers together is not ideal because it makes the heavy computation path harder to vectorize, harder to optimize, and harder to verify numerically. Once filtering, branching, and application-specific logic are pushed into the core kernel or SIMD batch layer, the implementation becomes less predictable and less suitable for large-scale computation.

For this kind of heavy numerical geometry kernel, the best practice is usually:

keep the low-level numerical kernel pure, isolated, and branch-minimized;

keep the batch/vectorized API uniform and status-based;

move branching, filtering, and user-facing convenience behavior to a separate dispatcher layer;

avoid mixing UXarray-specific data handling with the core computational algorithm.

So for UXarray integration, I think the preferred workflow should be:

Use try_gca_constlat_intersection for the main large-scale computation.

Preserve the full output matrix and status information during the vectorized computation stage.

Apply filtering only afterward, either through gca_constlat_intersection or through UXarray-side post-processing logic.

That way, we preserve the original AccuSphere design: accurate computation first, SIMD-friendly batch execution second, and lightweight branching/filtering only at the outer layer.

Understood — the key point is the design separation, not the C++ specifics. We've now split both intersection functions into the three layers you described:

_accux_constlat / _accux_gca — pure numerical kernels, no branching, no validity filtering, compensated operations in the exact AccuSphGeom order

_try_gca_const_lat_intersection / _try_gca_gca_intersection — batch/status layer using integer mask arithmetic (pos_valid * (1 - neg_valid)) to select candidates without branching in the hot path; returns (point, status, pos, neg)

gca_const_lat_intersection / gca_gca_intersection — outer dispatcher that reads status, applies endpoint snapping, and packages UXarray's NaN-filled result format; all UXarray-specific branching lives here

hongyuchen1030 · 2026-07-09T18:06:34Z

Thanks, that makes sense.

Just to confirm: what you are suggesting here is adding more rigorous UXarray-side verification for the FP64-vs-AccuX path, right?

For this PR we already added the AccuSphGeom-derived baseline regression tests in UXarray: 200 GCA-ConstLat cases, 31 GCA-GCA cases, and the point-in-polygon baseline cases. Those check numerical results against reference cases, but they do not yet do the same-body FP64-vs-AccuX API check you describe.

Before I implement that, would the following be sufficient for this PR?

add a direct FP64 GCA-ConstLat kernel using the equations from fp64_GCAconstLat.hh

add a FP64 try_gca_const_lat_intersection path with the same output/status shape as the AccuX try_ path

add an “AccuX wrapper with FP64 body” path: same status/output/dispatcher structure as the AccuX path, but with the internal math replaced by the FP64 kernel

add a UXarray dispatcher-level comparison where both paths return the same (2, 3) NaN-filled UXarray output shape

The acceptance check would be: direct FP64 and AccuX-wrapper-with-FP64-body produce matching outputs/status and similar timings; only then do we interpret real AccuX overhead.

Would that cover what you want for #1513, or do you also want this same-body check at a larger batched/multi-point API level in this PR?

Before proceeding, we should reiterate what issue we are handling here and what the purpose of this check is.

At this point, we are not mainly discussing accuracy. The existing regression tests are accuracy tests, and they are useful. The AccuSphGeom algorithm itself has already been extensively validated by the paper and the reference implementation.

What we need to verify in this PR is performance and implementation structure: whether the AccuX/EFT algorithm has been wired into UXarray correctly, without introducing extra engineering overhead through wrappers, dispatchers, status handling, masking, or output-shape handling.

There are two layers of checks we should carry out to verify the implementation structure and vectorization behavior.

First, we need to test whether the lower-level AccuX ConstLat API design itself introduces overhead. For this, we should implement a direct FP64 kernel using the equations from fp64_GCAconstLat.hh. Here, “kernel” means the minimal function that computes only the intersection result itself. Then we should compare:

direct FP64 kernel
AccuX ConstLat wrapper/API with its internal body temporarily replaced by the same FP64 kernel

If these two do not have similar performance, then the lower-level AccuX API structure itself is introducing overhead before we even discuss the real AccuX math.

Second, we need to test the higher-level try_gca_const_lat_intersection API path. This API includes additional logic such as status handling, mask selection, NaN-filled output shape, and dispatcher behavior. For this level, we should again use the same-body setup: replace the internal kernel with the FP64 body in both comparable paths, and verify that the UXarray-facing API route does not introduce extra overhead.

The point of the same-body FP64 tests is not to re-check numerical accuracy. It is to verify that the API wiring is apple-to-apple and that the implementation structure itself is not adding avoidable overhead.

Only after those same-body checks pass should we move to the actual implementation benchmark: direct FP64 implementation vs. real AccuX implementation. At that stage, we need a batched/multi-point UXarray API benchmark with a large enough input size to reach saturation. The expected behavior is that the AccuX-vs-FP64 ratio should shrink and then stabilize close to 1. If the ratio never shrinks, that likely means the Python/Numba compiler is not able to inline or optimize the current implementation structure well enough, and we need to fix that engineering issue.

Ultimately, what we need to confirm is that the AccuX algorithm is implemented inside UXarray in a way that preserves the original performance intent: branch-minimized, vectorization-friendly, and optimizable by the Python/Numba backend. This is independent of whether the implementation language is C++ or Python.

rajeeja · 2026-07-09T19:13:43Z

Ok, so I’ll add this as UXarray-side benchmark/verification scaffolding under benchmarks/, not as production/user-facing API.

Plan for the ConstLat path:

direct FP64 kernel using the equations from fp64_GCAconstLat.hh
same-body AccuX wrapper/status path using the FP64 body
dispatcher-level comparison with the same (2, 3) NaN-filled UXarray output shape
batched/multi-point benchmark comparing FP64, same-body wrapper, and real AccuX

Please confirm this is the right scope before I implement it.

hongyuchen1030 · 2026-07-10T20:04:49Z

Ok, so I’ll add this as UXarray-side benchmark/verification scaffolding under benchmarks/, not as production/user-facing API.

Plan for the ConstLat path:

direct FP64 kernel using the equations from fp64_GCAconstLat.hh

same-body AccuX wrapper/status path using the FP64 body

dispatcher-level comparison with the same (2, 3) NaN-filled UXarray output shape

batched/multi-point benchmark comparing FP64, same-body wrapper, and real AccuX

Please confirm this is the right scope before I implement it.

The goal here is not simply to implement a benchmark as a standalone task. The benchmark is a diagnostic tool for verifying whether the EFT/vectorized implementation has been wired correctly inside UXarray.

During the process of implementing and running the benchmark, we should be able to tell whether the API structure, wrapper path, and vectorization-friendly implementation are behaving as intended. If the benchmark results do not show the expected behavior, then that is an indication that there may still be an implementation or engineering-overhead issue that needs to be isolated and fixed.

So once the purpose and reasoning behind the benchmark are clear, it should be straightforward to proceed with the benchmark implementation and use the results to verify whether the UXarray AccuX path is actually wired correctly.

…l heap allocations

…/accusphere

rajeeja · 2026-07-11T04:26:40Z

Closed out the verification on this. I added `benchmarks/geometry_samebody.py` — a same-body diagnostic that rebuilds the L1/L2/L3 stack with the direct FP64 body from `fp64_GCAconstLat.hh`, so the only variable is the kernel body and we can separate EFT math cost from plumbing. I also built the AccuSphGeom C++ kernels with `-O3 -march=native` and fed them the identical 200 cases for a direct cross-language check (200k-point batch, single-thread, best-of-7).

kernel	C++ (clang -O3)	Python (Numba)	ratio
FP64	2.4 ns	3.9 ns	1.6x
AccuX (EFT)	29.4 ns	28.3 ns	0.96x

The Numba AccuX kernel is at parity with C++ — both toolchains lower the compensated arithmetic through LLVM to effectively the same code, so this settles it: the EFT path is wired in correctly with no measurable engineering overhead in the kernel. Correctness held at 0 status mismatches, max diff 2.5e-15. The diagnostic also showed the remaining per-call cost was heap allocation in the dispatcher, not math — 5 temporary NumPy arrays per call just to move scalars between layers — so I scalarized `gca_const_lat_intersection` (public `(2,3)` output unchanged, intermediates kept in registers), taking it from 232 to 139 ns/call, 5 allocations down to 1, output byte-identical across 502 cases including the snap and dual-intersection branches, with the full baseline suite plus zonal/integrate tests passing. I'm treating the verification as complete for this PR — kernel-level parity plus same-body isolation is the apples-to-apples check we needed; the C++ SIMD/OpenMP dispatcher is a separate regime and out of scope here.

hongyuchen1030 · 2026-07-13T21:27:32Z

Closed out the verification on this. I added benchmarks/geometry_samebody.py — a same-body diagnostic that rebuilds the L1/L2/L3 stack with the direct FP64 body from fp64_GCAconstLat.hh, so the only variable is the kernel body and we can separate EFT math cost from plumbing. I also built the AccuSphGeom C++ kernels with -O3 -march=native and fed them the identical 200 cases for a direct cross-language check (200k-point batch, single-thread, best-of-7).

kernel C++ (clang -O3) Python (Numba) ratio
FP64 2.4 ns 3.9 ns 1.6x
AccuX (EFT) 29.4 ns 28.3 ns 0.96x
The Numba AccuX kernel is at parity with C++ — both toolchains lower the compensated arithmetic through LLVM to effectively the same code, so this settles it: the EFT path is wired in correctly with no measurable engineering overhead in the kernel. Correctness held at 0 status mismatches, max diff 2.5e-15. The diagnostic also showed the remaining per-call cost was heap allocation in the dispatcher, not math — 5 temporary NumPy arrays per call just to move scalars between layers — so I scalarized gca_const_lat_intersection (public (2,3) output unchanged, intermediates kept in registers), taking it from 232 to 139 ns/call, 5 allocations down to 1, output byte-identical across 502 cases including the snap and dual-intersection branches, with the full baseline suite plus zonal/integrate tests passing. I'm treating the verification as complete for this PR — kernel-level parity plus same-body isolation is the apples-to-apples check we needed; the C++ SIMD/OpenMP dispatcher is a separate regime and out of scope here.

Thanks for the update. I do not think we should treat this verification as complete yet.

Again, the 200 cases are primarily accuracy/regression cases. They are useful for checking correctness, but they are not sufficient as a performance benchmark dataset.

Also, reporting a best-of-7 timing is not enough for this kind of benchmark. We need to show stability across repeated runs, not just the best observed timing.

And what does "scalarized `gca_const_lat_intersection" mean here?

And what does "byte-identical across 502 cases including the snap and dual-intersection branches" have to do with the performance benchmark here? Are you trying to test the accuracy or the performance?

rajeeja · 2026-07-13T23:43:46Z

Two direct answers first.

"Scalarized gca_const_lat_intersection" means the dispatcher's internal layers now pass individual float components instead of np.empty(3) arrays, so the per-call intermediate heap allocations are gone. The public (2,3) output shape is unchanged.

On the byte-identical check — that is an accuracy check, not a performance one. It only verifies that the scalarization above did not change any output; it is the correctness guard for that refactor and has nothing to do with the timing numbers.

Performance is not a blocker here. The EFT path performs as expected and is at parity with the C++ reference.

Results, with a broader input set (47,316 unique geometries across generic, near-tangent, and near-pole regimes — not the accuracy cases) and mean ± std over 30 independent runs instead of best-of-N:

kernel	C++ (clang -O3)	Python (Numba)
FP64	2.41 ± 0.13 ns	4.00 ± 0.13 ns
AccuX (EFT)	29.49 ± 0.68 ns	28.44 ± 0.63 ns

Coefficient of variation ~2% on AccuX both sides, so these are stable across runs, not a single lucky timing. Python AccuX matches the C++ reference.

Code changes: the scalarized dispatcher (heap allocations removed, output unchanged) and the benchmark now imports its kernels inside setup() so an environment mismatch cannot abort collection.

@cmdupuis3

Move the uxarray kernel imports from module level into each benchmark class's setup, so a stale or mismatched environment build only errors the affected benchmark instead of aborting collection of every benchmark in the directory. Thanks @cmdupuis3 for catching the asv import failure.

cmdupuis3 · 2026-07-14T21:07:54Z

Alright, so I have a stack of things I want to investigate with this PR, but the PR as-is seems pretty long. Some of them (like the integrate.py stuff #1570 ) are technically out of scope in my opinion, so I think my idea of the goal here is to do performance optimization only on the ported code rather than the peripheral stuff. That should get us closer to merging it, but I wanted to get clarity about what the win condition is for this specific PR.

@rajeeja I can take the PR off your hands for now and let you know when I'm satisfied with whatever performance gains are possible within that scope.

rajeeja · 2026-07-15T16:59:27Z

@cmdupuis3 thanks — I'll keep driving this one. Please flag anything you find (perf, scope, the EFT/dispatcher alignment vs the AccuSphGeom reference) as review comments and I'll work through them. Happy to pair on the parts you're digging into, but I'll own the changes and keep the PR moving.

* Honor quadrature kwargs in calculate_total_face_area calculate_total_face_area accepted quadrature_rule, order and latitude_adjusted_area but ignored them, always returning the cached default-parameter face_areas. As a result the gaussian/corrected path produced the same total as the triangular one. Keep the cached fast path for the default parameters (which also preserves the equal-area values used for HEALPix grids) and route any non-default quadrature settings through _compute_face_areas so the requested rule, order and latitude adjustment actually take effect. * Restore matplotlib backend after HoloViews matplotlib plot (#1538) * Restore matplotlib backend after HoloViews matplotlib plot plot(backend='matplotlib') calls hv.extension('matplotlib'), which switches the active matplotlib backend and clobbers the IPython inline display hook, silently breaking subsequent native matplotlib/xarray .plot() calls. Restore the original matplotlib backend right after the HoloViews extension switch; HoloViews objects still display via Store.current_backend, so this is safe. Closes #1537 * Address review: capture backend at switch, accurate docstring, effective test * Reconfigure IPython inline display hook when restoring backend mpl.use() restores the matplotlib backend name but does not re-register IPython's inline display integration that hv.extension('matplotlib') clobbers. In a Jupyter kernel without an explicit %matplotlib inline, native matplotlib/xarray .plot() calls after a uxarray matplotlib plot still failed to render. Re-run configure_inline_support when the restored backend is inline so the display hook is reinstated. See #1537. * Restore matplotlib backend via IPython shell reactivation to fix inline display --------- Co-authored-by: Orhan Eroglu <32553057+erogluorhan@users.noreply.github.com> * Allow SCRIP reader to respect units w/r/t radians (#1433) * Allow SCRIP reader to respect units wrt to radians * Add test file and unit test for SCRIP radian coordinate handling * minor formatting changes for scrip radians fix Moves meshfiles/scrip/scrip_radians.nc to meshfiles/scrip/scrip_radians/scrip_radians_grid.nc to match style of other meshfiles naming schemes. Renames the new _scrip._convert_to_degrees() to _scrip._values_in_degrees(). It doesn't always convert; and it also does more than just converting, because it gives numpy array from DataArray. Clarified docstring. ran the following, so it should now pass ruff checks: pre-commit run --all-files --------- Co-authored-by: Sam Evans <s7evans11@gmail.com> Co-authored-by: Sam Evans <47793072+Sevans711@users.noreply.github.com> Co-authored-by: Rajeev Jain <rajeeja@gmail.com> * Support Python 3.14 and upgrade YAC to v3.18 (with DNN remapping) (#1563) * Upgrade YAC to v3.18, expose DNN remapping, drop pathlib backport - upgrade YAC CI to v3.18.0 on Python 3.14 (numba>=0.63, py3.14 classifier, cython>=3.1 via conda for the bindings build) - fix add_average: reduction_type -> weight_type (renamed in YAC v3.18) - expose distance-nearest-neighbour (yac_method='dnn', new in YAC v3.15) + test - remove the obsolete 'pathlib' backport dependency: it shadows the stdlib and breaks tools that import pathlib on Python 3.10+ (broke YAC's Cython build) * Reference #1561 in numba version pin comment --------- Co-authored-by: Christopher Dupuis <45972964+cmdupuis3@users.noreply.github.com> * Bump actions/download-artifact in the actions group across 1 directory (#1572) Bumps the actions group with 1 update in the / directory: [actions/download-artifact](https://github.com/actions/download-artifact). Updates `actions/download-artifact` from 7 to 8 - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](actions/download-artifact@v7...v8) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-version: '8' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [pre-commit.ci] pre-commit autoupdate (#1565) updates: - [github.com/astral-sh/ruff-pre-commit: v0.15.20 → v0.15.21](astral-sh/ruff-pre-commit@v0.15.20...v0.15.21) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Accusphere: gca_gca benchmarks and better arc sampling * Accusphere: fix sum_of_squares accuracy bug + tests * Accusphere: numba fma refactor and fix validation tautology * Accusphere: scalarized _counts_as_crossing --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: vakudo <127726617+vakudo@users.noreply.github.com> Co-authored-by: Sam Evans <47793072+Sevans711@users.noreply.github.com> Co-authored-by: Rajeev Jain <rajeeja@gmail.com> Co-authored-by: Orhan Eroglu <32553057+erogluorhan@users.noreply.github.com> Co-authored-by: zarzycki <colin.zarzycki@gmail.com> Co-authored-by: Sam Evans <s7evans11@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

cmdupuis3 · 2026-07-16T22:54:19Z

Thanks @rajeeja for merging, here's some much needed context about my changes:

Bugfix: _sum_sq_c2/c3 computed Σh²+Σl² where C++ computes Σh²+2Σh·l, dropping the dominant cross term: median relative error 6.6e-16 → 3.2e-31, now bit-for-bit identical to C++ (800000/800000). The 5e-15 baseline never caught it.
Bugfix: _validate_fma's predicate was a tautology — (pf+ef) != (pv+ev) rounds back to pf != pv — so it could never fail. A maximally-broken FMA passed 2000/2000; the fix rejects it at sample 1, making the Veltkamp fallback reachable for the first time.
Benchmarks: gca_gca benchmarks (Accusphere and FP64 equivalent); better sampling algorithm for both samebody testing modules
Optimization: _sum_sq_c2/c3 routines replaced with variadic _sum_of_squares, mirroring the original. Marginal speedup, but cleaner.
Scalarized _counts_as_crossings in point_in_face.py, for a ~10% speedup locally
const_lat EFT fusions for a ~8% speedup locally

rajeeja · 2026-07-16T22:57:24Z

@hongyuchen1030 Re-ran the thread scaling and capped it at 8 threads on purpose — this is an M1 Max, which has 8 performance cores and 2 efficiency cores. My earlier run went to 10 and the last two points looked off (times crept back up, ratio dropped); that was just the loop spilling onto the slow efficiency cores, not the kernels. Capping at the 8 P-cores gives the clean picture.

benchmarks/thread_scaling_constlat.py drives the real UXarray gca_const_lat_intersection dispatcher vs a same-body FP64 dispatcher over a real mesh's edges.

threads	fp64 ns	accux ns	accux/fp64
1	64.5	105.2	1.63
2	32.3	54.1	1.68
4	16.6	28.3	1.70
8	8.6	14.1	1.65

Both scale ~7.5x from 1 to 8 threads and the AccuX/FP64 ratio stays flat around 1.65, so the compensated path parallelizes just like FP64 with no scaling penalty. (outCSne30, 10,800 edges x 40 latitudes, best-of-7.)

hongyuchen1030 · 2026-07-16T23:24:51Z

@hongyuchen1030 Re-ran the thread scaling and capped it at 8 threads on purpose — this is an M1 Max, which has 8 performance cores and 2 efficiency cores. My earlier run went to 10 and the last two points looked off (times crept back up, ratio dropped); that was just the loop spilling onto the slow efficiency cores, not the kernels. Capping at the 8 P-cores gives the clean picture.

benchmarks/thread_scaling_constlat.py drives the real UXarray gca_const_lat_intersection dispatcher vs a same-body FP64 dispatcher over a real mesh's edges.

threads fp64 ns accux ns accux/fp64
1 64.5 105.2 1.63
2 32.3 54.1 1.68
4 16.6 28.3 1.70
8 8.6 14.1 1.65
Both scale ~7.5x from 1 to 8 threads and the AccuX/FP64 ratio stays flat around 1.65, so the compensated path parallelizes just like FP64 with no scaling penalty. (outCSne30, 10,800 edges x 40 latitudes, best-of-7.)

The fact that the two lines look parallel to each other in the plot1 looks concerning here, while it might not overlap, but we should be able to observe a decreasing gap between these two. Please see the following plot from https://github.com/hongyuchen1030/AccuSphGeom/tree/main/tests/performance_test for reference.

Let's only focus one the vec-width = 1 here to match your case:
1 thread: AccuX/fp64 = 19.33/2.41 = 8.0
2 threads: AccuX/fp64 = 9.71/1.22 = 7.9
4 threads: AccuX/fp64 = 4.86/1.03 = 4.71
8 threads: AccuX/fp64 = 2.48/1.04 = 2.38
16 threads: AccuX/fp64 = 1.44/1.04 = 1.38

As the threads count increase, the ratio should approach to 1 if the inputs are saturated and the implementation is correct.

And again, what does the "best-of-7" mean here? Are you trying to "cherry-pick" from the performance test again?

rajeeja · 2026-07-17T00:09:23Z

@hongyuchen1030 At the bare-kernel level — the analog of your benchmark_gca_constlat_SIMDPack.cpp (kernels L232/L247, thread loop omp_set_num_threads L361, median L431) — the AccuX/FP64 ratio shows the decreasing trend you expected, matching the shape of your Fig 6:

threads	fp64 ns	accux ns	ratio
1	2.67	12.86	4.82
2	1.37	6.54	4.76
4	0.88	3.25	3.67
8	0.64	1.84	2.85

At equal thread count this tracks your numbers closely — your Fig 6 is 2.38 at 8 threads, ours is 2.85. Your ratio only reaches ~1.4 at 16 threads, which this machine can't run: it is an M1 Max with 8 performance cores (the 2 efficiency cores distort the loop, so I cap at 8). N=32M to saturate, median (not best-of-N).

Could you run our Python bench on the same architecture you used for Fig 6? Then we would have a true same-hardware comparison out to 16 threads and can confirm the ratio converges the same way. Code: python/bench_threads_py.py (you have access).

The earlier flat plot was the full gca_const_lat_intersection dispatcher (benchmarks/thread_scaling_constlat.py), where shared plumbing — array allocation, validity masks, endpoint snapping — dilutes the ratio. The kernel comparison above is the apples-to-apples match to your figure.

rajeeja · 2026-07-17T00:19:48Z

Re-ran at your 2M working set (kDefaultDataSize, L33); code: python/bench_threads_py.py.

threads	fp64 ns	accux ns	ratio
1	2.70	12.87	4.77
2	1.42	6.47	4.55
4	0.82	3.30	4.04
8	0.80	2.86	3.58

At 2M our FP64 flatlines 4→8 (0.82 → 0.80), same as your Fig 6 (1.03/1.04) — both hit the bandwidth wall. Ratio decreases; caps at 8 (M1 Max, 8 P-cores). My earlier 32M numbers hid this since the larger set kept FP64 scaling.

rajeeja requested a review from hongyuchen1030 May 22, 2026 13:24

hongyuchen1030 reviewed May 22, 2026

View reviewed changes

hongyuchen1030 requested changes May 22, 2026

View reviewed changes

rajeeja marked this pull request as draft May 22, 2026 21:18

rajeeja marked this pull request as ready for review May 28, 2026 19:35

rajeeja requested a review from hongyuchen1030 May 28, 2026 19:35

hongyuchen1030 reviewed Jun 4, 2026

View reviewed changes

rajeeja added this to UXarray Development Jul 8, 2026

Merge branch 'main' into rajeeja/accusphere

273f968

rajeeja added 3 commits July 10, 2026 23:23

Scalarize constant-latitude intersection dispatcher to remove per-cal…

dba0206

…l heap allocations

Merge remote-tracking branch 'origin/main' into rajeeja/accusphere

4366e02

Merge remote-tracking branch 'origin/rajeeja/accusphere' into rajeeja…

ffcc91e

…/accusphere

Merge branch 'main' into rajeeja/accusphere

64885f6

rajeeja mentioned this pull request Jul 14, 2026

Compute face_node_angles from uxarray.Grid? #1566

Open

cmdupuis3 mentioned this pull request Jul 16, 2026

Cmd/accusphere #1579

Merged

14 tasks

cmdupuis3 and others added 3 commits July 16, 2026 17:22

Merge remote-tracking branch 'origin/main' into rajeeja/accusphere

75f06c0

Add thread-scaling benchmark for FP64 vs AccuX constlat dispatcher

0d131d7

Cap thread-scaling sweep at performance cores to avoid E-core artifact

b9f5bb3

rajeeja added 2 commits July 16, 2026 17:59

Fix thread-scaling plot ticks to show only 1,2,4,8

f61555b

o Fix pre-commit

146672c

rajeeja added 2 commits July 16, 2026 18:39

o doc fixes

e9b394f

o Remove benchmark image from repo

a7d8e88



		@njit(cache=True)
		def _generate_lat_lon_bounds_local(face_vertices, z_min, z_max, snap_tol_deg):



		@njit(cache=True)
		def _generate_lat_lon_bounds_pole(face_vertices, label, z_min, z_max, snap_tol_deg):



		@njit(cache=True)
		def _face_location_info(face_vertices, polar_cap_z):

		return gca_gca_intersection(gca_a_xyz, gca_b_xyz)


		@njit(cache=True)



		@njit(cache=True)
		def gca_const_lat_intersection(gca_cart, const_z):



		@njit(cache=True)
		def gca_gca_intersection(gca_a_xyz, gca_b_xyz):



		@njit(cache=True)
		def _lon_bounds_from_vertices(face_vertices):



		@njit(parallel=True, nogil=True, cache=True)
		def constant_lat_intersections_no_extreme(lat, edge_node_z, n_edge):

Uh oh!

Conversation

rajeeja commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

What this is not

Performance

Accuracy

Tests

Uh oh!

review-notebook-app Bot commented May 22, 2026

Uh oh!

hongyuchen1030 May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongyuchen1030 Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongyuchen1030 May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongyuchen1030 May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rajeeja commented May 22, 2026 •

edited

Loading

hongyuchen1030 May 22, 2026 •

edited

Loading

hongyuchen1030 Jun 4, 2026 •

edited

Loading

hongyuchen1030 May 22, 2026 •

edited

Loading

hongyuchen1030 May 22, 2026 •

edited

Loading

hongyuchen1030 Jun 4, 2026 •

edited

Loading