Skip to content

Commit 3da31da

Browse files
committed
arena: rename SIMD-align knob, expose as TA CMake var, drop default 64 -> 32
Rename the in-arena element-storage alignment knob to TA_ARENATENSOR_SIMD_ALIGN (matching the TA_ prefix used by every other TA CMake option/var), and wire it through the same CMake -> config.h.in -> header pipeline as TA_MAX_SOO_RANK_METADATA. The previous TILEDARRAY_INNER_SIMD_ALIGN was a header-only #ifndef/#define knob with no CMake surface; the new form is a proper cache variable, documented in INSTALL.md. Drop the default from 64 B to 32 B: 32 B covers AVX2 YMM loads/stores (the most common x86_64 SIMD target today) and shaves 32 B/cell off the in-arena padding. AVX-512 builds that want a wider floor are one `-DTA_ARENATENSOR_SIMD_ALIGN=64` away. The doc comment and INSTALL.md entry also call out the NEON / Apple-Silicon options (16 / 128). No backward-compatible alias for the old macro/constant names -- there are no external users yet.
1 parent d327b82 commit 3da31da

6 files changed

Lines changed: 37 additions & 21 deletions

File tree

CMakeLists.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,12 @@ add_feature_info(SIGNED_1INDEX_TYPE TA_SIGNED_1INDEX_TYPE "Use of signed 1-index
159159
# define this as needed
160160
set(TA_MAX_SOO_RANK_METADATA 8 CACHE STRING "Specifies the max rank for which small object optimization will be used (hence, heap use avoided) for metadata objects")
161161

162+
# Alignment, in bytes, of element storage inside an ArenaTensor cell. Must be a
163+
# power of two and at least sizeof(void*). Default 32 covers AVX2 YMM; raise to
164+
# 64 for AVX-512, drop to 16 for NEON-only / Apple Silicon, raise to 128 for an
165+
# Apple-Silicon L1 cache-line floor. See src/TiledArray/tensor/arena_tensor.h.
166+
set(TA_ARENATENSOR_SIMD_ALIGN 32 CACHE STRING "Alignment (B) of in-arena element storage for ArenaTensor; power of two, default 32 covers AVX2 (set 64 for AVX-512, 16 for NEON-only, 128 for Apple-Silicon cache-line floor)")
167+
162168
option(TA_TRACE_TASKS "Enable debug tracing of MADNESS tasks in (some components of) TiledArray" OFF)
163169
add_feature_info(TASK_TRACE_DEBUG TA_TRACE_TASKS "Debug tracing of MADNESS tasks in (some components of) TiledArray")
164170
set(TILEDARRAY_ENABLE_TASK_DEBUG_TRACE ${TA_TRACE_TASKS})

INSTALL.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -431,6 +431,7 @@ support may be added.
431431
* `TA_WERROR` -- Set to `ON` to treat compiler warnings as errors when compiling TiledArray's own translation units (the `tiledarray` library and in-tree tests/examples). Also implies `MADNESS_WERROR=ON` for the MADworld translation units built as part of TA's FetchContent tree. Does **not** propagate to consumers of the installed `tiledarray` target (i.e. `find_package(tiledarray)` users do not inherit `-Werror`). Honored on GNU/Clang/AppleClang/IntelLLVM. [Default=OFF].
432432
* `TA_SIGNED_1INDEX_TYPE` -- Set to `OFF` to use unsigned 1-index coordinate type (default for TiledArray 1.0.0-alpha.2 and older). The default is `ON`, which enables the use of negative indices in coordinates.
433433
* `TA_MAX_SOO_RANK_METADATA` -- Specifies the maximum rank for which to use Small Object Optimization (hence, avoid the use of the heap) for metadata. The default is `8`.
434+
* `TA_ARENATENSOR_SIMD_ALIGN` -- Alignment (in bytes) of element storage inside an `ArenaTensor` cell. Must be a power of two. The default is `32` (covers AVX2 YMM). Set to `64` for AVX-512 ZMM (also matches the x86_64 cache line), `16` for NEON-only / Apple Silicon (NEON has no wider register and Apple Silicon does not implement SVE), or `128` for a two-cache-line / Apple-Silicon L1-line floor. Each `ArenaTensor` cell pads from `sizeof(Cell)` up to this alignment before its element storage, so lowering the value cuts per-cell padding at the cost of narrower vectorized loads/stores.
434435
* `TA_TENSOR_MEM_PROFILE` -- Set to `ON` to profile host memory allocations used by TA::Tensor. This causes the use of Umpire for host memory allocation. This also enables additional tracing facilities provided by Umpire; these can be controlled via [environment variable `UMPIRE_LOG_LEVEL`](https://umpire.readthedocs.io/en/develop/sphinx/features/logging_and_replay.html), but note that the default is to log Umpire info into a file rather than stdout.
435436
* `TA_TENSOR_MEM_TRACE` -- Set to `ON` to *trace* host memory allocations used by TA::Tensor. This turns on support for tracking memory used by `Tensor` objects; such tracking must be enabled programmatically. This can greatly increase memory consumption by the application and is only intended for expert developers troubleshooting memory use by TiledArray.
436437
* `TA_UT_CTEST_TIMEOUT` -- The value (in seconds) of the timeout to use for running the TA unit tests via CTest when building the `check`/`check-tiledarray` targets. The default timeout is 1500s.

src/TiledArray/config.h.in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,9 @@
172172
/* Specifies the rank for up to which to use Small Object Optimization for metadata (e.g. Range, Range::index, etc.) */
173173
#cmakedefine TA_MAX_SOO_RANK_METADATA @TA_MAX_SOO_RANK_METADATA@
174174

175+
/* Alignment (in bytes) of element storage inside an ArenaTensor cell; see src/TiledArray/tensor/arena_tensor.h */
176+
#cmakedefine TA_ARENATENSOR_SIMD_ALIGN @TA_ARENATENSOR_SIMD_ALIGN@
177+
175178
/* Enables tracing MADNESS tasks in TiledArray */
176179
#cmakedefine TILEDARRAY_ENABLE_TASK_DEBUG_TRACE 1
177180

src/TiledArray/tensor/arena_tensor.h

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -26,18 +26,23 @@
2626

2727
namespace TiledArray {
2828

29-
/// Alignment of in-arena element storage, in bytes. Sized to cover the
30-
/// widest common SIMD register (AVX-512 ZMM = 64 B) and a single x86_64
31-
/// cache line. Override at configure time by defining
32-
/// TILEDARRAY_INNER_SIMD_ALIGN to a larger power-of-two (e.g. 128 for
33-
/// two-cache-line floor / Apple-Silicon L1 line size).
34-
#ifndef TILEDARRAY_INNER_SIMD_ALIGN
35-
#define TILEDARRAY_INNER_SIMD_ALIGN 64
36-
#endif
37-
38-
inline constexpr std::size_t kInnerSimdAlign = TILEDARRAY_INNER_SIMD_ALIGN;
39-
static_assert((kInnerSimdAlign & (kInnerSimdAlign - 1)) == 0,
40-
"kInnerSimdAlign must be a power of two");
29+
/// Alignment of in-arena element storage, in bytes. Supplied via CMake
30+
/// (cache variable `TA_ARENATENSOR_SIMD_ALIGN`, propagated through
31+
/// `TiledArray/config.h`). The default (32 B) matches the SSE/AVX/AVX2
32+
/// family — AVX2's 256-bit YMM registers being the most common x86_64
33+
/// SIMD target today. Override at configure time with
34+
/// `-DTA_ARENATENSOR_SIMD_ALIGN=<N>` for another power of two:
35+
/// - 64 for AVX-512 ZMM (also matches an x86_64 cache line);
36+
/// - 16 for NEON-only targets — NEON has no wider register (Apple
37+
/// Silicon does not implement SVE), so 16 is sufficient there;
38+
/// - 128 for a two-cache-line / Apple-Silicon L1-line floor (useful
39+
/// only if cells need that as a false-sharing boundary).
40+
/// Each ArenaTensor cell pads from `sizeof(Cell)` up to this alignment
41+
/// before its element storage, so lowering the value cuts per-cell
42+
/// padding at the cost of narrower vectorized loads/stores.
43+
inline constexpr std::size_t kArenaTensorSimdAlign = TA_ARENATENSOR_SIMD_ALIGN;
44+
static_assert((kArenaTensorSimdAlign & (kArenaTensorSimdAlign - 1)) == 0,
45+
"TA_ARENATENSOR_SIMD_ALIGN must be a power of two");
4146

4247
template <typename T, typename Range_ = ::btas::zb::RangeNd<>>
4348
class ArenaTensor;
@@ -79,7 +84,8 @@ class ArenaTensor {
7984
/// arena slots must honour this so SIMD loads/stores on `data()` are
8085
/// aligned without an extra runtime check.
8186
static constexpr size_type data_alignment() noexcept {
82-
return alignof(T) > kInnerSimdAlign ? alignof(T) : kInnerSimdAlign;
87+
return alignof(T) > kArenaTensorSimdAlign ? alignof(T)
88+
: kArenaTensorSimdAlign;
8389
}
8490

8591
/// Offset (in bytes) of the first element past the cell header.

tests/arena_tensor.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -69,16 +69,16 @@ BOOST_AUTO_TEST_CASE(sizeof_invariant_across_range_parameterizations) {
6969
}
7070

7171
BOOST_AUTO_TEST_CASE(element_data_is_simd_aligned) {
72-
// data_alignment() should be at least kInnerSimdAlign; cell_alignment()
72+
// data_alignment() should be at least kArenaTensorSimdAlign; cell_alignment()
7373
// should propagate that so the element pointer is SIMD-aligned.
74-
BOOST_CHECK(Inner::data_alignment() >= TA::kInnerSimdAlign);
75-
BOOST_CHECK_EQUAL(Inner::data_alignment() % TA::kInnerSimdAlign, 0u);
74+
BOOST_CHECK(Inner::data_alignment() >= TA::kArenaTensorSimdAlign);
75+
BOOST_CHECK_EQUAL(Inner::data_alignment() % TA::kArenaTensorSimdAlign, 0u);
7676
BOOST_CHECK(Inner::cell_alignment() >= Inner::data_alignment());
7777
CellBuf buf(8);
7878
Inner x =
7979
TA::detail::make_arena_tensor_in<double>(buf.aligned_ptr, TA::Range{8});
8080
auto addr = reinterpret_cast<std::uintptr_t>(x.data());
81-
BOOST_CHECK_EQUAL(addr % TA::kInnerSimdAlign, 0u);
81+
BOOST_CHECK_EQUAL(addr % TA::kArenaTensorSimdAlign, 0u);
8282
}
8383

8484
BOOST_AUTO_TEST_CASE(default_constructed_is_null) {

tests/arena_tensor_kernels.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ BOOST_AUTO_TEST_CASE(builds_outer_with_uniform_inners) {
3131
BOOST_CHECK(bool(inner));
3232
BOOST_CHECK_EQUAL(inner.size(), 8u);
3333
auto addr = reinterpret_cast<std::uintptr_t>(inner.data());
34-
BOOST_CHECK_EQUAL(addr % TA::kInnerSimdAlign, 0u);
34+
BOOST_CHECK_EQUAL(addr % TA::kArenaTensorSimdAlign, 0u);
3535
}
3636
}
3737

@@ -106,7 +106,7 @@ BOOST_AUTO_TEST_CASE(jagged_inner_shapes_round_trip) {
106106
BOOST_REQUIRE(bool(inner));
107107
BOOST_CHECK_EQUAL(inner.size(), static_cast<std::size_t>(sizes[ord]));
108108
auto addr = reinterpret_cast<std::uintptr_t>(inner.data());
109-
BOOST_CHECK_EQUAL(addr % TA::kInnerSimdAlign, 0u);
109+
BOOST_CHECK_EQUAL(addr % TA::kArenaTensorSimdAlign, 0u);
110110
}
111111
}
112112
}
@@ -320,7 +320,7 @@ BOOST_AUTO_TEST_CASE(contraction_arena_plan_reserve_and_construct_inner) {
320320
BOOST_REQUIRE(bool(inner));
321321
BOOST_CHECK_EQUAL(inner.size(), 24u);
322322
auto addr = reinterpret_cast<std::uintptr_t>(inner.data());
323-
BOOST_CHECK_EQUAL(addr % TA::kInnerSimdAlign, 0u);
323+
BOOST_CHECK_EQUAL(addr % TA::kArenaTensorSimdAlign, 0u);
324324
}
325325
}
326326

@@ -441,7 +441,7 @@ BOOST_AUTO_TEST_CASE(outer_tile_serialize_round_trip_arena_tensor) {
441441
// The loaded cell's data pointer is SIMD-aligned via
442442
// arena_outer_init.
443443
auto addr = reinterpret_cast<std::uintptr_t>(d.data());
444-
BOOST_CHECK_EQUAL(addr % TA::kInnerSimdAlign, 0u);
444+
BOOST_CHECK_EQUAL(addr % TA::kArenaTensorSimdAlign, 0u);
445445
}
446446
}
447447

0 commit comments

Comments
 (0)