Skip to content

Commit b29b61a

Browse files
caugonnetandralex
andauthored
[STF] Add C bindings for the places layer (#9232)
* [STF] Add C bindings for the places layer Extends the experimental STF C API to mirror the C++ places layer: - green_context_helper (create/destroy/count/device id) and green-context exec_place / data_place factories (CUDA 12.4+). - exec_place scope enter/exit (RAII context activation), affine data_place accessor, and grid sub-place accessor (get_place). - data_place stream-ordered allocate/deallocate and an allocation_is_stream_ordered query, plus machine_init. - task grid accessors: get_grid_dims and get_custream_at_index. Adds coverage in test_places.cpp. Extracted from the python-bindings PR to keep that change reviewable. * [STF] Harden places C API at the C/FFI boundary Address CodeRabbit review feedback: - stf_exec_place_scope_enter now rejects out-of-range indices with NULL, matching the contract of the neighboring index-based accessors. - stf_data_place_deallocate catches and maps C++ exceptions instead of letting them escape the extern "C" entry point. * [STF] Use range-based for loops in test_places (clang-tidy) Fix modernize-loop-convert clang-tidy errors by iterating the places array with range-based for loops instead of index-based loops. * [STF] Add unified_task grid introspection used by places C API The places C bindings (stf_task_get_grid_dims / stf_task_get_custream_at_index) call get_grid_dims(dim4*) and get_stream(size_t) on context::unified_task<>, but those overloads were never declared on unified_task in this branch, so stf.cu failed to compile. Add both methods, dispatching the per-place stream to stream_task<Deps...> and returning nullptr/false for graph tasks or non-grid exec places. * [STF] Bounds-check unified_task::get_stream(place_index) stream_task::get_stream(size_t) indexes the stream grid without any bounds check, so stf_task_get_custream_at_index could read past the grid for an out-of-range index (UB) and returned success for non-grid exec places, contradicting the documented contract (non-zero on "not a grid" / index out of range). Guard the linear index in the unified_task<> wrapper: return nullptr for graph tasks, non-grid exec places, and out-of-range indices. Add a regression check to the grid test for the out-of-range index case. * [STF] Test green-context places C API Add direct C API coverage for green-context helper and green-context exec/data place factories so the extracted places bindings are self-contained. * [STF] Guard stf_machine_init at the C boundary machine::instance() does real work on first call (P2P/mempool/topology setup) and can throw. Wrap it in try/catch so a C++ exception never unwinds across the extern "C" boundary into a C caller (UB / terminate), matching the error-reporting convention used by stf_try_allocate. * [STF] Document allocate/deallocate size signedness rationale stf_data_place_allocate takes a signed ptrdiff_t while stf_data_place_deallocate takes an unsigned size_t. This mirrors the C++ allocator interface, where the requested size is passed by reference and negated to signal allocation failure; deallocation has no such error to signal. Document the asymmetry on both entry points so the C surface explains why the types differ. --------- Co-authored-by: Andrei Alexandrescu <andrei@erdani.com>
1 parent a385d04 commit b29b61a

4 files changed

Lines changed: 795 additions & 26 deletions

File tree

c/experimental/stf/include/cccl/c/experimental/stf/stf.h

Lines changed: 153 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,12 @@ typedef struct stf_exec_place_opaque_t* stf_exec_place_handle;
106106
//! \brief Opaque handle to a \c data_place.
107107
typedef struct stf_data_place_opaque_t* stf_data_place_handle;
108108

109+
//! \brief Opaque handle to a \c green_context_helper.
110+
typedef struct stf_green_context_helper_opaque_t* stf_green_context_helper_handle;
111+
112+
//! \brief Opaque handle to an active exec_place_scope (RAII context activation).
113+
typedef struct stf_exec_place_scope_opaque_t* stf_exec_place_scope_handle;
114+
109115
//! \brief Opaque handle to an \c exec_place_resources registry.
110116
//!
111117
//! Handles returned by stf_exec_place_resources_create() are owned by the
@@ -150,6 +156,19 @@ stf_exec_place_handle stf_exec_place_device(int dev_id);
150156
//! \brief Create execution place for the current CUDA device.
151157
stf_exec_place_handle stf_exec_place_current_device(void);
152158

159+
//! \brief Create a green-context helper for \p dev_id with \p sm_count SMs per green context.
160+
//! Requires CUDA 12.4+. Returns NULL on failure.
161+
stf_green_context_helper_handle stf_green_context_helper_create(int sm_count, int dev_id);
162+
163+
//! \brief Destroy a green-context helper handle.
164+
void stf_green_context_helper_destroy(stf_green_context_helper_handle h);
165+
166+
//! \brief Number of green contexts created by \p h.
167+
size_t stf_green_context_helper_get_count(stf_green_context_helper_handle h);
168+
169+
//! \brief Device ordinal used by this green-context helper.
170+
int stf_green_context_helper_get_device_id(stf_green_context_helper_handle h);
171+
153172
//! \brief Deep copy of an execution place handle (caller must stf_exec_place_destroy the result).
154173
stf_exec_place_handle stf_exec_place_clone(stf_exec_place_handle h);
155174

@@ -181,7 +200,20 @@ stf_exec_place_grid_create(const stf_exec_place_handle* places, size_t count, co
181200
//! \brief Same as stf_exec_place_destroy (grids are exec_place handles).
182201
void stf_exec_place_grid_destroy(stf_exec_place_handle grid);
183202

184-
//! \brief Create a fresh exec_place_resources registry for standalone place-layer use.
203+
//! \brief Activate the sub-place at linear index \p idx (0 for scalar places).
204+
//! Saves the current CUDA context; call stf_exec_place_scope_exit to restore.
205+
//! \return Opaque scope handle, or NULL on failure (including when \p idx is out of bounds).
206+
stf_exec_place_scope_handle stf_exec_place_scope_enter(stf_exec_place_handle place, size_t idx);
207+
208+
//! \brief Restore the CUDA context saved by stf_exec_place_scope_enter and destroy the scope.
209+
//! \p scope may be NULL (no-op).
210+
void stf_exec_place_scope_exit(stf_exec_place_scope_handle scope);
211+
212+
//! \brief Get the affine data_place associated with this exec_place.
213+
//! Caller must stf_data_place_destroy the result.
214+
stf_data_place_handle stf_exec_place_get_affine_data_place(stf_exec_place_handle h);
215+
216+
//! \brief Create a fresh, empty exec_place_resources registry.
185217
//!
186218
//! The registry lazily creates and owns stream pools for places used with
187219
//! stf_exec_place_pick_stream(). Destroying it releases every stream it owns.
@@ -202,6 +234,23 @@ void stf_exec_place_resources_destroy(stf_exec_place_resources_handle h);
202234
//! finalized for a borrowed registry.
203235
CUstream stf_exec_place_pick_stream(stf_exec_place_resources_handle res, stf_exec_place_handle h, int for_computation);
204236

237+
//! \brief Get the sub-place at linear index \p idx.
238+
//! For scalar places, \p idx must be 0. Returns NULL if \p idx is out of bounds.
239+
//! Caller must stf_exec_place_destroy the result.
240+
stf_exec_place_handle stf_exec_place_get_place(stf_exec_place_handle h, size_t idx);
241+
242+
//! \brief Create an exec_place from green-context helper \p helper and view index \p idx.
243+
//! If \p use_green_ctx_data_place is non-zero, set the affine data_place to a green-context data place.
244+
//! Returns NULL on failure or if \p idx is out of range.
245+
stf_exec_place_handle
246+
stf_exec_place_green_ctx(stf_green_context_helper_handle helper, size_t idx, int use_green_ctx_data_place);
247+
248+
//! \brief Initialize the machine singleton (P2P access, memory pool setup, topology).
249+
//! Safe to call multiple times; only the first call has effect. Any C++ exception
250+
//! raised during initialization is caught and reported to stderr (never propagated
251+
//! across the C boundary).
252+
void stf_machine_init(void);
253+
205254
//! \brief Host (CPU/pinned) data placement.
206255
stf_data_place_handle stf_data_place_host(void);
207256

@@ -220,6 +269,10 @@ stf_data_place_handle stf_data_place_current_device(void);
220269
//! \brief Composite partitioned placement over a grid of execution places.
221270
stf_data_place_handle stf_data_place_composite(stf_exec_place_handle grid, stf_get_executor_fn mapper);
222271

272+
//! \brief Create a data_place from green-context helper \p helper and view index \p idx.
273+
//! Returns NULL on failure or if \p idx is out of range.
274+
stf_data_place_handle stf_data_place_green_ctx(stf_green_context_helper_handle helper, size_t idx);
275+
223276
//! \brief Deep copy (caller must stf_data_place_destroy).
224277
stf_data_place_handle stf_data_place_clone(stf_data_place_handle h);
225278

@@ -232,6 +285,45 @@ int stf_data_place_get_device_ordinal(stf_data_place_handle h);
232285
//! \brief Human-readable description; pointer valid until the next call on this thread.
233286
const char* stf_data_place_to_string(stf_data_place_handle h);
234287

288+
//! \brief Allocate \p size bytes at this data place.
289+
//!
290+
//! For device places the allocation is stream-ordered (cudaMallocAsync).
291+
//! For host/managed places \p stream is ignored.
292+
//! Returns NULL on failure (e.g. unsupported place type or out of memory).
293+
//!
294+
//! \note \p size is signed (ptrdiff_t) to mirror the underlying C++ allocator
295+
//! interface, where the requested size is passed by reference and negated to
296+
//! signal allocation failure while preserving the requested amount. The matching
297+
//! stf_data_place_deallocate() takes an unsigned size_t because at deallocation
298+
//! the size is a known-good quantity with no error to signal.
299+
//!
300+
//! \param h Data place handle (must not be NULL)
301+
//! \param size Allocation size in bytes (must be non-negative)
302+
//! \param stream CUDA stream for stream-ordered allocation (may be NULL)
303+
//! \return Pointer to allocated memory, or NULL on failure
304+
void* stf_data_place_allocate(stf_data_place_handle h, ptrdiff_t size, cudaStream_t stream);
305+
306+
//! \brief Deallocate memory previously obtained from stf_data_place_allocate().
307+
//!
308+
//! For device places the deallocation is stream-ordered (cudaFreeAsync).
309+
//! For host/managed places \p stream is ignored.
310+
//!
311+
//! \note \p size is unsigned (size_t) on purpose: unlike stf_data_place_allocate(),
312+
//! deallocation never signals failure through the size argument (see that
313+
//! function's note), so it mirrors the unsigned C++ deallocate() signature.
314+
//!
315+
//! \param h Data place handle (must not be NULL)
316+
//! \param ptr Pointer returned by stf_data_place_allocate()
317+
//! \param size Size of the original allocation in bytes
318+
//! \param stream CUDA stream for stream-ordered deallocation (may be NULL)
319+
void stf_data_place_deallocate(stf_data_place_handle h, void* ptr, size_t size, cudaStream_t stream);
320+
321+
//! \brief Query whether allocations on this place are stream-ordered.
322+
//!
323+
//! \param h Data place handle (must not be NULL)
324+
//! \return 1 if stream-ordered, 0 otherwise
325+
int stf_data_place_allocation_is_stream_ordered(stf_data_place_handle h);
326+
235327
//! \}
236328

237329
//! \defgroup Handles Opaque Handles
@@ -1062,6 +1154,66 @@ void stf_task_destroy(stf_task_handle t);
10621154

10631155
void stf_task_enable_capture(stf_task_handle t);
10641156

1157+
//! \brief Get grid dimensions of a task's exec place
1158+
//!
1159+
//! When the task's execution place is a grid (size > 1), writes its
1160+
//! shape to \p out_dims. Returns 0 on success, non-zero if the task's
1161+
//! exec place is not a grid or \p out_dims is NULL.
1162+
//!
1163+
//! \param t Task handle
1164+
//! \param[out] out_dims On success, the grid shape (x, y, z, t) is written here. Must not be NULL.
1165+
//! \return 0 on success; non-zero if task exec place is not a grid or \p out_dims is NULL
1166+
//!
1167+
//! \pre t must be valid task handle
1168+
//! \pre stf_task_start() must have been called
1169+
//!
1170+
//! \note Total number of grid entries is out_dims->x * out_dims->y * out_dims->z * out_dims->t.
1171+
//! \note A single-element exec place (size 1) is intentionally not treated as a grid: this
1172+
//! returns non-zero for it, consistent with stf_task_get_custream_at_index().
1173+
//!
1174+
//! \par Example:
1175+
//! \code
1176+
//! stf_task_start(task);
1177+
//! stf_dim4 dims;
1178+
//! if (stf_task_get_grid_dims(task, &dims) == 0) {
1179+
//! printf("Grid: %lu x %lu\n", dims.x, dims.y);
1180+
//! }
1181+
//! \endcode
1182+
//!
1183+
//! \see stf_task_get_custream_at_index()
1184+
int stf_task_get_grid_dims(stf_task_handle t, stf_dim4* out_dims);
1185+
1186+
//! \brief Get the CUDA stream for a specific grid index
1187+
//!
1188+
//! When the task's exec place is a grid, returns the CUstream for the
1189+
//! given linear index (0 to product of grid dims - 1).
1190+
//!
1191+
//! \param t Task handle (must have been started; exec place must be a grid)
1192+
//! \param place_index Linear index in the grid (0-based; use stf_task_get_grid_dims to get shape)
1193+
//! \param[out] out_stream On success, the stream for that index is written here. Must not be NULL.
1194+
//! \return 0 on success; non-zero if task is not a grid, index out of range, or no per-index streams
1195+
//!
1196+
//! \pre t must be valid task handle
1197+
//! \pre stf_task_start() must have been called
1198+
//!
1199+
//! \note On success \p out_stream is set to the grid index's stream; on failure it is left
1200+
//! untouched and a non-zero code is returned. STF grids always use non-default streams, so a
1201+
//! valid result is never the legacy default stream (CUstream 0).
1202+
//!
1203+
//! \par Example:
1204+
//! \code
1205+
//! stf_dim4 dims;
1206+
//! stf_task_get_grid_dims(task, &dims);
1207+
//! for (size_t i = 0; i < dims.x; ++i) {
1208+
//! CUstream s;
1209+
//! stf_task_get_custream_at_index(task, i, &s);
1210+
//! // launch work on stream s
1211+
//! }
1212+
//! \endcode
1213+
//!
1214+
//! \see stf_task_get_grid_dims()
1215+
int stf_task_get_custream_at_index(stf_task_handle t, size_t place_index, CUstream* out_stream);
1216+
10651217
//! \}
10661218

10671219
//! \defgroup CUDAKernel CUDA Kernel Interface

0 commit comments

Comments
 (0)