Skip to content

Commit f29d7fd

Browse files
authored
cudax/stf: migrate internal/ launch + host_launch_scope from cuda_safe_call to cuda_try (#9249)
* cudax/stf: migrate internal/ launch + host_launch_scope to cuda_try Third internal/ slice, covering the kernel/host launch scopes and their shared event-timing pattern. - Convert eligible calls to the templated cuda_try<F> form: cudaLaunchKernelExC, cudaGraphAddKernelNode (out-param -> ref), cudaGraphKernelNodeSetAttribute, cudaFreeAsync, cudaEventRecord (start), cudaGraphAddHostNode (out-param -> ref), cudaLaunchHostFunc. - cudaEventCreate and cudaMallocAsync stay in the runtime-status form: both are overload sets (cuda_runtime.h flags overload / templated wrapper), so cuda_try<F> cannot name them. - Event timing's end record/synchronize/elapsed run inside the noexcept SCOPE(exit) body, so they keep cuda_safe_call: a CUDA error there should abort rather than throw through the guard (which would std::terminate). - The two stream-path cudaLaunchHostFunc enqueues now get a SCOPE(fail) that deletes the heap callback args (resolved / wrapper) if the enqueue throws -- the callback only takes ownership once the enqueue succeeds, so this closes the leak the new throw path would otherwise introduce. The graph-path host nodes are already covered because their args are owned by a ctx resource added before the node is created. Pre-existing and left as-is: the timing events created here are never cudaEventDestroy'd (a leak in the calibration path, unrelated to this change). * cudax/stf: use cuda_try<cudaEventCreateWithFlags> for timing event creation cudaEventCreate is an overload set (cuda_runtime.h adds a flags overload), so it cannot be named by the templated cuda_try<F> form. Use the non-overloaded cudaEventCreateWithFlags with cudaEventDefault instead, which is exactly what cudaEventCreate(&e) does internally, so behavior is unchanged while keeping the templated form. * cudax/stf: fix host_launch callback-arg ownership ordering The host_launch callback args (resolved / wrapper) are heap-allocated and guarded by SCOPE(fail) { delete ...; }. Transfer of ownership to the graph-path ctx resource was happening in the wrong order: - Untyped path: `resolved` was set to nullptr right after add_resource, but it is also used as the host node's userData. That made the graph node receive a null userData, so the callback dereferenced null on the success path. - Typed path: add_resource ran before cudaGraphAddHostNode, so a throw from the node creation would delete `wrapper` twice (SCOPE(fail) plus the resource's release_in_callback). Fix both by creating the host node first (while resolved/wrapper is still a valid userData), then handing ownership to the ctx resource, then nulling the pointer once at the end to disarm SCOPE(fail). On a throw before that point the resource has not been added, so SCOPE(fail) is the sole owner and frees the args exactly once. * cudax/stf: own host_launch callback args with unique_ptr Replace the raw new + SCOPE(fail){delete} + manual nulling design for the host_launch callback arguments with std::unique_ptr. The args are borrowed via .get() for the host node userData / cudaLaunchHostFunc argument and the ctx resource, and ownership is handed off with .release() once the node has been created (graph) or the launch has been enqueued (stream). On a throw before that point the unique_ptr frees the args; afterwards the ctx resource (graph) or the callback (stream) owns and frees them. Adds <memory>. * cudax/stf: free launch temp device memory via SCOPE(exit) launch_impl allocates a temporary device buffer (cudaMallocAsync) and freed it after cuda_launcher returned. Now that cuda_launcher throws on error (via cuda_try<cudaLaunchKernelExC>), the trailing cudaFreeAsync was skipped on a throw, leaking the buffer. Free it from a SCOPE(exit) placed right after the allocation so it runs on both normal and exceptional exit. cuda_safe_call is used inside the noexcept SCOPE(exit) body. * cudax/stf: check cudaGetDevice in launch timing path The cudaGetDevice call in the timing branch was unchecked. Use the templated cuda_try<cudaGetDevice> form so a failure is reported. * [STF] Initialize host_launch_scope timing events to nullptr Match launch.cuh and satisfy GCC -Wmaybe-uninitialized: if cuda_try throws before both events are created, SCOPE(exit) still runs with record_time set.
1 parent b29b61a commit f29d7fd

2 files changed

Lines changed: 105 additions & 70 deletions

File tree

cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh

Lines changed: 65 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
#include <cuda/experimental/__stf/internal/thread_hierarchy.cuh>
3636
#include <cuda/experimental/__stf/internal/void_interface.cuh>
3737

38+
#include <memory>
3839
#include <type_traits>
3940

4041
namespace cuda::experimental::stf
@@ -238,28 +239,24 @@ public:
238239
t.set_symbol(symbol);
239240
}
240241

241-
cudaEvent_t start_event, end_event;
242+
cudaEvent_t start_event = nullptr, end_event = nullptr;
242243
const bool record_time = t.schedule_task() || statistics.is_calibrating_to_file();
244+
// Set only once both timing events exist and the start event has been recorded.
245+
// The timing setup is done below, after the SCOPE(exit) guard is installed, so a
246+
// throw from those cuda_try calls cannot skip t.end_uncleared()/t.clear().
247+
bool timing_active = false;
243248

244249
t.start();
245250

246-
if constexpr (::std::is_same_v<Ctx, stream_ctx>)
247-
{
248-
if (record_time)
249-
{
250-
cuda_safe_call(cudaEventCreate(&start_event));
251-
cuda_safe_call(cudaEventCreate(&end_event));
252-
cuda_safe_call(cudaEventRecord(start_event, t.get_stream()));
253-
}
254-
}
255-
256251
SCOPE(exit)
257252
{
258253
t.end_uncleared();
259254
if constexpr (::std::is_same_v<Ctx, stream_ctx>)
260255
{
261-
if (record_time)
256+
if (timing_active)
262257
{
258+
// Inside the noexcept SCOPE(exit) body; keep cuda_safe_call so a CUDA
259+
// error aborts rather than throwing through the guard.
263260
cuda_safe_call(cudaEventRecord(end_event, t.get_stream()));
264261
cuda_safe_call(cudaEventSynchronize(end_event));
265262

@@ -280,11 +277,27 @@ public:
280277
t.clear();
281278
};
282279

280+
if constexpr (::std::is_same_v<Ctx, stream_ctx>)
281+
{
282+
if (record_time)
283+
{
284+
// cudaEventCreate is an overload set (cuda_runtime.h adds a flags overload),
285+
// so cuda_try<cudaEventCreate> cannot name it; use the non-overloaded
286+
// cudaEventCreateWithFlags with the default flags (equivalent to cudaEventCreate).
287+
start_event = cuda_try<cudaEventCreateWithFlags>(cudaEventDefault);
288+
end_event = cuda_try<cudaEventCreateWithFlags>(cudaEventDefault);
289+
cuda_try<cudaEventRecord>(start_event, t.get_stream());
290+
timing_active = true;
291+
}
292+
}
293+
283294
if constexpr (fun_invocable_untyped)
284295
{
285296
// --- Untyped dispatch path ---
286-
auto* resolved = new ::std::pair<Fun, host_launch_deps>{::std::forward<Fun>(f), host_launch_deps{}};
287-
auto& hld = resolved->second;
297+
auto resolved =
298+
::std::make_unique<::std::pair<Fun, host_launch_deps>>(::std::forward<Fun>(f), host_launch_deps{});
299+
300+
auto& hld = resolved->second;
288301

289302
const size_t ndeps = deps.size();
290303
hld.lds_.resize(ndeps);
@@ -298,32 +311,38 @@ public:
298311
hld.dtor_ = user_data_dtor_;
299312
user_data_dtor_ = nullptr;
300313

301-
if constexpr (::std::is_same_v<Ctx, graph_ctx>)
302-
{
303-
using wrapper_type = ::std::remove_reference_t<decltype(*resolved)>;
304-
auto resource = ::std::make_shared<host_callback_args_resource<wrapper_type>>(resolved);
305-
ctx.add_resource(mv(resource));
306-
}
307-
308314
auto callback = [](void* raw) {
309-
auto* w = static_cast<decltype(resolved)>(raw);
310-
w->first(w->second);
311-
if constexpr (!::std::is_same_v<Ctx, graph_ctx>)
315+
auto* w = static_cast<decltype(resolved.get())>(raw);
316+
SCOPE(exit)
312317
{
313-
delete w;
314-
}
318+
if constexpr (!::std::is_same_v<Ctx, graph_ctx>)
319+
{
320+
delete w;
321+
}
322+
};
323+
w->first(w->second);
315324
};
316325

317326
if constexpr (::std::is_same_v<Ctx, graph_ctx>)
318327
{
319-
cudaHostNodeParams params = {.fn = callback, .userData = resolved};
328+
cudaHostNodeParams params = {.fn = callback, .userData = resolved.get()};
320329
auto lock = t.lock_ctx_graph();
321-
cuda_safe_call(cudaGraphAddHostNode(&t.get_node(), t.get_ctx_graph(), nullptr, 0, &params));
330+
t.get_node() = cuda_try<cudaGraphAddHostNode>(t.get_ctx_graph(), nullptr, 0, &params);
331+
// The node now references the args; hand ownership to a ctx resource
332+
// that deletes them (in release_in_callback) when the ctx is released.
333+
using wrapper_type = ::std::remove_reference_t<decltype(*resolved)>;
334+
ctx.add_resource(::std::make_shared<host_callback_args_resource<wrapper_type>>(resolved.get()));
322335
}
323336
else
324337
{
325-
cuda_safe_call(cudaLaunchHostFunc(t.get_stream(), callback, resolved));
338+
// For a stream the callback owns the args once the launch succeeds.
339+
cuda_try<cudaLaunchHostFunc>(t.get_stream(), callback, resolved.get());
326340
}
341+
// Ownership has transferred (to the ctx resource for graph, or to the
342+
// callback for stream). These enqueues are asynchronous, so on a throw
343+
// above the callback has not run and the unique_ptr still owns the args;
344+
// release it now that ownership has moved on.
345+
resolved.release();
327346
}
328347
else
329348
{
@@ -338,17 +357,17 @@ public:
338357
return deps.instance(t);
339358
}
340359
}();
341-
auto* wrapper = new ::std::pair<Fun, decltype(payload)>{::std::forward<Fun>(f), mv(payload)};
342-
343-
if constexpr (::std::is_same_v<Ctx, graph_ctx>)
344-
{
345-
using wrapper_type = ::std::remove_reference_t<decltype(*wrapper)>;
346-
auto resource = ::std::make_shared<host_callback_args_resource<wrapper_type>>(wrapper);
347-
ctx.add_resource(mv(resource));
348-
}
360+
auto wrapper = ::std::make_unique<::std::pair<Fun, decltype(payload)>>(::std::forward<Fun>(f), mv(payload));
349361

350362
auto callback = [](void* untyped_wrapper) {
351-
auto w = static_cast<decltype(wrapper)>(untyped_wrapper);
363+
auto w = static_cast<decltype(wrapper.get())>(untyped_wrapper);
364+
SCOPE(exit)
365+
{
366+
if constexpr (!::std::is_same_v<Ctx, graph_ctx>)
367+
{
368+
delete w;
369+
}
370+
};
352371

353372
constexpr bool fun_invocable_task_deps = reserved::is_applicable_v<Fun, decltype(payload)>;
354373
constexpr bool fun_invocable_task_non_void_deps =
@@ -365,23 +384,23 @@ public:
365384
{
366385
::std::apply(::std::forward<Fun>(w->first), reserved::remove_void_interface(mv(w->second)));
367386
}
368-
369-
if constexpr (!::std::is_same_v<Ctx, graph_ctx>)
370-
{
371-
delete w;
372-
}
373387
};
374388

375389
if constexpr (::std::is_same_v<Ctx, graph_ctx>)
376390
{
377-
cudaHostNodeParams params = {.fn = callback, .userData = wrapper};
391+
cudaHostNodeParams params = {.fn = callback, .userData = wrapper.get()};
378392
auto lock = t.lock_ctx_graph();
379-
cuda_safe_call(cudaGraphAddHostNode(&t.get_node(), t.get_ctx_graph(), nullptr, 0, &params));
393+
t.get_node() = cuda_try<cudaGraphAddHostNode>(t.get_ctx_graph(), nullptr, 0, &params);
394+
// Transfer ownership only after the node references the args, so a throw
395+
// from cudaGraphAddHostNode leaves the unique_ptr as the sole owner.
396+
using wrapper_type = ::std::remove_reference_t<decltype(*wrapper)>;
397+
ctx.add_resource(::std::make_shared<host_callback_args_resource<wrapper_type>>(wrapper.get()));
380398
}
381399
else
382400
{
383-
cuda_safe_call(cudaLaunchHostFunc(t.get_stream(), callback, wrapper));
401+
cuda_try<cudaLaunchHostFunc>(t.get_stream(), callback, wrapper.get());
384402
}
403+
wrapper.release();
385404
}
386405
}
387406

cudax/include/cuda/experimental/__stf/internal/launch.cuh

Lines changed: 40 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ void cuda_launcher(interpreted_spec interpreted_policy, Fun&& f, void** args, St
6464
lconfig.dynamicSmemBytes = mem_config[2];
6565
lconfig.stream = stream;
6666

67-
cuda_safe_call(cudaLaunchKernelExC(&lconfig, (void*) f, args));
67+
cuda_try<cudaLaunchKernelExC>(&lconfig, (void*) f, args);
6868
}
6969

7070
template <typename interpreted_spec, typename Fun>
@@ -81,15 +81,15 @@ void cuda_launcher_graph(interpreted_spec interpreted_policy, Fun&& f, void** ar
8181
kconfig.kernelParams = args;
8282
kconfig.sharedMemBytes = static_cast<int>(mem_config[2]);
8383

84-
cuda_safe_call(cudaGraphAddKernelNode(&n, g, nullptr, 0, &kconfig));
84+
n = cuda_try<cudaGraphAddKernelNode>(g, nullptr, 0, &kconfig);
8585

8686
// Enable cooperative kernel if necessary by updating the node attributes
8787

8888
bool cooperative_kernel = interpreted_policy.need_cooperative_kernel_launch();
8989

9090
cudaKernelNodeAttrValue val;
9191
val.cooperative = cooperative_kernel ? 1 : 0;
92-
cuda_safe_call(cudaGraphKernelNodeSetAttribute(n, cudaKernelNodeAttributeCooperative, &val));
92+
cuda_try<cudaGraphKernelNodeSetAttribute>(n, cudaKernelNodeAttributeCooperative, &val);
9393
}
9494

9595
template <typename Fun, typename interpreted_spec, typename Arg>
@@ -120,20 +120,25 @@ void launch_impl(interpreted_spec interpreted_policy, exec_place& p, Fun f, Arg
120120

121121
if (th_mem_config[1] > 0)
122122
{
123-
cuda_safe_call(cudaMallocAsync(&th_dev_tmp_ptr, th_mem_config[1], stream));
123+
cuda_try(cudaMallocAsync(&th_dev_tmp_ptr, th_mem_config[1], stream));
124124
th.set_device_tmp(th_dev_tmp_ptr);
125125
}
126126

127+
// Free the temporary device memory on the way out, even if the launch throws.
128+
// cuda_safe_call (not cuda_try) because SCOPE(exit) is noexcept.
129+
SCOPE(exit)
130+
{
131+
if (th_dev_tmp_ptr)
132+
{
133+
cuda_safe_call(cudaFreeAsync(th_dev_tmp_ptr, stream));
134+
}
135+
};
136+
127137
auto kernel_args = tuple_prepend(mv(th), mv(arg));
128138
using args_type = decltype(kernel_args);
129139
void* all_args[] = {&f, &kernel_args};
130140

131141
cuda_launcher(interpreted_policy, reserved::launch_kernel<Fun, args_type>, all_args, stream);
132-
133-
if (th_mem_config[1] > 0)
134-
{
135-
cuda_safe_call(cudaFreeAsync(th_dev_tmp_ptr, stream));
136-
}
137142
};
138143
}
139144

@@ -358,20 +363,12 @@ public:
358363
nvtx_range nr(t.get_symbol().c_str());
359364
t.start();
360365

361-
int device;
362-
cudaEvent_t start_event, end_event;
363-
364-
if constexpr (::std::is_same_v<Ctx, stream_ctx>)
365-
{
366-
if (record_time)
367-
{
368-
cudaGetDevice(&device); // We will use this to force it during the next run
369-
// Events must be created here to avoid issues with multi-gpu
370-
cuda_safe_call(cudaEventCreate(&start_event));
371-
cuda_safe_call(cudaEventCreate(&end_event));
372-
cuda_safe_call(cudaEventRecord(start_event, t.get_stream()));
373-
}
374-
}
366+
int device = -1;
367+
cudaEvent_t start_event = nullptr, end_event = nullptr;
368+
// Set only once both timing events exist and the start event has been recorded.
369+
// The timing setup is done below, after the SCOPE(exit) guard is installed, so a
370+
// throw from those cuda_try calls cannot skip t.end_uncleared()/t.clear().
371+
bool timing_active = false;
375372

376373
const size_t grid_size = e_place.size();
377374

@@ -403,8 +400,11 @@ public:
403400
deallocateManagedMemory(hostMemoryArrivedList, grid_size, t.get_stream());
404401
}
405402

406-
if (record_time)
403+
if (timing_active)
407404
{
405+
// These run inside the enclosing SCOPE(exit) body, which is noexcept;
406+
// keep cuda_safe_call so a CUDA error aborts rather than throwing
407+
// through the guard (which would call std::terminate).
408408
cuda_safe_call(cudaEventRecord(end_event, t.get_stream()));
409409
cuda_safe_call(cudaEventSynchronize(end_event));
410410

@@ -426,6 +426,22 @@ public:
426426
t.clear();
427427
};
428428

429+
if constexpr (::std::is_same_v<Ctx, stream_ctx>)
430+
{
431+
if (record_time)
432+
{
433+
device = cuda_try<cudaGetDevice>(); // We will use this to force it during the next run
434+
// Events must be created here to avoid issues with multi-gpu.
435+
// cudaEventCreate is an overload set (cuda_runtime.h adds a flags overload),
436+
// so cuda_try<cudaEventCreate> cannot name it; use the non-overloaded
437+
// cudaEventCreateWithFlags with the default flags (equivalent to cudaEventCreate).
438+
start_event = cuda_try<cudaEventCreateWithFlags>(cudaEventDefault);
439+
end_event = cuda_try<cudaEventCreateWithFlags>(cudaEventDefault);
440+
cuda_try<cudaEventRecord>(start_event, t.get_stream());
441+
timing_active = true;
442+
}
443+
}
444+
429445
/* Should only be allocated / deallocated if the last level used is system wide. Unnecessary and wasteful
430446
* otherwise. */
431447
if (grid_size > 1)

0 commit comments

Comments
 (0)