Commit b336700
authored
cudax/stf: migrate stackable/ from cuda_safe_call to cuda_try (#9165)
* cudax/stf: migrate stackable/ from cuda_safe_call to cuda_try
In stackable_ctx_impl.cuh, replace cuda_safe_call with cuda_try in the
graph_ctx_node constructor and finalize() so CUDA errors are reported as
exceptions rather than aborting the process.
The constructor builds a CUDA graph in stages, so add transactional
cleanup:
- In the nested non-conditional branch, the freshly created
dummy_graph is destroyed intentionally mid-block. Guard it with a
SCOPE(fail) that frees it only while dummy_graph_owned is true, and
disarm the flag right after the intentional destroy.
- The outer `graph` is owned by us only in the non-nested case (in the
nested cases it is either parent_graph or a child of parent_graph,
both owned upstream). A SCOPE(fail) destroys it on early throw and
is disarmed the instant graph_ctx adopts it via
`auto gctx = graph_ctx(sub_graph, ...);`, matching graph_ctx's
documented ownership contract ("User code is not supposed to destroy
the graph later").
- The conditional handle (cudaGraphConditionalHandleCreate) and any
nodes added to `graph` (cudaGraphAddNode, cudaGraphAddKernelNode)
are implicitly cleaned up by the outer SCOPE(fail) destroying
`graph`.
Two residual hazards are intentionally documented inline rather than
fixed in this commit:
- cudaGraphAddChildGraphNode leaves an orphaned child node inside
parent_graph on later throw; cleanly removing it would need
cudaGraphDestroyNode and dependency rewiring.
- cudaGraphConditionalHandleCreate writes a handle into a caller-owned
pointer; CUDA has no destroy API for conditional handles, so on
throw the handle is left invalid (its backing graph is destroyed).
Both are no worse than the prior behavior (which aborted).
The four cuda_safe_call sites in finalize() (cudaGraphAddDependencies
on both CTK branches, cudaGraphDebugDotPrint, cudaGraphLaunch) become
plain cuda_try; no resource rollback applies.
The two cuda_safe_call sites inside the new SCOPE(fail) bodies are
intentional: SCOPE bodies are noexcept, so cuda_safe_call is the
correct tool there.
In stackable_ctx.cuh, the two cuda_safe_call sites inside
UNITTEST host-task lambdas are kept and annotated. Those lambdas
are dispatched by the STF host-task path, whose exception-safety has
not been audited, so an abort remains safer than an unannotated throw.
* cudax/stf: use templated cuda_try<F> form for graph out-params in stackable/
Where a graph_ctx_node site captures a CUDA out-parameter, switch from
the runtime-status form `cuda_try(cudaFn(&out, ...))` to the templated
form `out = cuda_try<cudaFn>(...)`. This allows the captured handle to
be a single const-initialized local instead of declare-then-fill.
Converted (each capturing one output handle):
- cudaGraphCreate (x2) -> dummy_graph / graph
- cudaGraphAddChildGraphNode -> const n
- cudaGraphChildGraphNodeGetGraph -> graph (last-output-param form)
- cudaGraphAddNode (both CTK -> const conditionalNode
branches)
- cudaGraphAddKernelNode -> const reset_node
dummy_graph stays non-const because it is reset to nullptr to disarm its
SCOPE(fail) guard after the intentional destroy.
Left in runtime-status form on purpose:
- cudaGraphConditionalHandleCreate: its output is written into the
caller-owned config.conditional_handle, not a synthesizable local;
the templated form would create a throwaway local and lose the write.
- cudaGraphDestroy, cudaGraphAddDependencies, cudaGraphDebugDotPrint,
cudaGraphLaunch: no captured output, so the templated form adds
nothing.
cudaGraphAddNode is convertible despite its trailing non-const
cudaGraphNodeParams* because the last-output interpretation fails to
typecheck (cudaGraph_t is not convertible to cudaGraphNode_t* once the
synthesized pointer is appended), so cuda_try selects the first-output
form unambiguously.1 parent ee9f95b commit b336700
2 files changed
Lines changed: 77 additions & 27 deletions
Lines changed: 6 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1638 | 1638 | | |
1639 | 1639 | | |
1640 | 1640 | | |
| 1641 | + | |
| 1642 | + | |
| 1643 | + | |
| 1644 | + | |
1641 | 1645 | | |
1642 | 1646 | | |
1643 | 1647 | | |
| |||
1648 | 1652 | | |
1649 | 1653 | | |
1650 | 1654 | | |
| 1655 | + | |
| 1656 | + | |
1651 | 1657 | | |
1652 | 1658 | | |
1653 | 1659 | | |
| |||
Lines changed: 71 additions & 27 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
| 40 | + | |
40 | 41 | | |
41 | 42 | | |
42 | 43 | | |
| |||
453 | 454 | | |
454 | 455 | | |
455 | 456 | | |
456 | | - | |
457 | | - | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
458 | 470 | | |
459 | | - | |
460 | | - | |
461 | | - | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
462 | 479 | | |
463 | 480 | | |
464 | | - | |
| 481 | + | |
| 482 | + | |
465 | 483 | | |
466 | 484 | | |
467 | 485 | | |
468 | 486 | | |
469 | | - | |
| 487 | + | |
470 | 488 | | |
471 | 489 | | |
472 | 490 | | |
473 | 491 | | |
474 | 492 | | |
475 | 493 | | |
476 | | - | |
| 494 | + | |
477 | 495 | | |
478 | 496 | | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
479 | 513 | | |
480 | 514 | | |
481 | 515 | | |
482 | 516 | | |
483 | 517 | | |
484 | 518 | | |
485 | 519 | | |
486 | | - | |
487 | | - | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
488 | 527 | | |
489 | 528 | | |
490 | 529 | | |
| |||
494 | 533 | | |
495 | 534 | | |
496 | 535 | | |
497 | | - | |
498 | | - | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
499 | 539 | | |
500 | | - | |
| 540 | + | |
501 | 541 | | |
502 | | - | |
| 542 | + | |
503 | 543 | | |
504 | 544 | | |
505 | 545 | | |
| |||
518 | 558 | | |
519 | 559 | | |
520 | 560 | | |
521 | | - | |
522 | | - | |
| 561 | + | |
523 | 562 | | |
524 | 563 | | |
525 | 564 | | |
| |||
540 | 579 | | |
541 | 580 | | |
542 | 581 | | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
543 | 587 | | |
544 | 588 | | |
545 | 589 | | |
| |||
605 | 649 | | |
606 | 650 | | |
607 | 651 | | |
608 | | - | |
| 652 | + | |
609 | 653 | | |
610 | 654 | | |
611 | 655 | | |
| |||
628 | 672 | | |
629 | 673 | | |
630 | 674 | | |
631 | | - | |
| 675 | + | |
632 | 676 | | |
633 | | - | |
| 677 | + | |
634 | 678 | | |
635 | | - | |
| 679 | + | |
636 | 680 | | |
637 | 681 | | |
638 | 682 | | |
| |||
680 | 724 | | |
681 | 725 | | |
682 | 726 | | |
683 | | - | |
| 727 | + | |
684 | 728 | | |
685 | 729 | | |
686 | 730 | | |
| |||
740 | 784 | | |
741 | 785 | | |
742 | 786 | | |
743 | | - | |
| 787 | + | |
744 | 788 | | |
745 | | - | |
| 789 | + | |
746 | 790 | | |
747 | | - | |
| 791 | + | |
748 | 792 | | |
749 | 793 | | |
750 | 794 | | |
| |||
754 | 798 | | |
755 | 799 | | |
756 | 800 | | |
757 | | - | |
| 801 | + | |
758 | 802 | | |
759 | 803 | | |
760 | 804 | | |
| |||
773 | 817 | | |
774 | 818 | | |
775 | 819 | | |
776 | | - | |
| 820 | + | |
777 | 821 | | |
778 | 822 | | |
779 | | - | |
| 823 | + | |
780 | 824 | | |
781 | 825 | | |
782 | 826 | | |
| |||
0 commit comments