Skip to content

Commit bf08032

Browse files
committed
Coroutine codegen optimizations from compiler/library author analysis
Extract frame_alloc_mixin as a public mixin so consuming libraries can opt promise types into TLS-based frame allocation. Runners (when_all, when_any) and dispatch_trampoline now inherit it instead of bypassing the recycling allocator via ::operator new. - Add frame_alloc_promise mixin (public API in ex/) - Refactor io_awaitable_promise_base to inherit from mixin - Mark unhandled_exception() noexcept on task, when_all_runner, and when_any_runner promises (eliminates implicit try-catch) - Add BOOST_CAPY_CORO_DESTROY_WHEN_COMPLETE attribute for internal coroutines that always run to completion (when_all_runner, when_any_runner, dispatch_trampoline, run_async_trampoline) - Switch both coroutine attribute macros to __has_cpp_attribute for Apple Clang compatibility - Document variable lifetime scoping and GCC vs Clang frame sizes
1 parent b1f8ead commit bf08032

File tree

9 files changed

+225
-61
lines changed

9 files changed

+225
-61
lines changed

doc/modules/ROOT/pages/4.coroutines/4g.allocators.adoc

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,84 @@ void process_batch(std::vector<item> const& items)
176176
}
177177
----
178178

179+
=== Scope Variables to Reduce Frame Size
180+
181+
Compilers use declaration scope (braces) to decide which variables cross suspend points and must live in the coroutine frame. Variables declared in an outer scope remain in the frame even after their last use, as long as a `co_await` follows within the same scope.
182+
183+
Wrapping buffer usage in explicit braces can dramatically reduce frame size:
184+
185+
[source,cpp]
186+
----
187+
// BAD: buf lives in frame across all subsequent co_awaits
188+
task<> process(stream& s)
189+
{
190+
char buf[4096];
191+
auto [ec, n] = co_await s.read_some(buf);
192+
co_await do_work(buf, n);
193+
co_await s.write_some(reply); // buf wastes 4K in frame
194+
}
195+
196+
// GOOD: braces end buf's lifetime before next suspend
197+
task<> process(stream& s)
198+
{
199+
std::size_t n;
200+
{
201+
char buf[4096];
202+
auto [ec, n_] = co_await s.read_some(buf);
203+
n = n_;
204+
co_await do_work(buf, n);
205+
}
206+
co_await s.write_some(reply); // 4K saved
207+
}
208+
----
209+
210+
This technique also enables the compiler to *overlap* variables in the frame. When two variables have completely non-overlapping lifetimes (in separate scoped blocks), the compiler can reuse the same frame memory for both — even on Clang:
211+
212+
[source,cpp]
213+
----
214+
// BAD: both arrays in frame simultaneously (8K)
215+
task<> pipeline(stream& in, stream& out)
216+
{
217+
char read_buf[4096];
218+
auto [ec1, n] = co_await in.read_some(read_buf);
219+
220+
char write_buf[4096];
221+
prepare(write_buf, read_buf, n);
222+
co_await out.write_some(write_buf);
223+
}
224+
225+
// GOOD: non-overlapping scopes allow frame reuse (4K)
226+
task<> pipeline(stream& in, stream& out)
227+
{
228+
std::size_t n;
229+
{
230+
char read_buf[4096];
231+
auto [ec, n_] = co_await in.read_some(read_buf);
232+
n = n_;
233+
}
234+
{
235+
char write_buf[4096];
236+
prepare(write_buf, n);
237+
co_await out.write_some(write_buf);
238+
}
239+
}
240+
----
241+
242+
In the second version, `read_buf` and `write_buf` never coexist, so the compiler can place them at the same frame offset — halving the frame's buffer footprint. This optimization applies to any variables with non-overlapping lifetimes, not just arrays.
243+
244+
=== GCC vs Clang Frame Sizes
245+
246+
NOTE: This section draws on https://chuanqixu9.github.io/c++/2026/03/27/C++20-Coroutines-from-compiler-and-library-authors-perspective.en.html[C++20 Coroutines from compiler and library authors' perspective] by Chuanqi Xu.
247+
248+
GCC and Clang use fundamentally different strategies for coroutine frame layout:
249+
250+
* **Clang** performs frame layout after middle-end optimizations. Dead variables, unused temporaries, and constant-folded intermediates are eliminated before the frame is sized.
251+
* **GCC** performs frame layout in the frontend, before optimizations. Every local variable whose scope spans a suspend point ends up in the frame, even if optimizations would later prove it dead.
252+
253+
The practical consequence is that GCC coroutine frames are often 5-10x larger than Clang's for the same source code. In one benchmark, the same coroutine produced a 24-byte frame on Clang and a 16,032-byte frame on GCC.
254+
255+
For production coroutine workloads, Clang currently produces substantially better code. If you must use GCC, pay extra attention to variable scoping (above) and consider supplying a custom `memory_resource` with larger block sizes, since frames above 2048 bytes bypass the default recycling allocator's pooling.
256+
179257
=== Profile Before Optimizing
180258

181259
Coroutine frame allocation is rarely the bottleneck. Profile your application before investing in custom allocators.
@@ -189,6 +267,9 @@ Coroutine frame allocation is rarely the bottleneck. Profile your application be
189267
| `<boost/capy/ex/frame_allocator.hpp>`
190268
| Frame allocator concept and utilities
191269

270+
| `<boost/capy/ex/frame_alloc_promise.hpp>`
271+
| Mixin base for promise types that use the TLS frame allocator
272+
192273
| `<boost/capy/ex/recycling_memory_resource.hpp>`
193274
| Default recycling allocator implementation
194275
|===

include/boost/capy/detail/config.hpp

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -132,13 +132,20 @@
132132
# define BOOST_CAPY_DECL
133133
#endif
134134

135-
// Clang 20+ supports coro_await_elidable for heap elision
136-
#if defined(__clang__) && !defined(__apple_build_version__) && __clang_major__ >= 20
135+
// Heap elision: compiler may allocate elided coroutine frames on the caller's frame
136+
#if __has_cpp_attribute(clang::coro_await_elidable)
137137
#define BOOST_CAPY_CORO_AWAIT_ELIDABLE [[clang::coro_await_elidable]]
138138
#else
139139
#define BOOST_CAPY_CORO_AWAIT_ELIDABLE
140140
#endif
141141

142+
// Simpler destroy codegen for coroutines that always run to completion
143+
#if __has_cpp_attribute(clang::coro_only_destroy_when_complete)
144+
#define BOOST_CAPY_CORO_DESTROY_WHEN_COMPLETE [[clang::coro_only_destroy_when_complete]]
145+
#else
146+
#define BOOST_CAPY_CORO_DESTROY_WHEN_COMPLETE
147+
#endif
148+
142149
namespace boost::capy::detail {
143150
inline constexpr unsigned max_iovec_ = 16;
144151
}
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
//
2+
// Copyright (c) 2026 Michael Vandeberg
3+
//
4+
// Distributed under the Boost Software License, Version 1.0. (See accompanying
5+
// file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
6+
//
7+
// Official repository: https://github.com/cppalliance/capy
8+
//
9+
10+
#ifndef BOOST_CAPY_EX_FRAME_ALLOC_MIXIN_HPP
11+
#define BOOST_CAPY_EX_FRAME_ALLOC_MIXIN_HPP
12+
13+
#include <boost/capy/detail/config.hpp>
14+
#include <boost/capy/ex/frame_allocator.hpp>
15+
#include <boost/capy/ex/recycling_memory_resource.hpp>
16+
17+
#include <cstddef>
18+
#include <cstring>
19+
#include <memory_resource>
20+
21+
namespace boost {
22+
namespace capy {
23+
24+
/** Mixin that adds frame-allocator-aware allocation to a promise type.
25+
26+
Inherit from this class in any coroutine promise type to opt into
27+
TLS-based frame allocation with the recycling memory resource
28+
fast path. The mixin provides `operator new` and `operator delete`
29+
that:
30+
31+
1. Read the thread-local frame allocator set by `run_async` or `run`.
32+
2. Bypass virtual dispatch when the allocator is the default
33+
recycling memory resource.
34+
3. Store the allocator pointer at the end of each frame for
35+
correct deallocation even when TLS changes between allocation
36+
and deallocation.
37+
38+
This is the same allocation strategy used by @ref
39+
io_awaitable_promise_base. Use this mixin directly when your
40+
promise type does not need the full environment and continuation
41+
support that `io_awaitable_promise_base` provides.
42+
43+
@par Example
44+
@code
45+
struct my_internal_coroutine
46+
{
47+
struct promise_type : frame_alloc_mixin
48+
{
49+
my_internal_coroutine get_return_object();
50+
std::suspend_always initial_suspend() noexcept;
51+
std::suspend_always final_suspend() noexcept;
52+
void return_void();
53+
void unhandled_exception() noexcept;
54+
};
55+
};
56+
@endcode
57+
58+
@par Thread Safety
59+
The allocation fast path uses thread-local storage and requires
60+
no synchronization. The global pool fallback is mutex-protected.
61+
62+
@see io_awaitable_promise_base, frame_allocator, recycling_memory_resource
63+
*/
64+
struct frame_alloc_mixin
65+
{
66+
/** Allocate a coroutine frame.
67+
68+
Uses the thread-local frame allocator set by run_async.
69+
Falls back to default memory resource if not set.
70+
Stores the allocator pointer at the end of each frame for
71+
correct deallocation even when TLS changes. Uses memcpy
72+
to avoid alignment requirements on the trailing pointer.
73+
Bypasses virtual dispatch for the recycling allocator.
74+
*/
75+
static void* operator new(std::size_t size)
76+
{
77+
static auto* const rmr = get_recycling_memory_resource();
78+
79+
auto* mr = get_current_frame_allocator();
80+
if(!mr)
81+
mr = std::pmr::get_default_resource();
82+
83+
auto total = size + sizeof(std::pmr::memory_resource*);
84+
void* raw;
85+
if(mr == rmr)
86+
raw = static_cast<recycling_memory_resource*>(mr)
87+
->allocate_fast(total, alignof(std::max_align_t));
88+
else
89+
raw = mr->allocate(total, alignof(std::max_align_t));
90+
std::memcpy(static_cast<char*>(raw) + size, &mr, sizeof(mr));
91+
return raw;
92+
}
93+
94+
/** Deallocate a coroutine frame.
95+
96+
Reads the allocator pointer stored at the end of the frame
97+
to ensure correct deallocation regardless of current TLS.
98+
Bypasses virtual dispatch for the recycling allocator.
99+
*/
100+
static void operator delete(void* ptr, std::size_t size) noexcept
101+
{
102+
static auto* const rmr = get_recycling_memory_resource();
103+
104+
std::pmr::memory_resource* mr;
105+
std::memcpy(&mr, static_cast<char*>(ptr) + size, sizeof(mr));
106+
auto total = size + sizeof(std::pmr::memory_resource*);
107+
if(mr == rmr)
108+
static_cast<recycling_memory_resource*>(mr)
109+
->deallocate_fast(ptr, total, alignof(std::max_align_t));
110+
else
111+
mr->deallocate(ptr, total, alignof(std::max_align_t));
112+
}
113+
};
114+
115+
} // namespace capy
116+
} // namespace boost
117+
118+
#endif

include/boost/capy/ex/io_awaitable_promise_base.hpp

Lines changed: 2 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,12 @@
1111
#define BOOST_CAPY_EX_IO_AWAITABLE_PROMISE_BASE_HPP
1212

1313
#include <boost/capy/detail/config.hpp>
14+
#include <boost/capy/ex/frame_alloc_mixin.hpp>
1415
#include <boost/capy/ex/frame_allocator.hpp>
1516
#include <boost/capy/ex/io_env.hpp>
16-
#include <boost/capy/ex/recycling_memory_resource.hpp>
1717
#include <boost/capy/ex/this_coro.hpp>
1818

1919
#include <coroutine>
20-
#include <cstddef>
21-
#include <cstring>
2220
#include <memory_resource>
2321
#include <stop_token>
2422
#include <type_traits>
@@ -131,59 +129,12 @@ namespace capy {
131129
*/
132130
template<typename Derived>
133131
class io_awaitable_promise_base
132+
: public frame_alloc_mixin
134133
{
135134
io_env const* env_ = nullptr;
136135
mutable std::coroutine_handle<> cont_{std::noop_coroutine()};
137136

138137
public:
139-
/** Allocate a coroutine frame.
140-
141-
Uses the thread-local frame allocator set by run_async.
142-
Falls back to default memory resource if not set.
143-
Stores the allocator pointer at the end of each frame for
144-
correct deallocation even when TLS changes. Uses memcpy
145-
to avoid alignment requirements on the trailing pointer.
146-
Bypasses virtual dispatch for the recycling allocator.
147-
*/
148-
static void* operator new(std::size_t size)
149-
{
150-
static auto* const rmr = get_recycling_memory_resource();
151-
152-
auto* mr = get_current_frame_allocator();
153-
if(!mr)
154-
mr = std::pmr::get_default_resource();
155-
156-
auto total = size + sizeof(std::pmr::memory_resource*);
157-
void* raw;
158-
if(mr == rmr)
159-
raw = static_cast<recycling_memory_resource*>(mr)
160-
->allocate_fast(total, alignof(std::max_align_t));
161-
else
162-
raw = mr->allocate(total, alignof(std::max_align_t));
163-
std::memcpy(static_cast<char*>(raw) + size, &mr, sizeof(mr));
164-
return raw;
165-
}
166-
167-
/** Deallocate a coroutine frame.
168-
169-
Reads the allocator pointer stored at the end of the frame
170-
to ensure correct deallocation regardless of current TLS.
171-
Bypasses virtual dispatch for the recycling allocator.
172-
*/
173-
static void operator delete(void* ptr, std::size_t size) noexcept
174-
{
175-
static auto* const rmr = get_recycling_memory_resource();
176-
177-
std::pmr::memory_resource* mr;
178-
std::memcpy(&mr, static_cast<char*>(ptr) + size, sizeof(mr));
179-
auto total = size + sizeof(std::pmr::memory_resource*);
180-
if(mr == rmr)
181-
static_cast<recycling_memory_resource*>(mr)
182-
->deallocate_fast(ptr, total, alignof(std::max_align_t));
183-
else
184-
mr->deallocate(ptr, total, alignof(std::max_align_t));
185-
}
186-
187138
~io_awaitable_promise_base()
188139
{
189140
// Abnormal teardown: destroy orphaned continuation

include/boost/capy/ex/run.hpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
#include <boost/capy/concept/io_runnable.hpp>
1818
#include <boost/capy/ex/executor_ref.hpp>
1919
#include <coroutine>
20+
#include <boost/capy/ex/frame_alloc_mixin.hpp>
2021
#include <boost/capy/ex/frame_allocator.hpp>
2122
#include <boost/capy/ex/io_env.hpp>
2223

@@ -69,9 +70,10 @@ namespace boost::capy::detail {
6970
7071
The trampoline never touches the task's result.
7172
*/
72-
struct dispatch_trampoline
73+
struct BOOST_CAPY_CORO_DESTROY_WHEN_COMPLETE dispatch_trampoline
7374
{
7475
struct promise_type
76+
: frame_alloc_mixin
7577
{
7678
executor_ref caller_ex_;
7779
continuation parent_;

include/boost/capy/ex/run_async.hpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ struct get_promise_awaiter
8282
@tparam Alloc The allocator type (value type or memory_resource*).
8383
*/
8484
template<class Ex, class Handlers, class Alloc>
85-
struct run_async_trampoline
85+
struct BOOST_CAPY_CORO_DESTROY_WHEN_COMPLETE run_async_trampoline
8686
{
8787
using invoke_fn = void(*)(void*, Handlers&);
8888

@@ -193,7 +193,8 @@ struct run_async_trampoline
193193
This avoids double indirection when the user passes a memory_resource*.
194194
*/
195195
template<class Ex, class Handlers>
196-
struct run_async_trampoline<Ex, Handlers, std::pmr::memory_resource*>
196+
struct BOOST_CAPY_CORO_DESTROY_WHEN_COMPLETE
197+
run_async_trampoline<Ex, Handlers, std::pmr::memory_resource*>
197198
{
198199
using invoke_fn = void(*)(void*, Handlers&);
199200

include/boost/capy/task.hpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ struct [[nodiscard]] BOOST_CAPY_CORO_AWAIT_ELIDABLE
178178
return awaiter{this};
179179
}
180180

181-
void unhandled_exception()
181+
void unhandled_exception() noexcept
182182
{
183183
new (&ep_) std::exception_ptr(std::current_exception());
184184
has_ep_ = true;

include/boost/capy/when_all.hpp

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
#include <boost/capy/concept/executor.hpp>
1717
#include <boost/capy/concept/io_awaitable.hpp>
1818
#include <coroutine>
19+
#include <boost/capy/ex/frame_alloc_mixin.hpp>
1920
#include <boost/capy/ex/io_env.hpp>
2021
#include <boost/capy/ex/frame_allocator.hpp>
2122
#include <boost/capy/task.hpp>
@@ -207,9 +208,10 @@ struct when_all_homogeneous_state<std::tuple<>>
207208
@tparam StateType The state type (when_all_state or when_all_homogeneous_state).
208209
*/
209210
template<typename StateType>
210-
struct when_all_runner
211+
struct BOOST_CAPY_CORO_DESTROY_WHEN_COMPLETE when_all_runner
211212
{
212213
struct promise_type
214+
: frame_alloc_mixin
213215
{
214216
StateType* state_ = nullptr;
215217
std::size_t index_ = 0;
@@ -253,7 +255,7 @@ struct when_all_runner
253255

254256
void return_void() noexcept {}
255257

256-
void unhandled_exception()
258+
void unhandled_exception() noexcept
257259
{
258260
state_->core_.capture_exception(std::current_exception());
259261
state_->core_.stop_source_.request_stop();

0 commit comments

Comments
 (0)