|
| 1 | +# Async-activity load benchmark results |
| 2 | + |
| 3 | +Generated by `bench_async_activities.py`. Re-run with: |
| 4 | + |
| 5 | +```bash |
| 6 | +uv run python ext/dapr-ext-workflow/benchmarks/bench_async_activities.py |
| 7 | +``` |
| 8 | + |
| 9 | +## Run environment |
| 10 | + |
| 11 | +- **Timestamp**: 2026-05-25 20:40:09 UTC |
| 12 | +- **Git commit**: `8f13da0-dirty` |
| 13 | +- **Python**: CPython 3.13.12 |
| 14 | +- **OS**: Darwin 25.5.0 (arm64) |
| 15 | +- **Platform**: `macOS-26.5-arm64-arm-64bit-Mach-O` |
| 16 | +- **CPU**: Apple M3 Pro (12 logical cores) |
| 17 | +- **Memory**: 36.0 GB |
| 18 | +- **asyncio default executor**: max_workers = 16 (`min(32, cpu_count + 4)`) |
| 19 | +- **CI environment**: no |
| 20 | + |
| 21 | +**Numbers from this report are specific to this machine.** Re-run the benchmark on your hardware before drawing conclusions; on a small CI runner or a busy workstation they will diverge. The shape of the curves (throughput plateau, p99 inflection, drift) is what to compare across machines. |
| 22 | + |
| 23 | + |
| 24 | +Each scenario drives the production dispatch path (`TaskHubGrpcWorker._execute_activity_async`) through `_AsyncWorkerManager` against a mock `CompleteActivityTask` stub. End-to-end latency is measured from `submit_activity` to the mock stub receiving the response, so queue wait, semaphore acquisition, activity work, response build, and `run_in_executor` delivery are all included. |
| 25 | + |
| 26 | +## 1. Concurrency win (issue #897 repro) |
| 27 | + |
| 28 | +Proves async activities run concurrently on the loop; the sync path is gated by the thread pool. This row reuses the original repro at 100 × 1 s HTTP fetches. |
| 29 | + |
| 30 | +| Scenario | N | Sem | Pool | Latency (s) | Wallclock (s) | Tput/s | Peak tasks | Peak queue | Peak RSS Δ (MB) | Notes | |
| 31 | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- | |
| 32 | +| Async fan-out (issue #897 repro) | 100 | 1000 | 8 | 1.000 | 1.47 | 68.1 | 305 | 0 | 86.4 | 100 awaits run concurrently on the loop | |
| 33 | +| Sync baseline (pre-#897 behavior) | 100 | 1000 | 8 | 1.000 | 13.34 | 7.5 | 121 | 0 | 2.4 | gated by thread pool size, demonstrates the bug from #897 | |
| 34 | + |
| 35 | +## 2. Throughput scaling |
| 36 | + |
| 37 | +Async fan-out at 50 ms server latency, semaphore cap 5000, thread pool 16. Throughput is reported as items completed per wallclock second; the sweep shows where the curve flattens. |
| 38 | + |
| 39 | +| Scenario | N | Sem | Pool | Latency (s) | Wallclock (s) | Tput/s | p50 ms | p95 ms | p99 ms | Peak tasks | Peak queue | Peak RSS Δ (MB) | Notes | |
| 40 | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- | |
| 41 | +| Throughput N=100 | 100 | 5000 | 16 | 0.050 | 0.06 | 1542.3 | 62.0 | 64.1 | 64.1 | 105 | 0 | 0.0 | full _execute_activity_async path + mock CompleteActivityTask | |
| 42 | +| Throughput N=500 | 500 | 5000 | 16 | 0.050 | 0.08 | 5931.1 | 78.6 | 79.6 | 79.6 | 505 | 0 | 0.4 | full _execute_activity_async path + mock CompleteActivityTask | |
| 43 | +| Throughput N=1000 | 1000 | 5000 | 16 | 0.050 | 0.11 | 8956.5 | 102.9 | 106.2 | 106.3 | 1005 | 0 | 2.9 | full _execute_activity_async path + mock CompleteActivityTask | |
| 44 | +| Throughput N=2500 | 2500 | 5000 | 16 | 0.050 | 0.24 | 10532.0 | 218.8 | 225.3 | 225.9 | 2505 | 0 | 10.0 | full _execute_activity_async path + mock CompleteActivityTask | |
| 45 | +| Throughput N=5000 | 5000 | 5000 | 16 | 0.050 | 0.57 | 8696.7 | 543.8 | 557.2 | 558.7 | 5005 | 0 | 25.2 | full _execute_activity_async path + mock CompleteActivityTask | |
| 46 | + |
| 47 | +## 3. Semaphore-cap sensitivity |
| 48 | + |
| 49 | +N=2500 async activities at 50 ms server latency. Cap below ~500 starves the loop and inflates queue wait. Above that, gains compress. |
| 50 | + |
| 51 | +| Scenario | N | Sem | Pool | Latency (s) | Wallclock (s) | Tput/s | p50 ms | p95 ms | p99 ms | Peak tasks | Peak queue | Peak RSS Δ (MB) | Notes | |
| 52 | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- | |
| 53 | +| Sem cap=50 | 2500 | 50 | 16 | 0.050 | 2.69 | 928.6 | 1422.7 | 2583.5 | 2687.0 | 2505 | 0 | 0.0 | lower caps serialize the batch through fewer parallel slots | |
| 54 | +| Sem cap=100 | 2500 | 100 | 16 | 0.050 | 1.42 | 1758.2 | 794.9 | 1360.7 | 1412.0 | 2505 | 0 | 0.0 | lower caps serialize the batch through fewer parallel slots | |
| 55 | +| Sem cap=500 | 2500 | 500 | 16 | 0.050 | 0.40 | 6229.5 | 279.2 | 387.9 | 392.3 | 2505 | 0 | 0.0 | caps above N x latency yield no further gain | |
| 56 | +| Sem cap=1000 | 2500 | 1000 | 16 | 0.050 | 0.30 | 8322.3 | 235.6 | 286.9 | 290.2 | 2505 | 0 | 0.0 | caps above N x latency yield no further gain | |
| 57 | +| Sem cap=5000 | 2500 | 5000 | 16 | 0.050 | 0.23 | 10720.7 | 215.0 | 222.3 | 222.8 | 2505 | 0 | 0.0 | caps above N x latency yield no further gain | |
| 58 | + |
| 59 | +## 4. Failure threshold (queue-wait inflection) |
| 60 | + |
| 61 | +Cap held at 1000, ramp N. Until N approaches cap, p99 stays close to server latency. Past it, queue wait dominates and p99 grows ~linearly with `N / cap`. |
| 62 | + |
| 63 | +| Scenario | N | Sem | Pool | Latency (s) | Wallclock (s) | Tput/s | p50 ms | p95 ms | p99 ms | Peak tasks | Peak queue | Peak RSS Δ (MB) | Notes | |
| 64 | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- | |
| 65 | +| Threshold N=500 (cap=1000) | 500 | 1000 | 16 | 0.050 | 0.08 | 6264.2 | 70.6 | 77.4 | 77.5 | 505 | 0 | 0.0 | N > cap forces queue wait; p99 grows linearly | |
| 66 | +| Threshold N=1000 (cap=1000) | 1000 | 1000 | 16 | 0.050 | 0.11 | 9145.3 | 94.5 | 104.2 | 104.7 | 1005 | 0 | 0.0 | N > cap forces queue wait; p99 grows linearly | |
| 67 | +| Threshold N=2500 (cap=1000) | 2500 | 1000 | 16 | 0.050 | 0.31 | 8086.8 | 243.6 | 294.2 | 298.0 | 2505 | 0 | 0.0 | N > cap forces queue wait; p99 grows linearly | |
| 68 | +| Threshold N=5000 (cap=1000) | 5000 | 1000 | 16 | 0.050 | 0.72 | 6983.2 | 584.2 | 691.1 | 700.5 | 5005 | 0 | 0.0 | N > cap forces queue wait; p99 grows linearly | |
| 69 | +| Threshold N=10000 (cap=1000) | 10000 | 1000 | 16 | 0.050 | 2.08 | 4813.1 | 1801.7 | 2019.3 | 2046.2 | 10005 | 0 | 1.1 | N > cap forces queue wait; p99 grows linearly | |
| 70 | + |
| 71 | +**Threshold**: p99 first exceeds 2x server latency (100.0 ms) at **N=1000** with cap=1000 (p99 = 104.7 ms). |
| 72 | + |
| 73 | +## 5. Sidecar response delivery overhead |
| 74 | + |
| 75 | +Mock `CompleteActivityTask` is given an artificial delay. Async responses go through `loop.run_in_executor(None, ...)`, so they share asyncio's default executor (max `min(32, cpu_count + 4)`; on this run, `cpu_count=12`). Delivery latency above ~5 ms × concurrency exceeds the default pool and serializes, inflating tail latency. |
| 76 | + |
| 77 | +| Scenario | N | Sem | Pool | Latency (s) | Wallclock (s) | Tput/s | p50 ms | p95 ms | p99 ms | Peak tasks | Peak queue | Peak RSS Δ (MB) | Notes | |
| 78 | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- | |
| 79 | +| Delivery latency=0ms | 1000 | 1000 | 16 | 0.050 | 0.11 | 9497.2 | 98.2 | 101.3 | 101.5 | 1005 | 0 | 0.0 | asyncio default executor caps response delivery at min(32, cpu+4) workers | |
| 80 | +| Delivery latency=1ms | 1000 | 1000 | 16 | 0.050 | 0.18 | 5699.8 | 133.0 | 167.7 | 171.0 | 1005 | 0 | 0.0 | asyncio default executor caps response delivery at min(32, cpu+4) workers | |
| 81 | +| Delivery latency=5ms | 1000 | 1000 | 16 | 0.050 | 0.48 | 2077.9 | 287.7 | 458.6 | 473.4 | 1005 | 0 | 0.0 | asyncio default executor caps response delivery at min(32, cpu+4) workers | |
| 82 | +| Delivery latency=10ms | 1000 | 1000 | 16 | 0.050 | 0.86 | 1162.5 | 494.1 | 820.4 | 843.5 | 1005 | 0 | 0.0 | asyncio default executor caps response delivery at min(32, cpu+4) workers | |
| 83 | + |
| 84 | +## 6. Sustained load |
| 85 | + |
| 86 | +- **Target rate**: 200/s for 120 s |
| 87 | +- **Submitted / completed**: 24000 / 24000 |
| 88 | +- **Wallclock**: 120.05 s (effective throughput 199.9/s) |
| 89 | +- **Latency (overall)**: p50 50.2 ms, p95 50.6 ms, p99 50.8 ms, max 62.8 ms |
| 90 | +- **Latency (first 25%)**: p99 50.8 ms |
| 91 | +- **Latency (last 25%)**: p99 50.7 ms |
| 92 | +- **Peak tasks**: 19, peak queue depth: 3, peak RSS Δ: 5.8 MB |
| 93 | + |
| 94 | + |
| 95 | +## 7. Real HTTP workload (production shape) |
| 96 | + |
| 97 | +Each activity opens a fresh `httpx.AsyncClient` and GETs a local aiohttp endpoint that sleeps 50 ms. Mirrors `examples/workflow/async_activities.py`. The sync row at N=100 shows the same workload throttled by the thread pool — directly comparable to the rest of the table. |
| 98 | + |
| 99 | +| Scenario | N | Sem | Pool | Latency (s) | Wallclock (s) | Tput/s | p50 ms | p95 ms | p99 ms | Peak tasks | Peak queue | Peak RSS Δ (MB) | Notes | |
| 100 | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- | |
| 101 | +| Real HTTP async N=100 | 100 | 1000 | 16 | 0.050 | 0.49 | 205.3 | 485.1 | 485.4 | 485.5 | 305 | 0 | 0.0 | httpx.AsyncClient → aiohttp server (50 ms) | |
| 102 | +| Real HTTP async N=500 | 500 | 1000 | 16 | 0.050 | 2.06 | 243.2 | 1990.2 | 2052.6 | 2053.0 | 1376 | 0 | 308.1 | httpx.AsyncClient → aiohttp server (50 ms) | |
| 103 | +| Real HTTP async N=1000 | 1000 | 1000 | 16 | 0.050 | 4.28 | 233.4 | 4200.5 | 4274.9 | 4280.5 | 2555 | 0 | 398.5 | httpx.AsyncClient → aiohttp server (50 ms) | |
| 104 | +| Real HTTP async N=2500 | 2500 | 5000 | 16 | 0.050 | 15.16 | 165.0 | 10240.9 | 13260.9 | 15111.6 | 5776 | 0 | 1219.1 | httpx.AsyncClient → aiohttp server (50 ms) | |
| 105 | +| Real HTTP sync N=100 | 100 | 1000 | 16 | 0.050 | 0.51 | 194.2 | 324.6 | 458.5 | 514.4 | 137 | 0 | 0.7 | httpx.Client → aiohttp server, throttled by thread pool | |
| 106 | + |
| 107 | +## 8. Real HTTP sustained load |
| 108 | + |
| 109 | +Open-loop submission of real `httpx.AsyncClient` fetches at 100/s. Confirms steady state under genuine I/O, not synthetic sleep. |
| 110 | + |
| 111 | +- **Target rate**: 100/s for 60 s |
| 112 | +- **Submitted / completed**: 6000 / 6000 |
| 113 | +- **Wallclock**: 60.05 s (effective throughput 99.9/s) |
| 114 | +- **Latency (overall)**: p50 56.1 ms, p95 68.9 ms, p99 76.0 ms, max 145.2 ms |
| 115 | +- **Latency (first 25%)**: p99 75.7 ms |
| 116 | +- **Latency (last 25%)**: p99 76.2 ms |
| 117 | +- **Peak tasks**: 45, peak queue depth: 6, peak RSS Δ: 0.0 MB |
| 118 | + |
| 119 | + |
| 120 | +## 9. OOM safety |
| 121 | + |
| 122 | +10 000 in-flight async activities at 50 ms with a 1 000-cap semaphore. The ~9 000 Tasks parked on the semaphore are the design-discussion concern. Peak RSS delta stays well under the 500 MB budget, so the unbounded-pending-Task pattern is fine in practice. |
| 123 | + |
| 124 | +| Scenario | N | Sem | Pool | Latency (s) | Wallclock (s) | Tput/s | Peak tasks | Peak queue | Peak RSS Δ (MB) | Notes | |
| 125 | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- | |
| 126 | +| OOM safety (10k tasks, 1k semaphore) | 10000 | 1000 | 8 | 0.050 | 2.03 | 4918.2 | 10005 | 0 | 0.0 | ~9k tasks blocked on the semaphore. Peak RSS delta budget is 500 MB. | |
| 127 | + |
| 128 | +## How to read this report |
| 129 | + |
| 130 | +- **Tput/s** is the closed-loop throughput (items completed / wallclock). For the sustained scenario it is the steady-state value over the full run. |
| 131 | +- **p99 ms** is the end-to-end latency for the 99th-percentile item: time from `submit_activity` to the mock stub seeing the response. |
| 132 | +- **Peak queue** is the maximum depth of the manager's `activity_queue` during the run. Non-zero peak queue means submission temporarily outran the semaphore. |
| 133 | +- **Peak tasks** is the maximum number of live `asyncio.Task` objects in the process, which doubles as a sanity check on the unbounded-pending-Task pattern. |
| 134 | + |
| 135 | +## Operational guidance |
| 136 | + |
| 137 | +See `ext/dapr-ext-workflow/docs/concurrency.md` for the full operational write-up, including sizing recommendations for `maximum_concurrent_activity_work_items`, `maximum_thread_pool_workers`, and the asyncio default-executor caveat. |
0 commit comments