|
| 1 | +# Async-activity load benchmark results |
| 2 | + |
| 3 | +Generated by `bench_async_activities.py`. Re-run with: |
| 4 | + |
| 5 | +```bash |
| 6 | +uv run python ext/dapr-ext-workflow/benchmarks/bench_async_activities.py |
| 7 | +``` |
| 8 | + |
| 9 | +## Run environment |
| 10 | + |
| 11 | +- **Timestamp**: 2026-05-25 20:40:09 UTC |
| 12 | +- **Git commit**: `8f13da0-dirty` |
| 13 | +- **Python**: CPython 3.13.12 |
| 14 | +- **OS**: Darwin 25.5.0 (arm64) on Apple M3 Pro (12 logical cores), 36.0 GB |
| 15 | +- **asyncio default executor**: `max_workers=16` (`min(32, cpu_count + 4)`) |
| 16 | +- **CI environment**: no |
| 17 | + |
| 18 | +Numbers are specific to this hardware. Re-run locally to compare. The shape of |
| 19 | +the curves (throughput plateau, p99 inflection, drift) is what to compare |
| 20 | +across machines. |
| 21 | + |
| 22 | +Each scenario drives `TaskHubGrpcWorker._execute_activity_async` through |
| 23 | +`_AsyncWorkerManager` against a mock `CompleteActivityTask` stub. End-to-end |
| 24 | +latency is measured from `submit_activity` to the mock stub seeing the response. |
| 25 | + |
| 26 | +## 1. Concurrency win (issue #897 repro) |
| 27 | + |
| 28 | +100 × 1 s HTTP fetches. Async runs them concurrently on the loop, sync gates |
| 29 | +them through the thread pool. |
| 30 | + |
| 31 | +| Scenario | Wallclock (s) | Tput/s | Peak tasks | Peak RSS Δ (MB) | |
| 32 | +| --- | ---: | ---: | ---: | ---: | |
| 33 | +| Async fan-out | 1.47 | 68.1 | 305 | 86.4 | |
| 34 | +| Sync baseline | 13.34 | 7.5 | 121 | 2.4 | |
| 35 | + |
| 36 | +## 2. Throughput scaling |
| 37 | + |
| 38 | +Async fan-out, 50 ms activity, sem=5000, pool=16. Throughput plateaus around |
| 39 | +N=2500. |
| 40 | + |
| 41 | +| N | Wallclock (s) | Tput/s | p50 ms | p95 ms | p99 ms | Peak tasks | Peak RSS Δ (MB) | |
| 42 | +| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | |
| 43 | +| 100 | 0.06 | 1542.3 | 62.0 | 64.1 | 64.1 | 105 | 0.0 | |
| 44 | +| 500 | 0.08 | 5931.1 | 78.6 | 79.6 | 79.6 | 505 | 0.4 | |
| 45 | +| 1000 | 0.11 | 8956.5 | 102.9 | 106.2 | 106.3 | 1005 | 2.9 | |
| 46 | +| 2500 | 0.24 | 10532.0 | 218.8 | 225.3 | 225.9 | 2505 | 10.0 | |
| 47 | +| 5000 | 0.57 | 8696.7 | 543.8 | 557.2 | 558.7 | 5005 | 25.2 | |
| 48 | + |
| 49 | +## 3. Semaphore-cap sensitivity |
| 50 | + |
| 51 | +N=2500, 50 ms activity, pool=16. Caps below ~500 starve the loop. Gains |
| 52 | +compress above ~1000. |
| 53 | + |
| 54 | +| Sem | Wallclock (s) | Tput/s | p50 ms | p95 ms | p99 ms | |
| 55 | +| ---: | ---: | ---: | ---: | ---: | ---: | |
| 56 | +| 50 | 2.69 | 928.6 | 1422.7 | 2583.5 | 2687.0 | |
| 57 | +| 100 | 1.42 | 1758.2 | 794.9 | 1360.7 | 1412.0 | |
| 58 | +| 500 | 0.40 | 6229.5 | 279.2 | 387.9 | 392.3 | |
| 59 | +| 1000 | 0.30 | 8322.3 | 235.6 | 286.9 | 290.2 | |
| 60 | +| 5000 | 0.23 | 10720.7 | 215.0 | 222.3 | 222.8 | |
| 61 | + |
| 62 | +## 4. Failure threshold (queue-wait inflection) |
| 63 | + |
| 64 | +Cap=1000, ramp N, 50 ms activity. p99 first exceeds 2× server latency at |
| 65 | +**N=1000** (p99 = 104.7 ms). |
| 66 | + |
| 67 | +| N | Wallclock (s) | Tput/s | p50 ms | p95 ms | p99 ms | |
| 68 | +| ---: | ---: | ---: | ---: | ---: | ---: | |
| 69 | +| 500 | 0.08 | 6264.2 | 70.6 | 77.4 | 77.5 | |
| 70 | +| 1000 | 0.11 | 9145.3 | 94.5 | 104.2 | 104.7 | |
| 71 | +| 2500 | 0.31 | 8086.8 | 243.6 | 294.2 | 298.0 | |
| 72 | +| 5000 | 0.72 | 6983.2 | 584.2 | 691.1 | 700.5 | |
| 73 | +| 10000 | 2.08 | 4813.1 | 1801.7 | 2019.3 | 2046.2 | |
| 74 | + |
| 75 | +## 5. Sidecar response delivery overhead |
| 76 | + |
| 77 | +N=1000, sem=1000, pool=16, 50 ms activity. Mock `CompleteActivityTask` given |
| 78 | +an artificial delay. Async responses go through `loop.run_in_executor(None, ...)`, |
| 79 | +sharing asyncio's default executor (`max_workers=16` here). Delays past ~5 ms |
| 80 | +saturate that pool. |
| 81 | + |
| 82 | +| Delivery | Wallclock (s) | Tput/s | p50 ms | p95 ms | p99 ms | |
| 83 | +| ---: | ---: | ---: | ---: | ---: | ---: | |
| 84 | +| 0 ms | 0.11 | 9497.2 | 98.2 | 101.3 | 101.5 | |
| 85 | +| 1 ms | 0.18 | 5699.8 | 133.0 | 167.7 | 171.0 | |
| 86 | +| 5 ms | 0.48 | 2077.9 | 287.7 | 458.6 | 473.4 | |
| 87 | +| 10 ms | 0.86 | 1162.5 | 494.1 | 820.4 | 843.5 | |
| 88 | + |
| 89 | +## 6. Sustained load |
| 90 | + |
| 91 | +200/s for 120 s, 50 ms activity. Submitted/completed: 24 000 / 24 000. |
| 92 | +Wallclock 120.05 s (effective 199.9/s). |
| 93 | + |
| 94 | +- p50 50.2 ms, p95 50.6 ms, p99 50.8 ms, max 62.8 ms. |
| 95 | +- First-25% p99 50.8 ms, last-25% p99 50.7 ms. No drift. |
| 96 | +- Peak tasks 19, peak queue depth 3, peak RSS Δ 5.8 MB. |
| 97 | + |
| 98 | +## 7. Real HTTP workload |
| 99 | + |
| 100 | +Each activity opens a fresh `httpx.AsyncClient` and GETs an aiohttp endpoint |
| 101 | +sleeping 50 ms. Mirrors `examples/workflow/async_activities.py`. Pool=16 for |
| 102 | +all rows. |
| 103 | + |
| 104 | +| Scenario | N | Sem | Wallclock (s) | Tput/s | p50 ms | p95 ms | p99 ms | Peak tasks | Peak RSS Δ (MB) | |
| 105 | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | |
| 106 | +| Async | 100 | 1000 | 0.49 | 205.3 | 485.1 | 485.4 | 485.5 | 305 | 0.0 | |
| 107 | +| Async | 500 | 1000 | 2.06 | 243.2 | 1990.2 | 2052.6 | 2053.0 | 1376 | 308.1 | |
| 108 | +| Async | 1000 | 1000 | 4.28 | 233.4 | 4200.5 | 4274.9 | 4280.5 | 2555 | 398.5 | |
| 109 | +| Async | 2500 | 5000 | 15.16 | 165.0 | 10240.9 | 13260.9 | 15111.6 | 5776 | 1219.1 | |
| 110 | +| Sync | 100 | 1000 | 0.51 | 194.2 | 324.6 | 458.5 | 514.4 | 137 | 0.7 | |
| 111 | + |
| 112 | +## 8. Real HTTP sustained load |
| 113 | + |
| 114 | +Open-loop 100/s for 60 s with real `httpx.AsyncClient`. Submitted/completed: |
| 115 | +6000 / 6000. Wallclock 60.05 s (effective 99.9/s). |
| 116 | + |
| 117 | +- p50 56.1 ms, p95 68.9 ms, p99 76.0 ms, max 145.2 ms. |
| 118 | +- First-25% p99 75.7 ms, last-25% p99 76.2 ms. No drift. |
| 119 | +- Peak tasks 45, peak queue depth 6, peak RSS Δ 0.0 MB. |
| 120 | + |
| 121 | +## 9. OOM safety |
| 122 | + |
| 123 | +10 000 in-flight async activities, 50 ms, sem=1000, pool=8. ~9 000 Tasks |
| 124 | +parked on the semaphore. Peak RSS Δ stays well under the 500 MB budget. |
| 125 | + |
| 126 | +| N | Sem | Wallclock (s) | Tput/s | Peak tasks | Peak RSS Δ (MB) | |
| 127 | +| ---: | ---: | ---: | ---: | ---: | ---: | |
| 128 | +| 10000 | 1000 | 2.03 | 4918.2 | 10005 | 0.0 | |
| 129 | + |
| 130 | +## Operational guidance |
| 131 | + |
| 132 | +See `ext/dapr-ext-workflow/docs/concurrency.md` for sizing recommendations. |
0 commit comments