Performance measurements for the a2a-protocol-sdk Rust implementation,
generated by Criterion.rs statistical
benchmarking.
Note: These numbers were collected on CI runners (
ubuntu-latest). Absolute values will vary by hardware. Use these for relative comparisons between operations and for regression detection across releases — not as guarantees of production performance on your specific hardware.To reproduce on your own machine:
cargo bench -p a2a-benchmarks
Last updated: 2026-04-22 22:51 UTC
Rust version: rustc 1.95.0 (59807616e 2026-04-14)
Platform: Linux-x86_64
Interactive dashboard: See the Benchmark Dashboard for charts, visual comparisons, and drill-down analysis of these results.
End-to-end HTTP round-trip latency through JSON-RPC and REST transports. All measurements use loopback (127.0.0.1) to isolate SDK overhead from network latency.
| Benchmark | Median |
|---|---|
transport_jsonrpc_send/single_message |
1.49 ms |
transport_jsonrpc_stream/stream_drain |
1.50 ms |
transport_payload_scaling/jsonrpc_send/1024 |
1.47 ms |
transport_payload_scaling/jsonrpc_send/102400 |
1.65 ms |
transport_payload_scaling/jsonrpc_send/1048576 |
2.97 ms |
transport_payload_scaling/jsonrpc_send/16384 |
1.49 ms |
transport_payload_scaling/jsonrpc_send/256 |
1.44 ms |
transport_payload_scaling/jsonrpc_send/4096 |
1.45 ms |
transport_payload_scaling/jsonrpc_send/64 |
1.46 ms |
transport_rest_send/single_message |
1.50 ms |
transport_rest_stream/stream_drain |
1.51 ms |
Serialization and deserialization cost per A2A type. This is the baseline tax every message pays regardless of transport.
Includes protocol/payload_scaling benchmarks that measure pure serde cost from
64B to 1MB — the correct regression detection target for serialization changes.
Also compares serde_json::to_vec vs SerBuffer (thread-local reuse) and
from_slice vs from_str (borrowed deserialization) paths.
| Benchmark | Median |
|---|---|
protocol_batch/deserialize_tasks/1 |
1.2 µs |
protocol_batch/deserialize_tasks/10 |
13.4 µs |
protocol_batch/deserialize_tasks/100 |
136.8 µs |
protocol_batch/deserialize_tasks/50 |
68.5 µs |
protocol_batch/serialize_tasks/1 |
443 ns |
protocol_batch/serialize_tasks/10 |
3.6 µs |
protocol_batch/serialize_tasks/100 |
33.0 µs |
protocol_batch/serialize_tasks/50 |
16.7 µs |
protocol_jsonrpc_envelope/deserialize_request |
791 ns |
protocol_jsonrpc_envelope/deserialize_response |
1.7 µs |
protocol_jsonrpc_envelope/serialize_request |
263 ns |
protocol_jsonrpc_envelope/serialize_response |
511 ns |
protocol_payload_scaling/from_slice/1024 |
534 ns |
protocol_payload_scaling/from_slice/102400 |
21.5 µs |
protocol_payload_scaling/from_slice/1048576 |
229.9 µs |
protocol_payload_scaling/from_slice/16384 |
3.7 µs |
protocol_payload_scaling/from_slice/256 |
358 ns |
protocol_payload_scaling/from_slice/4096 |
1.2 µs |
protocol_payload_scaling/from_slice/64 |
326 ns |
protocol_payload_scaling/from_str/1024 |
426 ns |
protocol_payload_scaling/from_str/102400 |
17.5 µs |
protocol_payload_scaling/from_str/1048576 |
188.3 µs |
protocol_payload_scaling/from_str/16384 |
3.0 µs |
protocol_payload_scaling/from_str/256 |
301 ns |
protocol_payload_scaling/from_str/4096 |
964 ns |
protocol_payload_scaling/from_str/64 |
269 ns |
protocol_payload_scaling/ser_buffer/1024 |
795 ns |
protocol_payload_scaling/ser_buffer/102400 |
68.3 µs |
protocol_payload_scaling/ser_buffer/1048576 |
725.8 µs |
protocol_payload_scaling/ser_buffer/16384 |
10.9 µs |
protocol_payload_scaling/ser_buffer/256 |
268 ns |
protocol_payload_scaling/ser_buffer/4096 |
2.8 µs |
protocol_payload_scaling/ser_buffer/64 |
143 ns |
protocol_payload_scaling/to_vec/1024 |
845 ns |
protocol_payload_scaling/to_vec/102400 |
66.3 µs |
protocol_payload_scaling/to_vec/1048576 |
689.8 µs |
protocol_payload_scaling/to_vec/16384 |
11.0 µs |
protocol_payload_scaling/to_vec/256 |
377 ns |
protocol_payload_scaling/to_vec/4096 |
2.9 µs |
protocol_payload_scaling/to_vec/64 |
194 ns |
protocol_stream_events/artifact_update_deserialize |
520 ns |
protocol_stream_events/artifact_update_serialize |
229 ns |
protocol_stream_events/status_update_deserialize |
363 ns |
protocol_stream_events/status_update_serialize |
128 ns |
protocol_type_serde/agent_card_deserialize |
1.3 µs |
protocol_type_serde/agent_card_serialize |
576 ns |
protocol_type_serde/message_deserialize/217 |
679 ns |
protocol_type_serde/message_serialize/217 |
314 ns |
protocol_type_serde/task_deserialize/278 |
1.1 µs |
protocol_type_serde/task_serialize/278 |
437 ns |
TaskStore and EventQueue operations — the backbone of task management.
| Benchmark | Median |
|---|---|
lifecycle_e2e/send_and_complete |
1.49 ms |
lifecycle_e2e/stream_and_drain |
1.46 ms |
lifecycle_queue/write_read/1 |
775 ns |
lifecycle_queue/write_read/10 |
4.5 µs |
lifecycle_queue/write_read/100 |
43.8 µs |
lifecycle_queue/write_read/50 |
21.7 µs |
lifecycle_store_get/lookup_in_1000 |
423 ns |
lifecycle_store_list/filtered_page_50_of_250 |
27.0 µs |
lifecycle_store_save/single_task |
480 ns |
Scaling behavior under parallel load — how latency changes as concurrency increases from 1 to 64 simultaneous operations.
| Benchmark | Median |
|---|---|
concurrent_mixed/send_then_get |
1.43 ms |
concurrent_sends/jsonrpc/1 |
1.49 ms |
concurrent_sends/jsonrpc/16 |
3.72 ms |
concurrent_sends/jsonrpc/4 |
2.99 ms |
concurrent_sends/jsonrpc/64 |
6.61 ms |
concurrent_store/save_and_get/1 |
31.5 µs |
concurrent_store/save_and_get/16 |
66.9 µs |
concurrent_store/save_and_get/4 |
29.7 µs |
concurrent_store/save_and_get/64 |
183.4 µs |
concurrent_streams/jsonrpc/1 |
1.40 ms |
concurrent_streams/jsonrpc/16 |
3.95 ms |
concurrent_streams/jsonrpc/4 |
3.16 ms |
concurrent_streams/jsonrpc/64 |
7.02 ms |
Production-like usage patterns: multi-turn conversations, mixed payloads, interceptor chains, and connection reuse vs per-request clients.
| Benchmark | Median |
|---|---|
realistic_complex_card/deserialize/1 |
2.3 µs |
realistic_complex_card/deserialize/10 |
12.8 µs |
realistic_complex_card/deserialize/100 |
114.4 µs |
realistic_complex_card/deserialize/50 |
58.5 µs |
realistic_complex_card/serialize/1 |
962 ns |
realistic_complex_card/serialize/10 |
3.4 µs |
realistic_complex_card/serialize/100 |
26.7 µs |
realistic_complex_card/serialize/50 |
13.7 µs |
realistic_connection/new_client_per_request |
1.55 ms |
realistic_connection/reused_client |
1.41 ms |
realistic_history_serde/deserialize/1 |
1.4 µs |
realistic_history_serde/deserialize/10 |
5.2 µs |
realistic_history_serde/deserialize/20 |
10.2 µs |
realistic_history_serde/deserialize/5 |
3.1 µs |
realistic_history_serde/deserialize/50 |
24.6 µs |
realistic_history_serde/serialize/1 |
502 ns |
realistic_history_serde/serialize/10 |
2.1 µs |
realistic_history_serde/serialize/20 |
3.7 µs |
realistic_history_serde/serialize/5 |
1.2 µs |
realistic_history_serde/serialize/50 |
8.1 µs |
realistic_interceptor_chain/interceptors/0 |
167.6 µs |
realistic_interceptor_chain/interceptors/1 |
169.0 µs |
realistic_interceptor_chain/interceptors/10 |
171.4 µs |
realistic_interceptor_chain/interceptors/5 |
169.6 µs |
realistic_multi_turn/sequential/1 |
1.47 ms |
realistic_multi_turn/sequential/10 |
15.09 ms |
realistic_multi_turn/sequential/3 |
4.58 ms |
realistic_multi_turn/sequential/5 |
7.55 ms |
realistic_payload_complexity/large_metadata_10kb |
1.54 ms |
realistic_payload_complexity/mixed_parts |
1.45 ms |
realistic_payload_complexity/nested_metadata_10 |
1.45 ms |
realistic_payload_complexity/simple_text |
1.50 ms |
Cost of error handling — comparing happy path latency to error path latency. Production systems spend significant time on error paths; benchmarking only the happy path gives an incomplete picture.
| Benchmark | Median |
|---|---|
errors_happy_vs_error/error_path |
1.35 ms |
errors_happy_vs_error/happy_path |
1.44 ms |
errors_malformed_request/invalid_json |
95.0 µs |
errors_malformed_request/wrong_content_type |
94.2 µs |
errors_task_not_found/get_nonexistent_task |
110.1 µs |
Stream throughput under varying event volumes and consumer speeds. Reveals buffering and flow-control overhead that synthetic single-event tests miss.
The default broadcast channel capacity was increased from 64 to 256 events in
v0.5.0, pushing the per-event cost inflection point from ~52 events to ~252
events. Deployments with >256 events/task should use
EventQueueManager::with_capacity() to set a higher value.
| Benchmark | Median |
|---|---|
backpressure_concurrent_streams/streams/1 |
1.50 ms |
backpressure_concurrent_streams/streams/16 |
4.32 ms |
backpressure_concurrent_streams/streams/4 |
3.33 ms |
backpressure_slow_consumer/1ms_delay |
29.48 ms |
backpressure_slow_consumer/5ms_delay |
82.10 ms |
backpressure_slow_consumer/fast_consumer |
1.58 ms |
backpressure_stream_volume/252_events |
9.82 ms |
backpressure_stream_volume/27_events |
1.78 ms |
backpressure_stream_volume/3_events |
1.47 ms |
backpressure_stream_volume/502_events |
50.06 ms |
backpressure_stream_volume/52_events |
2.01 ms |
backpressure_stream_volume/7_events |
1.54 ms |
backpressure_timer_calibration/sleep_1ms_actual |
2.10 ms |
backpressure_timer_calibration/sleep_5ms_actual |
6.18 ms |
TaskStore performance at realistic data volumes (1K to 100K tasks). Shows how store operations scale as data accumulates over time.
| Benchmark | Median |
|---|---|
data_volume_concurrent_reads/get/1 |
28.3 µs |
data_volume_concurrent_reads/get/16 |
35.9 µs |
data_volume_concurrent_reads/get/4 |
30.3 µs |
data_volume_concurrent_reads/get/64 |
77.1 µs |
data_volume_get/lookup/1000 |
427 ns |
data_volume_get/lookup/10000 |
437 ns |
data_volume_get/lookup/100000 |
209 ns |
data_volume_history_depth/save_with_turns/1 |
1.6 µs |
data_volume_history_depth/save_with_turns/10 |
6.1 µs |
data_volume_history_depth/save_with_turns/20 |
10.7 µs |
data_volume_history_depth/save_with_turns/5 |
3.1 µs |
data_volume_history_depth/save_with_turns/50 |
23.5 µs |
data_volume_list/filtered_page_50/1000 |
26.1 µs |
data_volume_list/filtered_page_50/10000 |
26.1 µs |
data_volume_list/filtered_page_50/100000 |
25.7 µs |
data_volume_save/after_prefill/0 |
1.8 µs |
data_volume_save/after_prefill/1000 |
1.4 µs |
data_volume_save/after_prefill/10000 |
1.4 µs |
data_volume_save/after_prefill/50000 |
1.4 µs |
Heap allocation counts and bytes per operation, measured via a counting
allocator (#[global_allocator]). Values represent allocation counts or
bytes — not time — encoded as nanoseconds for Criterion tracking.
| Metric | Unit |
|---|---|
*_alloc_count |
Number of alloc() calls per operation |
*_bytes_per_payload |
Bytes allocated per operation |
| Benchmark | Value |
|---|---|
memory_bytes_per_payload/serialize_bytes/1024 |
568 |
memory_bytes_per_payload/serialize_bytes/16384 |
6442 |
memory_bytes_per_payload/serialize_bytes/256 |
253 |
memory_bytes_per_payload/serialize_bytes/4096 |
1682 |
memory_bytes_per_payload/serialize_bytes/64 |
152 |
memory_deserialize/agent_card_alloc_count |
1361 |
memory_deserialize/task_alloc_count |
1030 |
memory_history_scaling/deserialize_allocs/1 |
1349 |
memory_history_scaling/deserialize_allocs/10 |
5171 |
memory_history_scaling/deserialize_allocs/20 |
10209 |
memory_history_scaling/deserialize_allocs/5 |
2967 |
memory_history_scaling/deserialize_allocs/50 |
23558 |
memory_history_scaling/serialize_allocs/1 |
459 |
memory_history_scaling/serialize_allocs/10 |
1951 |
memory_history_scaling/serialize_allocs/20 |
5013 |
memory_history_scaling/serialize_allocs/5 |
1160 |
memory_history_scaling/serialize_allocs/50 |
12043 |
memory_serialize/agent_card_alloc_count |
500 |
memory_serialize/task_alloc_count |
342 |
Standardized workloads designed to be reproduced identically across all A2A SDK implementations (Python, Go, JS, Java, C#/.NET).
- All SDKs hit the same Rust echo server (eliminates server-side variance)
- All workloads use identical JSON payloads from
benches/cross_language/ - Results use median ± MAD to resist outlier pollution
| Benchmark | Median |
|---|---|
cross_language_concurrent_50/rust |
6.01 ms |
cross_language_echo_roundtrip/rust |
1.44 ms |
cross_language_minimal_overhead/rust |
1.33 ms |
cross_language_serialize_agent_card/rust_deserialize |
1.4 µs |
cross_language_serialize_agent_card/rust_roundtrip |
2.2 µs |
cross_language_serialize_agent_card/rust_serialize |
601 ns |
cross_language_stream_events/rust |
1.48 ms |
Production-scale workloads modeling real deployments: multi-tenant isolation, push notification management, eviction under memory pressure, rate limiting, CORS handling, read/write mix ratios, and large conversation histories.
| Benchmark | Median |
|---|---|
enterprise_cancel_task/send_then_cancel |
1.46 ms |
enterprise_client_interceptors/interceptors/0 |
1.40 ms |
enterprise_client_interceptors/interceptors/1 |
1.46 ms |
enterprise_client_interceptors/interceptors/10 |
1.40 ms |
enterprise_client_interceptors/interceptors/5 |
1.43 ms |
enterprise_cors/options_preflight |
88.0 µs |
enterprise_eviction/save_at_capacity/100 |
536 ns |
enterprise_eviction/save_at_capacity/1000 |
588 ns |
enterprise_eviction/save_at_capacity/10000 |
862 ns |
enterprise_eviction/sweep_duration/100 |
131 ns |
enterprise_eviction/sweep_duration/1000 |
133 ns |
enterprise_eviction/sweep_duration/10000 |
154 ns |
enterprise_handler_limits/default_limits |
1.37 ms |
enterprise_handler_limits/metadata_rejection |
119.4 µs |
enterprise_handler_limits/tight_limits |
1.40 ms |
enterprise_large_history/deserialize/100 |
47.8 µs |
enterprise_large_history/deserialize/200 |
92.9 µs |
enterprise_large_history/deserialize/500 |
228.5 µs |
enterprise_large_history/serialize/100 |
16.1 µs |
enterprise_large_history/serialize/200 |
31.5 µs |
enterprise_large_history/serialize/500 |
77.9 µs |
enterprise_large_history/store_save/100 |
16.8 µs |
enterprise_large_history/store_save/200 |
34.2 µs |
enterprise_large_history/store_save/500 |
90.3 µs |
enterprise_list_tasks/page_size/10 |
140.6 µs |
enterprise_list_tasks/page_size/25 |
182.3 µs |
enterprise_list_tasks/page_size/50 |
238.0 µs |
enterprise_multi_tenant/concurrent_tenant_saves/1 |
31.0 µs |
enterprise_multi_tenant/concurrent_tenant_saves/10 |
46.3 µs |
enterprise_multi_tenant/concurrent_tenant_saves/100 |
138.9 µs |
enterprise_multi_tenant/concurrent_tenant_saves/50 |
87.6 µs |
enterprise_multi_tenant/tenant_isolation_check/1 |
498 ns |
enterprise_multi_tenant/tenant_isolation_check/10 |
494 ns |
enterprise_multi_tenant/tenant_isolation_check/100 |
491 ns |
enterprise_multi_tenant/tenant_isolation_check/50 |
493 ns |
enterprise_push_config/get |
288 ns |
enterprise_push_config/list_per_task/1 |
194 ns |
enterprise_push_config/list_per_task/10 |
1.3 µs |
enterprise_push_config/list_per_task/50 |
11.6 µs |
enterprise_push_config/set |
1.2 µs |
enterprise_rate_limiting/no_rate_limit |
1.39 ms |
enterprise_rate_limiting/with_rate_limit |
1.39 ms |
enterprise_rw_mix/0r_100w |
224.2 µs |
enterprise_rw_mix/100r_0w |
60.8 µs |
enterprise_rw_mix/25r_75w |
195.5 µs |
enterprise_rw_mix/50r_50w |
153.7 µs |
enterprise_rw_mix/75r_25w |
105.8 µs |
Full end-to-end workflows exercising the complete SDK pipeline in scenarios that real-world deployments encounter at scale: task reconnection, cold start latency, concurrent race conditions, multi-context orchestration, push config lifecycle, parallel agent bursts, and dispatch routing overhead isolation.
| Benchmark | Median |
|---|---|
production_agent_burst/agents/10 |
4.86 ms |
production_agent_burst/agents/100 |
22.36 ms |
production_agent_burst/agents/50 |
12.99 ms |
production_cancel_subscribe_race/concurrent_cancel_and_subscribe |
849.7 µs |
production_cold_start/first_request |
321.0 µs |
production_cold_start/steady_state |
1.46 ms |
production_dispatch_routing/direct_handler_invoke |
1.42 ms |
production_dispatch_routing/full_http_roundtrip |
1.33 ms |
production_e2e_orchestration/7_step_workflow |
6.20 ms |
production_push_config/delete_roundtrip |
208.5 µs |
production_push_config/get_roundtrip |
103.9 µs |
production_push_config/list_roundtrip |
103.3 µs |
production_push_config/set_roundtrip |
106.8 µs |
production_subscribe_to_task/send_then_subscribe |
1.62 ms |
SDK capabilities exercising previously-unbenchmarked paths: tenant resolver overhead, agent card hot-reload and discovery, subscribe fan-out for reconnection bursts, streaming artifact accumulation cost (the 90µs/event bottleneck), pagination full walk, and extended agent card round-trip.
| Benchmark | Median |
|---|---|
advanced_agent_card_discovery/well_known_endpoint |
87.8 µs |
advanced_agent_card_hot_reload/read_current_card |
309 ns |
advanced_agent_card_hot_reload/swap_and_read |
664 ns |
advanced_agent_card_hot_reload/swap_complex_card |
59.5 µs |
advanced_artifact_accumulation/store_save_at_depth/0 |
397 ns |
advanced_artifact_accumulation/store_save_at_depth/10 |
1.7 µs |
advanced_artifact_accumulation/store_save_at_depth/100 |
14.4 µs |
advanced_artifact_accumulation/store_save_at_depth/50 |
6.3 µs |
advanced_artifact_accumulation/store_save_at_depth/500 |
76.6 µs |
advanced_artifact_accumulation/task_clone_at_depth/0 |
125 ns |
advanced_artifact_accumulation/task_clone_at_depth/10 |
1.2 µs |
advanced_artifact_accumulation/task_clone_at_depth/100 |
13.4 µs |
advanced_artifact_accumulation/task_clone_at_depth/50 |
6.9 µs |
advanced_artifact_accumulation/task_clone_at_depth/500 |
65.9 µs |
advanced_extended_agent_card/get_extended_card_roundtrip |
116.6 µs |
advanced_pagination_walk/filtered/1000_tasks_page_50 |
303.0 µs |
advanced_pagination_walk/filtered/100_tasks_page_25 |
30.1 µs |
advanced_pagination_walk/unfiltered/1000_tasks_page_50 |
567.1 µs |
advanced_pagination_walk/unfiltered/100_tasks_page_25 |
60.8 µs |
advanced_subscribe_fanout/concurrent_subscribers/1 |
1.90 ms |
advanced_subscribe_fanout/concurrent_subscribers/10 |
2.33 ms |
advanced_subscribe_fanout/concurrent_subscribers/5 |
2.04 ms |
advanced_tenant_resolver/bearer_resolver |
126 ns |
advanced_tenant_resolver/bearer_resolver_with_mapper |
146 ns |
advanced_tenant_resolver/header_resolver |
127 ns |
advanced_tenant_resolver/header_resolver_miss |
89 ns |
advanced_tenant_resolver/path_resolver |
177 ns |
End-to-end latency through a 5-hop in-process coordinator chain as the links between hops are made progressively less reliable. Unlike every other benchmark on this page, this one does not measure SDK-layer overhead — it measures the characteristic an agent-harness reviewer actually wants: "what is the end-to-end latency of an agent chain when the network between agents is unreliable, and how well do per-hop retries absorb it?"
The topology is:
test client ─[link 0]─▶ coord 1 ─[link 1]─▶ coord 2 ─[link 2]─▶ coord 3 ─[link 3]─▶ coord 4 ─[link 4]─▶ leaf
Every coordinator forwards the message to the next hop via a pre-built
A2aClient wrapped in a FaultInjectingTransport. Each link applies its
own independent fault profile, so per-hop faults compound end-to-end the
way they would in a real deployment. Coordinators 1–4 retry their
downstream call up to 3 times on retryable errors; the bench harness
additionally retries the top-level send_message up to 8 times so the
published error rates have effectively-zero unrecoverable-failure
probability.
Honest caveats — read these before interpreting the numbers:
- In-process, not network faults. The injected "error" is a
synthetic
ClientError::Timeoutreturned before the wrapped transport is called. This exercises the SDK's retry path faithfully, but does not exercise TCP congestion control, DNS resolution, or transport-level head-of-line blocking. Treat the numbers as "latency under SDK-level retransmission pressure," not "latency under real network loss." - One topology. Sequential delegation is the simplest multi-agent shape. Critic loops, parallel fan-out with deadline propagation, and plan-and-execute with replanning would be more rubric-relevant — this benchmark does not claim to cover those.
- One benchmark does not retroactively make the other suites agent-level. It is deliberately additive: the first concrete data point in the "agent-level latency under fault" shape that the rest of the suite was missing entirely.
Varies per-link latency from 0 µs to 20 000 µs with zero synthetic errors, isolating the chain's latency-compounding factor from retry jitter. Five hops × per-hop latency gives the lower bound, plus the JSON-RPC loopback baseline (~2 ms for a five-hop chain with zero added latency).
Varies per-link synthetic-fault rate from 0% to 5% with zero added latency. Each coordinator retries its downstream call up to 3 times on retryable errors; the bench harness retries the top-level call up to 8 times. Records successful-path latency including retry cost, which is what "steady-state end-to-end latency under fault" means in practice.
These notes help interpret benchmark results accurately and avoid misdiagnosing CI variance as real performance changes.
On N-core systems, `tokio::spawn` places the SSE builder task on a different worker thread with (N-1)/N probability, causing ~500µs cache-miss + work-stealing penalty. This was root-caused as the source of the ~24% bimodal distribution in all streaming benchmarks.
Mitigations (v1.0.0): The SSE builder uses `sleep` + reset (not `interval`) to eliminate timer wheel entries during active streaming. Transport streaming benchmarks use `worker_threads(1)` runtime to eliminate cross-thread variance entirely (24 high severe → 4 high mild outliers, 3× tighter confidence intervals).
The data_volume/get/100K benchmark previously reported ~42% faster lookups
than the 1K/10K cases due to a CPU cache warming artifact from the large
populate_store() setup filling L1/L2 caches. A 4MB cache-busting step was
added in v0.5.0 to flush caches between populate and measure, producing more
representative O(1) lookup times across all scales. The 1K/10K number (~450ns)
remains the representative baseline.
Per-event cost inflects dramatically when events exceed the broadcast channel capacity. The default capacity was increased from 64 to 256 events in v0.5.0, pushing the inflection from ~52 events to ~252 events:
- Below capacity: ~4µs/event (fast path)
- At capacity boundary: ~53µs/event (12× jump — broadcast back-pressure)
- Above capacity: ~130µs/event (SSE frame accumulation under overflow)
Production deployments expecting >256 events/task should increase
EventQueueManager::with_capacity() to match their peak volume.
Transport benchmarks (64B → 16KB) show only ~10% latency increase for a
256× payload increase, because the ~1.4ms HTTP round-trip dominates. Serde
regressions cannot be detected via transport benchmarks. Use the
protocol/payload_scaling isolation benchmarks (64B → 1MB, pure serde)
for serialization regression detection.
Connection reuse saves ~140µs (9%) on loopback. On real networks with TLS,
savings would be 10-50ms (TLS handshake dominates). Best practice: create one
A2aClient at startup and share via Arc across request handlers.
Deserialization allocates ~3× more than serialization (Task: 1,026 vs 342
allocs). This is inherent to serde_json's parsing model: every field creates
an intermediate String/Vec allocation during parsing. The
serde_helpers::deser_from_str() helper enables serde_json's borrowed-data
path for ~15-25% fewer allocations. The serde_helpers::SerBuffer provides
thread-local buffer reuse for serialization, eliminating the 2.3× small-payload
overhead.
History depth scales at ~494 deserialization allocs/turn and ~242 serialization
allocs/turn (linear, constant marginal cost). At 50 turns: 24,714 deser allocs
per store.get(). The serde_helpers module provides optimized paths; for
maximum throughput on deep histories, consider storing pre-serialized bytes
alongside parsed structs to avoid re-parsing on every read.
The background event processor clones the full Task struct on each SSE event. Clone cost scales linearly at ~133ns/artifact. For tasks with 500+ accumulated artifacts, consider batching event processing or using the planned copy-on-write artifact storage (tracked as a future optimization).
The backpressure/timer_calibration benchmarks measure actual
tokio::time::sleep() durations on the CI runner. On shared runners,
1ms sleep ≈ 2.09ms actual, 5ms sleep ≈ 6.14ms actual. Slow consumer
results should be interpreted against these calibrated durations, not
the nominal sleep values.
The data_volume/save/after_prefill/10000 benchmark reports wide confidence
intervals ([1.4µs, 3.5µs], spanning a 2.5× range) and an 18% high severe
outlier rate. This is caused by BTreeSet rebalancing spikes when the sorted
index crosses internal node-split thresholds during insert. The median
(~1.6µs) is representative; the wide CI reflects genuine variance from the
B-tree data structure, not measurement noise. This is an acceptable tradeoff:
the BTreeSet enables O(page_size) pagination queries vs O(n) full scans.
The production/dispatch_routing/direct_handler_invoke benchmark may report
marginally higher latency than full_http_roundtrip. This is not anomalous
— the HTTP path reuses a warm keep-alive connection that amortizes TCP setup
cost, while direct handler invocation exercises the full dispatch path without
connection pooling benefits. The ~7% difference validates that the HTTP layer
adds near-zero overhead for repeat requests on warm connections.
The advanced/subscribe_fanout benchmark shows O(1) cost from 1→5 subscribers
(~2.9ms both), with gradual increase at 10 subscribers (~3.6ms). The broadcast
channel delivers to all subscribers in a single pass; the inflection at 10+
subscribers reflects increased channel contention and memory pressure from
concurrent readers.
The production/agent_burst benchmark shows per-agent cost decreasing as
concurrency increases: 714µs/agent at 10, 390µs/agent at 50, 310µs/agent at
100. This sub-linear scaling confirms the SDK handles high-fanout agent
coordination without degradation — Tokio's work-stealing scheduler amortizes
task scheduling overhead across the burst.
The production/cold_start/first_request benchmark (~328µs) appears faster
than steady_state (~1.97ms). This is because first_request creates a
fresh server per iteration (sample_size=20), measuring server handler
initialization + first TCP connect. The steady_state benchmark reuses an
existing keep-alive connection, measuring the full HTTP round-trip with
connection overhead already amortized. The two benchmarks measure different
things — they are complementary, not comparable.
Tenant resolvers operate at 88–173ns per request, representing ~0.008% of a typical 1.6ms round-trip. Header extraction (128ns) is marginally slower than the miss path (88ns) due to value parsing; path extraction (173ns) is slowest due to URL path parsing overhead. All resolvers are effectively free at production scale.
The advanced/pagination_walk filtered benchmarks show ~2× speedup over
unfiltered walks (309µs vs 592µs at 1000 tasks). The BTreeSet context index
eliminates half the scan work by only iterating tasks matching the
context_id filter.
All benchmarks use Criterion.rs, which provides:
- Statistical significance testing — detects real regressions vs noise
- Warm-up iterations — avoids cold-start measurement artifacts
- Median ± MAD — robust central tendency resistant to outliers
- Configurable sample sizes — more iterations for noisy benchmarks
All benchmarks follow these practices for reproducibility:
- Deterministic inputs: Fixed task IDs and payloads inside
iter()— no incrementing counters that change HashMap distribution across iterations - Setup outside measurement: Store creation, server startup, and resource
allocation happen before
iter(), not inside it debug_assert!for invariants: Correctness checks inside measurement loops usedebug_assert!to avoid string-formatting cost in release buildsblack_box()on inputs and outputs: Prevents the compiler from eliminating measured work through dead-code optimization- Tolerance-based allocation assertions: Memory benchmarks use a 5% tolerance instead of exact counts to avoid spurious CI failures from serde_json/stdlib version changes
- Side-effect interceptors: The interceptor chain benchmark uses
CountingInterceptor(AtomicU64) to verify interceptors are actually invoked during measurement — not just optimized away
The SDK's value proposition is the A2A protocol layer and runtime efficiency, not agent logic. The bulk of the suite therefore benchmarks what the SDK owns: transport overhead, serialization cost, store operations, concurrency scaling, streaming backpressure, error handling, and memory allocation behavior.
One benchmark — coordinator_chain_under_fault — is deliberately a different
shape: it measures end-to-end agent-chain latency under fault injection, not
SDK-layer overhead. It is documented in its own section above with the
caveats for how to interpret it (in-process only, sequential delegation only,
one topology). It is not intended to substitute for a real agent-capability
benchmark suite — it closes the most obvious gap in the existing suite while
staying honest about what it is.
- Agent intelligence — LLM quality is an eval problem, not a perf benchmark
- Real network faults — the fault-injection bench simulates synthetic
ClientError::Timeoutresponses in-process, not real packet loss or TCP congestion control - Network latency — all benchmarks use loopback (127.0.0.1)
- TLS handshake — benchmarks use plaintext HTTP
- Task completion quality — needs human-preference evaluation
- Multi-agent topologies beyond sequential delegation — critic loops, parallel fan-out with deadline propagation, and plan-and-execute with replanning are out of scope for this crate
# Run all benchmarks
cargo bench -p a2a-benchmarks
# Run a specific module
cargo bench -p a2a-benchmarks --bench transport_throughput
# Save baseline, make changes, then compare
./benches/scripts/run_benchmarks.sh --save
# ... make changes ...
./benches/scripts/run_benchmarks.sh --compareFull HTML reports (with violin plots and comparison overlays) are generated
in target/criterion/.