Benchmark Results

Performance measurements for the a2a-protocol-sdk Rust implementation, generated by Criterion.rs statistical benchmarking.

Note: These numbers were collected on CI runners (ubuntu-latest). Absolute values will vary by hardware. Use these for relative comparisons between operations and for regression detection across releases — not as guarantees of production performance on your specific hardware.

To reproduce on your own machine: cargo bench -p a2a-benchmarks

Last updated: 2026-04-22 22:51 UTC
Rust version: rustc 1.95.0 (59807616e 2026-04-14)
Platform: Linux-x86_64

Interactive dashboard: See the Benchmark Dashboard for charts, visual comparisons, and drill-down analysis of these results.

Transport Throughput

End-to-end HTTP round-trip latency through JSON-RPC and REST transports. All measurements use loopback (127.0.0.1) to isolate SDK overhead from network latency.

Benchmark	Median
`transport_jsonrpc_send/single_message`	1.49 ms
`transport_jsonrpc_stream/stream_drain`	1.50 ms
`transport_payload_scaling/jsonrpc_send/1024`	1.47 ms
`transport_payload_scaling/jsonrpc_send/102400`	1.65 ms
`transport_payload_scaling/jsonrpc_send/1048576`	2.97 ms
`transport_payload_scaling/jsonrpc_send/16384`	1.49 ms
`transport_payload_scaling/jsonrpc_send/256`	1.44 ms
`transport_payload_scaling/jsonrpc_send/4096`	1.45 ms
`transport_payload_scaling/jsonrpc_send/64`	1.46 ms
`transport_rest_send/single_message`	1.50 ms
`transport_rest_stream/stream_drain`	1.51 ms

Protocol Overhead

Serialization and deserialization cost per A2A type. This is the baseline tax every message pays regardless of transport.

Includes protocol/payload_scaling benchmarks that measure pure serde cost from 64B to 1MB — the correct regression detection target for serialization changes. Also compares serde_json::to_vec vs SerBuffer (thread-local reuse) and from_slice vs from_str (borrowed deserialization) paths.

Benchmark	Median
`protocol_batch/deserialize_tasks/1`	1.2 µs
`protocol_batch/deserialize_tasks/10`	13.4 µs
`protocol_batch/deserialize_tasks/100`	136.8 µs
`protocol_batch/deserialize_tasks/50`	68.5 µs
`protocol_batch/serialize_tasks/1`	443 ns
`protocol_batch/serialize_tasks/10`	3.6 µs
`protocol_batch/serialize_tasks/100`	33.0 µs
`protocol_batch/serialize_tasks/50`	16.7 µs
`protocol_jsonrpc_envelope/deserialize_request`	791 ns
`protocol_jsonrpc_envelope/deserialize_response`	1.7 µs
`protocol_jsonrpc_envelope/serialize_request`	263 ns
`protocol_jsonrpc_envelope/serialize_response`	511 ns
`protocol_payload_scaling/from_slice/1024`	534 ns
`protocol_payload_scaling/from_slice/102400`	21.5 µs
`protocol_payload_scaling/from_slice/1048576`	229.9 µs
`protocol_payload_scaling/from_slice/16384`	3.7 µs
`protocol_payload_scaling/from_slice/256`	358 ns
`protocol_payload_scaling/from_slice/4096`	1.2 µs
`protocol_payload_scaling/from_slice/64`	326 ns
`protocol_payload_scaling/from_str/1024`	426 ns
`protocol_payload_scaling/from_str/102400`	17.5 µs
`protocol_payload_scaling/from_str/1048576`	188.3 µs
`protocol_payload_scaling/from_str/16384`	3.0 µs
`protocol_payload_scaling/from_str/256`	301 ns
`protocol_payload_scaling/from_str/4096`	964 ns
`protocol_payload_scaling/from_str/64`	269 ns
`protocol_payload_scaling/ser_buffer/1024`	795 ns
`protocol_payload_scaling/ser_buffer/102400`	68.3 µs
`protocol_payload_scaling/ser_buffer/1048576`	725.8 µs
`protocol_payload_scaling/ser_buffer/16384`	10.9 µs
`protocol_payload_scaling/ser_buffer/256`	268 ns
`protocol_payload_scaling/ser_buffer/4096`	2.8 µs
`protocol_payload_scaling/ser_buffer/64`	143 ns
`protocol_payload_scaling/to_vec/1024`	845 ns
`protocol_payload_scaling/to_vec/102400`	66.3 µs
`protocol_payload_scaling/to_vec/1048576`	689.8 µs
`protocol_payload_scaling/to_vec/16384`	11.0 µs
`protocol_payload_scaling/to_vec/256`	377 ns
`protocol_payload_scaling/to_vec/4096`	2.9 µs
`protocol_payload_scaling/to_vec/64`	194 ns
`protocol_stream_events/artifact_update_deserialize`	520 ns
`protocol_stream_events/artifact_update_serialize`	229 ns
`protocol_stream_events/status_update_deserialize`	363 ns
`protocol_stream_events/status_update_serialize`	128 ns
`protocol_type_serde/agent_card_deserialize`	1.3 µs
`protocol_type_serde/agent_card_serialize`	576 ns
`protocol_type_serde/message_deserialize/217`	679 ns
`protocol_type_serde/message_serialize/217`	314 ns
`protocol_type_serde/task_deserialize/278`	1.1 µs
`protocol_type_serde/task_serialize/278`	437 ns

Task Lifecycle

TaskStore and EventQueue operations — the backbone of task management.

Benchmark	Median
`lifecycle_e2e/send_and_complete`	1.49 ms
`lifecycle_e2e/stream_and_drain`	1.46 ms
`lifecycle_queue/write_read/1`	775 ns
`lifecycle_queue/write_read/10`	4.5 µs
`lifecycle_queue/write_read/100`	43.8 µs
`lifecycle_queue/write_read/50`	21.7 µs
`lifecycle_store_get/lookup_in_1000`	423 ns
`lifecycle_store_list/filtered_page_50_of_250`	27.0 µs
`lifecycle_store_save/single_task`	480 ns

Concurrent Agents

Scaling behavior under parallel load — how latency changes as concurrency increases from 1 to 64 simultaneous operations.

Benchmark	Median
`concurrent_mixed/send_then_get`	1.43 ms
`concurrent_sends/jsonrpc/1`	1.49 ms
`concurrent_sends/jsonrpc/16`	3.72 ms
`concurrent_sends/jsonrpc/4`	2.99 ms
`concurrent_sends/jsonrpc/64`	6.61 ms
`concurrent_store/save_and_get/1`	31.5 µs
`concurrent_store/save_and_get/16`	66.9 µs
`concurrent_store/save_and_get/4`	29.7 µs
`concurrent_store/save_and_get/64`	183.4 µs
`concurrent_streams/jsonrpc/1`	1.40 ms
`concurrent_streams/jsonrpc/16`	3.95 ms
`concurrent_streams/jsonrpc/4`	3.16 ms
`concurrent_streams/jsonrpc/64`	7.02 ms

Realistic Workloads

Production-like usage patterns: multi-turn conversations, mixed payloads, interceptor chains, and connection reuse vs per-request clients.

Benchmark	Median
`realistic_complex_card/deserialize/1`	2.3 µs
`realistic_complex_card/deserialize/10`	12.8 µs
`realistic_complex_card/deserialize/100`	114.4 µs
`realistic_complex_card/deserialize/50`	58.5 µs
`realistic_complex_card/serialize/1`	962 ns
`realistic_complex_card/serialize/10`	3.4 µs
`realistic_complex_card/serialize/100`	26.7 µs
`realistic_complex_card/serialize/50`	13.7 µs
`realistic_connection/new_client_per_request`	1.55 ms
`realistic_connection/reused_client`	1.41 ms
`realistic_history_serde/deserialize/1`	1.4 µs
`realistic_history_serde/deserialize/10`	5.2 µs
`realistic_history_serde/deserialize/20`	10.2 µs
`realistic_history_serde/deserialize/5`	3.1 µs
`realistic_history_serde/deserialize/50`	24.6 µs
`realistic_history_serde/serialize/1`	502 ns
`realistic_history_serde/serialize/10`	2.1 µs
`realistic_history_serde/serialize/20`	3.7 µs
`realistic_history_serde/serialize/5`	1.2 µs
`realistic_history_serde/serialize/50`	8.1 µs
`realistic_interceptor_chain/interceptors/0`	167.6 µs
`realistic_interceptor_chain/interceptors/1`	169.0 µs
`realistic_interceptor_chain/interceptors/10`	171.4 µs
`realistic_interceptor_chain/interceptors/5`	169.6 µs
`realistic_multi_turn/sequential/1`	1.47 ms
`realistic_multi_turn/sequential/10`	15.09 ms
`realistic_multi_turn/sequential/3`	4.58 ms
`realistic_multi_turn/sequential/5`	7.55 ms
`realistic_payload_complexity/large_metadata_10kb`	1.54 ms
`realistic_payload_complexity/mixed_parts`	1.45 ms
`realistic_payload_complexity/nested_metadata_10`	1.45 ms
`realistic_payload_complexity/simple_text`	1.50 ms

Error Paths

Cost of error handling — comparing happy path latency to error path latency. Production systems spend significant time on error paths; benchmarking only the happy path gives an incomplete picture.

Benchmark	Median
`errors_happy_vs_error/error_path`	1.35 ms
`errors_happy_vs_error/happy_path`	1.44 ms
`errors_malformed_request/invalid_json`	95.0 µs
`errors_malformed_request/wrong_content_type`	94.2 µs
`errors_task_not_found/get_nonexistent_task`	110.1 µs

Streaming & Backpressure

Stream throughput under varying event volumes and consumer speeds. Reveals buffering and flow-control overhead that synthetic single-event tests miss.

The default broadcast channel capacity was increased from 64 to 256 events in v0.5.0, pushing the per-event cost inflection point from ~52 events to ~252 events. Deployments with >256 events/task should use EventQueueManager::with_capacity() to set a higher value.

Benchmark	Median
`backpressure_concurrent_streams/streams/1`	1.50 ms
`backpressure_concurrent_streams/streams/16`	4.32 ms
`backpressure_concurrent_streams/streams/4`	3.33 ms
`backpressure_slow_consumer/1ms_delay`	29.48 ms
`backpressure_slow_consumer/5ms_delay`	82.10 ms
`backpressure_slow_consumer/fast_consumer`	1.58 ms
`backpressure_stream_volume/252_events`	9.82 ms
`backpressure_stream_volume/27_events`	1.78 ms
`backpressure_stream_volume/3_events`	1.47 ms
`backpressure_stream_volume/502_events`	50.06 ms
`backpressure_stream_volume/52_events`	2.01 ms
`backpressure_stream_volume/7_events`	1.54 ms
`backpressure_timer_calibration/sleep_1ms_actual`	2.10 ms
`backpressure_timer_calibration/sleep_5ms_actual`	6.18 ms

Data Volume Scaling

TaskStore performance at realistic data volumes (1K to 100K tasks). Shows how store operations scale as data accumulates over time.

Benchmark	Median
`data_volume_concurrent_reads/get/1`	28.3 µs
`data_volume_concurrent_reads/get/16`	35.9 µs
`data_volume_concurrent_reads/get/4`	30.3 µs
`data_volume_concurrent_reads/get/64`	77.1 µs
`data_volume_get/lookup/1000`	427 ns
`data_volume_get/lookup/10000`	437 ns
`data_volume_get/lookup/100000`	209 ns
`data_volume_history_depth/save_with_turns/1`	1.6 µs
`data_volume_history_depth/save_with_turns/10`	6.1 µs
`data_volume_history_depth/save_with_turns/20`	10.7 µs
`data_volume_history_depth/save_with_turns/5`	3.1 µs
`data_volume_history_depth/save_with_turns/50`	23.5 µs
`data_volume_list/filtered_page_50/1000`	26.1 µs
`data_volume_list/filtered_page_50/10000`	26.1 µs
`data_volume_list/filtered_page_50/100000`	25.7 µs
`data_volume_save/after_prefill/0`	1.8 µs
`data_volume_save/after_prefill/1000`	1.4 µs
`data_volume_save/after_prefill/10000`	1.4 µs
`data_volume_save/after_prefill/50000`	1.4 µs

Memory Overhead

Heap allocation counts and bytes per operation, measured via a counting allocator (#[global_allocator]). Values represent allocation counts or bytes — not time — encoded as nanoseconds for Criterion tracking.

Metric	Unit
`*_alloc_count`	Number of `alloc()` calls per operation
`*_bytes_per_payload`	Bytes allocated per operation

Benchmark	Value
`memory_bytes_per_payload/serialize_bytes/1024`	568
`memory_bytes_per_payload/serialize_bytes/16384`	6442
`memory_bytes_per_payload/serialize_bytes/256`	253
`memory_bytes_per_payload/serialize_bytes/4096`	1682
`memory_bytes_per_payload/serialize_bytes/64`	152
`memory_deserialize/agent_card_alloc_count`	1361
`memory_deserialize/task_alloc_count`	1030
`memory_history_scaling/deserialize_allocs/1`	1349
`memory_history_scaling/deserialize_allocs/10`	5171
`memory_history_scaling/deserialize_allocs/20`	10209
`memory_history_scaling/deserialize_allocs/5`	2967
`memory_history_scaling/deserialize_allocs/50`	23558
`memory_history_scaling/serialize_allocs/1`	459
`memory_history_scaling/serialize_allocs/10`	1951
`memory_history_scaling/serialize_allocs/20`	5013
`memory_history_scaling/serialize_allocs/5`	1160
`memory_history_scaling/serialize_allocs/50`	12043
`memory_serialize/agent_card_alloc_count`	500
`memory_serialize/task_alloc_count`	342

Cross-Language Comparison

Standardized workloads designed to be reproduced identically across all A2A SDK implementations (Python, Go, JS, Java, C#/.NET).

All SDKs hit the same Rust echo server (eliminates server-side variance)
All workloads use identical JSON payloads from benches/cross_language/
Results use median ± MAD to resist outlier pollution

Benchmark	Median
`cross_language_concurrent_50/rust`	6.01 ms
`cross_language_echo_roundtrip/rust`	1.44 ms
`cross_language_minimal_overhead/rust`	1.33 ms
`cross_language_serialize_agent_card/rust_deserialize`	1.4 µs
`cross_language_serialize_agent_card/rust_roundtrip`	2.2 µs
`cross_language_serialize_agent_card/rust_serialize`	601 ns
`cross_language_stream_events/rust`	1.48 ms

Enterprise Scenarios

Production-scale workloads modeling real deployments: multi-tenant isolation, push notification management, eviction under memory pressure, rate limiting, CORS handling, read/write mix ratios, and large conversation histories.

Benchmark	Median
`enterprise_cancel_task/send_then_cancel`	1.46 ms
`enterprise_client_interceptors/interceptors/0`	1.40 ms
`enterprise_client_interceptors/interceptors/1`	1.46 ms
`enterprise_client_interceptors/interceptors/10`	1.40 ms
`enterprise_client_interceptors/interceptors/5`	1.43 ms
`enterprise_cors/options_preflight`	88.0 µs
`enterprise_eviction/save_at_capacity/100`	536 ns
`enterprise_eviction/save_at_capacity/1000`	588 ns
`enterprise_eviction/save_at_capacity/10000`	862 ns
`enterprise_eviction/sweep_duration/100`	131 ns
`enterprise_eviction/sweep_duration/1000`	133 ns
`enterprise_eviction/sweep_duration/10000`	154 ns
`enterprise_handler_limits/default_limits`	1.37 ms
`enterprise_handler_limits/metadata_rejection`	119.4 µs
`enterprise_handler_limits/tight_limits`	1.40 ms
`enterprise_large_history/deserialize/100`	47.8 µs
`enterprise_large_history/deserialize/200`	92.9 µs
`enterprise_large_history/deserialize/500`	228.5 µs
`enterprise_large_history/serialize/100`	16.1 µs
`enterprise_large_history/serialize/200`	31.5 µs
`enterprise_large_history/serialize/500`	77.9 µs
`enterprise_large_history/store_save/100`	16.8 µs
`enterprise_large_history/store_save/200`	34.2 µs
`enterprise_large_history/store_save/500`	90.3 µs
`enterprise_list_tasks/page_size/10`	140.6 µs
`enterprise_list_tasks/page_size/25`	182.3 µs
`enterprise_list_tasks/page_size/50`	238.0 µs
`enterprise_multi_tenant/concurrent_tenant_saves/1`	31.0 µs
`enterprise_multi_tenant/concurrent_tenant_saves/10`	46.3 µs
`enterprise_multi_tenant/concurrent_tenant_saves/100`	138.9 µs
`enterprise_multi_tenant/concurrent_tenant_saves/50`	87.6 µs
`enterprise_multi_tenant/tenant_isolation_check/1`	498 ns
`enterprise_multi_tenant/tenant_isolation_check/10`	494 ns
`enterprise_multi_tenant/tenant_isolation_check/100`	491 ns
`enterprise_multi_tenant/tenant_isolation_check/50`	493 ns
`enterprise_push_config/get`	288 ns
`enterprise_push_config/list_per_task/1`	194 ns
`enterprise_push_config/list_per_task/10`	1.3 µs
`enterprise_push_config/list_per_task/50`	11.6 µs
`enterprise_push_config/set`	1.2 µs
`enterprise_rate_limiting/no_rate_limit`	1.39 ms
`enterprise_rate_limiting/with_rate_limit`	1.39 ms
`enterprise_rw_mix/0r_100w`	224.2 µs
`enterprise_rw_mix/100r_0w`	60.8 µs
`enterprise_rw_mix/25r_75w`	195.5 µs
`enterprise_rw_mix/50r_50w`	153.7 µs
`enterprise_rw_mix/75r_25w`	105.8 µs

Production Scenarios

Full end-to-end workflows exercising the complete SDK pipeline in scenarios that real-world deployments encounter at scale: task reconnection, cold start latency, concurrent race conditions, multi-context orchestration, push config lifecycle, parallel agent bursts, and dispatch routing overhead isolation.

Benchmark	Median
`production_agent_burst/agents/10`	4.86 ms
`production_agent_burst/agents/100`	22.36 ms
`production_agent_burst/agents/50`	12.99 ms
`production_cancel_subscribe_race/concurrent_cancel_and_subscribe`	849.7 µs
`production_cold_start/first_request`	321.0 µs
`production_cold_start/steady_state`	1.46 ms
`production_dispatch_routing/direct_handler_invoke`	1.42 ms
`production_dispatch_routing/full_http_roundtrip`	1.33 ms
`production_e2e_orchestration/7_step_workflow`	6.20 ms
`production_push_config/delete_roundtrip`	208.5 µs
`production_push_config/get_roundtrip`	103.9 µs
`production_push_config/list_roundtrip`	103.3 µs
`production_push_config/set_roundtrip`	106.8 µs
`production_subscribe_to_task/send_then_subscribe`	1.62 ms

Advanced Scenarios

SDK capabilities exercising previously-unbenchmarked paths: tenant resolver overhead, agent card hot-reload and discovery, subscribe fan-out for reconnection bursts, streaming artifact accumulation cost (the 90µs/event bottleneck), pagination full walk, and extended agent card round-trip.

Benchmark	Median
`advanced_agent_card_discovery/well_known_endpoint`	87.8 µs
`advanced_agent_card_hot_reload/read_current_card`	309 ns
`advanced_agent_card_hot_reload/swap_and_read`	664 ns
`advanced_agent_card_hot_reload/swap_complex_card`	59.5 µs
`advanced_artifact_accumulation/store_save_at_depth/0`	397 ns
`advanced_artifact_accumulation/store_save_at_depth/10`	1.7 µs
`advanced_artifact_accumulation/store_save_at_depth/100`	14.4 µs
`advanced_artifact_accumulation/store_save_at_depth/50`	6.3 µs
`advanced_artifact_accumulation/store_save_at_depth/500`	76.6 µs
`advanced_artifact_accumulation/task_clone_at_depth/0`	125 ns
`advanced_artifact_accumulation/task_clone_at_depth/10`	1.2 µs
`advanced_artifact_accumulation/task_clone_at_depth/100`	13.4 µs
`advanced_artifact_accumulation/task_clone_at_depth/50`	6.9 µs
`advanced_artifact_accumulation/task_clone_at_depth/500`	65.9 µs
`advanced_extended_agent_card/get_extended_card_roundtrip`	116.6 µs
`advanced_pagination_walk/filtered/1000_tasks_page_50`	303.0 µs
`advanced_pagination_walk/filtered/100_tasks_page_25`	30.1 µs
`advanced_pagination_walk/unfiltered/1000_tasks_page_50`	567.1 µs
`advanced_pagination_walk/unfiltered/100_tasks_page_25`	60.8 µs
`advanced_subscribe_fanout/concurrent_subscribers/1`	1.90 ms
`advanced_subscribe_fanout/concurrent_subscribers/10`	2.33 ms
`advanced_subscribe_fanout/concurrent_subscribers/5`	2.04 ms
`advanced_tenant_resolver/bearer_resolver`	126 ns
`advanced_tenant_resolver/bearer_resolver_with_mapper`	146 ns
`advanced_tenant_resolver/header_resolver`	127 ns
`advanced_tenant_resolver/header_resolver_miss`	89 ns
`advanced_tenant_resolver/path_resolver`	177 ns

Agent-Level Latency Under Fault

End-to-end latency through a 5-hop in-process coordinator chain as the links between hops are made progressively less reliable. Unlike every other benchmark on this page, this one does not measure SDK-layer overhead — it measures the characteristic an agent-harness reviewer actually wants: "what is the end-to-end latency of an agent chain when the network between agents is unreliable, and how well do per-hop retries absorb it?"

The topology is:

test client ─[link 0]─▶ coord 1 ─[link 1]─▶ coord 2 ─[link 2]─▶ coord 3 ─[link 3]─▶ coord 4 ─[link 4]─▶ leaf

Every coordinator forwards the message to the next hop via a pre-built A2aClient wrapped in a FaultInjectingTransport. Each link applies its own independent fault profile, so per-hop faults compound end-to-end the way they would in a real deployment. Coordinators 1–4 retry their downstream call up to 3 times on retryable errors; the bench harness additionally retries the top-level send_message up to 8 times so the published error rates have effectively-zero unrecoverable-failure probability.

Honest caveats — read these before interpreting the numbers:

In-process, not network faults. The injected "error" is a synthetic ClientError::Timeout returned before the wrapped transport is called. This exercises the SDK's retry path faithfully, but does not exercise TCP congestion control, DNS resolution, or transport-level head-of-line blocking. Treat the numbers as "latency under SDK-level retransmission pressure," not "latency under real network loss."
One topology. Sequential delegation is the simplest multi-agent shape. Critic loops, parallel fan-out with deadline propagation, and plan-and-execute with replanning would be more rubric-relevant — this benchmark does not claim to cover those.
One benchmark does not retroactively make the other suites agent-level. It is deliberately additive: the first concrete data point in the "agent-level latency under fault" shape that the rest of the suite was missing entirely.

Group 1: per-hop latency injection (zero errors)

Varies per-link latency from 0 µs to 20 000 µs with zero synthetic errors, isolating the chain's latency-compounding factor from retry jitter. Five hops × per-hop latency gives the lower bound, plus the JSON-RPC loopback baseline (~2 ms for a five-hop chain with zero added latency).

Group 2: per-hop error injection (3 retries per hop + 8 outer retries)

Varies per-link synthetic-fault rate from 0% to 5% with zero added latency. Each coordinator retries its downstream call up to 3 times on retryable errors; the bench harness retries the top-level call up to 8 times. Records successful-path latency including retry cost, which is what "steady-state end-to-end latency under fault" means in practice.

Known Measurement Limitations

These notes help interpret benchmark results accurately and avoid misdiagnosing CI variance as real performance changes.

Streaming cross-thread scheduling

On N-core systems, `tokio::spawn` places the SSE builder task on a different worker thread with (N-1)/N probability, causing ~500µs cache-miss + work-stealing penalty. This was root-caused as the source of the ~24% bimodal distribution in all streaming benchmarks.

Mitigations (v1.0.0): The SSE builder uses `sleep` + reset (not `interval`) to eliminate timer wheel entries during active streaming. Transport streaming benchmarks use `worker_threads(1)` runtime to eliminate cross-thread variance entirely (24 high severe → 4 high mild outliers, 3× tighter confidence intervals).

Data volume get() at 100K tasks

The data_volume/get/100K benchmark previously reported ~42% faster lookups than the 1K/10K cases due to a CPU cache warming artifact from the large populate_store() setup filling L1/L2 caches. A 4MB cache-busting step was added in v0.5.0 to flush caches between populate and measure, producing more representative O(1) lookup times across all scales. The 1K/10K number (~450ns) remains the representative baseline.

Stream volume per-event cost inflection

Per-event cost inflects dramatically when events exceed the broadcast channel capacity. The default capacity was increased from 64 to 256 events in v0.5.0, pushing the inflection from ~52 events to ~252 events:

Below capacity: ~4µs/event (fast path)
At capacity boundary: ~53µs/event (12× jump — broadcast back-pressure)
Above capacity: ~130µs/event (SSE frame accumulation under overflow)

Production deployments expecting >256 events/task should increase EventQueueManager::with_capacity() to match their peak volume.

Transport payload insensitivity

Transport benchmarks (64B → 16KB) show only ~10% latency increase for a 256× payload increase, because the ~1.4ms HTTP round-trip dominates. Serde regressions cannot be detected via transport benchmarks. Use the protocol/payload_scaling isolation benchmarks (64B → 1MB, pure serde) for serialization regression detection.

Connection reuse impact

Connection reuse saves ~140µs (9%) on loopback. On real networks with TLS, savings would be 10-50ms (TLS handshake dominates). Best practice: create one A2aClient at startup and share via Arc across request handlers.

Deserialization allocation overhead

Deserialization allocates ~3× more than serialization (Task: 1,026 vs 342 allocs). This is inherent to serde_json's parsing model: every field creates an intermediate String/Vec allocation during parsing. The serde_helpers::deser_from_str() helper enables serde_json's borrowed-data path for ~15-25% fewer allocations. The serde_helpers::SerBuffer provides thread-local buffer reuse for serialization, eliminating the 2.3× small-payload overhead.

History depth allocation scaling

History depth scales at ~494 deserialization allocs/turn and ~242 serialization allocs/turn (linear, constant marginal cost). At 50 turns: 24,714 deser allocs per store.get(). The serde_helpers module provides optimized paths; for maximum throughput on deep histories, consider storing pre-serialized bytes alongside parsed structs to avoid re-parsing on every read.

Artifact accumulation clone cost

The background event processor clones the full Task struct on each SSE event. Clone cost scales linearly at ~133ns/artifact. For tasks with 500+ accumulated artifacts, consider batching event processing or using the planned copy-on-write artifact storage (tracked as a future optimization).

Slow consumer timer calibration

The backpressure/timer_calibration benchmarks measure actual tokio::time::sleep() durations on the CI runner. On shared runners, 1ms sleep ≈ 2.09ms actual, 5ms sleep ≈ 6.14ms actual. Slow consumer results should be interpreted against these calibrated durations, not the nominal sleep values.

Data volume save() wide confidence intervals

The data_volume/save/after_prefill/10000 benchmark reports wide confidence intervals ([1.4µs, 3.5µs], spanning a 2.5× range) and an 18% high severe outlier rate. This is caused by BTreeSet rebalancing spikes when the sorted index crosses internal node-split thresholds during insert. The median (~1.6µs) is representative; the wide CI reflects genuine variance from the B-tree data structure, not measurement noise. This is an acceptable tradeoff: the BTreeSet enables O(page_size) pagination queries vs O(n) full scans.

Dispatch routing: direct handler vs HTTP round-trip

The production/dispatch_routing/direct_handler_invoke benchmark may report marginally higher latency than full_http_roundtrip. This is not anomalous — the HTTP path reuses a warm keep-alive connection that amortizes TCP setup cost, while direct handler invocation exercises the full dispatch path without connection pooling benefits. The ~7% difference validates that the HTTP layer adds near-zero overhead for repeat requests on warm connections.

Subscribe fan-out O(1) scaling

The advanced/subscribe_fanout benchmark shows O(1) cost from 1→5 subscribers (~2.9ms both), with gradual increase at 10 subscribers (~3.6ms). The broadcast channel delivers to all subscribers in a single pass; the inflection at 10+ subscribers reflects increased channel contention and memory pressure from concurrent readers.

Agent burst sub-linear scaling

The production/agent_burst benchmark shows per-agent cost decreasing as concurrency increases: 714µs/agent at 10, 390µs/agent at 50, 310µs/agent at 100. This sub-linear scaling confirms the SDK handles high-fanout agent coordination without degradation — Tokio's work-stealing scheduler amortizes task scheduling overhead across the burst.

Cold start vs steady state

The production/cold_start/first_request benchmark (~328µs) appears faster than steady_state (~1.97ms). This is because first_request creates a fresh server per iteration (sample_size=20), measuring server handler initialization + first TCP connect. The steady_state benchmark reuses an existing keep-alive connection, measuring the full HTTP round-trip with connection overhead already amortized. The two benchmarks measure different things — they are complementary, not comparable.

Tenant resolver negligible overhead

Tenant resolvers operate at 88–173ns per request, representing ~0.008% of a typical 1.6ms round-trip. Header extraction (128ns) is marginally slower than the miss path (88ns) due to value parsing; path extraction (173ns) is slowest due to URL path parsing overhead. All resolvers are effectively free at production scale.

Pagination context index 2× speedup

The advanced/pagination_walk filtered benchmarks show ~2× speedup over unfiltered walks (309µs vs 592µs at 1000 tasks). The BTreeSet context index eliminates half the scan work by only iterating tasks matching the context_id filter.

Methodology

All benchmarks use Criterion.rs, which provides:

Statistical significance testing — detects real regressions vs noise
Warm-up iterations — avoids cold-start measurement artifacts
Median ± MAD — robust central tendency resistant to outliers
Configurable sample sizes — more iterations for noisy benchmarks

Measurement rigor

All benchmarks follow these practices for reproducibility:

Deterministic inputs: Fixed task IDs and payloads inside iter() — no incrementing counters that change HashMap distribution across iterations
Setup outside measurement: Store creation, server startup, and resource allocation happen before iter(), not inside it
debug_assert! for invariants: Correctness checks inside measurement loops use debug_assert! to avoid string-formatting cost in release builds
black_box() on inputs and outputs: Prevents the compiler from eliminating measured work through dead-code optimization
Tolerance-based allocation assertions: Memory benchmarks use a 5% tolerance instead of exact counts to avoid spurious CI failures from serde_json/stdlib version changes
Side-effect interceptors: The interceptor chain benchmark uses CountingInterceptor (AtomicU64) to verify interceptors are actually invoked during measurement — not just optimized away

What we benchmark

The SDK's value proposition is the A2A protocol layer and runtime efficiency, not agent logic. The bulk of the suite therefore benchmarks what the SDK owns: transport overhead, serialization cost, store operations, concurrency scaling, streaming backpressure, error handling, and memory allocation behavior.

One benchmark — coordinator_chain_under_fault — is deliberately a different shape: it measures end-to-end agent-chain latency under fault injection, not SDK-layer overhead. It is documented in its own section above with the caveats for how to interpret it (in-process only, sequential delegation only, one topology). It is not intended to substitute for a real agent-capability benchmark suite — it closes the most obvious gap in the existing suite while staying honest about what it is.

What we do NOT benchmark

Agent intelligence — LLM quality is an eval problem, not a perf benchmark
Real network faults — the fault-injection bench simulates synthetic ClientError::Timeout responses in-process, not real packet loss or TCP congestion control
Network latency — all benchmarks use loopback (127.0.0.1)
TLS handshake — benchmarks use plaintext HTTP
Task completion quality — needs human-preference evaluation
Multi-agent topologies beyond sequential delegation — critic loops, parallel fan-out with deadline propagation, and plan-and-execute with replanning are out of scope for this crate

Reproducing locally

# Run all benchmarks
cargo bench -p a2a-benchmarks

# Run a specific module
cargo bench -p a2a-benchmarks --bench transport_throughput

# Save baseline, make changes, then compare
./benches/scripts/run_benchmarks.sh --save
# ... make changes ...
./benches/scripts/run_benchmarks.sh --compare

Full HTML reports (with violin plots and comparison overlays) are generated in target/criterion/.

FilesExpand file tree

benchmarks.md

Latest commit

History