Skip to content

Latest commit

 

History

History
702 lines (588 loc) · 32.8 KB

File metadata and controls

702 lines (588 loc) · 32.8 KB

Benchmark Results

Performance measurements for the a2a-protocol-sdk Rust implementation, generated by Criterion.rs statistical benchmarking.

Note: These numbers were collected on CI runners (ubuntu-latest). Absolute values will vary by hardware. Use these for relative comparisons between operations and for regression detection across releases — not as guarantees of production performance on your specific hardware.

To reproduce on your own machine: cargo bench -p a2a-benchmarks

Last updated: 2026-04-22 22:51 UTC
Rust version: rustc 1.95.0 (59807616e 2026-04-14)
Platform: Linux-x86_64

Interactive dashboard: See the Benchmark Dashboard for charts, visual comparisons, and drill-down analysis of these results.

Transport Throughput

End-to-end HTTP round-trip latency through JSON-RPC and REST transports. All measurements use loopback (127.0.0.1) to isolate SDK overhead from network latency.

Benchmark Median
transport_jsonrpc_send/single_message 1.49 ms
transport_jsonrpc_stream/stream_drain 1.50 ms
transport_payload_scaling/jsonrpc_send/1024 1.47 ms
transport_payload_scaling/jsonrpc_send/102400 1.65 ms
transport_payload_scaling/jsonrpc_send/1048576 2.97 ms
transport_payload_scaling/jsonrpc_send/16384 1.49 ms
transport_payload_scaling/jsonrpc_send/256 1.44 ms
transport_payload_scaling/jsonrpc_send/4096 1.45 ms
transport_payload_scaling/jsonrpc_send/64 1.46 ms
transport_rest_send/single_message 1.50 ms
transport_rest_stream/stream_drain 1.51 ms

Protocol Overhead

Serialization and deserialization cost per A2A type. This is the baseline tax every message pays regardless of transport.

Includes protocol/payload_scaling benchmarks that measure pure serde cost from 64B to 1MB — the correct regression detection target for serialization changes. Also compares serde_json::to_vec vs SerBuffer (thread-local reuse) and from_slice vs from_str (borrowed deserialization) paths.

Benchmark Median
protocol_batch/deserialize_tasks/1 1.2 µs
protocol_batch/deserialize_tasks/10 13.4 µs
protocol_batch/deserialize_tasks/100 136.8 µs
protocol_batch/deserialize_tasks/50 68.5 µs
protocol_batch/serialize_tasks/1 443 ns
protocol_batch/serialize_tasks/10 3.6 µs
protocol_batch/serialize_tasks/100 33.0 µs
protocol_batch/serialize_tasks/50 16.7 µs
protocol_jsonrpc_envelope/deserialize_request 791 ns
protocol_jsonrpc_envelope/deserialize_response 1.7 µs
protocol_jsonrpc_envelope/serialize_request 263 ns
protocol_jsonrpc_envelope/serialize_response 511 ns
protocol_payload_scaling/from_slice/1024 534 ns
protocol_payload_scaling/from_slice/102400 21.5 µs
protocol_payload_scaling/from_slice/1048576 229.9 µs
protocol_payload_scaling/from_slice/16384 3.7 µs
protocol_payload_scaling/from_slice/256 358 ns
protocol_payload_scaling/from_slice/4096 1.2 µs
protocol_payload_scaling/from_slice/64 326 ns
protocol_payload_scaling/from_str/1024 426 ns
protocol_payload_scaling/from_str/102400 17.5 µs
protocol_payload_scaling/from_str/1048576 188.3 µs
protocol_payload_scaling/from_str/16384 3.0 µs
protocol_payload_scaling/from_str/256 301 ns
protocol_payload_scaling/from_str/4096 964 ns
protocol_payload_scaling/from_str/64 269 ns
protocol_payload_scaling/ser_buffer/1024 795 ns
protocol_payload_scaling/ser_buffer/102400 68.3 µs
protocol_payload_scaling/ser_buffer/1048576 725.8 µs
protocol_payload_scaling/ser_buffer/16384 10.9 µs
protocol_payload_scaling/ser_buffer/256 268 ns
protocol_payload_scaling/ser_buffer/4096 2.8 µs
protocol_payload_scaling/ser_buffer/64 143 ns
protocol_payload_scaling/to_vec/1024 845 ns
protocol_payload_scaling/to_vec/102400 66.3 µs
protocol_payload_scaling/to_vec/1048576 689.8 µs
protocol_payload_scaling/to_vec/16384 11.0 µs
protocol_payload_scaling/to_vec/256 377 ns
protocol_payload_scaling/to_vec/4096 2.9 µs
protocol_payload_scaling/to_vec/64 194 ns
protocol_stream_events/artifact_update_deserialize 520 ns
protocol_stream_events/artifact_update_serialize 229 ns
protocol_stream_events/status_update_deserialize 363 ns
protocol_stream_events/status_update_serialize 128 ns
protocol_type_serde/agent_card_deserialize 1.3 µs
protocol_type_serde/agent_card_serialize 576 ns
protocol_type_serde/message_deserialize/217 679 ns
protocol_type_serde/message_serialize/217 314 ns
protocol_type_serde/task_deserialize/278 1.1 µs
protocol_type_serde/task_serialize/278 437 ns

Task Lifecycle

TaskStore and EventQueue operations — the backbone of task management.

Benchmark Median
lifecycle_e2e/send_and_complete 1.49 ms
lifecycle_e2e/stream_and_drain 1.46 ms
lifecycle_queue/write_read/1 775 ns
lifecycle_queue/write_read/10 4.5 µs
lifecycle_queue/write_read/100 43.8 µs
lifecycle_queue/write_read/50 21.7 µs
lifecycle_store_get/lookup_in_1000 423 ns
lifecycle_store_list/filtered_page_50_of_250 27.0 µs
lifecycle_store_save/single_task 480 ns

Concurrent Agents

Scaling behavior under parallel load — how latency changes as concurrency increases from 1 to 64 simultaneous operations.

Benchmark Median
concurrent_mixed/send_then_get 1.43 ms
concurrent_sends/jsonrpc/1 1.49 ms
concurrent_sends/jsonrpc/16 3.72 ms
concurrent_sends/jsonrpc/4 2.99 ms
concurrent_sends/jsonrpc/64 6.61 ms
concurrent_store/save_and_get/1 31.5 µs
concurrent_store/save_and_get/16 66.9 µs
concurrent_store/save_and_get/4 29.7 µs
concurrent_store/save_and_get/64 183.4 µs
concurrent_streams/jsonrpc/1 1.40 ms
concurrent_streams/jsonrpc/16 3.95 ms
concurrent_streams/jsonrpc/4 3.16 ms
concurrent_streams/jsonrpc/64 7.02 ms

Realistic Workloads

Production-like usage patterns: multi-turn conversations, mixed payloads, interceptor chains, and connection reuse vs per-request clients.

Benchmark Median
realistic_complex_card/deserialize/1 2.3 µs
realistic_complex_card/deserialize/10 12.8 µs
realistic_complex_card/deserialize/100 114.4 µs
realistic_complex_card/deserialize/50 58.5 µs
realistic_complex_card/serialize/1 962 ns
realistic_complex_card/serialize/10 3.4 µs
realistic_complex_card/serialize/100 26.7 µs
realistic_complex_card/serialize/50 13.7 µs
realistic_connection/new_client_per_request 1.55 ms
realistic_connection/reused_client 1.41 ms
realistic_history_serde/deserialize/1 1.4 µs
realistic_history_serde/deserialize/10 5.2 µs
realistic_history_serde/deserialize/20 10.2 µs
realistic_history_serde/deserialize/5 3.1 µs
realistic_history_serde/deserialize/50 24.6 µs
realistic_history_serde/serialize/1 502 ns
realistic_history_serde/serialize/10 2.1 µs
realistic_history_serde/serialize/20 3.7 µs
realistic_history_serde/serialize/5 1.2 µs
realistic_history_serde/serialize/50 8.1 µs
realistic_interceptor_chain/interceptors/0 167.6 µs
realistic_interceptor_chain/interceptors/1 169.0 µs
realistic_interceptor_chain/interceptors/10 171.4 µs
realistic_interceptor_chain/interceptors/5 169.6 µs
realistic_multi_turn/sequential/1 1.47 ms
realistic_multi_turn/sequential/10 15.09 ms
realistic_multi_turn/sequential/3 4.58 ms
realistic_multi_turn/sequential/5 7.55 ms
realistic_payload_complexity/large_metadata_10kb 1.54 ms
realistic_payload_complexity/mixed_parts 1.45 ms
realistic_payload_complexity/nested_metadata_10 1.45 ms
realistic_payload_complexity/simple_text 1.50 ms

Error Paths

Cost of error handling — comparing happy path latency to error path latency. Production systems spend significant time on error paths; benchmarking only the happy path gives an incomplete picture.

Benchmark Median
errors_happy_vs_error/error_path 1.35 ms
errors_happy_vs_error/happy_path 1.44 ms
errors_malformed_request/invalid_json 95.0 µs
errors_malformed_request/wrong_content_type 94.2 µs
errors_task_not_found/get_nonexistent_task 110.1 µs

Streaming & Backpressure

Stream throughput under varying event volumes and consumer speeds. Reveals buffering and flow-control overhead that synthetic single-event tests miss.

The default broadcast channel capacity was increased from 64 to 256 events in v0.5.0, pushing the per-event cost inflection point from ~52 events to ~252 events. Deployments with >256 events/task should use EventQueueManager::with_capacity() to set a higher value.

Benchmark Median
backpressure_concurrent_streams/streams/1 1.50 ms
backpressure_concurrent_streams/streams/16 4.32 ms
backpressure_concurrent_streams/streams/4 3.33 ms
backpressure_slow_consumer/1ms_delay 29.48 ms
backpressure_slow_consumer/5ms_delay 82.10 ms
backpressure_slow_consumer/fast_consumer 1.58 ms
backpressure_stream_volume/252_events 9.82 ms
backpressure_stream_volume/27_events 1.78 ms
backpressure_stream_volume/3_events 1.47 ms
backpressure_stream_volume/502_events 50.06 ms
backpressure_stream_volume/52_events 2.01 ms
backpressure_stream_volume/7_events 1.54 ms
backpressure_timer_calibration/sleep_1ms_actual 2.10 ms
backpressure_timer_calibration/sleep_5ms_actual 6.18 ms

Data Volume Scaling

TaskStore performance at realistic data volumes (1K to 100K tasks). Shows how store operations scale as data accumulates over time.

Benchmark Median
data_volume_concurrent_reads/get/1 28.3 µs
data_volume_concurrent_reads/get/16 35.9 µs
data_volume_concurrent_reads/get/4 30.3 µs
data_volume_concurrent_reads/get/64 77.1 µs
data_volume_get/lookup/1000 427 ns
data_volume_get/lookup/10000 437 ns
data_volume_get/lookup/100000 209 ns
data_volume_history_depth/save_with_turns/1 1.6 µs
data_volume_history_depth/save_with_turns/10 6.1 µs
data_volume_history_depth/save_with_turns/20 10.7 µs
data_volume_history_depth/save_with_turns/5 3.1 µs
data_volume_history_depth/save_with_turns/50 23.5 µs
data_volume_list/filtered_page_50/1000 26.1 µs
data_volume_list/filtered_page_50/10000 26.1 µs
data_volume_list/filtered_page_50/100000 25.7 µs
data_volume_save/after_prefill/0 1.8 µs
data_volume_save/after_prefill/1000 1.4 µs
data_volume_save/after_prefill/10000 1.4 µs
data_volume_save/after_prefill/50000 1.4 µs

Memory Overhead

Heap allocation counts and bytes per operation, measured via a counting allocator (#[global_allocator]). Values represent allocation counts or bytes — not time — encoded as nanoseconds for Criterion tracking.

Metric Unit
*_alloc_count Number of alloc() calls per operation
*_bytes_per_payload Bytes allocated per operation
Benchmark Value
memory_bytes_per_payload/serialize_bytes/1024 568
memory_bytes_per_payload/serialize_bytes/16384 6442
memory_bytes_per_payload/serialize_bytes/256 253
memory_bytes_per_payload/serialize_bytes/4096 1682
memory_bytes_per_payload/serialize_bytes/64 152
memory_deserialize/agent_card_alloc_count 1361
memory_deserialize/task_alloc_count 1030
memory_history_scaling/deserialize_allocs/1 1349
memory_history_scaling/deserialize_allocs/10 5171
memory_history_scaling/deserialize_allocs/20 10209
memory_history_scaling/deserialize_allocs/5 2967
memory_history_scaling/deserialize_allocs/50 23558
memory_history_scaling/serialize_allocs/1 459
memory_history_scaling/serialize_allocs/10 1951
memory_history_scaling/serialize_allocs/20 5013
memory_history_scaling/serialize_allocs/5 1160
memory_history_scaling/serialize_allocs/50 12043
memory_serialize/agent_card_alloc_count 500
memory_serialize/task_alloc_count 342

Cross-Language Comparison

Standardized workloads designed to be reproduced identically across all A2A SDK implementations (Python, Go, JS, Java, C#/.NET).

  • All SDKs hit the same Rust echo server (eliminates server-side variance)
  • All workloads use identical JSON payloads from benches/cross_language/
  • Results use median ± MAD to resist outlier pollution
Benchmark Median
cross_language_concurrent_50/rust 6.01 ms
cross_language_echo_roundtrip/rust 1.44 ms
cross_language_minimal_overhead/rust 1.33 ms
cross_language_serialize_agent_card/rust_deserialize 1.4 µs
cross_language_serialize_agent_card/rust_roundtrip 2.2 µs
cross_language_serialize_agent_card/rust_serialize 601 ns
cross_language_stream_events/rust 1.48 ms

Enterprise Scenarios

Production-scale workloads modeling real deployments: multi-tenant isolation, push notification management, eviction under memory pressure, rate limiting, CORS handling, read/write mix ratios, and large conversation histories.

Benchmark Median
enterprise_cancel_task/send_then_cancel 1.46 ms
enterprise_client_interceptors/interceptors/0 1.40 ms
enterprise_client_interceptors/interceptors/1 1.46 ms
enterprise_client_interceptors/interceptors/10 1.40 ms
enterprise_client_interceptors/interceptors/5 1.43 ms
enterprise_cors/options_preflight 88.0 µs
enterprise_eviction/save_at_capacity/100 536 ns
enterprise_eviction/save_at_capacity/1000 588 ns
enterprise_eviction/save_at_capacity/10000 862 ns
enterprise_eviction/sweep_duration/100 131 ns
enterprise_eviction/sweep_duration/1000 133 ns
enterprise_eviction/sweep_duration/10000 154 ns
enterprise_handler_limits/default_limits 1.37 ms
enterprise_handler_limits/metadata_rejection 119.4 µs
enterprise_handler_limits/tight_limits 1.40 ms
enterprise_large_history/deserialize/100 47.8 µs
enterprise_large_history/deserialize/200 92.9 µs
enterprise_large_history/deserialize/500 228.5 µs
enterprise_large_history/serialize/100 16.1 µs
enterprise_large_history/serialize/200 31.5 µs
enterprise_large_history/serialize/500 77.9 µs
enterprise_large_history/store_save/100 16.8 µs
enterprise_large_history/store_save/200 34.2 µs
enterprise_large_history/store_save/500 90.3 µs
enterprise_list_tasks/page_size/10 140.6 µs
enterprise_list_tasks/page_size/25 182.3 µs
enterprise_list_tasks/page_size/50 238.0 µs
enterprise_multi_tenant/concurrent_tenant_saves/1 31.0 µs
enterprise_multi_tenant/concurrent_tenant_saves/10 46.3 µs
enterprise_multi_tenant/concurrent_tenant_saves/100 138.9 µs
enterprise_multi_tenant/concurrent_tenant_saves/50 87.6 µs
enterprise_multi_tenant/tenant_isolation_check/1 498 ns
enterprise_multi_tenant/tenant_isolation_check/10 494 ns
enterprise_multi_tenant/tenant_isolation_check/100 491 ns
enterprise_multi_tenant/tenant_isolation_check/50 493 ns
enterprise_push_config/get 288 ns
enterprise_push_config/list_per_task/1 194 ns
enterprise_push_config/list_per_task/10 1.3 µs
enterprise_push_config/list_per_task/50 11.6 µs
enterprise_push_config/set 1.2 µs
enterprise_rate_limiting/no_rate_limit 1.39 ms
enterprise_rate_limiting/with_rate_limit 1.39 ms
enterprise_rw_mix/0r_100w 224.2 µs
enterprise_rw_mix/100r_0w 60.8 µs
enterprise_rw_mix/25r_75w 195.5 µs
enterprise_rw_mix/50r_50w 153.7 µs
enterprise_rw_mix/75r_25w 105.8 µs

Production Scenarios

Full end-to-end workflows exercising the complete SDK pipeline in scenarios that real-world deployments encounter at scale: task reconnection, cold start latency, concurrent race conditions, multi-context orchestration, push config lifecycle, parallel agent bursts, and dispatch routing overhead isolation.

Benchmark Median
production_agent_burst/agents/10 4.86 ms
production_agent_burst/agents/100 22.36 ms
production_agent_burst/agents/50 12.99 ms
production_cancel_subscribe_race/concurrent_cancel_and_subscribe 849.7 µs
production_cold_start/first_request 321.0 µs
production_cold_start/steady_state 1.46 ms
production_dispatch_routing/direct_handler_invoke 1.42 ms
production_dispatch_routing/full_http_roundtrip 1.33 ms
production_e2e_orchestration/7_step_workflow 6.20 ms
production_push_config/delete_roundtrip 208.5 µs
production_push_config/get_roundtrip 103.9 µs
production_push_config/list_roundtrip 103.3 µs
production_push_config/set_roundtrip 106.8 µs
production_subscribe_to_task/send_then_subscribe 1.62 ms

Advanced Scenarios

SDK capabilities exercising previously-unbenchmarked paths: tenant resolver overhead, agent card hot-reload and discovery, subscribe fan-out for reconnection bursts, streaming artifact accumulation cost (the 90µs/event bottleneck), pagination full walk, and extended agent card round-trip.

Benchmark Median
advanced_agent_card_discovery/well_known_endpoint 87.8 µs
advanced_agent_card_hot_reload/read_current_card 309 ns
advanced_agent_card_hot_reload/swap_and_read 664 ns
advanced_agent_card_hot_reload/swap_complex_card 59.5 µs
advanced_artifact_accumulation/store_save_at_depth/0 397 ns
advanced_artifact_accumulation/store_save_at_depth/10 1.7 µs
advanced_artifact_accumulation/store_save_at_depth/100 14.4 µs
advanced_artifact_accumulation/store_save_at_depth/50 6.3 µs
advanced_artifact_accumulation/store_save_at_depth/500 76.6 µs
advanced_artifact_accumulation/task_clone_at_depth/0 125 ns
advanced_artifact_accumulation/task_clone_at_depth/10 1.2 µs
advanced_artifact_accumulation/task_clone_at_depth/100 13.4 µs
advanced_artifact_accumulation/task_clone_at_depth/50 6.9 µs
advanced_artifact_accumulation/task_clone_at_depth/500 65.9 µs
advanced_extended_agent_card/get_extended_card_roundtrip 116.6 µs
advanced_pagination_walk/filtered/1000_tasks_page_50 303.0 µs
advanced_pagination_walk/filtered/100_tasks_page_25 30.1 µs
advanced_pagination_walk/unfiltered/1000_tasks_page_50 567.1 µs
advanced_pagination_walk/unfiltered/100_tasks_page_25 60.8 µs
advanced_subscribe_fanout/concurrent_subscribers/1 1.90 ms
advanced_subscribe_fanout/concurrent_subscribers/10 2.33 ms
advanced_subscribe_fanout/concurrent_subscribers/5 2.04 ms
advanced_tenant_resolver/bearer_resolver 126 ns
advanced_tenant_resolver/bearer_resolver_with_mapper 146 ns
advanced_tenant_resolver/header_resolver 127 ns
advanced_tenant_resolver/header_resolver_miss 89 ns
advanced_tenant_resolver/path_resolver 177 ns

Agent-Level Latency Under Fault

End-to-end latency through a 5-hop in-process coordinator chain as the links between hops are made progressively less reliable. Unlike every other benchmark on this page, this one does not measure SDK-layer overhead — it measures the characteristic an agent-harness reviewer actually wants: "what is the end-to-end latency of an agent chain when the network between agents is unreliable, and how well do per-hop retries absorb it?"

The topology is:

test client ─[link 0]─▶ coord 1 ─[link 1]─▶ coord 2 ─[link 2]─▶ coord 3 ─[link 3]─▶ coord 4 ─[link 4]─▶ leaf

Every coordinator forwards the message to the next hop via a pre-built A2aClient wrapped in a FaultInjectingTransport. Each link applies its own independent fault profile, so per-hop faults compound end-to-end the way they would in a real deployment. Coordinators 1–4 retry their downstream call up to 3 times on retryable errors; the bench harness additionally retries the top-level send_message up to 8 times so the published error rates have effectively-zero unrecoverable-failure probability.

Honest caveats — read these before interpreting the numbers:

  • In-process, not network faults. The injected "error" is a synthetic ClientError::Timeout returned before the wrapped transport is called. This exercises the SDK's retry path faithfully, but does not exercise TCP congestion control, DNS resolution, or transport-level head-of-line blocking. Treat the numbers as "latency under SDK-level retransmission pressure," not "latency under real network loss."
  • One topology. Sequential delegation is the simplest multi-agent shape. Critic loops, parallel fan-out with deadline propagation, and plan-and-execute with replanning would be more rubric-relevant — this benchmark does not claim to cover those.
  • One benchmark does not retroactively make the other suites agent-level. It is deliberately additive: the first concrete data point in the "agent-level latency under fault" shape that the rest of the suite was missing entirely.

Group 1: per-hop latency injection (zero errors)

Varies per-link latency from 0 µs to 20 000 µs with zero synthetic errors, isolating the chain's latency-compounding factor from retry jitter. Five hops × per-hop latency gives the lower bound, plus the JSON-RPC loopback baseline (~2 ms for a five-hop chain with zero added latency).

Group 2: per-hop error injection (3 retries per hop + 8 outer retries)

Varies per-link synthetic-fault rate from 0% to 5% with zero added latency. Each coordinator retries its downstream call up to 3 times on retryable errors; the bench harness retries the top-level call up to 8 times. Records successful-path latency including retry cost, which is what "steady-state end-to-end latency under fault" means in practice.


Known Measurement Limitations

These notes help interpret benchmark results accurately and avoid misdiagnosing CI variance as real performance changes.

Streaming cross-thread scheduling

On N-core systems, `tokio::spawn` places the SSE builder task on a different worker thread with (N-1)/N probability, causing ~500µs cache-miss + work-stealing penalty. This was root-caused as the source of the ~24% bimodal distribution in all streaming benchmarks.

Mitigations (v1.0.0): The SSE builder uses `sleep` + reset (not `interval`) to eliminate timer wheel entries during active streaming. Transport streaming benchmarks use `worker_threads(1)` runtime to eliminate cross-thread variance entirely (24 high severe → 4 high mild outliers, 3× tighter confidence intervals).

Data volume get() at 100K tasks

The data_volume/get/100K benchmark previously reported ~42% faster lookups than the 1K/10K cases due to a CPU cache warming artifact from the large populate_store() setup filling L1/L2 caches. A 4MB cache-busting step was added in v0.5.0 to flush caches between populate and measure, producing more representative O(1) lookup times across all scales. The 1K/10K number (~450ns) remains the representative baseline.

Stream volume per-event cost inflection

Per-event cost inflects dramatically when events exceed the broadcast channel capacity. The default capacity was increased from 64 to 256 events in v0.5.0, pushing the inflection from ~52 events to ~252 events:

  • Below capacity: ~4µs/event (fast path)
  • At capacity boundary: ~53µs/event (12× jump — broadcast back-pressure)
  • Above capacity: ~130µs/event (SSE frame accumulation under overflow)

Production deployments expecting >256 events/task should increase EventQueueManager::with_capacity() to match their peak volume.

Transport payload insensitivity

Transport benchmarks (64B → 16KB) show only ~10% latency increase for a 256× payload increase, because the ~1.4ms HTTP round-trip dominates. Serde regressions cannot be detected via transport benchmarks. Use the protocol/payload_scaling isolation benchmarks (64B → 1MB, pure serde) for serialization regression detection.

Connection reuse impact

Connection reuse saves ~140µs (9%) on loopback. On real networks with TLS, savings would be 10-50ms (TLS handshake dominates). Best practice: create one A2aClient at startup and share via Arc across request handlers.

Deserialization allocation overhead

Deserialization allocates ~3× more than serialization (Task: 1,026 vs 342 allocs). This is inherent to serde_json's parsing model: every field creates an intermediate String/Vec allocation during parsing. The serde_helpers::deser_from_str() helper enables serde_json's borrowed-data path for ~15-25% fewer allocations. The serde_helpers::SerBuffer provides thread-local buffer reuse for serialization, eliminating the 2.3× small-payload overhead.

History depth allocation scaling

History depth scales at ~494 deserialization allocs/turn and ~242 serialization allocs/turn (linear, constant marginal cost). At 50 turns: 24,714 deser allocs per store.get(). The serde_helpers module provides optimized paths; for maximum throughput on deep histories, consider storing pre-serialized bytes alongside parsed structs to avoid re-parsing on every read.

Artifact accumulation clone cost

The background event processor clones the full Task struct on each SSE event. Clone cost scales linearly at ~133ns/artifact. For tasks with 500+ accumulated artifacts, consider batching event processing or using the planned copy-on-write artifact storage (tracked as a future optimization).

Slow consumer timer calibration

The backpressure/timer_calibration benchmarks measure actual tokio::time::sleep() durations on the CI runner. On shared runners, 1ms sleep ≈ 2.09ms actual, 5ms sleep ≈ 6.14ms actual. Slow consumer results should be interpreted against these calibrated durations, not the nominal sleep values.

Data volume save() wide confidence intervals

The data_volume/save/after_prefill/10000 benchmark reports wide confidence intervals ([1.4µs, 3.5µs], spanning a 2.5× range) and an 18% high severe outlier rate. This is caused by BTreeSet rebalancing spikes when the sorted index crosses internal node-split thresholds during insert. The median (~1.6µs) is representative; the wide CI reflects genuine variance from the B-tree data structure, not measurement noise. This is an acceptable tradeoff: the BTreeSet enables O(page_size) pagination queries vs O(n) full scans.

Dispatch routing: direct handler vs HTTP round-trip

The production/dispatch_routing/direct_handler_invoke benchmark may report marginally higher latency than full_http_roundtrip. This is not anomalous — the HTTP path reuses a warm keep-alive connection that amortizes TCP setup cost, while direct handler invocation exercises the full dispatch path without connection pooling benefits. The ~7% difference validates that the HTTP layer adds near-zero overhead for repeat requests on warm connections.

Subscribe fan-out O(1) scaling

The advanced/subscribe_fanout benchmark shows O(1) cost from 1→5 subscribers (~2.9ms both), with gradual increase at 10 subscribers (~3.6ms). The broadcast channel delivers to all subscribers in a single pass; the inflection at 10+ subscribers reflects increased channel contention and memory pressure from concurrent readers.

Agent burst sub-linear scaling

The production/agent_burst benchmark shows per-agent cost decreasing as concurrency increases: 714µs/agent at 10, 390µs/agent at 50, 310µs/agent at 100. This sub-linear scaling confirms the SDK handles high-fanout agent coordination without degradation — Tokio's work-stealing scheduler amortizes task scheduling overhead across the burst.

Cold start vs steady state

The production/cold_start/first_request benchmark (~328µs) appears faster than steady_state (~1.97ms). This is because first_request creates a fresh server per iteration (sample_size=20), measuring server handler initialization + first TCP connect. The steady_state benchmark reuses an existing keep-alive connection, measuring the full HTTP round-trip with connection overhead already amortized. The two benchmarks measure different things — they are complementary, not comparable.

Tenant resolver negligible overhead

Tenant resolvers operate at 88–173ns per request, representing ~0.008% of a typical 1.6ms round-trip. Header extraction (128ns) is marginally slower than the miss path (88ns) due to value parsing; path extraction (173ns) is slowest due to URL path parsing overhead. All resolvers are effectively free at production scale.

Pagination context index 2× speedup

The advanced/pagination_walk filtered benchmarks show ~2× speedup over unfiltered walks (309µs vs 592µs at 1000 tasks). The BTreeSet context index eliminates half the scan work by only iterating tasks matching the context_id filter.


Methodology

All benchmarks use Criterion.rs, which provides:

  • Statistical significance testing — detects real regressions vs noise
  • Warm-up iterations — avoids cold-start measurement artifacts
  • Median ± MAD — robust central tendency resistant to outliers
  • Configurable sample sizes — more iterations for noisy benchmarks

Measurement rigor

All benchmarks follow these practices for reproducibility:

  • Deterministic inputs: Fixed task IDs and payloads inside iter() — no incrementing counters that change HashMap distribution across iterations
  • Setup outside measurement: Store creation, server startup, and resource allocation happen before iter(), not inside it
  • debug_assert! for invariants: Correctness checks inside measurement loops use debug_assert! to avoid string-formatting cost in release builds
  • black_box() on inputs and outputs: Prevents the compiler from eliminating measured work through dead-code optimization
  • Tolerance-based allocation assertions: Memory benchmarks use a 5% tolerance instead of exact counts to avoid spurious CI failures from serde_json/stdlib version changes
  • Side-effect interceptors: The interceptor chain benchmark uses CountingInterceptor (AtomicU64) to verify interceptors are actually invoked during measurement — not just optimized away

What we benchmark

The SDK's value proposition is the A2A protocol layer and runtime efficiency, not agent logic. The bulk of the suite therefore benchmarks what the SDK owns: transport overhead, serialization cost, store operations, concurrency scaling, streaming backpressure, error handling, and memory allocation behavior.

One benchmark — coordinator_chain_under_fault — is deliberately a different shape: it measures end-to-end agent-chain latency under fault injection, not SDK-layer overhead. It is documented in its own section above with the caveats for how to interpret it (in-process only, sequential delegation only, one topology). It is not intended to substitute for a real agent-capability benchmark suite — it closes the most obvious gap in the existing suite while staying honest about what it is.

What we do NOT benchmark

  • Agent intelligence — LLM quality is an eval problem, not a perf benchmark
  • Real network faults — the fault-injection bench simulates synthetic ClientError::Timeout responses in-process, not real packet loss or TCP congestion control
  • Network latency — all benchmarks use loopback (127.0.0.1)
  • TLS handshake — benchmarks use plaintext HTTP
  • Task completion quality — needs human-preference evaluation
  • Multi-agent topologies beyond sequential delegation — critic loops, parallel fan-out with deadline propagation, and plan-and-execute with replanning are out of scope for this crate

Reproducing locally

# Run all benchmarks
cargo bench -p a2a-benchmarks

# Run a specific module
cargo bench -p a2a-benchmarks --bench transport_throughput

# Save baseline, make changes, then compare
./benches/scripts/run_benchmarks.sh --save
# ... make changes ...
./benches/scripts/run_benchmarks.sh --compare

Full HTML reports (with violin plots and comparison overlays) are generated in target/criterion/.