Improve benchmark realism for production scenarios

## Summary

Current benchmarks focus on mean execution time against reference libraries with synthetic data. While this is a solid baseline, it misses several dimensions that matter in production: tail latency, memory usage, concurrency, data realism, and scalability curves.

This is a tracking issue for a set of improvements to make zerodep benchmarks more representative of real-world workloads.

## Current strengths

- Apple-to-apple comparison against 30+ reference libraries
- S/M/L data size tiers across most modules
- Network modules (httpclient, websocket, httpserver) use real local server fixtures
- readability uses real CNN/BBC/Guardian HTML pages

## Gaps vs production reality

| Gap | Impact | Description |
|-----|--------|-------------|
| **Mean-only reporting** | Low | Report shows only mean; min/max/stddev already collected but not displayed |
| **No memory metrics** | Medium | Peak RSS matters in containerized deployments; parsers (yaml, soup, multipart, protobuf) are most affected |
| **No concurrency benchmarks** | High | httpclient tested async but single-request only; no `asyncio.gather` or thread-pool contention |
| **Synthetic data dominance** | Medium | Most modules use generated data; real configs, dirty HTML, edge-case payloads are absent |
| **No cold start measurement** | Low | First-call overhead (import, regex compile, table init) hidden by warmup rounds |
| **Fixed S/M/L, no scale curve** | Medium | Three data points can't reveal complexity inflection points |

## Sub-issues

- [x] #81 — Expose tail latency (min/max/stddev/P95) in reports `P1`
- [x] #82 — Add concurrency/throughput benchmarks for network modules `P1`
- [x] #83 — Add real-world fixture data for parser modules `P1`
- [x] #84 — Add memory peak measurement for parser modules `P2`
- [x] #85 — Add scale curves for key modules `P2`
- [x] #86 — Measure cold start / first-call overhead `P3`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve benchmark realism for production scenarios #80

Summary

Current strengths

Gaps vs production reality

Sub-issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Gap	Impact	Description
Mean-only reporting	Low	Report shows only mean; min/max/stddev already collected but not displayed
No memory metrics	Medium	Peak RSS matters in containerized deployments; parsers (yaml, soup, multipart, protobuf) are most affected
No concurrency benchmarks	High	httpclient tested async but single-request only; no `asyncio.gather` or thread-pool contention
Synthetic data dominance	Medium	Most modules use generated data; real configs, dirty HTML, edge-case payloads are absent
No cold start measurement	Low	First-call overhead (import, regex compile, table init) hidden by warmup rounds
Fixed S/M/L, no scale curve	Medium	Three data points can't reveal complexity inflection points

Improve benchmark realism for production scenarios #80

Description

Summary

Current strengths

Gaps vs production reality

Sub-issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions