Benchmark suite, paper draft, and recommendation tool for serialization choices in human-AI systems.
formattax.dev | Paper draft | Recommendation policy | Benchmarks | Schemas
Serialization is not just an implementation detail.
The same system may cross several boundaries with incompatible needs:
- model-facing input
- model-facing output
- progressive UI streaming
- human-maintained config
- inter-service transport
- storage
Using one format for all of them imposes a Format Tax: unnecessary structural tokens, weaker streaming behavior, poorer ergonomics, or a less reliable output path than the stage actually requires.
This repository contains three related artifacts:
- A paper draft An academic-facing document that stays conservative about evidence and separates local results from external claims.
- A benchmark suite Canonical datasets, question sets, encoders, and runner tracks for measuring format behavior.
- A homepage and recommendation engine A practitioner-facing tool that returns a primary format, fallback, confidence, and schema/parser pairing.
Every claim in the project should fit one of these buckets:
- Locally measured — produced by this repository’s benchmark code
- Externally reported — from papers, official project benchmarks, or official docs
- Open hypothesis — worth testing, not yet proven here
This matters because the public serialization discourse is noisy:
- some results are academic
- some are official format-project benchmarks
- some are independent reproductions
- some are placeholder numbers used for site development
The project is being rewritten to keep those apart.
Implemented or in-progress runner tracks:
- Token efficiency
- Retrieval accuracy
- Streaming readiness
- Generation quality
- Schema guidance
Planned but not yet implemented as a local runner track:
- Binary throughput / transport benchmarking
The homepage is intentionally more operational than the paper.
It answers:
- what boundary am I at?
- what shape is the data?
- what is the main optimization target?
- what hard constraints apply?
It then returns:
- a primary recommendation
- a fallback
- confidence
- evidence basis
- schema pairing
- parser pairing if relevant
The homepage is allowed to be decisive. The paper is not allowed to smuggle those decisions in as settled research results.
Validators and schemas are treated as a separate layer.
Current project stance:
- validator libraries are not serialization formats
- the strongest generation benefit usually comes from schema strategy, especially JSON Schema-oriented workflows
- library choice mostly matters through:
- JSON Schema export quality
- runtime ergonomics
- ecosystem fit
Some benchmark content under site/src/content/benchmarks/ and format metadata still exists primarily to support site UI development. Those placeholders should not be read as final reproduced benchmark results unless the corresponding run metadata is populated and dummy is false.
the-format-tax/
├── README.md
├── docs/
│ ├── paper.md
│ ├── decision-tree.md
│ ├── benchmarks.md
│ ├── schemas.md
│ └── superpowers/
├── benchmarks/
│ ├── datasets/
│ ├── questions/
│ ├── references/
│ └── runner/
├── paper/
└── site/
cd site
bun install
bun run devcd benchmarks/runner
bun run src/index.ts --track=1The runner currently expects Bun.
Useful contributions:
- independent reproductions of public format claims
- better harness normalization
- schema-guidance experiments
- binary transport benchmark implementation
- clearer separation between dummy site content and measured benchmark output
MIT for code. Documentation and paper material follow the project’s existing documentation license conventions.