ks-xlsx-parser vs hucre
An honest, reproducible head-to-head against hucre —
an excellent zero-dependency TypeScript spreadsheet I/O engine by
@productdevbook. Hucre reads and
writes xlsx/csv/ods, runs in Node/Deno/Bun/browsers/Cloudflare Workers, and
ships in ~18 KB gzipped. It's a different category of tool than
ks-xlsx-parser — they're an I/O engine, we're a semantic extractor — but
since xlsx reading overlaps, it's worth putting both on the same corpus and
publishing what we find. We built the comparison as much to learn from hucre
as to measure ourselves.
- hucre is faster on raw throughput: ~3× at P50 in our fast mode, ~25–100× at P95 on data-heavy files.
- We extract more: formula dependency graph, chart type/series, pivots, RAG chunks with token counts + citation URIs, content hashes. Hucre extracts sparklines and round-trips charts — we don't.
- We agree on every feature both parsers extract to exact parity (tables, merges, CF rules, DV rules, hyperlinks, comments) or near-exact (formulas: 0.05% drift).
- Accuracy is the primary constraint of
ks-xlsx-parser: 1631-test pytest suite, cross-validated againstcalamine, zero regressions required on every perf change.
Pick hucre for edge-runtime / browser / CF-Worker I/O.
Pick ks-xlsx-parser for Python LLM / RAG / auditing pipelines.
This page reflects the v0.1.x benchmark run on a curated stress corpus that shipped with earlier releases. Current head benchmarks SpreadsheetBench (5,458 real-world workbooks); see COMPARISON.md.
Same machine, same run, same OS page cache. parse_workbook(mode="fast")
is the apples-to-apples configuration for hucre's read-only path (it skips
LLM-specific chunking + template/tree extraction but still extracts every
metadata feature hucre extracts).
| metric | hucre 0.3.0 |
ks-xlsx-parser full |
ks-xlsx-parser fast |
|---|---|---|---|
| P50 parse time | 1.3 ms | 5.0 ms | 3.9 ms |
| P95 parse time | 3.5 ms | 368 ms | 206 ms |
| P99 parse time | 30.2 ms | 469 ms | 246 ms |
| mean parse time | 2.7 ms | 73.9 ms | 39.5 ms |
| total wall-clock | 2.8 s | 77.8 s | 41.6 s |
| Worst real-world file (17.6k formulas) |
139 ms | 1413 ms | 686 ms |
| mode | P50 ratio | P95 ratio | mean ratio |
|---|---|---|---|
| full | 3.8× slower | 105× slower | 27× slower |
| fast | 3.0× slower | 60× slower | 15× slower |
Hucre's per-file speed is genuinely remarkable — hand-rolled SAX parsing of OOXML in TypeScript, zero allocations in the hot loop. If raw read throughput is your bottleneck, use it.
hucre |
ks-xlsx-parser |
|
|---|---|---|
| Writes xlsx/csv/ods (round-trip) | ✅ | ❌ read-only |
| CSV / ODS / HTML input | ✅ | ❌ xlsx / xlsm only |
| Sparkline extraction | ✅ | ❌ not modelled |
| Chart round-trip preservation (open → modify → save) | ✅ | ❌ read-only |
| Edge runtime (Cloudflare Workers / Deno / browser) | ✅ | ❌ Python-only |
| Bundle size | ~18 KB, zero deps | ~500 KB incl. deps |
| Streaming row iterator API | ✅ streamXlsxRows |
❌ full-workbook parse |
| CSP-compliant, no eval | ✅ | N/A (Python) |
| Raw parse throughput | ✅ 3-100× faster | ❌ |
ks-xlsx-parser |
hucre |
|
|---|---|---|
| Formula dependency graph (topological, cycle detection via Tarjan's SCC) | ✅ | ❌ formula stored as string only |
| Chart type + series extraction (7 types: bar, line, pie, scatter, area, radar, bubble) | ✅ | ❌ round-trip preservation only |
| Pivot table structure (cache source, row/col/filter fields, slicer connections) | ✅ | ❌ listed as "No" |
| RAG chunking with configurable token budget | ✅ | ❌ no LLM positioning |
Source URIs for citations (file.xlsx#Sheet!A1:F18) |
✅ | ❌ |
| Sheet-purpose classification (raw_data / dashboard / calc / …) | ✅ | ❌ |
| KPI ranking by formula connectivity + entity index | ✅ | ❌ |
| Deterministic content hashes (xxhash64 per cell / block / chunk) | ✅ | ❌ |
| Adversarial-corpus robustness | ✅ 1053/1053 parsed | |
| Stress corpus (1053 workbooks checked into repo + CI round-trip) | ✅ | ❌ |
On every feature both parsers extract, the drift is zero or near-zero:
| feature | hucre |
ks-xlsx-parser |
drift |
|---|---|---|---|
| formulas | 46,411 | 46,433 | 0.05% |
| tables | 523 | 523 | 0 |
| merges | 10,488 | 10,488 | 0 |
| conditional-format rules | 70 | 70 | 0 |
| data validations | 503 | 503 | 0 |
| hyperlinks | 511 | 511 | 0 |
| comments | 486 | 486 | 0 |
| named ranges | 822 | 809 | 1.6% (tracked) |
The 22-formula disagreement is dominated by one real-world workbook where we parse 16 formulas that hucre misses — we surface this in the drift report, not hide it.
The cell-count difference on adversarial merge-heavy files (we emit ~50%
more rows) is a methodology difference: ks-xlsx-parser counts every
addressable cell in a merged region; hucre counts the master cell only.
Both are defensible; document in the drift report generated by the
benchmark harness.
Every perf change in ks-xlsx-parser has to pass, in order:
- The 1631-test pytest suite (unit + integration + corpus-slice)
- Cross-validation against
calamine— the Rust reference parser — on a golden fixture set - Zero regressions on the SpreadsheetBench robustness baseline (5,458 real-world workbooks)
- Feature-count stability vs. the hucre benchmark above
That's the order. If a perf change breaks any gate, we don't ship it. Every number on this page came from a run that passed all four gates.
If you're building RAG / agent / auditing pipelines where a silently dropped formula or a misread merge is a user-visible bug, that order matters. If you're shipping an I/O library for edge runtimes, use hucre — it's the right tool.
The benchmark harness lives at tests/benchmarks/.
Full details in tests/benchmarks/README
but the short version:
# From the repo root, in the ks-xlsx-parser venv
cd tests/benchmarks/hucre_node && pnpm install --frozen-lockfile
cd ../../..
# Download SpreadsheetBench once
make corpus-download
# Full mode (default)
python -m tests.benchmarks.vs_hucre \
--corpus data/corpora/spreadsheetbench --out tests/benchmarks/reports
# Fast mode
KS_PARSE_MODE=fast python -m tests.benchmarks.vs_hucre \
--corpus data/corpora/spreadsheetbench --out tests/benchmarks/reportsOutputs (under tests/benchmarks/reports/<timestamp>_<git-sha>/):
results.csv— one row per(file, parser)pairraw.ndjson— full per-row records (nullable fields preserved)failures.jsonl— status != ok rowssummary.md— aggregate counts, capability matrix, perf percentilesdrift.md— per-feature disagreement between parsersmanifest.json— run metadata (git sha, node / python versions, host, timestamp, CLI args)
The harness:
- Pins hucre exact (
0.3.0,--frozen-lockfile) so numbers are reproducible - Randomises
(file, parser)ordering per seed to kill OS-page-cache bias - Each parser times itself in-process; Python driver doesn't measure the other
- Per-file 60s timeout, 4 GB memory ceiling, worker respawn per 50-file batch
- Uses
null(not0) for features a parser doesn't model — the summary generator distinguishes them
This comparison wouldn't exist without hucre
and its author @productdevbook.
Their work on a zero-dep TypeScript parser pushed us to actually measure
our perf floor and invest in the Rust fast-path, the Tarjan's SCC swap,
and parse_workbook(mode='fast').
If you need a fast, tiny, edge-runtime xlsx / csv / ods library with write support — that's them, not us.