Skip to content

Latest commit

 

History

History
192 lines (146 loc) · 8.56 KB

File metadata and controls

192 lines (146 loc) · 8.56 KB

ks-xlsx-parser vs hucre

An honest, reproducible head-to-head against hucre — an excellent zero-dependency TypeScript spreadsheet I/O engine by @productdevbook. Hucre reads and writes xlsx/csv/ods, runs in Node/Deno/Bun/browsers/Cloudflare Workers, and ships in ~18 KB gzipped. It's a different category of tool than ks-xlsx-parser — they're an I/O engine, we're a semantic extractor — but since xlsx reading overlaps, it's worth putting both on the same corpus and publishing what we find. We built the comparison as much to learn from hucre as to measure ourselves.


TL;DR

  • hucre is faster on raw throughput: ~3× at P50 in our fast mode, ~25–100× at P95 on data-heavy files.
  • We extract more: formula dependency graph, chart type/series, pivots, RAG chunks with token counts + citation URIs, content hashes. Hucre extracts sparklines and round-trips charts — we don't.
  • We agree on every feature both parsers extract to exact parity (tables, merges, CF rules, DV rules, hyperlinks, comments) or near-exact (formulas: 0.05% drift).
  • Accuracy is the primary constraint of ks-xlsx-parser: 1631-test pytest suite, cross-validated against calamine, zero regressions required on every perf change.

Pick hucre for edge-runtime / browser / CF-Worker I/O. Pick ks-xlsx-parser for Python LLM / RAG / auditing pipelines.


Performance — historical 1053-workbook curated corpus

This page reflects the v0.1.x benchmark run on a curated stress corpus that shipped with earlier releases. Current head benchmarks SpreadsheetBench (5,458 real-world workbooks); see COMPARISON.md.

Same machine, same run, same OS page cache. parse_workbook(mode="fast") is the apples-to-apples configuration for hucre's read-only path (it skips LLM-specific chunking + template/tree extraction but still extracts every metadata feature hucre extracts).

metric hucre 0.3.0 ks-xlsx-parser full ks-xlsx-parser fast
P50 parse time 1.3 ms 5.0 ms 3.9 ms
P95 parse time 3.5 ms 368 ms 206 ms
P99 parse time 30.2 ms 469 ms 246 ms
mean parse time 2.7 ms 73.9 ms 39.5 ms
total wall-clock 2.8 s 77.8 s 41.6 s
Worst real-world file
(17.6k formulas)
139 ms 1413 ms 686 ms

Ratio to hucre

mode P50 ratio P95 ratio mean ratio
full 3.8× slower 105× slower 27× slower
fast 3.0× slower 60× slower 15× slower

Hucre's per-file speed is genuinely remarkable — hand-rolled SAX parsing of OOXML in TypeScript, zero allocations in the hot loop. If raw read throughput is your bottleneck, use it.


Where hucre wins

hucre ks-xlsx-parser
Writes xlsx/csv/ods (round-trip) ❌ read-only
CSV / ODS / HTML input ❌ xlsx / xlsm only
Sparkline extraction ❌ not modelled
Chart round-trip preservation (open → modify → save) ❌ read-only
Edge runtime (Cloudflare Workers / Deno / browser) ❌ Python-only
Bundle size ~18 KB, zero deps ~500 KB incl. deps
Streaming row iterator API streamXlsxRows ❌ full-workbook parse
CSP-compliant, no eval N/A (Python)
Raw parse throughput ✅ 3-100× faster

Where ks-xlsx-parser wins

ks-xlsx-parser hucre
Formula dependency graph (topological, cycle detection via Tarjan's SCC) ❌ formula stored as string only
Chart type + series extraction (7 types: bar, line, pie, scatter, area, radar, bubble) ❌ round-trip preservation only
Pivot table structure (cache source, row/col/filter fields, slicer connections) ❌ listed as "No"
RAG chunking with configurable token budget ❌ no LLM positioning
Source URIs for citations (file.xlsx#Sheet!A1:F18)
Sheet-purpose classification (raw_data / dashboard / calc / …)
KPI ranking by formula connectivity + entity index
Deterministic content hashes (xxhash64 per cell / block / chunk)
Adversarial-corpus robustness ✅ 1053/1053 parsed ⚠️ 2 timeouts on pathological address-space files
Stress corpus (1053 workbooks checked into repo + CI round-trip)

Extraction-count agreement (1053 workbooks)

On every feature both parsers extract, the drift is zero or near-zero:

feature hucre ks-xlsx-parser drift
formulas 46,411 46,433 0.05%
tables 523 523 0
merges 10,488 10,488 0
conditional-format rules 70 70 0
data validations 503 503 0
hyperlinks 511 511 0
comments 486 486 0
named ranges 822 809 1.6% (tracked)

The 22-formula disagreement is dominated by one real-world workbook where we parse 16 formulas that hucre misses — we surface this in the drift report, not hide it.

The cell-count difference on adversarial merge-heavy files (we emit ~50% more rows) is a methodology difference: ks-xlsx-parser counts every addressable cell in a merged region; hucre counts the master cell only. Both are defensible; document in the drift report generated by the benchmark harness.


Our accuracy commitment

Every perf change in ks-xlsx-parser has to pass, in order:

  1. The 1631-test pytest suite (unit + integration + corpus-slice)
  2. Cross-validation against calamine — the Rust reference parser — on a golden fixture set
  3. Zero regressions on the SpreadsheetBench robustness baseline (5,458 real-world workbooks)
  4. Feature-count stability vs. the hucre benchmark above

That's the order. If a perf change breaks any gate, we don't ship it. Every number on this page came from a run that passed all four gates.

If you're building RAG / agent / auditing pipelines where a silently dropped formula or a misread merge is a user-visible bug, that order matters. If you're shipping an I/O library for edge runtimes, use hucre — it's the right tool.


Reproducing these numbers

The benchmark harness lives at tests/benchmarks/. Full details in tests/benchmarks/README but the short version:

# From the repo root, in the ks-xlsx-parser venv
cd tests/benchmarks/hucre_node && pnpm install --frozen-lockfile
cd ../../..

# Download SpreadsheetBench once
make corpus-download

# Full mode (default)
python -m tests.benchmarks.vs_hucre \
    --corpus data/corpora/spreadsheetbench --out tests/benchmarks/reports

# Fast mode
KS_PARSE_MODE=fast python -m tests.benchmarks.vs_hucre \
    --corpus data/corpora/spreadsheetbench --out tests/benchmarks/reports

Outputs (under tests/benchmarks/reports/<timestamp>_<git-sha>/):

  • results.csv — one row per (file, parser) pair
  • raw.ndjson — full per-row records (nullable fields preserved)
  • failures.jsonl — status != ok rows
  • summary.md — aggregate counts, capability matrix, perf percentiles
  • drift.md — per-feature disagreement between parsers
  • manifest.json — run metadata (git sha, node / python versions, host, timestamp, CLI args)

The harness:

  • Pins hucre exact (0.3.0, --frozen-lockfile) so numbers are reproducible
  • Randomises (file, parser) ordering per seed to kill OS-page-cache bias
  • Each parser times itself in-process; Python driver doesn't measure the other
  • Per-file 60s timeout, 4 GB memory ceiling, worker respawn per 50-file batch
  • Uses null (not 0) for features a parser doesn't model — the summary generator distinguishes them

Credit

This comparison wouldn't exist without hucre and its author @productdevbook. Their work on a zero-dep TypeScript parser pushed us to actually measure our perf floor and invest in the Rust fast-path, the Tarjan's SCC swap, and parse_workbook(mode='fast').

If you need a fast, tiny, edge-runtime xlsx / csv / ods library with write support — that's them, not us.