First: welcome. 👋 If you got here and aren't sure what to do:
- Jump into our Discord — real-time help, roadmap chat, and the fastest way to pair on an idea with a maintainer.
- Or open a Discussion if async is your thing.
We'd rather talk than have you leave. Every good-first-issue, every weird
.xlsx fixture, every three-line doc patch is welcome.
This project only moves forward because people take 20 minutes to file a good bug or send a small PR. If that's you, thank you.
- Run
make bench-robuston SpreadsheetBench and report a file that breaks. We actively want edge-case.xlsxfixtures — use the Parser edge case issue template. - Submit an adversarial workbook. Attach a
.xlsx(or a generator that builds one) to a Parser edge case issue. If the parser crashes on it, even better. - Fix one of the flagged issues in
docs/PARSER_KNOWN_ISSUES.md. - Improve docs. The README, the architecture diagram, the examples — if something confused you, it confuses everyone.
- Open a Show & Tell if you shipped something with the parser. Seriously, it helps us prioritise.
git clone https://github.com/knowledgestack/ks-xlsx-parser.git
cd ks-xlsx-parser
make install # pip install -e ".[dev,api]"
make test # fast, default suite
make corpus-download # fetch SpreadsheetBench (5,458 real-world xlsx)
make bench-robust # parse-success + structural counts vs Docling
make bench-retrieval # retrieval recall@k vs DoclingPrerequisites: Python 3.10+, pip, optionally make. We use ruff for
linting/formatting — install it with the [dev] extra.
Your PR should:
- Have tests.
pytestmust stay green:make test. - If touching parser or chunker internals, run
make bench-robustagainst SpreadsheetBench and call out any regressions in the PR description. - Pass
ruff check(make lint) and be formatted withmake format. - Include one sentence in the PR description that starts with "This change…".
- Use conventional-commit style
commit messages:
feat:,fix:,perf:,refactor:,docs:,test:,chore:.
We lean toward smaller PRs with more context over big bundles. A five-line fix with a one-paragraph explanation is almost always mergeable.
Use the issue templates. For security issues, please use the private advisory flow — not a public issue.
Helpful things to include:
- Output of
python -c "import xlsx_parser; print(xlsx_parser.__version__)" - Python version (
python --version) - OS
- Minimal
.xlsxthat reproduces the bug (or a generator that builds one) - Full traceback
- Type hints everywhere that's practical.
- Tests live in
tests/; programmatic workbook fixtures live intests/conftest.py. - Cross-validation against calamine uses the
crossvalmarker. - The benchmark harness (
tests/benchmarks/) lives outsidepytest— invoke viamake bench-robust/make bench-retrieval. - Keep public-API changes additive; if you can't, note it in the PR and the maintainers will line up the deprecation.
- Discord: https://discord.gg/4uaGhJcx — come hang out, the maintainers and regulars are active here.
- Discussions: https://github.com/knowledgestack/ks-xlsx-parser/discussions
- Issues: https://github.com/knowledgestack/ks-xlsx-parser/issues
- Security: https://github.com/knowledgestack/ks-xlsx-parser/security/advisories
- Knowledge Stack org: https://github.com/knowledgestack
By participating you agree to follow our Code of Conduct.
Really. Every contribution makes this project sustainable.