Skip to content

Commit bef68a0

Browse files
prezisclaude
andcommitted
feat(docs_crawler): add exhaustive documentation-site crawler with sitem
Task-Id: 5 Auto-committed by per-task-commit hook after TaskUpdate(completed). Session: 00bd6bf2-f81e-4d17-b827-8a7aef9e2619 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 0bb896c commit bef68a0

5 files changed

Lines changed: 788 additions & 2 deletions

File tree

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,35 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Added
11+
12+
- **`scraperx/docs_crawler.py`** — exhaustive documentation-site crawler. Born from the IC API docs incident (2026-05-03) where an LLM agent claimed "I read the docs" while having only read 5 of 82 pages. The module:
13+
- Discovers URLs via `sitemap.xml` (with namespace-agnostic fallback for non-conformant XML) or explicit URL list.
14+
- Fetches each page via curl with a Mozilla UA — Cloudflare-resistant.
15+
- Extracts prose with a *less aggressive* HTML parser that recovers Docusaurus / VitePress / mkdocs-material content (prior mistake: skipping `<nav>`/`<header>`/`<footer>` along with `<script>` nuked breadcrumbs, sidebars, and 95% of the prose on Docusaurus sites).
16+
- Flags pages with `<500` chars as "shell candidates" needing playwright render — written to `_SHELLS.md`.
17+
- Writes `_DIGEST.md` with per-page byte/word counts so the caller can verify coverage at a glance.
18+
- URL validation defends against curl flag injection (URLs starting with `-`, line breaks, non-http schemes).
19+
- Path-traversal defence: resolved write paths must stay inside the output directory.
20+
- **`scraperx docs-crawl <root_url>`** — new CLI subcommand. Default output dir `./docs-crawl-<host>/`. Flags: `--max-pages`, `--user-agent`, `--timeout`, `--sleep-between`, `--include-encoded-dups`.
21+
- **22 tests** in `tests/test_docs_crawler.py` covering extractor robustness (nav/header/footer preservation, script/style stripping, runaway-newline guard), sitemap namespace fallback, URL validation (flag-injection guard, line-break rejection), slug safety, end-to-end crawl with stub curl.
22+
23+
### Use case
24+
25+
When an agent says "read every page of docs.example.com":
26+
```bash
27+
scraperx docs-crawl https://docs.example.com/
28+
# → ./docs-crawl-docs.example.com/_DIGEST.md ← byte-counted page index
29+
# → ./docs-crawl-docs.example.com/*.txt ← extracted prose, ready to grep
30+
# → ./docs-crawl-docs.example.com/_SHELLS.md ← pages that need playwright
31+
```
32+
33+
Then any LLM/agent can `grep -l "<topic>" ./docs-crawl-*/` with confidence that nothing was silently skipped.
34+
35+
### Why this matters
36+
37+
Pre-`docs_crawler`, agents would either (a) read 3-5 prose pages and claim coverage, (b) crawl with a too-aggressive HTML parser that returned 83 bytes per page, or (c) download swagger YAML and conflate "have access to" with "have read." This module makes "I read every page" verifiable.
38+
1039
## [1.4.3] — 2026-04-25
1140

1241
Bug-fix release: production-grade SQLite WAL hygiene across all storage callsites. Important for anyone running scraperx components as long-lived daemons (BMW corpus ingester, Reddit/KBA/forum scrapers, GitHub analyzer in batch mode) — closes the unbounded-WAL disaster vector.

scraperx/__main__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,10 @@ def main():
8686
from scraperx.doctor import main as doctor_main
8787

8888
sys.exit(doctor_main())
89+
if subcmd == "docs-crawl":
90+
from scraperx.docs_crawler import _main_docs_crawl
91+
92+
sys.exit(_main_docs_crawl())
8993
_main_url()
9094

9195

0 commit comments

Comments
 (0)