You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(docs_crawler): add exhaustive documentation-site crawler with sitem
Task-Id: 5
Auto-committed by per-task-commit hook after TaskUpdate(completed).
Session: 00bd6bf2-f81e-4d17-b827-8a7aef9e2619
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+29Lines changed: 29 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,35 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
## [Unreleased]
9
9
10
+
### Added
11
+
12
+
-**`scraperx/docs_crawler.py`** — exhaustive documentation-site crawler. Born from the IC API docs incident (2026-05-03) where an LLM agent claimed "I read the docs" while having only read 5 of 82 pages. The module:
13
+
- Discovers URLs via `sitemap.xml` (with namespace-agnostic fallback for non-conformant XML) or explicit URL list.
14
+
- Fetches each page via curl with a Mozilla UA — Cloudflare-resistant.
15
+
- Extracts prose with a *less aggressive* HTML parser that recovers Docusaurus / VitePress / mkdocs-material content (prior mistake: skipping `<nav>`/`<header>`/`<footer>` along with `<script>` nuked breadcrumbs, sidebars, and 95% of the prose on Docusaurus sites).
16
+
- Flags pages with `<500` chars as "shell candidates" needing playwright render — written to `_SHELLS.md`.
17
+
- Writes `_DIGEST.md` with per-page byte/word counts so the caller can verify coverage at a glance.
18
+
- URL validation defends against curl flag injection (URLs starting with `-`, line breaks, non-http schemes).
19
+
- Path-traversal defence: resolved write paths must stay inside the output directory.
20
+
-**`scraperx docs-crawl <root_url>`** — new CLI subcommand. Default output dir `./docs-crawl-<host>/`. Flags: `--max-pages`, `--user-agent`, `--timeout`, `--sleep-between`, `--include-encoded-dups`.
When an agent says "read every page of docs.example.com":
26
+
```bash
27
+
scraperx docs-crawl https://docs.example.com/
28
+
# → ./docs-crawl-docs.example.com/_DIGEST.md ← byte-counted page index
29
+
# → ./docs-crawl-docs.example.com/*.txt ← extracted prose, ready to grep
30
+
# → ./docs-crawl-docs.example.com/_SHELLS.md ← pages that need playwright
31
+
```
32
+
33
+
Then any LLM/agent can `grep -l "<topic>" ./docs-crawl-*/` with confidence that nothing was silently skipped.
34
+
35
+
### Why this matters
36
+
37
+
Pre-`docs_crawler`, agents would either (a) read 3-5 prose pages and claim coverage, (b) crawl with a too-aggressive HTML parser that returned 83 bytes per page, or (c) download swagger YAML and conflate "have access to" with "have read." This module makes "I read every page" verifiable.
38
+
10
39
## [1.4.3] — 2026-04-25
11
40
12
41
Bug-fix release: production-grade SQLite WAL hygiene across all storage callsites. Important for anyone running scraperx components as long-lived daemons (BMW corpus ingester, Reddit/KBA/forum scrapers, GitHub analyzer in batch mode) — closes the unbounded-WAL disaster vector.
0 commit comments