This plan converts the design into a production scraper without changing the output contract.
Deliverables:
- CLI skeleton:
substack-archive-scraper preflight|discover|scrape|validate. - Config loader with secret-path validation.
- Schema validators for manifest rows, relationship rows, and scrape manifest.
- Atomic writer utility.
- Content-hash utility matching the requirements exactly.
Acceptance criteria:
- A fixture source tree validates.
- Hash changes ignore frontmatter-only edits.
- Required log files are created for every run.
Deliverables:
- JSON archive adapter.
- Sitemap adapter.
- HTML archive adapter.
- RSS incremental adapter.
- Reconciliation report and gap warnings.
Acceptance criteria:
- Candidate union is deterministic.
- RSS candidates never enter the primary union.
- Archive count gaps are logged with per-channel counts and intersections.
Deliverables:
- Canonical URL implementation.
- Authenticated and unauthenticated fetch clients.
- Article metadata extraction.
- Paywall detection and hydration self-test.
- Deterministic HTML-to-Markdown conversion.
Acceptance criteria:
- Five sampled post URLs canonicalize to their
<link rel="canonical">values. - Known paid post differs between authed and unauthed fetches.
- Markdown preserves paragraphs, links, emphasis, blockquotes, footnotes, image captions, and embeds.
Deliverables:
- Comment-thread fetch and pagination.
- Stable ID based author-reply detection.
- One-file-per-author-reply source assembly.
- Moderator and deleted-comment logging.
- Comment-access self-test.
Acceptance criteria:
- Missing
commenter_stable_idhalts comment scraping. - Every comments source has exactly one author reply.
- Every comments source has a
comments_for_articlerelationship.
Deliverables:
- PDF link discovery.
- PDF metadata and cover byline authorship decision.
- PDF text extraction and boilerplate handling.
- Transcript detection and parsing.
Acceptance criteria:
- Ambiguous PDFs are skipped and logged.
- Coauthored PDFs are skipped even when the target author appears.
- Accepted PDFs have
linked_from_article.
Deliverables:
- Prior-manifest indexes.
- Skip/update/new/404/slug-redirect decisions.
- Previous URL/hash mutation.
- Duplicate hash detection.
Acceptance criteria:
- Re-running unchanged fixtures produces zero writes.
- Slug changes do not create new source IDs.
- Edited content appends
previous_hashes[].
Deliverables:
- Full validation suite.
- Human scrape report.
- Voice candidate sampler.
scrape_manifest.ymlreadiness gate.
Acceptance criteria:
- Any validation failure sets
ready_for_ingestion: false. - Duplicate hashes are recorded in
content_duplicates.jsonl. - Reports match the condensed requirements sections.