112 lines (76 loc) · 2.97 KB

Implementation Plan

This plan converts the design into a production scraper without changing the output contract.

Milestone 1: Contract Harness

Deliverables:

CLI skeleton: substack-archive-scraper preflight|discover|scrape|validate.
Config loader with secret-path validation.
Schema validators for manifest rows, relationship rows, and scrape manifest.
Atomic writer utility.
Content-hash utility matching the requirements exactly.

Acceptance criteria:

A fixture source tree validates.
Hash changes ignore frontmatter-only edits.
Required log files are created for every run.

Milestone 2: Discovery

Deliverables:

JSON archive adapter.
Sitemap adapter.
HTML archive adapter.
RSS incremental adapter.
Reconciliation report and gap warnings.

Acceptance criteria:

Candidate union is deterministic.
RSS candidates never enter the primary union.
Archive count gaps are logged with per-channel counts and intersections.

Milestone 3: Article Capture

Deliverables:

Canonical URL implementation.
Authenticated and unauthenticated fetch clients.
Article metadata extraction.
Paywall detection and hydration self-test.
Deterministic HTML-to-Markdown conversion.

Acceptance criteria:

Five sampled post URLs canonicalize to their <link rel="canonical"> values.
Known paid post differs between authed and unauthed fetches.
Markdown preserves paragraphs, links, emphasis, blockquotes, footnotes, image captions, and embeds.

Milestone 4: Comments

Deliverables:

Comment-thread fetch and pagination.
Stable ID based author-reply detection.
One-file-per-author-reply source assembly.
Moderator and deleted-comment logging.
Comment-access self-test.

Acceptance criteria:

Missing commenter_stable_id halts comment scraping.
Every comments source has exactly one author reply.
Every comments source has a comments_for_article relationship.

Milestone 5: PDFs and Transcripts

Deliverables:

PDF link discovery.
PDF metadata and cover byline authorship decision.
PDF text extraction and boilerplate handling.
Transcript detection and parsing.

Acceptance criteria:

Ambiguous PDFs are skipped and logged.
Coauthored PDFs are skipped even when the target author appears.
Accepted PDFs have linked_from_article.

Milestone 6: Idempotent Re-Scrape

Deliverables:

Prior-manifest indexes.
Skip/update/new/404/slug-redirect decisions.
Previous URL/hash mutation.
Duplicate hash detection.

Acceptance criteria:

Re-running unchanged fixtures produces zero writes.
Slug changes do not create new source IDs.
Edited content appends previous_hashes[].

Milestone 7: Full Validation and Reports

Deliverables:

Full validation suite.
Human scrape report.
Voice candidate sampler.
scrape_manifest.yml readiness gate.

Acceptance criteria:

Any validation failure sets ready_for_ingestion: false.
Duplicate hashes are recorded in content_duplicates.jsonl.
Reports match the condensed requirements sections.