Skip to content

Latest commit

 

History

History
112 lines (76 loc) · 2.97 KB

File metadata and controls

112 lines (76 loc) · 2.97 KB

Implementation Plan

This plan converts the design into a production scraper without changing the output contract.

Milestone 1: Contract Harness

Deliverables:

  • CLI skeleton: substack-archive-scraper preflight|discover|scrape|validate.
  • Config loader with secret-path validation.
  • Schema validators for manifest rows, relationship rows, and scrape manifest.
  • Atomic writer utility.
  • Content-hash utility matching the requirements exactly.

Acceptance criteria:

  • A fixture source tree validates.
  • Hash changes ignore frontmatter-only edits.
  • Required log files are created for every run.

Milestone 2: Discovery

Deliverables:

  • JSON archive adapter.
  • Sitemap adapter.
  • HTML archive adapter.
  • RSS incremental adapter.
  • Reconciliation report and gap warnings.

Acceptance criteria:

  • Candidate union is deterministic.
  • RSS candidates never enter the primary union.
  • Archive count gaps are logged with per-channel counts and intersections.

Milestone 3: Article Capture

Deliverables:

  • Canonical URL implementation.
  • Authenticated and unauthenticated fetch clients.
  • Article metadata extraction.
  • Paywall detection and hydration self-test.
  • Deterministic HTML-to-Markdown conversion.

Acceptance criteria:

  • Five sampled post URLs canonicalize to their <link rel="canonical"> values.
  • Known paid post differs between authed and unauthed fetches.
  • Markdown preserves paragraphs, links, emphasis, blockquotes, footnotes, image captions, and embeds.

Milestone 4: Comments

Deliverables:

  • Comment-thread fetch and pagination.
  • Stable ID based author-reply detection.
  • One-file-per-author-reply source assembly.
  • Moderator and deleted-comment logging.
  • Comment-access self-test.

Acceptance criteria:

  • Missing commenter_stable_id halts comment scraping.
  • Every comments source has exactly one author reply.
  • Every comments source has a comments_for_article relationship.

Milestone 5: PDFs and Transcripts

Deliverables:

  • PDF link discovery.
  • PDF metadata and cover byline authorship decision.
  • PDF text extraction and boilerplate handling.
  • Transcript detection and parsing.

Acceptance criteria:

  • Ambiguous PDFs are skipped and logged.
  • Coauthored PDFs are skipped even when the target author appears.
  • Accepted PDFs have linked_from_article.

Milestone 6: Idempotent Re-Scrape

Deliverables:

  • Prior-manifest indexes.
  • Skip/update/new/404/slug-redirect decisions.
  • Previous URL/hash mutation.
  • Duplicate hash detection.

Acceptance criteria:

  • Re-running unchanged fixtures produces zero writes.
  • Slug changes do not create new source IDs.
  • Edited content appends previous_hashes[].

Milestone 7: Full Validation and Reports

Deliverables:

  • Full validation suite.
  • Human scrape report.
  • Voice candidate sampler.
  • scrape_manifest.yml readiness gate.

Acceptance criteria:

  • Any validation failure sets ready_for_ingestion: false.
  • Duplicate hashes are recorded in content_duplicates.jsonl.
  • Reports match the condensed requirements sections.