|
| 1 | +# Substack Archive Scraper |
| 2 | + |
| 3 | +A Substack scraper and archive exporter that turns single-author Substack |
| 4 | +publications into Markdown source files for wiki ingestion. |
| 5 | + |
| 6 | +It is intended as a local Substack to Markdown downloader for researchers, |
| 7 | +operators, and publication owners who need reproducible source archives. |
| 8 | + |
| 9 | +The tool is built around a strict source contract: every article, author reply, |
| 10 | +accepted PDF, and accepted transcript gets a manifest row, a deterministic source |
| 11 | +file, provenance metadata, and validation logs. It preserves source text; it does |
| 12 | +not summarize, paraphrase, atomize, or build the wiki itself. |
| 13 | + |
| 14 | +## Safety Model |
| 15 | + |
| 16 | +- Use this only for publications and paid content you are allowed to access. |
| 17 | +- Paid Substacks are handled through a local browser login that exports a |
| 18 | + Playwright storage-state file outside the repo. |
| 19 | +- The scraper does not bypass paywalls, evade bot detection, or hide what it is. |
| 20 | +- The HTTP client uses a configured contact in its `User-Agent`, respects |
| 21 | + `robots.txt`, rate-limits requests, and logs access caveats. |
| 22 | +- Generated output can contain paid/private text. Keep output roots and session |
| 23 | + files outside the repository. |
| 24 | + |
| 25 | +## Install |
| 26 | + |
| 27 | +```bash |
| 28 | +uv sync --dev |
| 29 | +uv run playwright install chromium |
| 30 | +``` |
| 31 | + |
| 32 | +Without `uv`: |
| 33 | + |
| 34 | +```bash |
| 35 | +python3 -m venv .venv |
| 36 | +.venv/bin/python -m pip install -e ".[dev]" |
| 37 | +.venv/bin/python -m playwright install chromium |
| 38 | +``` |
| 39 | + |
| 40 | +## Quickstart |
| 41 | + |
| 42 | +Create a config from the generic template: |
| 43 | + |
| 44 | +```bash |
| 45 | +cp config/public.example.yml config/my-publication.yml |
| 46 | +``` |
| 47 | + |
| 48 | +Edit `config/my-publication.yml`: |
| 49 | + |
| 50 | +- `target.base_url` |
| 51 | +- `target.publication_name` |
| 52 | +- `target.author.canonical_name` |
| 53 | +- `target.author.stable_id` |
| 54 | +- `output.root` |
| 55 | +- `operator.user_agent_contact` |
| 56 | + |
| 57 | +The author stable ID is required for comment disambiguation. Display-name |
| 58 | +matching is not safe enough for author replies. |
| 59 | + |
| 60 | +Run preflight and discovery: |
| 61 | + |
| 62 | +```bash |
| 63 | +uv run substack-archive-scraper preflight --config config/my-publication.yml |
| 64 | +uv run substack-archive-scraper discover --config config/my-publication.yml --limit 10 |
| 65 | +``` |
| 66 | + |
| 67 | +Scrape and validate: |
| 68 | + |
| 69 | +```bash |
| 70 | +uv run substack-archive-scraper scrape --config config/my-publication.yml |
| 71 | +uv run substack-archive-scraper validate \ |
| 72 | + --config config/my-publication.yml \ |
| 73 | + --output-root /absolute/path/to/output |
| 74 | +``` |
| 75 | + |
| 76 | +## Paid Substacks |
| 77 | + |
| 78 | +Start from the authenticated template: |
| 79 | + |
| 80 | +```bash |
| 81 | +cp config/authenticated.example.yml config/my-paid-publication.yml |
| 82 | +``` |
| 83 | + |
| 84 | +Set `auth.cookie_file` to a path outside the repo, then capture a session: |
| 85 | + |
| 86 | +```bash |
| 87 | +uv run substack-archive-scraper login --config config/my-paid-publication.yml |
| 88 | +``` |
| 89 | + |
| 90 | +A headed Chromium window opens. Log in normally, return to the terminal, and |
| 91 | +press Enter. The scraper stores Playwright storage state at `auth.cookie_file`. |
| 92 | + |
| 93 | +Credentialed scrapes require `auth.known_paid_post_url` so the scraper can prove |
| 94 | +that authenticated article hydration is working before it captures paid content. |
| 95 | + |
| 96 | +## Output Contract |
| 97 | + |
| 98 | +```text |
| 99 | +<output-root>/ |
| 100 | + raw/ |
| 101 | + articles/<YYYY>/YYYY-MM-DD-<slug>.md |
| 102 | + pdfs/<descriptive-slug>.md |
| 103 | + comments/YYYY-MM-DD-<article-slug>-<reply-seq>.md |
| 104 | + transcripts/YYYY-MM-DD-<episode-slug>.md |
| 105 | + _manifests/ |
| 106 | + source_manifest.jsonl |
| 107 | + source_relationships.jsonl |
| 108 | + content_duplicates.jsonl |
| 109 | + scrape_report.md |
| 110 | + voice_candidates.md |
| 111 | + scrape_manifest.yml |
| 112 | + scrape_logs/<run-id>/ |
| 113 | + *.log |
| 114 | +``` |
| 115 | + |
| 116 | +The manifest is canonical. Source-file frontmatter is a recovery mirror only. |
| 117 | + |
| 118 | +## Completeness Policy |
| 119 | + |
| 120 | +By default the scraper is completeness-first: |
| 121 | + |
| 122 | +- Keep every discovered single-author article. |
| 123 | +- Keep every confirmed author reply the authenticated session can see. |
| 124 | +- Include partially paywalled articles with `paywall_truncation` warnings. |
| 125 | +- Include confirmed author replies whose parent comment is hidden by the API |
| 126 | + with `comment_parent_context_unavailable`. |
| 127 | +- Log comment-access gaps instead of silently omitting them. |
| 128 | + |
| 129 | +Use `--exclude-partial-paywalled` only when you intentionally want a stricter |
| 130 | +complete-body corpus. |
| 131 | + |
| 132 | +## Cache And Progress |
| 133 | + |
| 134 | +Scrape runs use a persistent HTTP cache by default at: |
| 135 | + |
| 136 | +```text |
| 137 | +~/.cache/substack-archive-scraper/ |
| 138 | +``` |
| 139 | + |
| 140 | +Useful flags: |
| 141 | + |
| 142 | +```bash |
| 143 | +uv run substack-archive-scraper scrape --config config/my-publication.yml --progress-every 300 |
| 144 | +uv run substack-archive-scraper scrape --config config/my-publication.yml --refresh-cache |
| 145 | +uv run substack-archive-scraper scrape --config config/my-publication.yml --no-cache |
| 146 | +``` |
| 147 | + |
| 148 | +Progress output includes post counts, elapsed time, ETA, source counts, and cache |
| 149 | +hit/miss/write counts. |
| 150 | + |
| 151 | +## Developer Commands |
| 152 | + |
| 153 | +```bash |
| 154 | +uv sync --dev |
| 155 | +uv run ruff check src tests schemas |
| 156 | +uv run pytest |
| 157 | +uv run python -m json.tool schemas/source_manifest.schema.json >/dev/null |
| 158 | +``` |
| 159 | + |
| 160 | +Equivalent shortcuts are available through `make`: |
| 161 | + |
| 162 | +```bash |
| 163 | +make install |
| 164 | +make check |
| 165 | +make test |
| 166 | +``` |
| 167 | + |
| 168 | +The shorter `substack-ingest` command is kept as a compatibility alias. |
| 169 | + |
| 170 | +## Documentation |
| 171 | + |
| 172 | +- [Pipeline design](docs/pipeline-design.md) |
| 173 | +- [Implementation plan](docs/implementation-plan.md) |
| 174 | +- [Development guide](docs/development.md) |
| 175 | +- [Security and publishing notes](docs/security.md) |
| 176 | +- [Release checklist](docs/release-checklist.md) |
| 177 | + |
| 178 | +## Publishing Status |
| 179 | + |
| 180 | +This repository is prepared for later publication under [The Unlicense](LICENSE.md). |
0 commit comments