Skip to content

Commit 8a7cb14

Browse files
committed
Initial Substack Archive Scraper
0 parents  commit 8a7cb14

42 files changed

Lines changed: 5395 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
pull_request:
6+
7+
jobs:
8+
test:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v4
12+
- uses: astral-sh/setup-uv@v6
13+
- uses: actions/setup-python@v6
14+
with:
15+
python-version: "3.12"
16+
- run: uv sync --dev
17+
- run: uv run ruff check src tests schemas
18+
- run: uv run pytest
19+
- run: uv run python -m json.tool schemas/source_manifest.schema.json >/dev/null
20+
- run: uv run python -m json.tool schemas/source_relationship.schema.json >/dev/null
21+
- run: uv run python -m json.tool schemas/scrape_manifest.schema.json >/dev/null

.gitignore

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
.DS_Store
2+
.env
3+
.env.*
4+
!.env.example
5+
6+
# Runtime scrape output; the wiki ingest consumes these after a run,
7+
# but the repository should not store paid/private content or credentials.
8+
outputs/
9+
scrape_logs/
10+
raw/
11+
12+
# Local auth/session material.
13+
secrets/
14+
*.cookies.json
15+
*.har
16+
17+
# Python/tooling caches.
18+
__pycache__/
19+
.pytest_cache/
20+
.ruff_cache/
21+
.mypy_cache/
22+
.venv/
23+
venv/
24+
build/
25+
dist/
26+
*.egg-info/

LICENSE.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
This is free and unencumbered software released into the public domain.
2+
3+
Anyone is free to copy, modify, publish, use, compile, sell, or distribute this
4+
software, either in source code form or as a compiled binary, for any purpose,
5+
commercial or non-commercial, and by any means.
6+
7+
In jurisdictions that recognize copyright laws, the author or authors of this
8+
software dedicate any and all copyright interest in the software to the public
9+
domain. We make this dedication for the benefit of the public at large and to
10+
the detriment of our heirs and successors. We intend this dedication to be an
11+
overt act of relinquishment in perpetuity of all present and future rights to
12+
this software under copyright law.
13+
14+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
16+
FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE
17+
LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF
18+
CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE
19+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
20+
21+
For more information, please refer to <https://unlicense.org/>

MANIFEST.in

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
include LICENSE.md
2+
include README.md
3+
include Makefile
4+
recursive-include config *.example.yml
5+
recursive-include docs *.md
6+
recursive-include schemas *.json
7+
recursive-include src/substack_ingest/schemas *.json

Makefile

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
.PHONY: install check test schemas build clean ci
2+
3+
install:
4+
uv sync --dev
5+
uv run playwright install chromium
6+
7+
check:
8+
uv run ruff check src tests schemas
9+
10+
test:
11+
uv run pytest
12+
13+
schemas:
14+
uv run python -m json.tool schemas/source_manifest.schema.json >/dev/null
15+
uv run python -m json.tool schemas/source_relationship.schema.json >/dev/null
16+
uv run python -m json.tool schemas/scrape_manifest.schema.json >/dev/null
17+
18+
build:
19+
uv build
20+
21+
clean:
22+
rm -rf build dist .pytest_cache .ruff_cache src/*.egg-info
23+
find src tests -type d -name __pycache__ -prune -exec rm -rf {} +
24+
25+
ci: check test schemas build

README.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Substack Archive Scraper
2+
3+
A Substack scraper and archive exporter that turns single-author Substack
4+
publications into Markdown source files for wiki ingestion.
5+
6+
It is intended as a local Substack to Markdown downloader for researchers,
7+
operators, and publication owners who need reproducible source archives.
8+
9+
The tool is built around a strict source contract: every article, author reply,
10+
accepted PDF, and accepted transcript gets a manifest row, a deterministic source
11+
file, provenance metadata, and validation logs. It preserves source text; it does
12+
not summarize, paraphrase, atomize, or build the wiki itself.
13+
14+
## Safety Model
15+
16+
- Use this only for publications and paid content you are allowed to access.
17+
- Paid Substacks are handled through a local browser login that exports a
18+
Playwright storage-state file outside the repo.
19+
- The scraper does not bypass paywalls, evade bot detection, or hide what it is.
20+
- The HTTP client uses a configured contact in its `User-Agent`, respects
21+
`robots.txt`, rate-limits requests, and logs access caveats.
22+
- Generated output can contain paid/private text. Keep output roots and session
23+
files outside the repository.
24+
25+
## Install
26+
27+
```bash
28+
uv sync --dev
29+
uv run playwright install chromium
30+
```
31+
32+
Without `uv`:
33+
34+
```bash
35+
python3 -m venv .venv
36+
.venv/bin/python -m pip install -e ".[dev]"
37+
.venv/bin/python -m playwright install chromium
38+
```
39+
40+
## Quickstart
41+
42+
Create a config from the generic template:
43+
44+
```bash
45+
cp config/public.example.yml config/my-publication.yml
46+
```
47+
48+
Edit `config/my-publication.yml`:
49+
50+
- `target.base_url`
51+
- `target.publication_name`
52+
- `target.author.canonical_name`
53+
- `target.author.stable_id`
54+
- `output.root`
55+
- `operator.user_agent_contact`
56+
57+
The author stable ID is required for comment disambiguation. Display-name
58+
matching is not safe enough for author replies.
59+
60+
Run preflight and discovery:
61+
62+
```bash
63+
uv run substack-archive-scraper preflight --config config/my-publication.yml
64+
uv run substack-archive-scraper discover --config config/my-publication.yml --limit 10
65+
```
66+
67+
Scrape and validate:
68+
69+
```bash
70+
uv run substack-archive-scraper scrape --config config/my-publication.yml
71+
uv run substack-archive-scraper validate \
72+
--config config/my-publication.yml \
73+
--output-root /absolute/path/to/output
74+
```
75+
76+
## Paid Substacks
77+
78+
Start from the authenticated template:
79+
80+
```bash
81+
cp config/authenticated.example.yml config/my-paid-publication.yml
82+
```
83+
84+
Set `auth.cookie_file` to a path outside the repo, then capture a session:
85+
86+
```bash
87+
uv run substack-archive-scraper login --config config/my-paid-publication.yml
88+
```
89+
90+
A headed Chromium window opens. Log in normally, return to the terminal, and
91+
press Enter. The scraper stores Playwright storage state at `auth.cookie_file`.
92+
93+
Credentialed scrapes require `auth.known_paid_post_url` so the scraper can prove
94+
that authenticated article hydration is working before it captures paid content.
95+
96+
## Output Contract
97+
98+
```text
99+
<output-root>/
100+
raw/
101+
articles/<YYYY>/YYYY-MM-DD-<slug>.md
102+
pdfs/<descriptive-slug>.md
103+
comments/YYYY-MM-DD-<article-slug>-<reply-seq>.md
104+
transcripts/YYYY-MM-DD-<episode-slug>.md
105+
_manifests/
106+
source_manifest.jsonl
107+
source_relationships.jsonl
108+
content_duplicates.jsonl
109+
scrape_report.md
110+
voice_candidates.md
111+
scrape_manifest.yml
112+
scrape_logs/<run-id>/
113+
*.log
114+
```
115+
116+
The manifest is canonical. Source-file frontmatter is a recovery mirror only.
117+
118+
## Completeness Policy
119+
120+
By default the scraper is completeness-first:
121+
122+
- Keep every discovered single-author article.
123+
- Keep every confirmed author reply the authenticated session can see.
124+
- Include partially paywalled articles with `paywall_truncation` warnings.
125+
- Include confirmed author replies whose parent comment is hidden by the API
126+
with `comment_parent_context_unavailable`.
127+
- Log comment-access gaps instead of silently omitting them.
128+
129+
Use `--exclude-partial-paywalled` only when you intentionally want a stricter
130+
complete-body corpus.
131+
132+
## Cache And Progress
133+
134+
Scrape runs use a persistent HTTP cache by default at:
135+
136+
```text
137+
~/.cache/substack-archive-scraper/
138+
```
139+
140+
Useful flags:
141+
142+
```bash
143+
uv run substack-archive-scraper scrape --config config/my-publication.yml --progress-every 300
144+
uv run substack-archive-scraper scrape --config config/my-publication.yml --refresh-cache
145+
uv run substack-archive-scraper scrape --config config/my-publication.yml --no-cache
146+
```
147+
148+
Progress output includes post counts, elapsed time, ETA, source counts, and cache
149+
hit/miss/write counts.
150+
151+
## Developer Commands
152+
153+
```bash
154+
uv sync --dev
155+
uv run ruff check src tests schemas
156+
uv run pytest
157+
uv run python -m json.tool schemas/source_manifest.schema.json >/dev/null
158+
```
159+
160+
Equivalent shortcuts are available through `make`:
161+
162+
```bash
163+
make install
164+
make check
165+
make test
166+
```
167+
168+
The shorter `substack-ingest` command is kept as a compatibility alias.
169+
170+
## Documentation
171+
172+
- [Pipeline design](docs/pipeline-design.md)
173+
- [Implementation plan](docs/implementation-plan.md)
174+
- [Development guide](docs/development.md)
175+
- [Security and publishing notes](docs/security.md)
176+
- [Release checklist](docs/release-checklist.md)
177+
178+
## Publishing Status
179+
180+
This repository is prepared for later publication under [The Unlicense](LICENSE.md).

config/authenticated.example.yml

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Authenticated scrape template for paid Substacks.
2+
#
3+
# Use only with accounts and paid content you are allowed to access. The
4+
# storage-state file is created by `substack-archive-scraper login` and must stay outside
5+
# this repository.
6+
7+
target:
8+
base_url: "https://example.substack.com/"
9+
publication_name: "Example Publication"
10+
author:
11+
canonical_name: "Example Author"
12+
stable_id: "REQUIRED_AUTHOR_STABLE_ID"
13+
confirmed_display_aliases: []
14+
15+
output:
16+
root: "/tmp/substack-archive-scraper-output/example-publication"
17+
18+
operator:
19+
user_agent_contact: "mailto:operator@example.com"
20+
max_requests_per_second: 2
21+
22+
auth:
23+
mode: "cookie_file"
24+
cookie_file: "/tmp/substack-sessions/example-publication.storage-state.json"
25+
# Pick one paid post that this account can lawfully access. The scraper uses
26+
# it for an authenticated-vs-public hydration self-test.
27+
known_paid_post_url: "https://example.substack.com/p/paid-post-slug"
28+
known_subscriber_comments_article_url: null
29+
debug_cache_raw_payloads: false
30+
31+
date_range:
32+
start: null
33+
end: null
34+
35+
resume:
36+
resume_token: null
37+
38+
validation:
39+
recipe_compatibility_target: "wiki-recipe-v6"
40+
fail_on_missing_author_stable_id: true
41+
fail_on_hydration_self_test: true
42+
fail_on_comment_stable_id_missing: true
43+
fail_on_wrong_speaker_attribution: true
44+
fail_on_uncontrolled_quality_warning: true
45+
fail_on_type_specific_field_leakage: true
46+
fail_on_unconfirmed_author_display_name_variance: true

config/public.example.yml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Public-only scrape template.
2+
#
3+
# Copy to a private path such as config/my-publication.yml and fill in target
4+
# metadata before running. The author stable ID is still required if comments
5+
# are enabled.
6+
7+
target:
8+
base_url: "https://example.substack.com/"
9+
publication_name: "Example Publication"
10+
author:
11+
canonical_name: "Example Author"
12+
stable_id: "REQUIRED_AUTHOR_STABLE_ID"
13+
confirmed_display_aliases: []
14+
15+
output:
16+
root: "/tmp/substack-archive-scraper-output/example-publication"
17+
18+
operator:
19+
user_agent_contact: "mailto:operator@example.com"
20+
max_requests_per_second: 2
21+
22+
auth:
23+
mode: "none"
24+
cookie_file: null
25+
known_paid_post_url: null
26+
known_subscriber_comments_article_url: null
27+
debug_cache_raw_payloads: false
28+
29+
date_range:
30+
start: null
31+
end: null
32+
33+
resume:
34+
resume_token: null
35+
36+
validation:
37+
recipe_compatibility_target: "wiki-recipe-v6"
38+
fail_on_missing_author_stable_id: true
39+
fail_on_hydration_self_test: true
40+
fail_on_comment_stable_id_missing: true
41+
fail_on_wrong_speaker_attribution: true
42+
fail_on_uncontrolled_quality_warning: true
43+
fail_on_type_specific_field_leakage: true
44+
fail_on_unconfirmed_author_display_name_variance: true

0 commit comments

Comments
 (0)