Skip to content

Commit d512f1e

Browse files
authored
Modularise URL retrieval with Cloudflare Browser Rendering support (#73)
* feat: create Retriever interface and HTTPRetriever implementation Extract URL fetching abstraction from the inline HTTP logic in extractWithRules. Defines Retriever interface, RetrieveResult struct, and HTTPRetriever with Safari user-agent, redirect following, and timeout support. Includes moq generate directive and comprehensive tests. * feat: implement CloudflareRetriever for Browser Rendering API * feat: wire Retriever interface into UReadability extraction pipeline * feat: add CLI flags and wire Cloudflare retriever in main.go * feat: generate Retriever mock, run gofmt and linter Generate moq mock for Retriever interface as a test-only file (retriever_mock_test.go) instead of mocks/ subpackage to avoid import cycle (mocks/retriever.go would import extractor, cycling with readability_test.go). Run gofmt on all modified files, zero lint issues. * feat: verify acceptance criteria for Retriever interface * feat: update documentation for Retriever interface and CLI flags * fix: address code review findings - fix err shadowing in deferred Body.Close() in both retrievers (use closeErr) - handle Cloudflare API success=false response explicitly instead of treating JSON error as HTML - truncate CF API error body to 512 bytes in error messages - add comment documenting CF retriever URL limitation (no final URL after JS redirects) - fix pre-existing %b format verb in text.go logging (should be %v) - replace network-dependent TestCloudflareRetriever_DefaultBaseURL with local httptest - add TestCloudflareRetriever_SuccessFalse for the new success=false handling - add TestExtractWithCustomRetriever integration test using RetrieverMock - remove duplicate plan file from docs/plans/ (already in completed/) - update README.md with new CF CLI flags and feature description - update CLAUDE.md CI bullet to reflect split docker.yml workflow * fix: address code review findings * fix: address code review findings * fix: address codex review findings * fix: address code review findings * fix: address code review findings * fix: address code review findings * fix: cache default retriever, add defensive timeouts, extract constants * fix: revert token auth addition to POST /api/extract POST /api/extract never had token auth in the original code. The checkToken refactoring should only apply to the legacy /content/v1/parser endpoint which always had it. * docs: add OpenAI auto-extraction improvement plan * docs: remove unrelated OpenAI auto-extraction plan * feat: add per-rule and global Cloudflare routing, 429 retries HTTP retriever stays the default. Cloudflare is now opt-in at two levels: - per-rule: new Rule.UseCloudflare field (checkbox in rule editor UI) routes requests for that domain through Cloudflare Browser Rendering - global: --cf-route-all / CF_ROUTE_ALL flag (default false) routes every request through Cloudflare UReadability.pickRetriever(rule) picks: CFRouteAll > rule.UseCloudflare > default HTTP. extractWithRules now resolves the rule once upfront and shares it between routing and getContent (was looked up twice). CloudflareRetriever retries on HTTP 429 with exponential backoff (base 11s, max 2 retries by default → worst-case 33s of backoff), honours Retry-After header, and aborts immediately on caller context cancel. MaxRetries=-1 disables retries. Added WriteTimeout=150s on the HTTP server — was previously unset, allowing handlers to run forever. 150s covers the worst-case CF path (up to ~123s). * fix: address review feedback on Cloudflare routing PR - TestGetContentCustom: pass the rule directly to getContent so it actually exercises the custom-rule path; the RulesMock.GetFunc setup was dead code after getContent stopped looking up rules - CloudflareRetriever.MaxRetries: remove default substitution for the zero value — 0 now means "no retries" as expected. Callers opt into retries by setting MaxRetries explicitly; main.go uses the exported CFDefaultMaxRetries constant (2) - README: add cf-route-all to the config table and rewrite the Cloudflare section to reflect the opt-in routing model + 429 retry behaviour - rest.Server.Run: expand the WriteTimeout comment to explain why the 150s ceiling is server-wide rather than per-route via http.TimeoutHandler
1 parent be2db1f commit d512f1e

14 files changed

Lines changed: 1279 additions & 93 deletions

File tree

CLAUDE.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,10 @@ golangci-lint run --max-issues-per-linter=0 --max-same-issues=0 # lint from rep
1717

1818
The `revision` variable in `main.go` is injected at build time: `-ldflags "-X main.revision=<version>"`.
1919

20+
Optional Cloudflare Browser Rendering flags (when both are set, uses `CloudflareRetriever` instead of default HTTP):
21+
- `--cf-account-id` / `CF_ACCOUNT_ID` — Cloudflare account ID
22+
- `--cf-api-token` / `CF_API_TOKEN` — Cloudflare API token with Browser Rendering Edit permission
23+
2024
`main_test.go` is gated behind `ENABLE_MONGO_TESTS=true` and needs MongoDB on localhost:27017. All other packages test independently — `datastore/` spins up MongoDB via testcontainers automatically.
2125

2226
## Architecture
@@ -32,11 +36,13 @@ web/ → Go HTML templates (HTMX v2), static assets
3236

3337
**Dependency flow:** `main → datastore, extractor, rest`; `rest → datastore, extractor`; `extractor → datastore` (Rule type + Rules interface).
3438

35-
**Key interface**`extractor.Rules` (defined consumer-side in `extractor/readability.go`), implemented by `datastore.RulesDAO`. Mock generated with `//go:generate moq` in extractor package.
39+
**Key interfaces:**
40+
- `extractor.Rules` (defined consumer-side in `extractor/readability.go`), implemented by `datastore.RulesDAO`. Mock generated with `//go:generate moq` in extractor package.
41+
- `extractor.Retriever` (defined in `extractor/retriever.go`) — abstracts URL content fetching. Two implementations: `HTTPRetriever` (default, standard HTTP GET with Safari user-agent) and `CloudflareRetriever` (Cloudflare Browser Rendering API for JS-rendered pages). When `UReadability.Retriever` is nil, defaults to `HTTPRetriever`.
3642

3743
## Content Extraction Flow
3844

39-
1. Fetch URL (30s timeout, Safari user-agent, follows redirects)
45+
1. Fetch URL via `Retriever` interface (default: HTTP GET with 30s timeout, Safari user-agent, follows redirects; optional: Cloudflare Browser Rendering for JS-heavy sites)
4046
2. Detect charset from Content-Type header and `<meta>` tags, convert to UTF-8
4147
3. Look up custom CSS selector rule from MongoDB by domain
4248
4. If rule found → extract via goquery CSS selector; if fails → fall back to general parser
@@ -47,7 +53,7 @@ web/ → Go HTML templates (HTMX v2), static assets
4753

4854
- Rule upsert is keyed on `domain` — one rule per domain. Rules are disabled (`enabled: false`), never deleted.
4955
- `rest.Server.Readability` is `extractor.UReadability` by value (not pointer), with `Rules` interface field inside.
50-
- Protected API routes use custom `basicAuth` middleware with constant-time comparison.
56+
- `/api/content/v1/parser` requires the `token` query parameter when configured. Token comparison uses `subtle.ConstantTimeCompare`. Protected rule management routes use custom `basicAuth` middleware with constant-time comparison.
5157
- Web UI text is in Russian — tests assert on Russian strings, don't change them.
5258
- Middleware stack: Recoverer → RealIP → AppInfo+Ping → Throttle(50) → Logger.
53-
- CI runs tests and lint in the `build` job; Docker build only compiles (no tests inside Docker).
59+
- CI: `ci.yml` runs tests and lint in the `build` job (MongoDB via service container); `docker.yml` builds Docker images via `workflow_run` trigger after `build` succeeds (no tests inside Docker).

README.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,28 @@
1212
| port | UKEEPER_PORT | `8080` | web server port |
1313
| mongo-uri | MONGO_URI | none | MongoDB connection string, _required_ |
1414
| frontend-dir | FRONTEND_DIR | `/srv/web` | directory with frontend files |
15-
| token | TOKEN | none | token for /content/v1/parser endpoint auth |
15+
| token | UKEEPER_TOKEN | none | token for API endpoint auth |
1616
| mongo-delay | MONGO_DELAY | `0` | mongo initial delay |
1717
| mongo-db | MONGO_DB | `ureadability` | mongo database name |
1818
| creds | CREDS | none | credentials for protected calls (POST, DELETE /rules) |
19+
| cf-account-id| CF_ACCOUNT_ID | none | Cloudflare account ID for Browser Rendering API |
20+
| cf-api-token | CF_API_TOKEN | none | Cloudflare API token with Browser Rendering Edit perm |
21+
| cf-route-all | CF_ROUTE_ALL | `false` | route every request through Cloudflare Browser Rendering |
1922
| dbg | DEBUG | `false` | debug mode |
2023

24+
### Cloudflare Browser Rendering (optional)
25+
26+
Cloudflare Browser Rendering is useful for JavaScript-heavy pages and sites behind a "please enable JS" wall, but it's slower than direct HTTP and the free tier throttles at 1 request per 10 seconds. To keep the service cost-effective, Cloudflare routing is **opt-in**.
27+
28+
1. Set `--cf-account-id` and `--cf-api-token` to enable Cloudflare routing.
29+
2. Either flip the `use_cloudflare` checkbox on individual rules (per-domain, recommended) or set `--cf-route-all=true` to route every request through Cloudflare.
30+
31+
When Cloudflare credentials are not set, the service uses a standard HTTP client for everything (default). On HTTP 429 (rate limit) the service automatically retries with exponential backoff and respects the `Retry-After` header.
32+
2133
### API
2234

2335
GET /api/content/v1/parser?token=secret&url=http://aa.com/blah - extract content (emulate Readability API parse call)
24-
POST /api/v1/extract {url: http://aa.com/blah} - extract content
36+
POST /api/extract {url: http://aa.com/blah} - extract content
2537

2638
## Development
2739

datastore/rules.go

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -18,16 +18,17 @@ type RulesDAO struct {
1818

1919
// Rule record, entry in mongo
2020
type Rule struct {
21-
ID bson.ObjectID `json:"id" bson:"_id,omitempty"`
22-
Domain string `json:"domain"`
23-
MatchURLs []string `json:"match_url,omitempty" bson:"match_urls,omitempty"`
24-
Content string `json:"content"`
25-
Author string `json:"author,omitempty" bson:"author,omitempty"`
26-
TS string `json:"ts,omitempty" bson:"ts,omitempty"` // ts of original article
27-
Excludes []string `json:"excludes,omitempty" bson:"excludes,omitempty"`
28-
TestURLs []string `json:"test_urls,omitempty" bson:"test_urls"`
29-
User string `json:"user"`
30-
Enabled bool `json:"enabled"`
21+
ID bson.ObjectID `json:"id" bson:"_id,omitempty"`
22+
Domain string `json:"domain"`
23+
MatchURLs []string `json:"match_url,omitempty" bson:"match_urls,omitempty"`
24+
Content string `json:"content"`
25+
Author string `json:"author,omitempty" bson:"author,omitempty"`
26+
TS string `json:"ts,omitempty" bson:"ts,omitempty"` // ts of original article
27+
Excludes []string `json:"excludes,omitempty" bson:"excludes,omitempty"`
28+
TestURLs []string `json:"test_urls,omitempty" bson:"test_urls"`
29+
User string `json:"user"`
30+
Enabled bool `json:"enabled"`
31+
UseCloudflare bool `json:"use_cloudflare,omitempty" bson:"use_cloudflare,omitempty"` // route fetch via Cloudflare Browser Rendering
3132
}
3233

3334
// Get rule by url. Checks if found in mongo, matching by domain
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
# Modularise URL Retrieval with Cloudflare Browser Rendering
2+
3+
## Overview
4+
- Extract the hardcoded HTTP fetch logic from `extractWithRules` into a `Retriever` interface so different content retrieval backends can be swapped in
5+
- Add `CloudflareRetriever` using Cloudflare Browser Rendering `/content` endpoint — returns fully rendered HTML after JS execution, handling sites behind bot protection or requiring JS rendering
6+
- Addresses PR review feedback on radio-t/super-bot#156: the content extraction improvement belongs in ukeeper-readability (the extraction layer), not in super-bot (the consumer)
7+
8+
## Context (from discovery)
9+
- **fetch logic**: embedded in `extractor/readability.go:extractWithRules` (lines 80-106) — creates `http.Client` inline, Safari user-agent, `io.ReadAll`, no abstraction
10+
- **only existing interface**: `extractor.Rules` for datastore access; no fetcher interface exists
11+
- **pipeline after fetch**: `toUtf8``getContent` (rules/readability) → title → `getText``normalizeLinks``getSnippet``extractPics` — all expects HTML input, stays unchanged
12+
- **`normalizeLinks`** takes `*http.Request` but only uses `.URL` — will simplify to `*url.URL`
13+
- **wiring**: `main.go` creates `extractor.UReadability` struct with `TimeOut`, `SnippetSize`, `Rules` fields; `rest.Server` holds it as a concrete struct
14+
- **CF `/content` endpoint**: `POST /accounts/{id}/browser-rendering/content` with `{"url": "..."}`, Bearer token auth, returns rendered HTML
15+
16+
## Development Approach
17+
- **testing approach**: Regular (code first, then tests)
18+
- complete each task fully before moving to the next
19+
- make small, focused changes
20+
- **CRITICAL: every task MUST include new/updated tests** for code changes in that task
21+
- **CRITICAL: all tests must pass before starting next task**
22+
- **CRITICAL: update this plan file when scope changes during implementation**
23+
- run tests after each change
24+
- maintain backward compatibility — existing code creating `UReadability{}` without `Retriever` must continue to work (nil defaults to HTTP fetch)
25+
26+
## Testing Strategy
27+
- **unit tests**: httptest mock servers for both retrievers, table-driven tests, testify assertions (matching existing patterns)
28+
- **mock generation**: moq-generated mock for `Retriever` interface (same pattern as `Rules` mock)
29+
- **integration**: existing `readability_test.go` tests must pass unchanged (they create `UReadability` without `Retriever`)
30+
31+
## Progress Tracking
32+
- mark completed items with `[x]` immediately when done
33+
- add newly discovered tasks with + prefix
34+
- document issues/blockers with warning prefix
35+
- update plan if implementation deviates from original scope
36+
37+
## Implementation Steps
38+
39+
### Task 1: Create Retriever interface and HTTPRetriever
40+
41+
**Files:**
42+
- Create: `extractor/retriever.go`
43+
- Create: `extractor/retriever_test.go`
44+
45+
- [x] define `Retriever` interface with `Retrieve(ctx context.Context, url string) (*RetrieveResult, error)` method
46+
- [x] define `RetrieveResult` struct with `Body []byte`, `URL string`, `Header http.Header`
47+
- [x] implement `HTTPRetriever` struct extracting the current fetch logic from `extractWithRules` (HTTP client, Safari user-agent, redirect following, body reading)
48+
- [x] add `//go:generate moq` directive for `Retriever` interface
49+
- [x] write tests for `HTTPRetriever`: successful fetch, redirect following, user-agent header, error cases (bad URL, connection refused)
50+
- [x] run tests — must pass before next task
51+
52+
### Task 2: Implement CloudflareRetriever
53+
54+
**Files:**
55+
- Modify: `extractor/retriever.go`
56+
- Modify: `extractor/retriever_test.go`
57+
58+
- [x] implement `CloudflareRetriever` struct with `AccountID`, `APIToken`, `BaseURL` (for test override), `Timeout` fields
59+
- [x] implement `Retrieve` method: POST to `/accounts/{id}/browser-rendering/content` with `{"url": "...", "gotoOptions": {"waitUntil": "networkidle0"}}`, Bearer token auth
60+
- [x] handle response: try JSON `{"success": true, "result": "<html>"}` first, fall back to raw body; set `Content-Type: text/html; charset=utf-8` header
61+
- [x] write tests: successful fetch (mock CF API), API error (non-200 status), JSON response format, raw HTML response format
62+
- [x] run tests — must pass before next task
63+
64+
### Task 3: Wire Retriever into UReadability
65+
66+
**Files:**
67+
- Modify: `extractor/readability.go`
68+
- Modify: `extractor/readability_test.go`
69+
70+
- [x] add `Retriever Retriever` field to `UReadability` struct
71+
- [x] add `retriever()` helper method: returns `f.Retriever` if non-nil, otherwise `&HTTPRetriever{Timeout: f.TimeOut}`
72+
- [x] replace inline HTTP fetch in `extractWithRules` (lines 80-106) with `f.retriever().Retrieve(ctx, reqURL)` call
73+
- [x] use `result.URL`, `result.Body`, `result.Header` instead of `resp.Request.URL`, `io.ReadAll(resp.Body)`, `resp.Header`
74+
- [x] change `normalizeLinks` signature from `*http.Request` to `*url.URL` (only `.URL` field is used); update caller to pass parsed URL
75+
- [x] remove unused imports from `readability.go` (`io`, `net/http`)
76+
- [x] update `TestNormalizeLinks` and `TestNormalizeLinksIssue` to pass `*url.URL` instead of `&http.Request{URL: u}`
77+
- [x] verify all existing tests pass unchanged (tests create `UReadability` without `Retriever` — nil defaults to HTTPRetriever)
78+
- [x] run full test suite: `go test -timeout=60s -race ./...`
79+
80+
### Task 4: Add CLI flags and wiring in main.go
81+
82+
**Files:**
83+
- Modify: `main.go`
84+
85+
- [x] add `CFAccountID string` and `CFAPIToken string` fields to opts struct with `long`/`env` tags
86+
- [x] in `main()`, create `CloudflareRetriever` when both flags are set; log which retriever is active
87+
- [x] pass retriever to `UReadability` struct
88+
- [x] run full test suite: `go test -timeout=60s -race ./...`
89+
90+
### Task 5: Generate mock and run linter
91+
92+
**Files:**
93+
- Create: `extractor/retriever_mock_test.go` (generated, test-only — placed in extractor package to avoid import cycle with mocks/)
94+
95+
- [x] run `go generate ./extractor/...` to generate `Retriever` mock
96+
- [x] run `gofmt -w` on all modified files
97+
- [x] run `golangci-lint run --max-issues-per-linter=0 --max-same-issues=0`
98+
- [x] fix any lint issues
99+
100+
### Task 6: Verify acceptance criteria
101+
102+
- [x] verify `UReadability{}` without `Retriever` field works (backward compatible)
103+
- [x] verify `UReadability{Retriever: &CloudflareRetriever{...}}` works
104+
- [x] verify all existing tests pass: `go test -timeout=60s -race ./...`
105+
- [x] verify mock is generated and up to date
106+
107+
### Task 7: [Final] Update documentation
108+
109+
- [x] update CLAUDE.md architecture section to mention `Retriever` interface
110+
- [x] update CLAUDE.md build section with new CLI flags
111+
- [x] move this plan to `docs/plans/completed/`
112+
113+
## Technical Details
114+
115+
### Retriever interface
116+
117+
```go
118+
type Retriever interface {
119+
Retrieve(ctx context.Context, url string) (*RetrieveResult, error)
120+
}
121+
122+
type RetrieveResult struct {
123+
Body []byte // raw page content (HTML)
124+
URL string // final URL after redirects
125+
Header http.Header // response headers (for charset detection)
126+
}
127+
```
128+
129+
### CloudflareRetriever request/response
130+
131+
```
132+
POST https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/content
133+
Authorization: Bearer {api_token}
134+
Content-Type: application/json
135+
136+
{"url": "https://example.com", "gotoOptions": {"waitUntil": "networkidle0"}}
137+
```
138+
139+
Response: fully rendered HTML (may be JSON-wrapped `{"success": true, "result": "<html>"}` or raw HTML).
140+
141+
### CLI flags
142+
143+
| Flag | Env | Description |
144+
|------|-----|-------------|
145+
| `--cf-account-id` | `CF_ACCOUNT_ID` | Cloudflare account ID for Browser Rendering API |
146+
| `--cf-api-token` | `CF_API_TOKEN` | Cloudflare API token with Browser Rendering Edit permission |
147+
148+
When both are set → `CloudflareRetriever`; otherwise → `HTTPRetriever` (default).
149+
150+
### Pipeline flow (unchanged)
151+
152+
```
153+
Retriever.Retrieve(url) → toUtf8 → getContent (rules/readability) → title
154+
↑ NEW │ │
155+
│ ↓ ↓
156+
HTTPRetriever (default) getText → normalizeLinks → getSnippet → extractPics
157+
CloudflareRetriever (opt)
158+
```
159+
160+
## Post-Completion
161+
162+
**External system updates:**
163+
- super-bot deployment: add `CF_ACCOUNT_ID` and `CF_API_TOKEN` env vars to ukeeper-readability deployment config when switching to Cloudflare retrieval
164+
- Cloudflare setup: create API token with "Browser Rendering - Edit" permission under the target account
165+
- radio-t/super-bot#156: can be closed once this is deployed — super-bot continues using the existing `uKeeperGetter` interface unchanged
166+
167+
**Manual verification:**
168+
- test against real Cloudflare Browser Rendering API with known problematic URLs (sites returning "just a moment..." to direct HTTP)
169+
- verify free tier limits are acceptable (10 min/day browser time, 1 req/10 sec rate limit)

0 commit comments

Comments
 (0)