Commit d512f1e
authored
Modularise URL retrieval with Cloudflare Browser Rendering support (#73)
* feat: create Retriever interface and HTTPRetriever implementation
Extract URL fetching abstraction from the inline HTTP logic in extractWithRules.
Defines Retriever interface, RetrieveResult struct, and HTTPRetriever with
Safari user-agent, redirect following, and timeout support. Includes moq
generate directive and comprehensive tests.
* feat: implement CloudflareRetriever for Browser Rendering API
* feat: wire Retriever interface into UReadability extraction pipeline
* feat: add CLI flags and wire Cloudflare retriever in main.go
* feat: generate Retriever mock, run gofmt and linter
Generate moq mock for Retriever interface as a test-only file
(retriever_mock_test.go) instead of mocks/ subpackage to avoid
import cycle (mocks/retriever.go would import extractor, cycling
with readability_test.go). Run gofmt on all modified files, zero
lint issues.
* feat: verify acceptance criteria for Retriever interface
* feat: update documentation for Retriever interface and CLI flags
* fix: address code review findings
- fix err shadowing in deferred Body.Close() in both retrievers (use closeErr)
- handle Cloudflare API success=false response explicitly instead of treating JSON error as HTML
- truncate CF API error body to 512 bytes in error messages
- add comment documenting CF retriever URL limitation (no final URL after JS redirects)
- fix pre-existing %b format verb in text.go logging (should be %v)
- replace network-dependent TestCloudflareRetriever_DefaultBaseURL with local httptest
- add TestCloudflareRetriever_SuccessFalse for the new success=false handling
- add TestExtractWithCustomRetriever integration test using RetrieverMock
- remove duplicate plan file from docs/plans/ (already in completed/)
- update README.md with new CF CLI flags and feature description
- update CLAUDE.md CI bullet to reflect split docker.yml workflow
* fix: address code review findings
* fix: address code review findings
* fix: address codex review findings
* fix: address code review findings
* fix: address code review findings
* fix: address code review findings
* fix: cache default retriever, add defensive timeouts, extract constants
* fix: revert token auth addition to POST /api/extract
POST /api/extract never had token auth in the original code.
The checkToken refactoring should only apply to the legacy
/content/v1/parser endpoint which always had it.
* docs: add OpenAI auto-extraction improvement plan
* docs: remove unrelated OpenAI auto-extraction plan
* feat: add per-rule and global Cloudflare routing, 429 retries
HTTP retriever stays the default. Cloudflare is now opt-in at two levels:
- per-rule: new Rule.UseCloudflare field (checkbox in rule editor UI)
routes requests for that domain through Cloudflare Browser Rendering
- global: --cf-route-all / CF_ROUTE_ALL flag (default false) routes every
request through Cloudflare
UReadability.pickRetriever(rule) picks: CFRouteAll > rule.UseCloudflare >
default HTTP. extractWithRules now resolves the rule once upfront and
shares it between routing and getContent (was looked up twice).
CloudflareRetriever retries on HTTP 429 with exponential backoff (base 11s,
max 2 retries by default → worst-case 33s of backoff), honours Retry-After
header, and aborts immediately on caller context cancel. MaxRetries=-1
disables retries.
Added WriteTimeout=150s on the HTTP server — was previously unset, allowing
handlers to run forever. 150s covers the worst-case CF path (up to ~123s).
* fix: address review feedback on Cloudflare routing PR
- TestGetContentCustom: pass the rule directly to getContent so it actually
exercises the custom-rule path; the RulesMock.GetFunc setup was dead code
after getContent stopped looking up rules
- CloudflareRetriever.MaxRetries: remove default substitution for the zero
value — 0 now means "no retries" as expected. Callers opt into retries by
setting MaxRetries explicitly; main.go uses the exported CFDefaultMaxRetries
constant (2)
- README: add cf-route-all to the config table and rewrite the Cloudflare
section to reflect the opt-in routing model + 429 retry behaviour
- rest.Server.Run: expand the WriteTimeout comment to explain why the 150s
ceiling is server-wide rather than per-route via http.TimeoutHandler1 parent be2db1f commit d512f1e
14 files changed
Lines changed: 1279 additions & 93 deletions
File tree
- datastore
- docs/plans/completed
- extractor
- rest
- web/components
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
20 | 24 | | |
21 | 25 | | |
22 | 26 | | |
| |||
32 | 36 | | |
33 | 37 | | |
34 | 38 | | |
35 | | - | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
36 | 42 | | |
37 | 43 | | |
38 | 44 | | |
39 | | - | |
| 45 | + | |
40 | 46 | | |
41 | 47 | | |
42 | 48 | | |
| |||
47 | 53 | | |
48 | 54 | | |
49 | 55 | | |
50 | | - | |
| 56 | + | |
51 | 57 | | |
52 | 58 | | |
53 | | - | |
| 59 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
| 15 | + | |
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
19 | 22 | | |
20 | 23 | | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
21 | 33 | | |
22 | 34 | | |
23 | 35 | | |
24 | | - | |
| 36 | + | |
25 | 37 | | |
26 | 38 | | |
27 | 39 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
22 | | - | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
31 | 32 | | |
32 | 33 | | |
33 | 34 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
0 commit comments