Commit 73f2c4e
Fix GHA crawl: remove 404 section seeds, add --full to weekly crawl
Section-level seed URLs like /search/10.3.2512 and /get-started/10.2
return HTTP 404 on help.splunk.com. They accumulated as 'failed' in
crawl_state and were re-attempted on every GHA run. Landing page BFS
already discovers all pages without them, so they are removed.
Add --full to the GHA crawl step so the weekly cron actually re-fetches
content. Without --full, all seeds are in crawl_state as 'fetched' and
every run prints 'Nothing to crawl' -- the index never updates.
--full re-fetches all pages; the content hash check skips re-embedding
unchanged pages.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>1 parent d9e5e6e commit 73f2c4e
2 files changed
Lines changed: 11 additions & 22 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
56 | | - | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
57 | 62 | | |
58 | 63 | | |
59 | 64 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
185 | 185 | | |
186 | 186 | | |
187 | 187 | | |
188 | | - | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
189 | 192 | | |
190 | | - | |
191 | | - | |
192 | | - | |
193 | | - | |
194 | | - | |
195 | | - | |
196 | 193 | | |
197 | 194 | | |
198 | 195 | | |
| |||
202 | 199 | | |
203 | 200 | | |
204 | 201 | | |
205 | | - | |
| 202 | + | |
206 | 203 | | |
207 | | - | |
208 | | - | |
209 | | - | |
210 | | - | |
211 | | - | |
212 | 204 | | |
213 | 205 | | |
214 | 206 | | |
| |||
219 | 211 | | |
220 | 212 | | |
221 | 213 | | |
222 | | - | |
223 | | - | |
224 | | - | |
225 | | - | |
226 | 214 | | |
227 | 215 | | |
228 | 216 | | |
| |||
233 | 221 | | |
234 | 222 | | |
235 | 223 | | |
236 | | - | |
237 | | - | |
238 | | - | |
239 | | - | |
240 | 224 | | |
241 | 225 | | |
242 | 226 | | |
| |||
0 commit comments