Skip to content

Commit 73f2c4e

Browse files
jwindleyclaude
andcommitted
Fix GHA crawl: remove 404 section seeds, add --full to weekly crawl
Section-level seed URLs like /search/10.3.2512 and /get-started/10.2 return HTTP 404 on help.splunk.com. They accumulated as 'failed' in crawl_state and were re-attempted on every GHA run. Landing page BFS already discovers all pages without them, so they are removed. Add --full to the GHA crawl step so the weekly cron actually re-fetches content. Without --full, all seeds are in crawl_state as 'fetched' and every run prints 'Nothing to crawl' -- the index never updates. --full re-fetches all pages; the content hash check skips re-embedding unchanged pages. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent d9e5e6e commit 73f2c4e

2 files changed

Lines changed: 11 additions & 22 deletions

File tree

.github/workflows/crawl-and-release.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,12 @@ jobs:
5353
splunk-db-${{ matrix.source }}-
5454
5555
- name: Crawl ${{ matrix.source }}
56-
run: uv run splunk-crawl --sources ${{ matrix.source }} --db data/${{ matrix.source }}.db
56+
# --full re-fetches all pages on every run so the index stays current.
57+
# Pages whose raw HTML hasn't changed are skipped at the extract/embed
58+
# stage (hash comparison), so only genuinely updated content is re-indexed.
59+
# Without --full the cached crawl_state causes "Nothing to crawl" on every
60+
# run after the first, meaning the index is never updated.
61+
run: uv run splunk-crawl --sources ${{ matrix.source }} --db data/${{ matrix.source }}.db --full
5762

5863
- name: Save per-source DB cache
5964
if: always()

src/splunk_docs_mcp/config.py

Lines changed: 5 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -185,14 +185,11 @@ class CrawlSource:
185185
display_name="Splunk Enterprise 10.2",
186186
version="10.2",
187187
seed_urls=[
188-
# Landing page is the primary reliable entry point
188+
# Landing page only — BFS discovers all section pages from here.
189+
# Section-level seeds like /{section}/10.2 return HTTP 404 on
190+
# help.splunk.com and were removed to avoid accumulating dead URLs
191+
# in crawl_state that get re-attempted on every run.
189192
"https://help.splunk.com/en/splunk-enterprise/",
190-
# Section-level seeds — may redirect to a deeper page; the redirect
191-
# bug fix (2026-04-18) ensures relative links resolve correctly.
192-
*[
193-
f"https://help.splunk.com/en/splunk-enterprise/{s}/10.2"
194-
for s in _ENTERPRISE_SECTIONS
195-
],
196193
],
197194
url_prefix="https://help.splunk.com/en/splunk-enterprise/",
198195
blocked_path_prefixes=_HELP_BLOCKED,
@@ -202,13 +199,8 @@ class CrawlSource:
202199
display_name="Splunk Cloud Platform 10.3.2512",
203200
version="10.3.2512",
204201
seed_urls=[
205-
# Landing page is the primary reliable entry point
202+
# Landing page only — same reasoning as splunk-enterprise above.
206203
"https://help.splunk.com/en/splunk-cloud-platform/",
207-
# Section-level seeds
208-
*[
209-
f"https://help.splunk.com/en/splunk-cloud-platform/{s}/10.3.2512"
210-
for s in _CLOUD_SECTIONS
211-
],
212204
],
213205
url_prefix="https://help.splunk.com/en/splunk-cloud-platform/",
214206
blocked_path_prefixes=_HELP_BLOCKED,
@@ -219,10 +211,6 @@ class CrawlSource:
219211
version="10.1",
220212
seed_urls=[
221213
"https://help.splunk.com/en/splunk-enterprise/",
222-
*[
223-
f"https://help.splunk.com/en/splunk-enterprise/{s}/10.1"
224-
for s in _ENTERPRISE_SECTIONS
225-
],
226214
],
227215
url_prefix="https://help.splunk.com/en/splunk-enterprise/",
228216
blocked_path_prefixes=_HELP_BLOCKED,
@@ -233,10 +221,6 @@ class CrawlSource:
233221
version="10.2",
234222
seed_urls=[
235223
"https://help.splunk.com/en/splunk-cloud-platform/",
236-
*[
237-
f"https://help.splunk.com/en/splunk-cloud-platform/{s}/10.2"
238-
for s in _CLOUD_SECTIONS
239-
],
240224
],
241225
url_prefix="https://help.splunk.com/en/splunk-cloud-platform/",
242226
blocked_path_prefixes=_HELP_BLOCKED,

0 commit comments

Comments
 (0)