Skip to content

Commit 54b365a

Browse files
jwindleyclaude
andcommitted
Drop Enterprise 10.1 and Cloud 10.2; fix ES 8.3 seeds; run dedup after merge
Enterprise 10.1 and Cloud 10.2 removed: the landing page only links to current-version content so BFS never discovers older-version pages. No viable seeding strategy without significant crawler rework. ES 8.3: add section seeds identical to 8.4 pattern. The assumption that /section/8.3 redirects to 8.5 was wrong -- 8.4 section seeds work fine with the same URL pattern and 8.3 should too. GHA workflow: run run_dedup_pass() after merge so the merged DB actually has is_duplicate flags set. Previously merge_dbs() never called dedup, leaving all rows with is_duplicate=0 in the published DB. server.py: remove splunk-enterprise-10-1 and splunk-cloud-10-2 from tool descriptions and valid version list. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 73f2c4e commit 54b365a

4 files changed

Lines changed: 26 additions & 38 deletions

File tree

.github/workflows/crawl-and-release.yml

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,7 @@ jobs:
2525
- enterprise-security-8-3
2626
- admin-manual
2727
- splunk-enterprise
28-
- splunk-enterprise-10-1
2928
- splunk-cloud
30-
- splunk-cloud-10-2
3129
- lantern
3230

3331
runs-on: ubuntu-latest
@@ -76,7 +74,7 @@ jobs:
7674
retention-days: 1
7775

7876
# ---------------------------------------------------------------------------
79-
# Aggregation — merge all per-source DBs, export per-source files, release
77+
# Aggregation — merge all per-source DBs, run dedup, export, release
8078
# ---------------------------------------------------------------------------
8179
merge-and-release:
8280
needs: [crawl]
@@ -107,8 +105,7 @@ jobs:
107105
# was uploaded rather than failing the whole release.
108106
DBS=""
109107
for src in enterprise-security enterprise-security-8-4 enterprise-security-8-3 \
110-
admin-manual splunk-enterprise splunk-enterprise-10-1 \
111-
splunk-cloud splunk-cloud-10-2 lantern; do
108+
admin-manual splunk-enterprise splunk-cloud lantern; do
112109
if [ -f "data/${src}.db" ]; then
113110
DBS="$DBS data/${src}.db"
114111
else
@@ -121,6 +118,16 @@ jobs:
121118
fi
122119
uv run splunk-merge $DBS --output data/splunk_docs.db
123120
121+
- name: Run cross-source deduplication
122+
run: uv run python -c "
123+
from splunk_docs_mcp.db import get_connection, run_dedup_pass
124+
from pathlib import Path
125+
conn = get_connection(Path('data/splunk_docs.db'))
126+
n = run_dedup_pass(conn)
127+
print(f'Dedup complete: {n} duplicate rows suppressed')
128+
conn.close()
129+
"
130+
124131
- name: Export per-source DBs and manifest
125132
run: uv run splunk-merge --export-sources data/export/ --db data/splunk_docs.db
126133
@@ -138,8 +145,8 @@ jobs:
138145
139146
**Sources indexed:**
140147
- Splunk Enterprise Security 8.3, 8.4, 8.5
141-
- Splunk Enterprise 10.1, 10.2
142-
- Splunk Cloud Platform 10.2, 10.3.2512
148+
- Splunk Enterprise 10.2
149+
- Splunk Cloud Platform 10.3.2512
143150
- Splunk Configuration File Reference 10.2
144151
- Splunk Lantern (current)
145152

README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -125,9 +125,7 @@ The goal is to keep the **current released version plus the previous version (n
125125
| `enterprise-security-8-4` | Splunk Enterprise Security | 8.4 (n−1) |
126126
| `enterprise-security-8-3` | Splunk Enterprise Security | 8.3 (n−2) |
127127
| `splunk-enterprise` | Splunk Enterprise | 10.2 (current) |
128-
| `splunk-enterprise-10-1` | Splunk Enterprise | 10.1 (n−1) |
129128
| `splunk-cloud` | Splunk Cloud Platform | 10.3.2512 (current) |
130-
| `splunk-cloud-10-2` | Splunk Cloud Platform | 10.2 (n−1) |
131129
| `admin-manual` | Splunk Configuration File Reference | 10.2 |
132130
| `lantern` | Splunk Lantern | current |
133131

src/splunk_docs_mcp/config.py

Lines changed: 7 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -205,26 +205,6 @@ class CrawlSource:
205205
url_prefix="https://help.splunk.com/en/splunk-cloud-platform/",
206206
blocked_path_prefixes=_HELP_BLOCKED,
207207
),
208-
CrawlSource(
209-
source_id="splunk-enterprise-10-1",
210-
display_name="Splunk Enterprise 10.1",
211-
version="10.1",
212-
seed_urls=[
213-
"https://help.splunk.com/en/splunk-enterprise/",
214-
],
215-
url_prefix="https://help.splunk.com/en/splunk-enterprise/",
216-
blocked_path_prefixes=_HELP_BLOCKED,
217-
),
218-
CrawlSource(
219-
source_id="splunk-cloud-10-2",
220-
display_name="Splunk Cloud Platform 10.2",
221-
version="10.2",
222-
seed_urls=[
223-
"https://help.splunk.com/en/splunk-cloud-platform/",
224-
],
225-
url_prefix="https://help.splunk.com/en/splunk-cloud-platform/",
226-
blocked_path_prefixes=_HELP_BLOCKED,
227-
),
228208
CrawlSource(
229209
source_id="enterprise-security-8-4",
230210
display_name="Splunk Enterprise Security 8.4",
@@ -243,10 +223,15 @@ class CrawlSource:
243223
source_id="enterprise-security-8-3",
244224
display_name="Splunk Enterprise Security 8.3",
245225
version="8.3",
246-
# No version-specific section seeds — those redirect to 8.5 and get
247-
# rejected by the version filter. BFS from the root discovers 8.3 links.
226+
# Section seeds use the same pattern as 8.4 — /section/8.3 loads 8.3
227+
# content directly (the earlier assumption that these redirect to 8.5
228+
# was incorrect; 8.4 section seeds work identically and 8.3 should too).
248229
seed_urls=[
249230
"https://help.splunk.com/en/splunk-enterprise-security-8",
231+
*[
232+
f"https://help.splunk.com/en/splunk-enterprise-security-8/{s}/8.3"
233+
for s in _ES_SECTIONS
234+
],
250235
],
251236
url_prefix="https://help.splunk.com/en/splunk-enterprise-security-8/",
252237
blocked_path_prefixes=_HELP_BLOCKED,

src/splunk_docs_mcp/server.py

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -110,14 +110,12 @@ def _get_db() -> sqlite3.Connection:
110110
" enterprise-security-8-3 — Splunk Enterprise Security 8.3\n"
111111
" admin-manual — Splunk Configuration File Reference 10.2\n"
112112
" splunk-enterprise — Splunk Enterprise 10.2\n"
113-
" splunk-enterprise-10-1 — Splunk Enterprise 10.1\n"
114113
" splunk-cloud — Splunk Cloud Platform 10.3.2512\n"
115-
" splunk-cloud-10-2 — Splunk Cloud Platform 10.2\n"
116114
" lantern — Splunk Lantern (use-case guidance, best practices)\n\n"
117115
"Version filter (version= on search_docs / search_docs_semantic):\n"
118116
" Use version= to filter across sources by product version when the user asks\n"
119117
" about a specific release. Example: version='8.4' returns ES 8.4 docs only.\n"
120-
" Valid values: '8.3', '8.4', '8.5', '10.1', '10.2', '10.3.2512', 'current'.\n"
118+
" Valid values: '8.3', '8.4', '8.5', '10.2', '10.3.2512', 'current'.\n"
121119
" Combine source= and version= for precise targeting (e.g. ES 8.4 only).\n\n"
122120

123121
"DECISION TREE — apply before every question:\n\n"
@@ -199,15 +197,15 @@ def search_docs(
199197
"Limit search to a specific source. "
200198
"Options: 'enterprise-security', 'enterprise-security-8-4', "
201199
"'enterprise-security-8-3', 'admin-manual', 'splunk-enterprise', "
202-
"'splunk-enterprise-10-1', 'splunk-cloud', 'splunk-cloud-10-2', 'lantern'. "
200+
"'splunk-cloud', 'lantern'. "
203201
"Omit to search across all indexed sources."
204202
)),
205203
] = None,
206204
version: Annotated[
207205
str | None,
208206
Field(description=(
209207
"Filter by product version. "
210-
"Valid values: '8.3', '8.4', '8.5', '10.1', '10.2', '10.3.2512', 'current'. "
208+
"Valid values: '8.3', '8.4', '8.5', '10.2', '10.3.2512', 'current'. "
211209
"Combine with source= for precise targeting, or use alone to search "
212210
"a specific release across all sources that have it."
213211
)),
@@ -264,15 +262,15 @@ def search_docs_semantic(
264262
"Limit search to a specific source. "
265263
"Options: 'enterprise-security', 'enterprise-security-8-4', "
266264
"'enterprise-security-8-3', 'admin-manual', 'splunk-enterprise', "
267-
"'splunk-enterprise-10-1', 'splunk-cloud', 'splunk-cloud-10-2', 'lantern'. "
265+
"'splunk-cloud', 'lantern'. "
268266
"Omit to search across all indexed sources."
269267
)),
270268
] = None,
271269
version: Annotated[
272270
str | None,
273271
Field(description=(
274272
"Filter by product version. "
275-
"Valid values: '8.3', '8.4', '8.5', '10.1', '10.2', '10.3.2512', 'current'. "
273+
"Valid values: '8.3', '8.4', '8.5', '10.2', '10.3.2512', 'current'. "
276274
"Combine with source= for precise targeting."
277275
)),
278276
] = None,

0 commit comments

Comments
 (0)