Skip to content

Commit 6f4511f

Browse files
sciapanCAclaude
andcommitted
Add datasource relevance filter support (--query)
Mirror the backend/MCP datasource relevance filter in the skill: - get_datasources() accepts an optional natural-language query, sends it as ?query=, parses the X-CodeAlive-Total-Data-Sources header, and returns a {dataSources, message} envelope with fail-open detection - datasources.py gains a --query flag, renders relevanceReason per source and the omitted-count / fail-open message - SKILL.md, workflows reference, and the context-explorer agent now recommend passing the user's task as --query - Bump plugin version to 2.1.0 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
1 parent ad298b7 commit 6f4511f

8 files changed

Lines changed: 283 additions & 28 deletions

File tree

.claude-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "codealive",
33
"description": "CodeAlive context engine for semantic code search and AI-powered codebase Q&A. Enables AI coding agents to understand entire codebases beyond just open files — search across all indexed repositories, trace cross-service dependencies, discover usage patterns, and get synthesized answers to architectural questions. Includes a lightweight code exploration subagent, authentication hooks, and multiple search modes (fast lexical, semantic, and deep cross-cutting). Works standalone or alongside the CodeAlive MCP server for direct tool access via the Model Context Protocol.",
4-
"version": "2.0.9",
4+
"version": "2.1.0",
55
"author": {
66
"name": "CodeAlive AI",
77
"email": "hello@codealive.ai"

agents/codealive-context-explorer.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ You are a code exploration specialist. **Your default tool is CodeAlive — not
1616
Unless the request is unambiguously a local-only file lookup ("read line 42 of foo.ts", "is bar.py in this repo"), your first turn MUST include both of these calls before any answer:
1717

1818
```bash
19-
python scripts/datasources.py
19+
python scripts/datasources.py --query "<the user's question or task>"
2020
python scripts/search.py "<question paraphrased as a concept query>" <data_source>
2121
```
2222

@@ -28,9 +28,12 @@ The scripts directory is relative to the skill location. If a path fails, fall b
2828

2929
### 1. List data sources — run FIRST every session
3030
```bash
31-
python scripts/datasources.py
31+
python scripts/datasources.py --query "<the user's question or task>"
3232
```
33-
Without this you do not know what to search against. Instant, free, cheap.
33+
Without this you do not know what to search against. Pass the user's question as `--query` so
34+
the backend returns only the relevant sources, each with a `relevanceReason`. The output tells
35+
you when sources were omitted, and when filtering was unavailable (the full list is returned
36+
instead — fail-open). Omit `--query` only when the user asks for the complete inventory.
3437

3538
### 2. Semantic search — your default discovery tool
3639
```bash
@@ -64,7 +67,7 @@ Use after `search.py` or `fetch.py` to expand a call graph, inheritance, or symb
6467

6568
Standard loop, in order:
6669

67-
1. **`datasources.py`** — every session, no exceptions.
70+
1. **`datasources.py --query "<user's task>"`** — every session, no exceptions. The relevance-filtered shortlist tells you what to search against; if a source you expected is missing, rerun without `--query` to see the full list.
6871
2. **`search.py`** with the main concept — every session, no exceptions. Run it even when you have a guess; the search confirms or refutes it with real evidence.
6972
3. **`grep.py`** for specific identifiers, error messages, or config keys surfaced in step 2.
7073
4. **`fetch.py`** on the most relevant identifiers (descriptions are triage pointers only — never reason from them).

skills/codealive-context-engine/SKILL.md

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ Do NOT retry the failed script until setup completes successfully.
3737

3838
| Tool | Script | Speed | Cost | Best For |
3939
|------|--------|-------|------|----------|
40-
| **List Data Sources** | `datasources.py` | Instant | Free | Discovering indexed repos and workspaces |
40+
| **List Data Sources** | `datasources.py` | Instant | Free | Discovering indexed repos and workspaces. With `--query "task"`, runs an AI relevance filter (low cost, not instant) returning only the relevant sources |
4141
| **Semantic Search** | `search.py` | Fast | Low | Default discovery — finds code by meaning (concepts, behavior, architecture) |
4242
| **Grep Search** | `grep.py` | Fast | Low | Finds code containing a specific string or regex (identifiers, literals, patterns) |
4343
| **Fetch Artifacts** | `fetch.py` | Fast | Low | Retrieving full content; function-like artifacts also include up to 3 outgoing/incoming calls as a preview |
@@ -106,9 +106,13 @@ logic.
106106
### 1. Discover what's indexed
107107

108108
```bash
109-
python scripts/datasources.py
109+
python scripts/datasources.py --query "the user's task in natural language"
110110
```
111111

112+
Recommended: pass the user's task as `--query` so the backend returns only the relevant
113+
data sources, each with a `relevanceReason`. Omit `--query` to list everything (instant,
114+
no AI filtering).
115+
112116
### 2. Search for code (fast, cheap)
113117

114118
```bash
@@ -151,11 +155,21 @@ python scripts/chat.py "What about security considerations?" --continue CONV_ID
151155
### `datasources.py` — List Data Sources
152156

153157
```bash
154-
python scripts/datasources.py # Ready-to-use sources
158+
python scripts/datasources.py --query "add OAuth to checkout" # Only sources relevant to a task (recommended)
159+
python scripts/datasources.py # Ready-to-use sources (full list)
155160
python scripts/datasources.py --all # All (including processing)
156161
python scripts/datasources.py --json # JSON output
157162
```
158163

164+
| Option | Description |
165+
|--------|-------------|
166+
| `--query "TASK"` | The user's task/intent in natural language. The backend runs an AI relevance filter and returns only the relevant sources, each with a `relevanceReason`. Recommended whenever you know what the user is trying to accomplish |
167+
| `--all` | Include sources still processing |
168+
| `--json` | Raw JSON output (with `--query`: `{"dataSources": [...], "message": "..."}`) |
169+
170+
**Fail-open:** if relevance filtering is unavailable, the FULL list is returned and the
171+
output says so — check the message before treating the result as a relevant shortlist.
172+
159173
### `search.py` — Semantic Code Search (default discovery tool)
160174

161175
The default starting point. Finds code by WHAT it does — concepts, behavior,

skills/codealive-context-engine/references/workflows.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,13 @@ Complete workflows for common code exploration scenarios using CodeAlive.
2020

2121
### Step 1: Discover Available Code
2222
```bash
23-
python datasources.py
23+
python datasources.py --query "your task in natural language"
2424
```
2525

26+
Pass your task as `--query` to get only the relevant data sources, each with a
27+
`relevanceReason` (recommended when you know the goal). Run plain `python datasources.py`
28+
for the complete inventory.
29+
2630
Review output to understand:
2731
- What repositories are indexed
2832
- What workspaces group related repos
@@ -287,8 +291,8 @@ python grep.py "useMemo|useCallback|React.memo" workspace:all-frontend --regex
287291
### Day 1: Get Overview
288292

289293
```bash
290-
# Discover what's indexed
291-
python datasources.py
294+
# Discover what's indexed (relevance-filtered to the onboarding goal)
295+
python datasources.py --query "onboard to the new-service codebase"
292296

293297
# Find entry points and main features
294298
python search.py "main application entry point, startup initialization" new-service

skills/codealive-context-engine/scripts/datasources.py

Lines changed: 49 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,16 @@
66
Includes current project repos, dependencies, libraries, and organizational codebases.
77
88
Usage:
9-
python datasources.py # Show ready-to-use data sources
10-
python datasources.py --all # Show all data sources (including processing)
11-
python datasources.py --json # Output as JSON
9+
python datasources.py # Show ready-to-use data sources
10+
python datasources.py --query "TASK" # Show only sources relevant to a task (recommended)
11+
python datasources.py --all # Show all data sources (including processing)
12+
python datasources.py --json # Output as JSON
1213
1314
Examples:
15+
# RECOMMENDED when you know the task: only sources relevant to it, each with a
16+
# relevanceReason explaining the match
17+
python datasources.py --query "add OAuth to the checkout flow"
18+
1419
# List ready data sources
1520
python datasources.py
1621
@@ -19,6 +24,10 @@
1924
2025
# Get JSON output for parsing
2126
python datasources.py --json
27+
28+
Note:
29+
--query runs an AI relevance filter on the backend. It fails open: if filtering is
30+
unavailable, the FULL list is returned and the output says so.
2231
"""
2332

2433
import sys
@@ -31,17 +40,27 @@
3140
from api_client import CodeAliveClient
3241

3342

34-
def format_datasources(datasources: list, as_json: bool = False) -> str:
35-
"""Format data sources for display."""
43+
def format_datasources(datasources: list, as_json: bool = False, message: str = "") -> str:
44+
"""Format data sources for display.
45+
46+
`message` is the relevance hint accompanying a --query'd listing: how many sources
47+
were omitted as non-relevant, or that filtering was unavailable and the list is full.
48+
"""
3649
if as_json:
50+
if message:
51+
return json.dumps({"dataSources": datasources, "message": message}, indent=2)
3752
return json.dumps(datasources, indent=2)
3853

3954
if not datasources:
55+
if message:
56+
return f"No data sources matched.\nℹ️ {message}"
4057
return "No data sources found.\nAdd repositories at https://app.codealive.ai"
4158

4259
output = []
4360
output.append(f"\n📚 Available Data Sources ({len(datasources)} total)\n")
4461
output.append("="*80)
62+
if message:
63+
output.append(f"\nℹ️ {message}")
4564

4665
# Group by type
4766
repos = [ds for ds in datasources if ds.get("type") == "Repository"]
@@ -58,6 +77,8 @@ def format_datasources(datasources: list, as_json: bool = False) -> str:
5877
status = f" [{state}]" if state and state != "Alive" else ""
5978
output.append(f"\n 📁 {name}{status}")
6079
output.append(f" {desc}")
80+
if ws.get("relevanceReason"):
81+
output.append(f" 🎯 {ws['relevanceReason']}")
6182

6283
if repos:
6384
output.append("\n\n📦 REPOSITORIES")
@@ -71,6 +92,8 @@ def format_datasources(datasources: list, as_json: bool = False) -> str:
7192
status = f" [{state}]" if state and state != "Alive" else ""
7293
output.append(f"\n 📄 {name}{status}")
7394
output.append(f" {desc}")
95+
if repo.get("relevanceReason"):
96+
output.append(f" 🎯 {repo['relevanceReason']}")
7497
if url:
7598
output.append(f" 🔗 {url}")
7699

@@ -79,6 +102,7 @@ def format_datasources(datasources: list, as_json: bool = False) -> str:
79102
output.append(" • Use names with search.py, grep.py, and fetch.py")
80103
output.append(" • Workspaces search ALL repos in the workspace")
81104
output.append(" • Combine multiple data sources for broader search")
105+
output.append(" • Pass --query 'your task' to list only the relevant sources")
82106
output.append("\n📖 Examples:")
83107
output.append(" python search.py 'auth logic' my-backend")
84108
output.append(" python grep.py 'AuthService' my-backend")
@@ -90,20 +114,37 @@ def main():
90114
"""CLI interface for listing data sources."""
91115
alive_only = True
92116
as_json = False
117+
query = None
93118

94-
for arg in sys.argv[1:]:
119+
args = sys.argv[1:]
120+
i = 0
121+
while i < len(args):
122+
arg = args[i]
95123
if arg == "--all":
96124
alive_only = False
97125
elif arg == "--json":
98126
as_json = True
127+
elif arg == "--query":
128+
if i + 1 >= len(args):
129+
print("❌ Error: --query requires a value", file=sys.stderr)
130+
sys.exit(1)
131+
query = args[i + 1]
132+
i += 1
99133
elif arg == "--help":
100134
print(__doc__)
101135
sys.exit(0)
136+
i += 1
102137

103138
try:
104139
client = CodeAliveClient()
105-
datasources = client.get_datasources(alive_only=alive_only)
106-
print(format_datasources(datasources, as_json))
140+
result = client.get_datasources(alive_only=alive_only, query=query)
141+
if isinstance(result, dict):
142+
datasources = result.get("dataSources", [])
143+
message = result.get("message", "")
144+
else:
145+
datasources = result
146+
message = ""
147+
print(format_datasources(datasources, as_json, message))
107148

108149
except Exception as e:
109150
print(f"❌ Error: {e}", file=sys.stderr)

skills/codealive-context-engine/scripts/lib/api_client.py

Lines changed: 77 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,45 @@
1919
# agents get an actionable error before the network round-trip.
2020
_OBJECT_ID_RE = re.compile(r"^[0-9a-fA-F]{24}$")
2121

22+
# Pre-filter scoped candidate count, emitted by the backend only on relevance-filtered
23+
# (query'd) data source listings.
24+
_TOTAL_DATA_SOURCES_HEADER = "X-CodeAlive-Total-Data-Sources"
25+
26+
27+
def relevance_message(datasources: List[Dict[str, Any]], total_header: Optional[str]) -> str:
28+
"""Build the hint accompanying a query'd (relevance-filtered) data source listing.
29+
30+
The backend guarantees every relevance-selected item carries a non-empty
31+
``relevanceReason``, so a query'd response where NO item has one means the filter
32+
did not run (fail-open on error, disabled by config, or an older backend ignoring
33+
``query``) and the FULL list was returned — the caller must be told, instead of
34+
mistaking the full dump for a relevant shortlist.
35+
"""
36+
filtered = any(ds.get("relevanceReason") for ds in datasources)
37+
if not filtered:
38+
return (
39+
"Relevance filtering was unavailable for this request (it may have failed or be "
40+
"disabled), so the FULL unfiltered list of data sources is returned."
41+
)
42+
43+
shown = len(datasources)
44+
try:
45+
total = int(total_header)
46+
except (TypeError, ValueError):
47+
# Header absent (TypeError on int(None)) or malformed (ValueError).
48+
total = None
49+
if total is not None and total > shown:
50+
return (
51+
f"{shown} of {total} available data sources are relevant to this query; the other "
52+
f"{total - shown} were omitted. List without a query to get the full list."
53+
)
54+
if total is not None:
55+
return f"All {total} available data sources are relevant to this query."
56+
return (
57+
"Only the data sources relevant to this query are shown; non-relevant sources were "
58+
"omitted. List without a query to get the full list."
59+
)
60+
2261

2362
def format_codealive_error(status: int, body: Any) -> str:
2463
"""Format a CodeAlive REST API error body into a single human/agent-readable line.
@@ -274,8 +313,9 @@ def _make_request(
274313
method: str,
275314
endpoint: str,
276315
params: Optional[Dict[str, Any]] = None,
277-
body: Optional[Dict[str, Any]] = None
278-
) -> Dict[str, Any]:
316+
body: Optional[Dict[str, Any]] = None,
317+
return_headers: bool = False
318+
) -> Any:
279319
"""
280320
Make an HTTP request to the CodeAlive API.
281321
@@ -284,9 +324,10 @@ def _make_request(
284324
endpoint: API endpoint path
285325
params: URL query parameters
286326
body: Request body for POST requests
327+
return_headers: If True, return (parsed JSON, response headers dict) instead.
287328
288329
Returns:
289-
Parsed JSON response
330+
Parsed JSON response, or (parsed JSON, headers) when return_headers is True
290331
"""
291332
url = f"{self.base_url}{endpoint}"
292333

@@ -312,7 +353,10 @@ def _make_request(
312353
try:
313354
with urllib.request.urlopen(request, timeout=self.timeout) as response:
314355
response_data = response.read().decode("utf-8")
315-
return json.loads(response_data) if response_data else {}
356+
parsed = json.loads(response_data) if response_data else {}
357+
if return_headers:
358+
return parsed, dict(response.headers.items())
359+
return parsed
316360
except urllib.error.HTTPError as e:
317361
error_body = e.read()
318362
error_msg = format_codealive_error(e.code, error_body)
@@ -353,18 +397,35 @@ def _make_request(
353397
f"Check your network connection and CODEALIVE_BASE_URL setting."
354398
)
355399

356-
def get_datasources(self, alive_only: bool = True) -> List[Dict[str, Any]]:
400+
def get_datasources(
401+
self, alive_only: bool = True, query: Optional[str] = None
402+
) -> Any:
357403
"""
358404
Get available data sources (repositories and workspaces).
359405
360406
Args:
361407
alive_only: If True, only return data sources ready for use. If False, return all.
408+
query: Optional natural-language task/intent (e.g. "add OAuth to checkout"). When
409+
provided, the backend runs an agentic relevance filter and returns ONLY the data
410+
sources relevant to that intent, each with a `relevanceReason` explaining why.
362411
363412
Returns:
364-
List of data source objects with id, name, description, type, etc.
413+
Without query: list of data source objects with id, name, description, type, etc.
414+
With query: dict {"dataSources": [...], "message": "..."} where `message` says whether
415+
sources were omitted as non-relevant (and how many of the total) or that relevance
416+
filtering was unavailable and the FULL list is returned.
365417
"""
366418
endpoint = "/api/datasources/ready" if alive_only else "/api/datasources/all"
367-
return self._make_request("GET", endpoint)
419+
if not query or not query.strip():
420+
return self._make_request("GET", endpoint)
421+
422+
datasources, headers = self._make_request(
423+
"GET", endpoint, params={"query": query}, return_headers=True
424+
)
425+
return {
426+
"dataSources": datasources,
427+
"message": relevance_message(datasources, headers.get(_TOTAL_DATA_SOURCES_HEADER)),
428+
}
368429

369430
def search(
370431
self,
@@ -581,7 +642,7 @@ def main():
581642
if len(sys.argv) < 2:
582643
print("Usage: python api_client.py <command> [args...]")
583644
print("Commands:")
584-
print(" datasources [--all]")
645+
print(" datasources [--all] [--query TASK]")
585646
print(" search <query> <data_source1> [data_source2...] [--mode auto|fast|deep] [--description-detail short|full]")
586647
print(" semantic-search <query> <data_source1> [data_source2...] [--path PATH] [--ext EXT] [--max-results N]")
587648
print(" grep-search <query> <data_source1> [data_source2...] [--regex] [--path PATH] [--ext EXT] [--max-results N]")
@@ -596,7 +657,14 @@ def main():
596657
try:
597658
if command == "datasources":
598659
alive_only = "--all" not in sys.argv
599-
result = client.get_datasources(alive_only=alive_only)
660+
query = None
661+
if "--query" in sys.argv:
662+
query_index = sys.argv.index("--query")
663+
if query_index + 1 >= len(sys.argv):
664+
print("Usage: datasources [--all] [--query TASK]")
665+
sys.exit(1)
666+
query = sys.argv[query_index + 1]
667+
result = client.get_datasources(alive_only=alive_only, query=query)
600668
print(json.dumps(result, indent=2))
601669

602670
elif command == "search":

0 commit comments

Comments
 (0)