Skip to content

Feat/kb rag retrieval#924

Merged
qdaxb merged 7 commits into
wecode-ai:mainfrom
kissghosts:feat/kb-rag-retrieval
Apr 7, 2026
Merged

Feat/kb rag retrieval#924
qdaxb merged 7 commits into
wecode-ai:mainfrom
kissghosts:feat/kb-rag-retrieval

Conversation

@kissghosts
Copy link
Copy Markdown
Collaborator

@kissghosts kissghosts commented Apr 4, 2026

Summary by CodeRabbit

  • New Features

    • Retrieve documents by exact name in addition to IDs; tools and APIs accept optional name-based scoping.
    • Knowledge base metadata now includes search availability and document counts (total, searchable, spreadsheets).
  • Bug Fixes / UX

    • Clearer error messages and short-circuit behavior when scoped document names resolve to no documents.
  • Documentation

    • Updated retrieval guidance and prompts to prefer high-signal scoped searches and handle spreadsheet caveats.
  • Tests

    • Added coverage for name-based retrieval, metadata formatting, and prompt changes.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

Adds optional per-call document scoping by document_names/document_ids, introduces name-to-ID resolution and per-KB document statistics, updates retrieval/control flow to use resolved IDs (with short-circuit on no matches), and adjusts prompts and tests to reflect scoped retrieval behavior.

Changes

Cohort / File(s) Summary
Core RAG Endpoint
backend/app/api/endpoints/internal/rag.py
Added document_names to InternalRetrieveRequest and message to InternalRetrieveResponse. Implemented _resolve_document_names() and updated internal_retrieve() to prefer explicit document_ids, resolve names when needed, short-circuit with empty response/message on no matches, and pass resolved IDs into retrieval runtime.
Knowledge Service Helpers
backend/app/services/knowledge/knowledge_service.py
Added get_document_prompt_stats(db, knowledge_base_ids) returning per-KB counts and resolve_document_ids_by_names(db, knowledge_base_ids, document_names) to map exact names→IDs scoped to KBs; updated imports to support conditional aggregation.
KB Meta Prompt Construction
backend/app/services/chat/preprocessing/contexts.py, backend/app/services/chat/preprocessing/kb_meta.py
_build_kb_meta_prompt now fetches per-KB document stats and injects search_available, total_document_count, searchable_document_count, and spreadsheet_document_count. format_kb_meta_prompt() formats these fields into KB lines.
Chat Shell Tool & Input Schema
chat_shell/chat_shell/tools/builtin/knowledge_base.py
Added document_ids/document_names to KnowledgeBaseInput and KnowledgeBaseTool; _arun now delegates to _arun_impl with optional per-call filters, forwards filters to retrieval paths, resolves names to IDs for DB-backed retrieval, and short-circuits with an empty RAG result/message when resolution yields no IDs.
Prompt Templates
shared/prompts/knowledge_base.py
Updated Type C guidance to prefer highest-signal retrieval (global vs scoped knowledge_base_search, kb_ls, kb_head) and added notes about spreadsheet reliability and exact document_names. Adjusted strict phrasing to require retrieval before answering.
Tests — Endpoint & Service
backend/tests/api/endpoints/internal/test_rag_retrieve_endpoint.py, backend/tests/services/rag/test_retrieval_service.py
Added tests verifying document_names triggers resolution before retrieval, and that empty-resolution yields 200 with empty records and a "Document names not found" message; added async unit test asserting resolution is called and retrieval gets resolved IDs.
Tests — KB Meta & Prompts
backend/tests/services/test_kb_meta_prompt.py, shared/tests/test_prompts.py
Extended fixtures and assertions to include retrieval metadata fields and updated prompt assertions to the new strict wording and presence of document_names, document_ids, kb_ls/kb_head, and spreadsheet-related tokens.
Tests — Knowledge Service & Chat Shell Scoped Search
backend/tests/services/knowledge/test_knowledge_service.py, chat_shell/tests/test_knowledge_base_scoped_search.py
Added unit tests for resolve_document_ids_by_names and get_document_prompt_stats; added comprehensive async tests ensuring KnowledgeBaseInput preserves document_names, HTTP mode forwards scoped filters, error message retention on name-resolution failure, and per-call filters do not mutate tool defaults.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Endpoint as RAG Endpoint
    participant Resolve as Name Resolver
    participant KBService as KnowledgeService
    participant Retrieval as Retrieval Engine
    participant Meta as KB Meta Builder

    Note over Client,Endpoint: Scoped retrieval request
    Client->>Endpoint: POST internal_retrieve(document_names)
    Endpoint->>Resolve: _resolve_document_names(names, kb_ids)
    Resolve->>KBService: resolve_document_ids_by_names(db, kb_ids, names)
    KBService-->>Resolve: [resolved_ids] / []
    Resolve-->>Endpoint: resolved_document_ids
    alt resolved_ids non-empty
      Endpoint->>Retrieval: query(resolved_document_ids)
      Retrieval-->>Endpoint: chunks/results
      Endpoint-->>Client: 200 OK with results
    else no ids found
      Endpoint-->>Client: 200 OK with mode="rag_retrieval", records=[], message="Document names not found..."
    end

    Note over Client,Meta: KB metadata enrichment
    Client->>Meta: _build_kb_meta_prompt(kb_ids)
    Meta->>KBService: get_document_prompt_stats(db, kb_ids)
    KBService-->>Meta: {kb_id: {total_count, searchable_count, spreadsheet_count}}
    Meta-->>Client: kb_meta_list (includes stats)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hop through names and IDs with glee,
I seek the files so answers flow free,
Stats dress the prompt, scopes guide the quest,
Resolved and ready, we fetch the best,
a joyful thump — knowledge in tow!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and overly broad, using generic terminology ('Feat/kb rag retrieval') that doesn't clearly convey the specific main change or intent. Consider using a more specific title that highlights the primary feature, such as 'Add document name scoping to RAG retrieval' or 'Support document-level filtering in knowledge base search'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
backend/app/api/endpoints/internal/rag.py (1)

216-230: Consider returning InternalRetrieveResponse instance for type consistency.

The early return at lines 224-230 uses a plain dict, while the normal path at lines 353-368 returns an InternalRetrieveResponse instance. While FastAPI handles dict-to-model conversion, using the model consistently improves type safety and makes the code more explicit.

♻️ Optional: Return model instance for consistency
             if not resolved_document_ids:
-                return {
-                    "mode": "rag_retrieval",
-                    "records": [],
-                    "total": 0,
-                    "total_estimated_tokens": 0,
-                    "message": "Document names not found in the selected knowledge bases. Use kb_ls to inspect available documents first.",
-                }
+                return InternalRetrieveResponse(
+                    mode="rag_retrieval",
+                    records=[],
+                    total=0,
+                    total_estimated_tokens=0,
+                    message="Document names not found in the selected knowledge bases. Use kb_ls to inspect available documents first.",
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/endpoints/internal/rag.py` around lines 216 - 230, The early
return after resolving document names currently returns a plain dict; change it
to return an InternalRetrieveResponse instance for consistency and type safety.
In the branch that runs when resolved_document_ids is empty (after calling
_resolve_document_names with db, knowledge_base_ids, request.document_names),
construct and return InternalRetrieveResponse(mode="rag_retrieval", records=[],
total=0, total_estimated_tokens=0, message="Document names not found in the
selected knowledge bases. Use kb_ls to inspect available documents first.")
instead of the dict so the function consistently returns the same model type as
the normal path.
backend/tests/api/endpoints/internal/test_rag_retrieve_endpoint.py (1)

116-144: Test could verify resolution arguments for stronger assertions.

The test verifies that _resolve_document_names is called, but doesn't assert the arguments passed. Consider verifying the call arguments to ensure the correct KB IDs and document names are forwarded.

♻️ Optional improvement for stronger test assertion
     assert response.status_code == 200
-    mock_resolve.assert_called_once()
+    mock_resolve.assert_called_once_with(
+        db=ANY,
+        knowledge_base_ids=[12],
+        document_names=["release.md"],
+    )
     mock_query.assert_awaited_once()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/api/endpoints/internal/test_rag_retrieve_endpoint.py` around
lines 116 - 144, Update the
test_internal_retrieve_resolves_document_names_before_query test to assert the
arguments passed to the mocked _resolve_document_names call; after posting to
"/api/internal/rag/retrieve" assert
mock_resolve.assert_called_once_with(knowledge_base_ids=[12],
document_names=["release.md"]) (or equivalent tuple/kwargs matching how
_resolve_document_names is invoked) so the test verifies the correct KB IDs and
document names are forwarded; keep the existing assertions for
response.status_code and mock_query.assert_awaited_once().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/services/knowledge/knowledge_service.py`:
- Around line 892-916: The resolve_document_ids_by_names method currently
returns document IDs without checking document active state; update the DB query
inside resolve_document_ids_by_names to add a filter on
KnowledgeDocument.is_active == True so only active documents are resolved
(mirror the active filtering used in other helpers like
get_active_document_text_length_stats); locate the query using
db.query(KnowledgeDocument.id) and augment its .filter(...) call to include the
is_active condition.

In `@chat_shell/chat_shell/tools/builtin/knowledge_base.py`:
- Around line 481-496: The _apply_call_scoped_filters context manager mutates
instance state (self.document_ids / self.document_names), which is unsafe for
async/concurrent executions; change callers to accept and forward scoped filter
parameters instead of relying on mutated attributes (stop using
_apply_call_scoped_filters), or replace the mutable approach with an async-aware
solution using contextvars to store per-call document_ids/document_names; update
methods that currently read self.document_ids/self.document_names to accept
optional parameters (e.g., document_ids/document_names) and thread the scoped
values through the call chain, or create contextvar-backed getters/setters and
use those in place of direct attribute access.

---

Nitpick comments:
In `@backend/app/api/endpoints/internal/rag.py`:
- Around line 216-230: The early return after resolving document names currently
returns a plain dict; change it to return an InternalRetrieveResponse instance
for consistency and type safety. In the branch that runs when
resolved_document_ids is empty (after calling _resolve_document_names with db,
knowledge_base_ids, request.document_names), construct and return
InternalRetrieveResponse(mode="rag_retrieval", records=[], total=0,
total_estimated_tokens=0, message="Document names not found in the selected
knowledge bases. Use kb_ls to inspect available documents first.") instead of
the dict so the function consistently returns the same model type as the normal
path.

In `@backend/tests/api/endpoints/internal/test_rag_retrieve_endpoint.py`:
- Around line 116-144: Update the
test_internal_retrieve_resolves_document_names_before_query test to assert the
arguments passed to the mocked _resolve_document_names call; after posting to
"/api/internal/rag/retrieve" assert
mock_resolve.assert_called_once_with(knowledge_base_ids=[12],
document_names=["release.md"]) (or equivalent tuple/kwargs matching how
_resolve_document_names is invoked) so the test verifies the correct KB IDs and
document names are forwarded; keep the existing assertions for
response.status_code and mock_query.assert_awaited_once().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 710ef808-9475-4dce-ab88-237ab1db3688

📥 Commits

Reviewing files that changed from the base of the PR and between a83c425 and d7d8140.

📒 Files selected for processing (12)
  • backend/app/api/endpoints/internal/rag.py
  • backend/app/services/chat/preprocessing/contexts.py
  • backend/app/services/chat/preprocessing/kb_meta.py
  • backend/app/services/knowledge/knowledge_service.py
  • backend/tests/api/endpoints/internal/test_rag_retrieve_endpoint.py
  • backend/tests/services/rag/test_retrieval_service.py
  • backend/tests/services/test_kb_meta_prompt.py
  • chat_shell/chat_shell/tools/builtin/knowledge_base.py
  • chat_shell/tests/test_knowledge_base_scoped_search.py
  • docs/superpowers/specs/2026-04-04-kb-rag-retrieval-design.md
  • shared/prompts/knowledge_base.py
  • shared/tests/test_prompts.py

Comment thread backend/app/services/knowledge/knowledge_service.py
Comment thread chat_shell/chat_shell/tools/builtin/knowledge_base.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
chat_shell/chat_shell/tools/builtin/knowledge_base.py (1)

940-944: ⚠️ Potential issue | 🟡 Minor

ImportError fallback doesn't forward document filters.

When the ImportError exception is caught (lines 939-944), the fallback call to _retrieve_with_strategy_via_http doesn't pass document_ids or document_names. This means scoped document filtering would silently fail in this code path.

🔧 Proposed fix to forward filters in fallback
             except ImportError:
                 result = await self._retrieve_with_strategy_via_http(
                     query=query,
                     max_results=max_results,
                     route_mode=route_mode,
+                    document_ids=resolved_document_ids,
+                    document_names=document_names,
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chat_shell/chat_shell/tools/builtin/knowledge_base.py` around lines 940 -
944, The ImportError except block currently calls
_retrieve_with_strategy_via_http without forwarding document_ids and
document_names, which drops scoped filters; update the fallback call inside the
ImportError handler to pass through the same document_ids and document_names
parameters (alongside query, max_results, route_mode) so the fallback path
honors document filtering in _retrieve_with_strategy_via_http.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/services/knowledge/knowledge_service.py`:
- Around line 866-876: The spreadsheet_count aggregation currently counts all
KnowledgeDocument.file_extension matches regardless of is_active; update the
CASE predicate used for spreadsheet_count in the method (the prompt-oriented
document stats block in knowledge_service.py) to require the document be active
as well (e.g., include KnowledgeDocument.is_active == True alongside the
file_extension IN spreadsheet_exts) so it mirrors the filtering behavior of
searchable_document_count and only counts active spreadsheet documents.

---

Outside diff comments:
In `@chat_shell/chat_shell/tools/builtin/knowledge_base.py`:
- Around line 940-944: The ImportError except block currently calls
_retrieve_with_strategy_via_http without forwarding document_ids and
document_names, which drops scoped filters; update the fallback call inside the
ImportError handler to pass through the same document_ids and document_names
parameters (alongside query, max_results, route_mode) so the fallback path
honors document filtering in _retrieve_with_strategy_via_http.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 32490fe3-ce6e-48c4-89e6-6ca8885e929e

📥 Commits

Reviewing files that changed from the base of the PR and between d7d8140 and 96a039a.

📒 Files selected for processing (5)
  • backend/app/api/endpoints/internal/rag.py
  • backend/app/services/knowledge/knowledge_service.py
  • backend/tests/services/knowledge/test_knowledge_service.py
  • chat_shell/chat_shell/tools/builtin/knowledge_base.py
  • chat_shell/tests/test_knowledge_base_scoped_search.py
✅ Files skipped from review due to trivial changes (1)
  • backend/tests/services/knowledge/test_knowledge_service.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/api/endpoints/internal/rag.py

Comment thread backend/app/services/knowledge/knowledge_service.py
@kissghosts kissghosts force-pushed the feat/kb-rag-retrieval branch from 96a039a to 9ac8342 Compare April 7, 2026 03:45
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/app/services/knowledge/knowledge_service.py (1)

857-857: Consider extracting spreadsheet extensions to a class constant.

The spreadsheet_exts list is embedded within the method. Extracting it to a class-level constant would improve maintainability and make it reusable if needed elsewhere.

♻️ Proposed refactor

Add near line 961 (alongside NOTEBOOK_MAX_DOCUMENTS):

SPREADSHEET_EXTENSIONS = ["csv", "xls", "xlsx"]

Then update line 857:

-        spreadsheet_exts = ["csv", "xls", "xlsx"]
+        spreadsheet_exts = KnowledgeService.SPREADSHEET_EXTENSIONS

As per coding guidelines: "Python: Extract magic numbers to constants".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/knowledge/knowledge_service.py` at line 857, Extract the
inline list spreadsheet_exts = ["csv", "xls", "xlsx"] into a class-level
constant named SPREADSHEET_EXTENSIONS (placed alongside NOTEBOOK_MAX_DOCUMENTS
in the same class or module) and then replace the inline spreadsheet_exts usage
in the method (where spreadsheet_exts is defined) with SPREADSHEET_EXTENSIONS to
remove the magic literal and make it reusable; ensure imports/refs remain
unchanged and run tests to verify behavior in KnowledgeService
(knowledge_service.py).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/services/knowledge/knowledge_service.py`:
- Line 857: Extract the inline list spreadsheet_exts = ["csv", "xls", "xlsx"]
into a class-level constant named SPREADSHEET_EXTENSIONS (placed alongside
NOTEBOOK_MAX_DOCUMENTS in the same class or module) and then replace the inline
spreadsheet_exts usage in the method (where spreadsheet_exts is defined) with
SPREADSHEET_EXTENSIONS to remove the magic literal and make it reusable; ensure
imports/refs remain unchanged and run tests to verify behavior in
KnowledgeService (knowledge_service.py).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dfbdd1f1-ddb5-4aa2-96c6-8b45268c83ad

📥 Commits

Reviewing files that changed from the base of the PR and between 96a039a and 9ac8342.

📒 Files selected for processing (4)
  • backend/app/services/knowledge/knowledge_service.py
  • backend/tests/services/knowledge/test_knowledge_service.py
  • shared/prompts/knowledge_base.py
  • shared/tests/test_prompts.py
✅ Files skipped from review due to trivial changes (1)
  • backend/tests/services/knowledge/test_knowledge_service.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • shared/tests/test_prompts.py
  • shared/prompts/knowledge_base.py

@qdaxb qdaxb merged commit d9438ba into wecode-ai:main Apr 7, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants