Skip to content

Feature/knowledge index base64 sanitization#894

Closed
kissghosts wants to merge 2 commits into
wecode-ai:mainfrom
kissghosts:feature/knowledge-index-base64-sanitization
Closed

Feature/knowledge index base64 sanitization#894
kissghosts wants to merge 2 commits into
wecode-ai:mainfrom
kissghosts:feature/knowledge-index-base64-sanitization

Conversation

@kissghosts
Copy link
Copy Markdown
Collaborator

@kissghosts kissghosts commented Mar 31, 2026

Summary by CodeRabbit

New Features

  • Added image input capability option for embedding models
  • Added text sanitization to remove binary payloads from documents during indexing
  • Added UI toggle in model configuration to enable/disable image input support for embedding models

Documentation

  • Added English and Chinese translations for image input embedding configuration

Tests

  • Added unit tests for embedding capabilities and text sanitization

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

This PR extends embedding model configuration to support additional input modalities (image), adds backend text sanitization utilities to replace binary payloads with readable placeholders before chunking, and introduces frontend UI components and utilities to manage image input capabilities during model configuration.

Changes

Cohort / File(s) Summary
Backend Schema
backend/app/schemas/kind.py
Added optional additional_input_modalities field to EmbeddingConfig to declare supported input modalities beyond text.
Backend Capabilities & Sanitization
backend/app/services/rag/embedding/capabilities.py, backend/app/services/rag/preprocess/text_sanitizer.py, backend/app/services/rag/preprocess/__init__.py
Introduced capability helpers (normalize_additional_input_modalities, embedding_supports_image_input) and text sanitization utilities (sanitize_text_for_indexing) to replace inline binary payloads (markdown images, PDFs, base64 blobs) with readable placeholders before indexing.
Backend Integration
backend/app/services/rag/embedding/factory.py, backend/app/services/rag/index/indexer.py
Integrated capability normalization in embedding factory and added document sanitization step in indexer before text splitting.
Frontend Schema & Utilities
frontend/src/apis/models.ts, frontend/src/features/settings/utils/embedding-model-config.ts
Extended EmbeddingConfig interface with additional_input_modalities property and added three utility functions to manage modality normalization and image capability detection.
Frontend UI Component
frontend/src/features/settings/components/ModelEditDialog.tsx
Added embeddingSupportsImageInput state with checkbox UI control, integrated buildEmbeddingConfig helper, and memoized base model options computation.
Internationalization
frontend/src/i18n/locales/en/common.json, frontend/src/i18n/locales/zh-CN/common.json
Added localized labels and hints for embedding image input capability toggle.
Tests
backend/tests/services/rag/test_embedding_capabilities.py, backend/tests/services/rag/test_text_sanitizer.py, frontend/src/__tests__/features/settings/utils/embedding-model-config.test.ts
Added comprehensive unit test coverage for capability normalization, text sanitization with multiple payload types, and frontend utility functions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A hop and a skip through modalities bright,
Images now join the embedding's flight!
Sanitized text flows, base64 takes flight,
Into neat placeholders—quite a delight!
Models now know what they can handle right. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature being added: base64 sanitization for knowledge indexing. The pull request introduces comprehensive sanitization utilities for handling binary payloads, embedding model image input configuration, and related infrastructure across both frontend and backend.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
backend/app/services/rag/preprocess/text_sanitizer.py (1)

21-23: Consider the false positive risk of bare base64 detection.

The 128-character threshold is reasonable, but this pattern could potentially match legitimate long alphanumeric strings (e.g., concatenated UUIDs, long hashes). Since this is for knowledge indexing where accuracy matters, consider:

  1. Adding a stricter check (e.g., valid base64 length divisible by 4, or presence of padding =)
  2. Documenting this edge case risk

The current implementation is acceptable for most use cases, but worth monitoring for false positives in production.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/rag/preprocess/text_sanitizer.py` around lines 21 - 23,
BARE_BASE64_PATTERN can yield false positives for long alphanumeric strings;
update the pattern (symbol: BARE_BASE64_PATTERN in text_sanitizer.py) to enforce
valid Base64 structure (require groups of 4 characters so total length is
divisible by 4 and allow optional padding of one or two '=' characters) instead
of a simple {128,} run, and add a brief inline comment documenting the remaining
edge-case risk (e.g., concatenated UUIDs or long hashes) so reviewers/ops can
monitor false positives in production.
backend/app/services/rag/embedding/factory.py (1)

178-194: Minor redundant normalization, but acceptable.

At line 193, embedding_supports_image_input(additional_input_modalities) internally calls normalize_additional_input_modalities again on the already-normalized list. This is a minor inefficiency.

If you want to optimize, you could check directly:

-            embedding_supports_image_input(additional_input_modalities),
+            "image" in additional_input_modalities,

However, since this is just logging and the list is small, the current implementation is acceptable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/rag/embedding/factory.py` around lines 178 - 194, The
logging path double-normalizes modalities because
embedding_supports_image_input(normalize_additional_input_modalities(...)) will
re-run normalization on additional_input_modalities; to avoid that, compute a
supports_image boolean directly from the already-normalized
additional_input_modalities (e.g., check for the "image" modality with a small
comprehension) and pass that boolean to logger.info instead of calling
embedding_supports_image_input again; update references in the block that use
additional_input_modalities and logger.info to use the precomputed
supports_image value.
frontend/src/features/settings/components/ModelEditDialog.tsx (1)

345-346: Preserve unknown modalities across edit/save.

The dialog reduces additional_input_modalities to a single embeddingSupportsImageInput boolean on load and then reconstructs the array from that boolean on save. If an existing model already carries another modality, opening and saving it here will silently strip that value. Keep the raw modalities list in state and only add/remove 'image'.

Also applies to: 441-445, 1012-1016

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontend/src/features/settings/components/ModelEditDialog.tsx` around lines
345 - 346, The component currently maps additional_input_modalities down to a
boolean embeddingSupportsImageInput and then rebuilds the array on save, which
drops any unknown modalities; instead keep the original modalities array in
state (e.g., add a state like additionalInputModalities or reuse existing prop)
and update only the presence of 'image' when embeddingSupportsImageInput
toggles; modify where embeddingSupportsImageInput and
setEmbeddingSupportsImageInput are initialized to derive from the modalities
array but do not overwrite it, and on save (the save/submit handler around where
additional_input_modalities is reconstructed) merge the existing array with or
without 'image' rather than replacing the array so all unknown modalities are
preserved.
backend/app/services/rag/index/indexer.py (1)

75-90: Avoid rebuilding every Document on the clean path.

sanitize_text_for_indexing() now runs for every document, but model_dump() + Document(**payload) is only needed when the text actually changes. Reusing the original doc when replacements_count == 0 removes an avoidable deep copy from the indexing hot path.

♻️ Suggested change
 def sanitize_documents(documents: List[Document]) -> List[Document]:
     """Sanitize document text before chunking."""
     sanitized_documents: List[Document] = []
     for doc in documents:
         result = sanitize_text_for_indexing(doc.text)
+        if result.replacements_count == 0:
+            sanitized_documents.append(doc)
+            continue
+
         if result.replacements_count > 0:
             add_span_event(
                 "rag.indexer.documents.sanitized",
                 {
                     "replacements_count": str(result.replacements_count),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/rag/index/indexer.py` around lines 75 - 90, The loop in
indexer.py always reconstructs a Document even when
sanitize_text_for_indexing(doc.text) makes no changes; change the logic in the
documents loop to check result.replacements_count and if it is 0 simply append
the original doc to sanitized_documents, otherwise (when replacements_count > 0)
call add_span_event and perform the existing model_dump()/payload mutation and
Document(**payload) creation so only modified documents are rebuilt; keep the
text_resource update and names (documents, sanitize_text_for_indexing,
add_span_event, Document, sanitized_documents, result, payload, doc) as in the
diff.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/tests/services/rag/test_text_sanitizer.py`:
- Around line 17-18: The class attribute SUPPORTED_RETRIEVAL_METHODS on
_FakeStorageBackend is currently a mutable list which can be mutated across
tests; change it to an immutable tuple (e.g., replace ["vector"] with
("vector",)) on the _FakeStorageBackend class that inherits BaseStorageBackend
so the fake backend's supported methods cannot be altered accidentally during
tests.

---

Nitpick comments:
In `@backend/app/services/rag/embedding/factory.py`:
- Around line 178-194: The logging path double-normalizes modalities because
embedding_supports_image_input(normalize_additional_input_modalities(...)) will
re-run normalization on additional_input_modalities; to avoid that, compute a
supports_image boolean directly from the already-normalized
additional_input_modalities (e.g., check for the "image" modality with a small
comprehension) and pass that boolean to logger.info instead of calling
embedding_supports_image_input again; update references in the block that use
additional_input_modalities and logger.info to use the precomputed
supports_image value.

In `@backend/app/services/rag/index/indexer.py`:
- Around line 75-90: The loop in indexer.py always reconstructs a Document even
when sanitize_text_for_indexing(doc.text) makes no changes; change the logic in
the documents loop to check result.replacements_count and if it is 0 simply
append the original doc to sanitized_documents, otherwise (when
replacements_count > 0) call add_span_event and perform the existing
model_dump()/payload mutation and Document(**payload) creation so only modified
documents are rebuilt; keep the text_resource update and names (documents,
sanitize_text_for_indexing, add_span_event, Document, sanitized_documents,
result, payload, doc) as in the diff.

In `@backend/app/services/rag/preprocess/text_sanitizer.py`:
- Around line 21-23: BARE_BASE64_PATTERN can yield false positives for long
alphanumeric strings; update the pattern (symbol: BARE_BASE64_PATTERN in
text_sanitizer.py) to enforce valid Base64 structure (require groups of 4
characters so total length is divisible by 4 and allow optional padding of one
or two '=' characters) instead of a simple {128,} run, and add a brief inline
comment documenting the remaining edge-case risk (e.g., concatenated UUIDs or
long hashes) so reviewers/ops can monitor false positives in production.

In `@frontend/src/features/settings/components/ModelEditDialog.tsx`:
- Around line 345-346: The component currently maps additional_input_modalities
down to a boolean embeddingSupportsImageInput and then rebuilds the array on
save, which drops any unknown modalities; instead keep the original modalities
array in state (e.g., add a state like additionalInputModalities or reuse
existing prop) and update only the presence of 'image' when
embeddingSupportsImageInput toggles; modify where embeddingSupportsImageInput
and setEmbeddingSupportsImageInput are initialized to derive from the modalities
array but do not overwrite it, and on save (the save/submit handler around where
additional_input_modalities is reconstructed) merge the existing array with or
without 'image' rather than replacing the array so all unknown modalities are
preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f5b0158e-45e8-4515-936c-257474a58939

📥 Commits

Reviewing files that changed from the base of the PR and between 9596002 and d8ea857.

📒 Files selected for processing (15)
  • backend/app/schemas/kind.py
  • backend/app/services/rag/embedding/capabilities.py
  • backend/app/services/rag/embedding/factory.py
  • backend/app/services/rag/index/indexer.py
  • backend/app/services/rag/preprocess/__init__.py
  • backend/app/services/rag/preprocess/text_sanitizer.py
  • backend/tests/services/rag/test_embedding_capabilities.py
  • backend/tests/services/rag/test_text_sanitizer.py
  • docs/specs/knowledge/2026-03-31-knowledge-index-base64-sanitization-design.md
  • frontend/src/__tests__/features/settings/utils/embedding-model-config.test.ts
  • frontend/src/apis/models.ts
  • frontend/src/features/settings/components/ModelEditDialog.tsx
  • frontend/src/features/settings/utils/embedding-model-config.ts
  • frontend/src/i18n/locales/en/common.json
  • frontend/src/i18n/locales/zh-CN/common.json

Comment thread backend/tests/services/rag/test_text_sanitizer.py Outdated
@kissghosts kissghosts force-pushed the feature/knowledge-index-base64-sanitization branch from d8ea857 to 9ab56cd Compare March 31, 2026 12:57
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
backend/tests/services/rag/test_text_sanitizer.py (1)

136-139: Assert splitter invocation before dereferencing call_args.

Add an explicit assert_called_once() so test failures point to the root cause directly when _index_documents does not call the splitter.

💡 Suggested patch
         indexer._index_documents(documents=documents, chunk_metadata=chunk_metadata)

+        splitter.split_documents.assert_called_once()
         sanitized_document = splitter.split_documents.call_args.args[0][0]
         assert sanitized_document.text == "![architecture]([inline image omitted])"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/services/rag/test_text_sanitizer.py` around lines 136 - 139,
Add an explicit assertion that the splitter was invoked before dereferencing its
call arguments: in the test around indexer._index_documents(...) assert that
splitter.split_documents was called (e.g., use
splitter.split_documents.assert_called_once()) before reading
splitter.split_documents.call_args.args[0][0]; this ensures test failures
clearly indicate the splitter was not called rather than raising an attribute
error when accessing call_args.
frontend/src/features/settings/components/ModelEditDialog.tsx (2)

345-345: Extract the embedding section into its own hook/subcomponent.

This feature now touches state setup, edit initialization, save-time serialization, and rendering inside a 2k-line dialog. Pulling the embedding form into a dedicated module would make the new capability easier to reason about and keep handleSave slimmer.

As per coding guidelines, File size limit: If a file exceeds 1000 lines, split it into multiple sub-modules and Function length: Max 50 lines per function (preferred).

Also applies to: 441-445, 1012-1016, 1675-1693

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontend/src/features/settings/components/ModelEditDialog.tsx` at line 345,
The embedding-related state and UI in ModelEditDialog (e.g.,
embeddingSupportsImageInput useState, initialization logic referenced around
lines noted, and save-time serialization inside handleSave) should be extracted
into a new hook or subcomponent (e.g., useEmbeddingSettings or
EmbeddingSettingsPanel) that encapsulates: (1) local state
(embeddingSupportsImageInput and any other embedding flags), (2) an init
function to load edit values (formerly run during dialog init), (3) a
serializer/collector function used by handleSave to produce the embedding
payload, and (4) the rendering JSX for the embedding form; replace the in-file
state, init, save logic and JSX in ModelEditDialog with calls to the new
hook/component (invoke the init on dialog open and call the serializer from
handleSave) so ModelEditDialog becomes slimmer and the embedding logic is fully
contained in the new module.

76-76: Keep image-input capability single-sourced.

frontend/src/features/settings/utils/embedding-model-config.ts:23-37 already turns this flag into embeddingConfig.additional_input_modalities. Forwarding the raw boolean in formData gives onSave consumers two representations of the same state to keep aligned. Prefer deriving it from modelCRD.spec.embeddingConfig, or move all embedding-specific form state behind a single nested object.

Also applies to: 1012-1016, 1132-1152

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontend/src/features/settings/components/ModelEditDialog.tsx` at line 76,
The form currently forwards embeddingSupportsImageInput into formData,
duplicating state that is already derived into
embeddingConfig.additional_input_modalities in
frontend/src/features/settings/utils/embedding-model-config.ts; remove the
top-level embeddingSupportsImageInput from the ModelEditDialog form state and
consumers (e.g., where formData is built and passed to onSave) and instead
derive the image-input capability from modelCRD.spec.embeddingConfig (or place
all embedding-related fields into a single nested embeddingConfig object) so
that onSave consumers only read embeddingConfig.additional_input_modalities;
update any references in ModelEditDialog (embeddingSupportsImageInput, formData,
and onSave usage) to use that single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/tests/services/rag/test_text_sanitizer.py`:
- Around line 20-59: The _FakeStorageBackend methods lack type hints; update the
signatures for __init__, create_vector_store, index_with_metadata, retrieve,
delete_document, get_document, list_documents, and get_all_chunks to include
parameter and return type annotations using typing.List, typing.Dict, and
typing.Any and the project types BaseNode and ChunkMetadata (e.g., nodes:
List[BaseNode], embed_model: Any, chunk_metadata: ChunkMetadata) and ensure each
method declares an appropriate return type such as Dict[str, Any] or
Optional[Dict[str, Any]]/None to match the parent BaseStorageBackend patterns;
also add any necessary typing imports at the top (List, Dict, Any, Optional) and
reference these exact method names when applying changes.

---

Nitpick comments:
In `@backend/tests/services/rag/test_text_sanitizer.py`:
- Around line 136-139: Add an explicit assertion that the splitter was invoked
before dereferencing its call arguments: in the test around
indexer._index_documents(...) assert that splitter.split_documents was called
(e.g., use splitter.split_documents.assert_called_once()) before reading
splitter.split_documents.call_args.args[0][0]; this ensures test failures
clearly indicate the splitter was not called rather than raising an attribute
error when accessing call_args.

In `@frontend/src/features/settings/components/ModelEditDialog.tsx`:
- Line 345: The embedding-related state and UI in ModelEditDialog (e.g.,
embeddingSupportsImageInput useState, initialization logic referenced around
lines noted, and save-time serialization inside handleSave) should be extracted
into a new hook or subcomponent (e.g., useEmbeddingSettings or
EmbeddingSettingsPanel) that encapsulates: (1) local state
(embeddingSupportsImageInput and any other embedding flags), (2) an init
function to load edit values (formerly run during dialog init), (3) a
serializer/collector function used by handleSave to produce the embedding
payload, and (4) the rendering JSX for the embedding form; replace the in-file
state, init, save logic and JSX in ModelEditDialog with calls to the new
hook/component (invoke the init on dialog open and call the serializer from
handleSave) so ModelEditDialog becomes slimmer and the embedding logic is fully
contained in the new module.
- Line 76: The form currently forwards embeddingSupportsImageInput into
formData, duplicating state that is already derived into
embeddingConfig.additional_input_modalities in
frontend/src/features/settings/utils/embedding-model-config.ts; remove the
top-level embeddingSupportsImageInput from the ModelEditDialog form state and
consumers (e.g., where formData is built and passed to onSave) and instead
derive the image-input capability from modelCRD.spec.embeddingConfig (or place
all embedding-related fields into a single nested embeddingConfig object) so
that onSave consumers only read embeddingConfig.additional_input_modalities;
update any references in ModelEditDialog (embeddingSupportsImageInput, formData,
and onSave usage) to use that single source of truth.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ee1ff31d-1231-4835-a41e-8ef4e5ca7a8d

📥 Commits

Reviewing files that changed from the base of the PR and between d8ea857 and 9ab56cd.

📒 Files selected for processing (14)
  • backend/app/schemas/kind.py
  • backend/app/services/rag/embedding/capabilities.py
  • backend/app/services/rag/embedding/factory.py
  • backend/app/services/rag/index/indexer.py
  • backend/app/services/rag/preprocess/__init__.py
  • backend/app/services/rag/preprocess/text_sanitizer.py
  • backend/tests/services/rag/test_embedding_capabilities.py
  • backend/tests/services/rag/test_text_sanitizer.py
  • frontend/src/__tests__/features/settings/utils/embedding-model-config.test.ts
  • frontend/src/apis/models.ts
  • frontend/src/features/settings/components/ModelEditDialog.tsx
  • frontend/src/features/settings/utils/embedding-model-config.ts
  • frontend/src/i18n/locales/en/common.json
  • frontend/src/i18n/locales/zh-CN/common.json
✅ Files skipped from review due to trivial changes (10)
  • backend/app/services/rag/preprocess/init.py
  • backend/app/services/rag/embedding/factory.py
  • frontend/src/i18n/locales/en/common.json
  • frontend/src/i18n/locales/zh-CN/common.json
  • frontend/src/tests/features/settings/utils/embedding-model-config.test.ts
  • backend/tests/services/rag/test_embedding_capabilities.py
  • backend/app/services/rag/embedding/capabilities.py
  • frontend/src/features/settings/utils/embedding-model-config.ts
  • backend/app/schemas/kind.py
  • backend/app/services/rag/preprocess/text_sanitizer.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/services/rag/index/indexer.py

Comment on lines +20 to +59
def __init__(self):
super().__init__({})
self.indexed_nodes = None

def create_vector_store(self, index_name: str):
return None

def index_with_metadata(self, nodes, chunk_metadata, embed_model, **kwargs):
self.indexed_nodes = nodes
return {
"indexed_count": len(nodes),
"index_name": "test-index",
"status": "success",
}

def retrieve(
self,
knowledge_id,
query,
embed_model,
retrieval_setting,
metadata_condition=None,
**kwargs,
):
return {"records": []}

def delete_document(self, knowledge_id, doc_ref, **kwargs):
return {"status": "success"}

def get_document(self, knowledge_id, doc_ref, **kwargs):
return {}

def list_documents(self, knowledge_id, page=1, page_size=20, **kwargs):
return {}

def test_connection(self) -> bool:
return True

def get_all_chunks(self, knowledge_id, max_chunks=10000, **kwargs):
return []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify missing type annotations in this file (read-only)
python - <<'PY'
import ast
from pathlib import Path

path = Path("backend/tests/services/rag/test_text_sanitizer.py")
tree = ast.parse(path.read_text())
for node in ast.walk(tree):
    if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
        missing_args = []
        args = node.args.posonlyargs + node.args.args + node.args.kwonlyargs
        for a in args:
            if a.arg == "self":
                continue
            if a.annotation is None:
                missing_args.append(a.arg)
        missing_return = node.returns is None
        if missing_args or missing_return:
            print(f"{node.name} (line {node.lineno}) -> missing args: {missing_args}, missing return: {missing_return}")
PY

Repository: wecode-ai/Wegent

Length of output: 826


🏁 Script executed:

cat -n backend/tests/services/rag/test_text_sanitizer.py | head -70

Repository: wecode-ai/Wegent

Length of output: 2617


🏁 Script executed:

# Find the parent class and understand what _FakeStorageBackend inherits from
rg "class _FakeStorageBackend" -A 2 backend/tests/services/rag/test_text_sanitizer.py

Repository: wecode-ai/Wegent

Length of output: 153


🏁 Script executed:

# Look for the actual storage backend implementation to understand correct types
fd -e py -path "**/services/rag/*" | grep -E "(backend|storage)" | head -20

Repository: wecode-ai/Wegent

Length of output: 228


🏁 Script executed:

# Search for TextNode and ChunkMetadata types in codebase
rg "class TextNode|from.*TextNode|import.*TextNode" -t py | head -10

Repository: wecode-ai/Wegent

Length of output: 154


🏁 Script executed:

# Search for the actual backend interface/implementation
rg "def index_with_metadata|def retrieve|def delete_document" backend/ --type py | head -15

Repository: wecode-ai/Wegent

Length of output: 1414


🏁 Script executed:

cat -n backend/app/services/rag/storage/base.py | head -100

Repository: wecode-ai/Wegent

Length of output: 3994


🏁 Script executed:

# Get the full method signatures from BaseStorageBackend
rg "def index_with_metadata|def retrieve|def delete_document|def __init__" backend/app/services/rag/storage/base.py -A 5

Repository: wecode-ai/Wegent

Length of output: 714


🏁 Script executed:

# Check the actual implementation in elasticsearch_backend to see real type hints
cat -n backend/app/services/rag/storage/elasticsearch_backend.py | head -150

Repository: wecode-ai/Wegent

Length of output: 6602


Add type annotations to _FakeStorageBackend method signatures.

All methods lack parameter and return type annotations, violating the repository's Python type hints requirement. Match the parent class BaseStorageBackend signature patterns:

  • nodes should be List[BaseNode] (not TextNode)
  • embed_model should be Any (parent class also leaves this untyped)
  • All methods should declare return types; most return Dict or None
  • Use typing.Dict and typing.List for consistency with the codebase

For example, index_with_metadata should be:

def index_with_metadata(
    self,
    nodes: List[BaseNode],
    chunk_metadata: ChunkMetadata,
    embed_model: Any,
    **kwargs: Any,
) -> Dict[str, Any]:

Apply similar typing to all 8 methods (__init__, create_vector_store, index_with_metadata, retrieve, delete_document, get_document, list_documents, get_all_chunks).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/services/rag/test_text_sanitizer.py` around lines 20 - 59, The
_FakeStorageBackend methods lack type hints; update the signatures for __init__,
create_vector_store, index_with_metadata, retrieve, delete_document,
get_document, list_documents, and get_all_chunks to include parameter and return
type annotations using typing.List, typing.Dict, and typing.Any and the project
types BaseNode and ChunkMetadata (e.g., nodes: List[BaseNode], embed_model: Any,
chunk_metadata: ChunkMetadata) and ensure each method declares an appropriate
return type such as Dict[str, Any] or Optional[Dict[str, Any]]/None to match the
parent BaseStorageBackend patterns; also add any necessary typing imports at the
top (List, Dict, Any, Optional) and reference these exact method names when
applying changes.

@kissghosts kissghosts closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant