Feature/knowledge index base64 sanitization#894
Conversation
📝 WalkthroughWalkthroughThis PR extends embedding model configuration to support additional input modalities (image), adds backend text sanitization utilities to replace binary payloads with readable placeholders before chunking, and introduces frontend UI components and utilities to manage image input capabilities during model configuration. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
backend/app/services/rag/preprocess/text_sanitizer.py (1)
21-23: Consider the false positive risk of bare base64 detection.The 128-character threshold is reasonable, but this pattern could potentially match legitimate long alphanumeric strings (e.g., concatenated UUIDs, long hashes). Since this is for knowledge indexing where accuracy matters, consider:
- Adding a stricter check (e.g., valid base64 length divisible by 4, or presence of padding
=)- Documenting this edge case risk
The current implementation is acceptable for most use cases, but worth monitoring for false positives in production.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/rag/preprocess/text_sanitizer.py` around lines 21 - 23, BARE_BASE64_PATTERN can yield false positives for long alphanumeric strings; update the pattern (symbol: BARE_BASE64_PATTERN in text_sanitizer.py) to enforce valid Base64 structure (require groups of 4 characters so total length is divisible by 4 and allow optional padding of one or two '=' characters) instead of a simple {128,} run, and add a brief inline comment documenting the remaining edge-case risk (e.g., concatenated UUIDs or long hashes) so reviewers/ops can monitor false positives in production.backend/app/services/rag/embedding/factory.py (1)
178-194: Minor redundant normalization, but acceptable.At line 193,
embedding_supports_image_input(additional_input_modalities)internally callsnormalize_additional_input_modalitiesagain on the already-normalized list. This is a minor inefficiency.If you want to optimize, you could check directly:
- embedding_supports_image_input(additional_input_modalities), + "image" in additional_input_modalities,However, since this is just logging and the list is small, the current implementation is acceptable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/rag/embedding/factory.py` around lines 178 - 194, The logging path double-normalizes modalities because embedding_supports_image_input(normalize_additional_input_modalities(...)) will re-run normalization on additional_input_modalities; to avoid that, compute a supports_image boolean directly from the already-normalized additional_input_modalities (e.g., check for the "image" modality with a small comprehension) and pass that boolean to logger.info instead of calling embedding_supports_image_input again; update references in the block that use additional_input_modalities and logger.info to use the precomputed supports_image value.frontend/src/features/settings/components/ModelEditDialog.tsx (1)
345-346: Preserve unknown modalities across edit/save.The dialog reduces
additional_input_modalitiesto a singleembeddingSupportsImageInputboolean on load and then reconstructs the array from that boolean on save. If an existing model already carries another modality, opening and saving it here will silently strip that value. Keep the raw modalities list in state and only add/remove'image'.Also applies to: 441-445, 1012-1016
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@frontend/src/features/settings/components/ModelEditDialog.tsx` around lines 345 - 346, The component currently maps additional_input_modalities down to a boolean embeddingSupportsImageInput and then rebuilds the array on save, which drops any unknown modalities; instead keep the original modalities array in state (e.g., add a state like additionalInputModalities or reuse existing prop) and update only the presence of 'image' when embeddingSupportsImageInput toggles; modify where embeddingSupportsImageInput and setEmbeddingSupportsImageInput are initialized to derive from the modalities array but do not overwrite it, and on save (the save/submit handler around where additional_input_modalities is reconstructed) merge the existing array with or without 'image' rather than replacing the array so all unknown modalities are preserved.backend/app/services/rag/index/indexer.py (1)
75-90: Avoid rebuilding everyDocumenton the clean path.
sanitize_text_for_indexing()now runs for every document, butmodel_dump()+Document(**payload)is only needed when the text actually changes. Reusing the originaldocwhenreplacements_count == 0removes an avoidable deep copy from the indexing hot path.♻️ Suggested change
def sanitize_documents(documents: List[Document]) -> List[Document]: """Sanitize document text before chunking.""" sanitized_documents: List[Document] = [] for doc in documents: result = sanitize_text_for_indexing(doc.text) + if result.replacements_count == 0: + sanitized_documents.append(doc) + continue + if result.replacements_count > 0: add_span_event( "rag.indexer.documents.sanitized", { "replacements_count": str(result.replacements_count),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/rag/index/indexer.py` around lines 75 - 90, The loop in indexer.py always reconstructs a Document even when sanitize_text_for_indexing(doc.text) makes no changes; change the logic in the documents loop to check result.replacements_count and if it is 0 simply append the original doc to sanitized_documents, otherwise (when replacements_count > 0) call add_span_event and perform the existing model_dump()/payload mutation and Document(**payload) creation so only modified documents are rebuilt; keep the text_resource update and names (documents, sanitize_text_for_indexing, add_span_event, Document, sanitized_documents, result, payload, doc) as in the diff.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/tests/services/rag/test_text_sanitizer.py`:
- Around line 17-18: The class attribute SUPPORTED_RETRIEVAL_METHODS on
_FakeStorageBackend is currently a mutable list which can be mutated across
tests; change it to an immutable tuple (e.g., replace ["vector"] with
("vector",)) on the _FakeStorageBackend class that inherits BaseStorageBackend
so the fake backend's supported methods cannot be altered accidentally during
tests.
---
Nitpick comments:
In `@backend/app/services/rag/embedding/factory.py`:
- Around line 178-194: The logging path double-normalizes modalities because
embedding_supports_image_input(normalize_additional_input_modalities(...)) will
re-run normalization on additional_input_modalities; to avoid that, compute a
supports_image boolean directly from the already-normalized
additional_input_modalities (e.g., check for the "image" modality with a small
comprehension) and pass that boolean to logger.info instead of calling
embedding_supports_image_input again; update references in the block that use
additional_input_modalities and logger.info to use the precomputed
supports_image value.
In `@backend/app/services/rag/index/indexer.py`:
- Around line 75-90: The loop in indexer.py always reconstructs a Document even
when sanitize_text_for_indexing(doc.text) makes no changes; change the logic in
the documents loop to check result.replacements_count and if it is 0 simply
append the original doc to sanitized_documents, otherwise (when
replacements_count > 0) call add_span_event and perform the existing
model_dump()/payload mutation and Document(**payload) creation so only modified
documents are rebuilt; keep the text_resource update and names (documents,
sanitize_text_for_indexing, add_span_event, Document, sanitized_documents,
result, payload, doc) as in the diff.
In `@backend/app/services/rag/preprocess/text_sanitizer.py`:
- Around line 21-23: BARE_BASE64_PATTERN can yield false positives for long
alphanumeric strings; update the pattern (symbol: BARE_BASE64_PATTERN in
text_sanitizer.py) to enforce valid Base64 structure (require groups of 4
characters so total length is divisible by 4 and allow optional padding of one
or two '=' characters) instead of a simple {128,} run, and add a brief inline
comment documenting the remaining edge-case risk (e.g., concatenated UUIDs or
long hashes) so reviewers/ops can monitor false positives in production.
In `@frontend/src/features/settings/components/ModelEditDialog.tsx`:
- Around line 345-346: The component currently maps additional_input_modalities
down to a boolean embeddingSupportsImageInput and then rebuilds the array on
save, which drops any unknown modalities; instead keep the original modalities
array in state (e.g., add a state like additionalInputModalities or reuse
existing prop) and update only the presence of 'image' when
embeddingSupportsImageInput toggles; modify where embeddingSupportsImageInput
and setEmbeddingSupportsImageInput are initialized to derive from the modalities
array but do not overwrite it, and on save (the save/submit handler around where
additional_input_modalities is reconstructed) merge the existing array with or
without 'image' rather than replacing the array so all unknown modalities are
preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f5b0158e-45e8-4515-936c-257474a58939
📒 Files selected for processing (15)
backend/app/schemas/kind.pybackend/app/services/rag/embedding/capabilities.pybackend/app/services/rag/embedding/factory.pybackend/app/services/rag/index/indexer.pybackend/app/services/rag/preprocess/__init__.pybackend/app/services/rag/preprocess/text_sanitizer.pybackend/tests/services/rag/test_embedding_capabilities.pybackend/tests/services/rag/test_text_sanitizer.pydocs/specs/knowledge/2026-03-31-knowledge-index-base64-sanitization-design.mdfrontend/src/__tests__/features/settings/utils/embedding-model-config.test.tsfrontend/src/apis/models.tsfrontend/src/features/settings/components/ModelEditDialog.tsxfrontend/src/features/settings/utils/embedding-model-config.tsfrontend/src/i18n/locales/en/common.jsonfrontend/src/i18n/locales/zh-CN/common.json
d8ea857 to
9ab56cd
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
backend/tests/services/rag/test_text_sanitizer.py (1)
136-139: Assert splitter invocation before dereferencingcall_args.Add an explicit
assert_called_once()so test failures point to the root cause directly when_index_documentsdoes not call the splitter.💡 Suggested patch
indexer._index_documents(documents=documents, chunk_metadata=chunk_metadata) + splitter.split_documents.assert_called_once() sanitized_document = splitter.split_documents.call_args.args[0][0] assert sanitized_document.text == ""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/tests/services/rag/test_text_sanitizer.py` around lines 136 - 139, Add an explicit assertion that the splitter was invoked before dereferencing its call arguments: in the test around indexer._index_documents(...) assert that splitter.split_documents was called (e.g., use splitter.split_documents.assert_called_once()) before reading splitter.split_documents.call_args.args[0][0]; this ensures test failures clearly indicate the splitter was not called rather than raising an attribute error when accessing call_args.frontend/src/features/settings/components/ModelEditDialog.tsx (2)
345-345: Extract the embedding section into its own hook/subcomponent.This feature now touches state setup, edit initialization, save-time serialization, and rendering inside a 2k-line dialog. Pulling the embedding form into a dedicated module would make the new capability easier to reason about and keep
handleSaveslimmer.As per coding guidelines,
File size limit: If a file exceeds 1000 lines, split it into multiple sub-modulesandFunction length: Max 50 lines per function (preferred).Also applies to: 441-445, 1012-1016, 1675-1693
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@frontend/src/features/settings/components/ModelEditDialog.tsx` at line 345, The embedding-related state and UI in ModelEditDialog (e.g., embeddingSupportsImageInput useState, initialization logic referenced around lines noted, and save-time serialization inside handleSave) should be extracted into a new hook or subcomponent (e.g., useEmbeddingSettings or EmbeddingSettingsPanel) that encapsulates: (1) local state (embeddingSupportsImageInput and any other embedding flags), (2) an init function to load edit values (formerly run during dialog init), (3) a serializer/collector function used by handleSave to produce the embedding payload, and (4) the rendering JSX for the embedding form; replace the in-file state, init, save logic and JSX in ModelEditDialog with calls to the new hook/component (invoke the init on dialog open and call the serializer from handleSave) so ModelEditDialog becomes slimmer and the embedding logic is fully contained in the new module.
76-76: Keep image-input capability single-sourced.
frontend/src/features/settings/utils/embedding-model-config.ts:23-37already turns this flag intoembeddingConfig.additional_input_modalities. Forwarding the raw boolean informDatagivesonSaveconsumers two representations of the same state to keep aligned. Prefer deriving it frommodelCRD.spec.embeddingConfig, or move all embedding-specific form state behind a single nested object.Also applies to: 1012-1016, 1132-1152
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@frontend/src/features/settings/components/ModelEditDialog.tsx` at line 76, The form currently forwards embeddingSupportsImageInput into formData, duplicating state that is already derived into embeddingConfig.additional_input_modalities in frontend/src/features/settings/utils/embedding-model-config.ts; remove the top-level embeddingSupportsImageInput from the ModelEditDialog form state and consumers (e.g., where formData is built and passed to onSave) and instead derive the image-input capability from modelCRD.spec.embeddingConfig (or place all embedding-related fields into a single nested embeddingConfig object) so that onSave consumers only read embeddingConfig.additional_input_modalities; update any references in ModelEditDialog (embeddingSupportsImageInput, formData, and onSave usage) to use that single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/tests/services/rag/test_text_sanitizer.py`:
- Around line 20-59: The _FakeStorageBackend methods lack type hints; update the
signatures for __init__, create_vector_store, index_with_metadata, retrieve,
delete_document, get_document, list_documents, and get_all_chunks to include
parameter and return type annotations using typing.List, typing.Dict, and
typing.Any and the project types BaseNode and ChunkMetadata (e.g., nodes:
List[BaseNode], embed_model: Any, chunk_metadata: ChunkMetadata) and ensure each
method declares an appropriate return type such as Dict[str, Any] or
Optional[Dict[str, Any]]/None to match the parent BaseStorageBackend patterns;
also add any necessary typing imports at the top (List, Dict, Any, Optional) and
reference these exact method names when applying changes.
---
Nitpick comments:
In `@backend/tests/services/rag/test_text_sanitizer.py`:
- Around line 136-139: Add an explicit assertion that the splitter was invoked
before dereferencing its call arguments: in the test around
indexer._index_documents(...) assert that splitter.split_documents was called
(e.g., use splitter.split_documents.assert_called_once()) before reading
splitter.split_documents.call_args.args[0][0]; this ensures test failures
clearly indicate the splitter was not called rather than raising an attribute
error when accessing call_args.
In `@frontend/src/features/settings/components/ModelEditDialog.tsx`:
- Line 345: The embedding-related state and UI in ModelEditDialog (e.g.,
embeddingSupportsImageInput useState, initialization logic referenced around
lines noted, and save-time serialization inside handleSave) should be extracted
into a new hook or subcomponent (e.g., useEmbeddingSettings or
EmbeddingSettingsPanel) that encapsulates: (1) local state
(embeddingSupportsImageInput and any other embedding flags), (2) an init
function to load edit values (formerly run during dialog init), (3) a
serializer/collector function used by handleSave to produce the embedding
payload, and (4) the rendering JSX for the embedding form; replace the in-file
state, init, save logic and JSX in ModelEditDialog with calls to the new
hook/component (invoke the init on dialog open and call the serializer from
handleSave) so ModelEditDialog becomes slimmer and the embedding logic is fully
contained in the new module.
- Line 76: The form currently forwards embeddingSupportsImageInput into
formData, duplicating state that is already derived into
embeddingConfig.additional_input_modalities in
frontend/src/features/settings/utils/embedding-model-config.ts; remove the
top-level embeddingSupportsImageInput from the ModelEditDialog form state and
consumers (e.g., where formData is built and passed to onSave) and instead
derive the image-input capability from modelCRD.spec.embeddingConfig (or place
all embedding-related fields into a single nested embeddingConfig object) so
that onSave consumers only read embeddingConfig.additional_input_modalities;
update any references in ModelEditDialog (embeddingSupportsImageInput, formData,
and onSave usage) to use that single source of truth.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ee1ff31d-1231-4835-a41e-8ef4e5ca7a8d
📒 Files selected for processing (14)
backend/app/schemas/kind.pybackend/app/services/rag/embedding/capabilities.pybackend/app/services/rag/embedding/factory.pybackend/app/services/rag/index/indexer.pybackend/app/services/rag/preprocess/__init__.pybackend/app/services/rag/preprocess/text_sanitizer.pybackend/tests/services/rag/test_embedding_capabilities.pybackend/tests/services/rag/test_text_sanitizer.pyfrontend/src/__tests__/features/settings/utils/embedding-model-config.test.tsfrontend/src/apis/models.tsfrontend/src/features/settings/components/ModelEditDialog.tsxfrontend/src/features/settings/utils/embedding-model-config.tsfrontend/src/i18n/locales/en/common.jsonfrontend/src/i18n/locales/zh-CN/common.json
✅ Files skipped from review due to trivial changes (10)
- backend/app/services/rag/preprocess/init.py
- backend/app/services/rag/embedding/factory.py
- frontend/src/i18n/locales/en/common.json
- frontend/src/i18n/locales/zh-CN/common.json
- frontend/src/tests/features/settings/utils/embedding-model-config.test.ts
- backend/tests/services/rag/test_embedding_capabilities.py
- backend/app/services/rag/embedding/capabilities.py
- frontend/src/features/settings/utils/embedding-model-config.ts
- backend/app/schemas/kind.py
- backend/app/services/rag/preprocess/text_sanitizer.py
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/app/services/rag/index/indexer.py
| def __init__(self): | ||
| super().__init__({}) | ||
| self.indexed_nodes = None | ||
|
|
||
| def create_vector_store(self, index_name: str): | ||
| return None | ||
|
|
||
| def index_with_metadata(self, nodes, chunk_metadata, embed_model, **kwargs): | ||
| self.indexed_nodes = nodes | ||
| return { | ||
| "indexed_count": len(nodes), | ||
| "index_name": "test-index", | ||
| "status": "success", | ||
| } | ||
|
|
||
| def retrieve( | ||
| self, | ||
| knowledge_id, | ||
| query, | ||
| embed_model, | ||
| retrieval_setting, | ||
| metadata_condition=None, | ||
| **kwargs, | ||
| ): | ||
| return {"records": []} | ||
|
|
||
| def delete_document(self, knowledge_id, doc_ref, **kwargs): | ||
| return {"status": "success"} | ||
|
|
||
| def get_document(self, knowledge_id, doc_ref, **kwargs): | ||
| return {} | ||
|
|
||
| def list_documents(self, knowledge_id, page=1, page_size=20, **kwargs): | ||
| return {} | ||
|
|
||
| def test_connection(self) -> bool: | ||
| return True | ||
|
|
||
| def get_all_chunks(self, knowledge_id, max_chunks=10000, **kwargs): | ||
| return [] |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify missing type annotations in this file (read-only)
python - <<'PY'
import ast
from pathlib import Path
path = Path("backend/tests/services/rag/test_text_sanitizer.py")
tree = ast.parse(path.read_text())
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
missing_args = []
args = node.args.posonlyargs + node.args.args + node.args.kwonlyargs
for a in args:
if a.arg == "self":
continue
if a.annotation is None:
missing_args.append(a.arg)
missing_return = node.returns is None
if missing_args or missing_return:
print(f"{node.name} (line {node.lineno}) -> missing args: {missing_args}, missing return: {missing_return}")
PYRepository: wecode-ai/Wegent
Length of output: 826
🏁 Script executed:
cat -n backend/tests/services/rag/test_text_sanitizer.py | head -70Repository: wecode-ai/Wegent
Length of output: 2617
🏁 Script executed:
# Find the parent class and understand what _FakeStorageBackend inherits from
rg "class _FakeStorageBackend" -A 2 backend/tests/services/rag/test_text_sanitizer.pyRepository: wecode-ai/Wegent
Length of output: 153
🏁 Script executed:
# Look for the actual storage backend implementation to understand correct types
fd -e py -path "**/services/rag/*" | grep -E "(backend|storage)" | head -20Repository: wecode-ai/Wegent
Length of output: 228
🏁 Script executed:
# Search for TextNode and ChunkMetadata types in codebase
rg "class TextNode|from.*TextNode|import.*TextNode" -t py | head -10Repository: wecode-ai/Wegent
Length of output: 154
🏁 Script executed:
# Search for the actual backend interface/implementation
rg "def index_with_metadata|def retrieve|def delete_document" backend/ --type py | head -15Repository: wecode-ai/Wegent
Length of output: 1414
🏁 Script executed:
cat -n backend/app/services/rag/storage/base.py | head -100Repository: wecode-ai/Wegent
Length of output: 3994
🏁 Script executed:
# Get the full method signatures from BaseStorageBackend
rg "def index_with_metadata|def retrieve|def delete_document|def __init__" backend/app/services/rag/storage/base.py -A 5Repository: wecode-ai/Wegent
Length of output: 714
🏁 Script executed:
# Check the actual implementation in elasticsearch_backend to see real type hints
cat -n backend/app/services/rag/storage/elasticsearch_backend.py | head -150Repository: wecode-ai/Wegent
Length of output: 6602
Add type annotations to _FakeStorageBackend method signatures.
All methods lack parameter and return type annotations, violating the repository's Python type hints requirement. Match the parent class BaseStorageBackend signature patterns:
nodesshould beList[BaseNode](not TextNode)embed_modelshould beAny(parent class also leaves this untyped)- All methods should declare return types; most return
DictorNone - Use
typing.Dictandtyping.Listfor consistency with the codebase
For example, index_with_metadata should be:
def index_with_metadata(
self,
nodes: List[BaseNode],
chunk_metadata: ChunkMetadata,
embed_model: Any,
**kwargs: Any,
) -> Dict[str, Any]:Apply similar typing to all 8 methods (__init__, create_vector_store, index_with_metadata, retrieve, delete_document, get_document, list_documents, get_all_chunks).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/tests/services/rag/test_text_sanitizer.py` around lines 20 - 59, The
_FakeStorageBackend methods lack type hints; update the signatures for __init__,
create_vector_store, index_with_metadata, retrieve, delete_document,
get_document, list_documents, and get_all_chunks to include parameter and return
type annotations using typing.List, typing.Dict, and typing.Any and the project
types BaseNode and ChunkMetadata (e.g., nodes: List[BaseNode], embed_model: Any,
chunk_metadata: ChunkMetadata) and ensure each method declares an appropriate
return type such as Dict[str, Any] or Optional[Dict[str, Any]]/None to match the
parent BaseStorageBackend patterns; also add any necessary typing imports at the
top (List, Dict, Any, Optional) and reference these exact method names when
applying changes.
Summary by CodeRabbit
New Features
Documentation
Tests