LCORE-1444: Solr enrichment not working in library mode#1335
Conversation
WalkthroughThis PR refactors chunk filtering logic in the OKP (Okapi) RAG configuration. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can approve the review once all CodeRabbit's comments are resolved.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/llama_stack_configuration.py (1)
418-468:⚠️ Potential issue | 🟠 MajorExisting Solr providers are never updated with the new chunk filter.
If
okp_solralready exists, the code skips provider creation and never writes the newchunk_filter_query, so internal/default filtering can be missed.Proposed fix
existing_providers = [ p.get("provider_id") for p in ls_config["providers"]["vector_io"] ] - if constants.SOLR_PROVIDER_ID not in existing_providers: + if constants.SOLR_PROVIDER_ID not in existing_providers: @@ logger.info("Added OKP provider to providers/vector_io") + else: + for provider in ls_config["providers"]["vector_io"]: + if provider.get("provider_id") == constants.SOLR_PROVIDER_ID: + provider.setdefault("config", {}) + provider["config"].setdefault("chunk_window_config", {}) + provider["config"]["chunk_window_config"][ + "chunk_filter_query" + ] = chunk_filter_query + logger.info("Updated OKP provider chunk_filter_query") + break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llama_stack_configuration.py` around lines 418 - 468, The code only appends a Solr provider when constants.SOLR_PROVIDER_ID is missing, so existing providers never get their chunk_filter_query updated; modify the branch that checks existing_providers (and the else branch when constants.SOLR_PROVIDER_ID is present) to locate the existing provider object in ls_config["providers"]["vector_io"] whose "provider_id" == constants.SOLR_PROVIDER_ID and update its ["config"]["chunk_window_config"]["chunk_filter_query"] = chunk_filter_query (and create the nested dicts if missing), and emit a logger.info/log.debug message indicating the provider was updated; keep the creation flow for the new provider unchanged.
🧹 Nitpick comments (2)
tests/unit/models/config/test_rag_configuration.py (1)
83-87: Add a regression test that rejects user-suppliedis_chunk:*.Given the new contract, add a negative test to prevent future regressions where users can inject the reserved internal clause.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/models/config/test_rag_configuration.py` around lines 83 - 87, Add a negative regression test that ensures OkpConfiguration rejects user-supplied internal clause "is_chunk:*": create a new test (e.g., test_reject_is_chunk_clause) that attempts to construct OkpConfiguration(chunk_filter_query="is_chunk:*" or "foo AND is_chunk:true") and asserts that this raises the validation error (ValueError or the specific exception raised by the configuration validation). Reference the existing test_custom_chunk_filter_query and the OkpConfiguration class/constructor to locate where to add the test.tests/unit/test_llama_stack_configuration.py (1)
501-525: Good test coverage for the new chunk filter behavior.The tests correctly validate the core functionality: default internal filter application and AND conjunction with user-provided filters.
Consider adding an edge case test to verify behavior when a user inadvertently includes
is_chunk:truein their filter. Depending on the implementation, this could result in a duplicated filter like"is_chunk:true AND is_chunk:true".📝 Optional: Additional edge case test
def test_enrich_solr_user_chunk_filter_query_with_duplicate_internal_filter() -> None: """Test enrich_solr handles user filter that already contains is_chunk:true.""" ls_config: dict[str, Any] = {} enrich_solr(ls_config, _OKP_RAG_CONFIG, {"chunk_filter_query": "is_chunk:true AND product:ansible"}) provider = next( p for p in ls_config["providers"]["vector_io"] if p["provider_id"] == "okp_solr" ) # Verify expected behavior - document whether duplication is acceptable or if it should be deduplicated chunk_filter = provider["config"]["chunk_window_config"]["chunk_filter_query"] assert "is_chunk:true" in chunk_filter🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_llama_stack_configuration.py` around lines 501 - 525, Add an edge-case unit test named test_enrich_solr_user_chunk_filter_query_with_duplicate_internal_filter that calls enrich_solr with a user chunk_filter_query that already contains "is_chunk:true" (e.g., "is_chunk:true AND product:ansible"), then locate the okp_solr provider and assert the provider["config"]["chunk_window_config"]["chunk_filter_query"] contains "is_chunk:true" but does not contain a duplicated "is_chunk:true AND is_chunk:true" sequence; this documents expected behavior and prevents regressions in enrich_solr's chunk filter composition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/config.md`:
- Line 411: Update the docs for the chunk_filter_query field to explicitly state
that the system always injects the internal filter is_chunk:true and that
chunk_filter_query is additive-only (user-provided conditions are ANDed with the
injected is_chunk:true filter); mention the field name chunk_filter_query and
the internal flag is_chunk:true so readers know the system-enforced behavior and
that they cannot override or remove is_chunk:true via chunk_filter_query.
In `@docs/openapi.json`:
- Line 4797: The GET and POST operations for the /a2a path share the same
operationId ("handle_a2a_jsonrpc_a2a_post"), which breaks OpenAPI consumers;
update the GET operation's operationId to a unique name (for example
"handle_a2a_jsonrpc_a2a_get") so each operationId is distinct, ensuring you
change the operationId field on the /a2a GET entry rather than the POST entry
and keep the rest of the GET operation intact.
In `@examples/lightspeed-stack-byok-okp-rag.yaml`:
- Around line 69-71: Add a short explicit note next to the chunk_filter_query
example clarifying that the system always appends is_chunk:true internally, so
users should only supply additional Solr filters and must not include any
is_chunk:* clause themselves; reference the chunk_filter_query config key and
the is_chunk:true field so reviewers can locate the example to update.
In `@src/client.py`:
- Around line 79-91: yaml.safe_load can return None (or non-dict) which causes
enrich_byok_rag and enrich_solr to fail; after loading ls_config, check if it's
a mapping (isinstance(ls_config, dict)), and if not log a warning and replace it
with an empty dict (ls_config = {}) before calling enrich_byok_rag and
enrich_solr so the enrichment helpers always receive a dict; keep existing
logger usage and the same variables (ls_config, configuration, enrich_byok_rag,
enrich_solr).
In `@src/llama_stack_configuration.py`:
- Around line 404-408: The composed Solr chunk_filter_query currently
concatenates constants.SOLR_CHUNK_FILTER_QUERY and user_filter without grouping,
which breaks boolean grouping when user_filter contains OR; update the
composition of chunk_filter_query (the variable chunk_filter_query where
constants.SOLR_CHUNK_FILTER_QUERY and user_filter are combined) to wrap
user_filter in parentheses before the AND, e.g. use "(" + user_filter + ")" or
an f-string that places user_filter inside parentheses so the entire
user-provided clause is grouped: keep the same conditional logic (if user_filter
else constants.SOLR_CHUNK_FILTER_QUERY) but ensure user_filter is parenthesized.
In `@src/models/config.py`:
- Around line 1743-1748: The chunk_filter_query Field currently allows
user-supplied predicates like "is_chunk:..." which conflicts with the internal
always-on filter; add a validation step for chunk_filter_query (e.g., a pydantic
`@validator` for the chunk_filter_query field, name it
validate_chunk_filter_query) that rejects any value containing the reserved
predicate "is_chunk" (case-insensitive, e.g. detect patterns like
"is_chunk\s*:") and raise a ValueError with a clear message asking users to
remove is_chunk predicates; ensure the validator returns the original value when
valid.
---
Outside diff comments:
In `@src/llama_stack_configuration.py`:
- Around line 418-468: The code only appends a Solr provider when
constants.SOLR_PROVIDER_ID is missing, so existing providers never get their
chunk_filter_query updated; modify the branch that checks existing_providers
(and the else branch when constants.SOLR_PROVIDER_ID is present) to locate the
existing provider object in ls_config["providers"]["vector_io"] whose
"provider_id" == constants.SOLR_PROVIDER_ID and update its
["config"]["chunk_window_config"]["chunk_filter_query"] = chunk_filter_query
(and create the nested dicts if missing), and emit a logger.info/log.debug
message indicating the provider was updated; keep the creation flow for the new
provider unchanged.
---
Nitpick comments:
In `@tests/unit/models/config/test_rag_configuration.py`:
- Around line 83-87: Add a negative regression test that ensures
OkpConfiguration rejects user-supplied internal clause "is_chunk:*": create a
new test (e.g., test_reject_is_chunk_clause) that attempts to construct
OkpConfiguration(chunk_filter_query="is_chunk:*" or "foo AND is_chunk:true") and
asserts that this raises the validation error (ValueError or the specific
exception raised by the configuration validation). Reference the existing
test_custom_chunk_filter_query and the OkpConfiguration class/constructor to
locate where to add the test.
In `@tests/unit/test_llama_stack_configuration.py`:
- Around line 501-525: Add an edge-case unit test named
test_enrich_solr_user_chunk_filter_query_with_duplicate_internal_filter that
calls enrich_solr with a user chunk_filter_query that already contains
"is_chunk:true" (e.g., "is_chunk:true AND product:ansible"), then locate the
okp_solr provider and assert the
provider["config"]["chunk_window_config"]["chunk_filter_query"] contains
"is_chunk:true" but does not contain a duplicated "is_chunk:true AND
is_chunk:true" sequence; this documents expected behavior and prevents
regressions in enrich_solr's chunk filter composition.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bb781fd9-73b3-4fb1-863b-8e42ae90bbdb
📒 Files selected for processing (12)
docs/config.mddocs/openapi.jsondocs/openapi.mddocs/rag_guide.mdexamples/lightspeed-stack-byok-okp-rag.yamlsrc/client.pysrc/constants.pysrc/llama_stack_configuration.pysrc/models/config.pytests/unit/models/config/test_dump_configuration.pytests/unit/models/config/test_rag_configuration.pytests/unit/test_llama_stack_configuration.py
| with open(input_config_path, "r", encoding="utf-8") as f: | ||
| ls_config = yaml.safe_load(f) | ||
| except (OSError, yaml.YAMLError) as e: | ||
| logger.warning("Failed to read llama-stack config: %s", e) | ||
| return input_config_path | ||
|
|
||
| enrich_byok_rag(ls_config, byok_rag) | ||
| config = configuration.configuration | ||
|
|
||
| # Enrichment: BYOK RAG | ||
| enrich_byok_rag(ls_config, [b.model_dump() for b in config.byok_rag]) | ||
|
|
||
| # Enrichment: Solr - enabled when "okp" appears in either inline or tool list | ||
| enrich_solr(ls_config, config.rag.model_dump(), config.okp.model_dump()) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What does yaml.safe_load from PyYAML return for an empty YAML document?
💡 Result:
For an empty YAML stream/document (e.g., "" or a file containing only whitespace/comments), yaml.safe_load(...) returns None. [1], [2]
🏁 Script executed:
# Find enrichment helper functions
fd "\.py$" src/ | xargs grep -l "def enrich_byok_rag\|def enrich_solr"Repository: lightspeed-core/lightspeed-stack
Length of output: 108
🏁 Script executed:
# Get the content around the enrichment calls to understand context
rg "def enrich_byok_rag|def enrich_solr" -A 15Repository: lightspeed-core/lightspeed-stack
Length of output: 2310
🏁 Script executed:
# Check if ls_config is ever checked for None elsewhere in the file
rg "ls_config" src/client.py -B 2 -A 2Repository: lightspeed-core/lightspeed-stack
Length of output: 924
Guard against non-dict YAML payloads before enrichment.
yaml.safe_load returns None for empty or comment-only YAML; enrichment helpers expect a mapping and will crash with TypeError when accessing dict operations on None.
Proposed fix
try:
with open(input_config_path, "r", encoding="utf-8") as f:
ls_config = yaml.safe_load(f)
except (OSError, yaml.YAMLError) as e:
logger.warning("Failed to read llama-stack config: %s", e)
return input_config_path
+ if not isinstance(ls_config, dict):
+ logger.warning(
+ "Invalid llama-stack config format in %s: expected mapping at root",
+ input_config_path,
+ )
+ return input_config_path🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/client.py` around lines 79 - 91, yaml.safe_load can return None (or
non-dict) which causes enrich_byok_rag and enrich_solr to fail; after loading
ls_config, check if it's a mapping (isinstance(ls_config, dict)), and if not log
a warning and replace it with an empty dict (ls_config = {}) before calling
enrich_byok_rag and enrich_solr so the enrichment helpers always receive a dict;
keep existing logger usage and the same variables (ls_config, configuration,
enrich_byok_rag, enrich_solr).
tisnik
left a comment
There was a problem hiding this comment.
Please do not update openapi.md, it's generation is broken now. Simply remove it from the PR
- The solr enrichment script runs also when running lightspeed-stack in library mode - Hid is_chunk:true filtering attribute from the user
b9137bc to
cf588c1
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
src/models/config.py (1)
1743-1748:⚠️ Potential issue | 🟠 MajorReject reserved
is_chunk:*predicates inchunk_filter_query.The field is now hidden from users in the schema, but it still accepts user-supplied
is_chunk:*clauses. That breaks the additive-only contract and lets callers override or contradict the system-managed chunk filter.Proposed fix
class OkpConfiguration(ConfigurationBase): @@ chunk_filter_query: Optional[str] = Field( default=None, title="OKP chunk filter query", description="Additional OKP filter query applied to every OKP search request. " "Use Solr boolean syntax, e.g. 'product:ansible AND product:*openshift*'.", ) + + `@field_validator`("chunk_filter_query") + `@classmethod` + def validate_chunk_filter_query(cls, value: Optional[str]) -> Optional[str]: + if value is None: + return value + if re.search(r"\bis_chunk\s*:", value, flags=re.IGNORECASE): + raise ValueError( + "Do not include 'is_chunk:*' in okp.chunk_filter_query; it is applied internally." + ) + return valueBased on learnings,
is_chunk:trueis always injected internally andchunk_filter_queryis intentionally additive-only and hidden from users.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/config.py` around lines 1743 - 1748, The chunk_filter_query field currently allows user-supplied "is_chunk:*" predicates which must be prohibited; add a Pydantic validator for chunk_filter_query in the Config model (use the existing Field declaration for chunk_filter_query as the identifier) that inspects the string (case-insensitive) and raises a ValueError if any "is_chunk" predicate is present (e.g. match patterns like (?i)\bis_chunk\s*:), so user input cannot include or override is_chunk:true/false; keep the field hidden in the schema as-is and provide a clear error message when validation fails.src/client.py (1)
76-91:⚠️ Potential issue | 🟠 MajorGuard non-mapping YAML roots before calling the enrichment helpers.
PyYAML documents that
safe_load()returnsNonewhen the stream has no documents. An empty or comment-only library config will therefore hitenrich_byok_rag()/enrich_solr()as a non-dict and fail on the first mapping operation instead of following the existing fallback path. (pyyaml.org)Proposed fix
try: with open(input_config_path, "r", encoding="utf-8") as f: ls_config = yaml.safe_load(f) except (OSError, yaml.YAMLError) as e: logger.warning("Failed to read llama-stack config: %s", e) return input_config_path + + if not isinstance(ls_config, dict): + logger.warning( + "Invalid llama-stack config format in %s: expected mapping at root", + input_config_path, + ) + return input_config_path config = configuration.configuration🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/client.py` around lines 76 - 91, The loaded YAML root (ls_config from yaml.safe_load) may be None or a non-mapping, so before calling enrich_byok_rag or enrich_solr in _enrich_library_config, guard against non-dict roots (e.g., if ls_config is None or not isinstance(ls_config, dict)) and follow the existing fallback path (log a warning and return input_config_path or otherwise avoid calling enrich_byok_rag/enrich_solr). Locate _enrich_library_config and add the check immediately after safe_load and before enrich_byok_rag/enrich_solr to prevent passing a non-mapping to those helpers.src/llama_stack_configuration.py (1)
403-408:⚠️ Potential issue | 🟠 MajorWrap the user clause before
AND-ing it into the Solr filter.Solr's standard parser uses parentheses to control boolean logic, and Lucene's parser docs likewise call out parentheses as the way to set precedence inside boolean clauses. Without grouping
user_filter, anyORinside it can escape the intendedAND (...)scope and broaden the final chunk filter. (solr.apache.org)Proposed fix
chunk_filter_query = ( - f"{constants.SOLR_CHUNK_FILTER_QUERY} AND {user_filter}" + f"{constants.SOLR_CHUNK_FILTER_QUERY} AND ({user_filter})" if user_filter else constants.SOLR_CHUNK_FILTER_QUERY )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llama_stack_configuration.py` around lines 403 - 408, The current construction of chunk_filter_query concatenates constants.SOLR_CHUNK_FILTER_QUERY and user_filter directly, which allows any ORs in user_filter to escape the intended AND scope; update the logic that builds chunk_filter_query (where user_filter and constants.SOLR_CHUNK_FILTER_QUERY are used) to wrap the user_filter in parentheses before joining (i.e., use "(" + user_filter + ")" when user_filter is present) so the resulting Solr filter preserves correct boolean precedence.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/llama_stack_configuration.py`:
- Line 560: The call enrich_solr(ls_config, config.get("rag", {}),
config.get("okp", {})) can pass None if the YAML explicitly sets rag or okp to
null; normalize those values before calling enrich_solr by replacing
config.get("rag", {}) and config.get("okp", {}) with expressions that coerce
falsy/None to empty dicts (e.g., rag = config.get("rag") or {}; okp =
config.get("okp") or {}) and then call enrich_solr(ls_config, rag, okp) so
enrich_solr always receives dicts and avoids AttributeError.
---
Duplicate comments:
In `@src/client.py`:
- Around line 76-91: The loaded YAML root (ls_config from yaml.safe_load) may be
None or a non-mapping, so before calling enrich_byok_rag or enrich_solr in
_enrich_library_config, guard against non-dict roots (e.g., if ls_config is None
or not isinstance(ls_config, dict)) and follow the existing fallback path (log a
warning and return input_config_path or otherwise avoid calling
enrich_byok_rag/enrich_solr). Locate _enrich_library_config and add the check
immediately after safe_load and before enrich_byok_rag/enrich_solr to prevent
passing a non-mapping to those helpers.
In `@src/llama_stack_configuration.py`:
- Around line 403-408: The current construction of chunk_filter_query
concatenates constants.SOLR_CHUNK_FILTER_QUERY and user_filter directly, which
allows any ORs in user_filter to escape the intended AND scope; update the logic
that builds chunk_filter_query (where user_filter and
constants.SOLR_CHUNK_FILTER_QUERY are used) to wrap the user_filter in
parentheses before joining (i.e., use "(" + user_filter + ")" when user_filter
is present) so the resulting Solr filter preserves correct boolean precedence.
In `@src/models/config.py`:
- Around line 1743-1748: The chunk_filter_query field currently allows
user-supplied "is_chunk:*" predicates which must be prohibited; add a Pydantic
validator for chunk_filter_query in the Config model (use the existing Field
declaration for chunk_filter_query as the identifier) that inspects the string
(case-insensitive) and raises a ValueError if any "is_chunk" predicate is
present (e.g. match patterns like (?i)\bis_chunk\s*:), so user input cannot
include or override is_chunk:true/false; keep the field hidden in the schema
as-is and provide a clear error message when validation fails.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b0339113-5cf3-41b1-b966-80bab1497113
📒 Files selected for processing (10)
docs/config.mddocs/rag_guide.mdexamples/lightspeed-stack-byok-okp-rag.yamlsrc/client.pysrc/constants.pysrc/llama_stack_configuration.pysrc/models/config.pytests/unit/models/config/test_dump_configuration.pytests/unit/models/config/test_rag_configuration.pytests/unit/test_llama_stack_configuration.py
🚧 Files skipped from review as they are similar to previous changes (7)
- examples/lightspeed-stack-byok-okp-rag.yaml
- tests/unit/models/config/test_dump_configuration.py
- docs/config.md
- docs/rag_guide.md
- src/constants.py
- tests/unit/models/config/test_rag_configuration.py
- tests/unit/test_llama_stack_configuration.py
| enrich_solr( | ||
| ls_config, {"enabled": okp_enabled, "chunk_filter_query": chunk_filter_query} | ||
| ) | ||
| enrich_solr(ls_config, config.get("rag", {}), config.get("okp", {})) |
There was a problem hiding this comment.
Normalize nullable rag / okp sections before passing them down.
dict.get(key, {}) only uses the fallback when the key is missing. If the YAML explicitly sets rag or okp to null, this call passes None into enrich_solr(), and the first .get() there raises AttributeError.
Proposed fix
- enrich_solr(ls_config, config.get("rag", {}), config.get("okp", {}))
+ enrich_solr(ls_config, config.get("rag") or {}, config.get("okp") or {})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| enrich_solr(ls_config, config.get("rag", {}), config.get("okp", {})) | |
| enrich_solr(ls_config, config.get("rag") or {}, config.get("okp") or {}) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/llama_stack_configuration.py` at line 560, The call
enrich_solr(ls_config, config.get("rag", {}), config.get("okp", {})) can pass
None if the YAML explicitly sets rag or okp to null; normalize those values
before calling enrich_solr by replacing config.get("rag", {}) and
config.get("okp", {}) with expressions that coerce falsy/None to empty dicts
(e.g., rag = config.get("rag") or {}; okp = config.get("okp") or {}) and then
call enrich_solr(ls_config, rag, okp) so enrich_solr always receives dicts and
avoids AttributeError.
|
LGTM! |
Description
Changes:
is_chunk:truefiltering attribute from the userThe
is_chunk:truevalue should always be included in the OKP Solr filters, thus should not be the responsibility of the user to provide this filter.Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
I performed such testing, configuring also BYOK rag
Summary by CodeRabbit
Release Notes
New Features
Documentation