[codex] Preserve inference coverage with cache misses#780
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThe inference service now retrieves node content_hash from Neo4j, separates cacheability from inference selection in batching, maps LLM responses to requests by node_id while propagating content_hash, and persists content_hash in Neo4j updates alongside docstrings and embeddings. ChangesContent Hash Tracking Through Inference Pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/modules/parsing/knowledge_graph/inference_service.py`:
- Around line 1515-1516: The method currently clears content_hash on nodes when
content_hash_by_node_id is omitted because it always sets item.content_hash and
runs SET n.content_hash = item.content_hash; instead, change the logic so that
you only set item.content_hash and include the SET n.content_hash clause when
content_hash_by_node_id is not None (i.e., when the caller provided a map).
Locate the places handling the parameters precomputed_embeddings and
content_hash_by_node_id (see references to item.content_hash and the Cypher
fragment "SET n.content_hash = item.content_hash") and wrap those
assignments/SET clauses in a conditional branch (or build the update query
dynamically) so omission of content_hash_by_node_id leaves existing node
content_hash values untouched; apply the same fix for the other block noted
around the 1530-1557 region.
- Around line 1061-1070: The loop that populates content_hash_by_node_id copies
metadata["content_hash"] for every request (and its parent) without checking
cacheability, causing stale hashes to be re-applied; modify the loop in
inference_service.py to only assign content_hash_by_node_id[request.node_id]
(and parent) when the node is currently cacheable by checking the node/request's
should_cache flag (or equivalent property) before writing the hash, and skip
writing hashes for nodes where should_cache is False.
In `@tests/unit/parsing/test_inference_cache_flow.py`:
- Around line 88-90: The FakeProjectManager test stub uses a temp-directory path
"/tmp/repo" which triggers CI lint rules; update
FakeProjectManager.get_project_from_db_by_id_sync to return a stable dummy path
(e.g., "/repo" or "repo") instead of "/tmp/repo" so the test does not reference
the system temp directory.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 06b8a942-0f01-4c2b-bb91-7d96a9be3c83
📒 Files selected for processing (2)
app/modules/parsing/knowledge_graph/inference_service.pytests/unit/parsing/test_inference_cache_flow.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/unit/parsing/test_inference_cache_flow.py (1)
88-90:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winReplace
/tmp/repoin the fake project path to avoid CI lint failure.Line 90 still uses
"/tmp/repo", which triggers Ruff S108 in this test and can keep CI red.🔧 Minimal fix
class FakeProjectManager: def get_project_from_db_by_id_sync(self, repo_id): - return {"repo_path": "/tmp/repo"} + return {"repo_path": "/repo"}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/parsing/test_inference_cache_flow.py` around lines 88 - 90, The test's FakeProjectManager.get_project_from_db_by_id_sync returns a hardcoded repo path "/tmp/repo" which triggers a linter rule; update that method to return a safe, non-/tmp path (e.g., use a relative path like "tmp/repo" or a clearly test-scoped path such as "/var/tmp/test-repo" or better yet generate a tempfile via tempfile.mkdtemp in the test setup) so the fake project path no longer uses "/tmp/repo" and the test remains deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@tests/unit/parsing/test_inference_cache_flow.py`:
- Around line 88-90: The test's
FakeProjectManager.get_project_from_db_by_id_sync returns a hardcoded repo path
"/tmp/repo" which triggers a linter rule; update that method to return a safe,
non-/tmp path (e.g., use a relative path like "tmp/repo" or a clearly
test-scoped path such as "/var/tmp/test-repo" or better yet generate a tempfile
via tempfile.mkdtemp in the test setup) so the fake project path no longer uses
"/tmp/repo" and the test remains deterministic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e718de84-d067-4b41-acde-177626461504
📒 Files selected for processing (2)
app/modules/parsing/knowledge_graph/inference_service.pytests/unit/parsing/test_inference_cache_flow.py
🚧 Files skipped from review as they are similar to previous changes (1)
- app/modules/parsing/knowledge_graph/inference_service.py
There was a problem hiding this comment.
Nice work on this — splitting “run inference” from “store in cache” is exactly what #223 needed, and the batching / node_id matching changes makes sense.
Requesting one fix before merge:
When update_neo4j_with_docstrings is called with content_hash_by_node_id (always from process_batch), nodes that are not in the map still get item["content_hash"] = None and SET n.content_hash = item.content_hash, which can clear hashes on uncacheable nodes in the same batch.
Please only set/persist content_hash for node IDs present in the map (or use a Cypher coalesce/CASE so omission leaves the existing value). A small unit test for a mixed cacheable + uncacheable batch would lock this in.
After that, happy to approve.
|
|
Updated in |



Summary
Fixes part of #223 by tightening the inference cache flow so cache eligibility no longer determines whether a node receives inference.
node_idinstead of relying on response order when storing cache entries.Why
The existing cache path only batched cacheable cache misses. That made caching improve speed at the cost of inference coverage for nodes that were too small or otherwise not worth caching. The bounty success criteria include no loss in inference quality, so cacheability should only affect cache storage/reuse, not whether inference runs.
Validation
python3 -m py_compile app/modules/parsing/knowledge_graph/inference_service.py tests/unit/parsing/test_inference_cache_flow.pygit diff --checkFull pytest was not run locally because this environment does not have the project test dependencies installed.
Summary by CodeRabbit