Skip to content

[codex] Preserve inference coverage with cache misses#780

Open
AMBRA7592 wants to merge 6 commits into
potpie-ai:mainfrom
AMBRA7592:codex-inference-cache-preserve-coverage
Open

[codex] Preserve inference coverage with cache misses#780
AMBRA7592 wants to merge 6 commits into
potpie-ai:mainfrom
AMBRA7592:codex-inference-cache-preserve-coverage

Conversation

@AMBRA7592
Copy link
Copy Markdown

@AMBRA7592 AMBRA7592 commented May 15, 2026

Summary

Fixes part of #223 by tightening the inference cache flow so cache eligibility no longer determines whether a node receives inference.

  • Fetches node type and existing content hash from Neo4j for cache-aware processing.
  • Ensures non-cacheable nodes, such as small nodes or unresolved-reference nodes, still enter LLM inference batches instead of being skipped.
  • Stores content hashes back onto Neo4j nodes when writing fresh or cached inference results.
  • Matches LLM responses back to requests by node_id instead of relying on response order when storing cache entries.
  • Adds unit coverage for uncacheable-node inference coverage and content-hash persistence.

Why

The existing cache path only batched cacheable cache misses. That made caching improve speed at the cost of inference coverage for nodes that were too small or otherwise not worth caching. The bounty success criteria include no loss in inference quality, so cacheability should only affect cache storage/reuse, not whether inference runs.

Validation

  • python3 -m py_compile app/modules/parsing/knowledge_graph/inference_service.py tests/unit/parsing/test_inference_cache_flow.py
  • git diff --check

Full pytest was not run locally because this environment does not have the project test dependencies installed.

Summary by CodeRabbit

  • Improvements
    • Inference batching refined: decouples cacheability from inference selection, excludes cache-hits while still batching uncached content, and improves batch logging/metrics.
    • Persisted content-hash with inferred results, propagating hashes from chunks to parents for more reliable tracking and ensuring stored graph nodes include content-hash while conditionally clearing source text for non-local repos.
  • Tests
    • Added unit tests for inference-cache batching behavior and content-hash persistence.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The inference service now retrieves node content_hash from Neo4j, separates cacheability from inference selection in batching, maps LLM responses to requests by node_id while propagating content_hash, and persists content_hash in Neo4j updates alongside docstrings and embeddings.

Changes

Content Hash Tracking Through Inference Pipeline

Layer / File(s) Summary
Fetch content_hash from Neo4j graph
app/modules/parsing/knowledge_graph/inference_service.py
fetch_graph Cypher projection now returns content_hash with node attributes.
Batching logic with cacheability tracking
app/modules/parsing/knowledge_graph/inference_service.py
Batch creation computes nodes_needing_inference from cache-misses plus text, adds nodes_not_cacheable metric, and updates batching logs.
LLM response mapping and content_hash propagation
app/modules/parsing/knowledge_graph/inference_service.py
LLM batch processing builds request_by_node_id and content_hash_by_node_id (propagating chunk hashes to parent nodes), matches response.docstrings by node_id, and skips unexpected ids.
Propagation to Neo4j update payloads and Cypher
app/modules/parsing/knowledge_graph/inference_service.py
batch_update_neo4j_with_cached_inference and update_neo4j_with_docstrings include content_hash in payloads and set n.content_hash in Cypher; updates conditionally clear source fields for non-local repos.
Unit tests for cache flow and Neo4j persistence
tests/unit/parsing/test_inference_cache_flow.py
New tests instantiate InferenceService without full init, verify batching behavior with cacheable/non-cacheable nodes, and assert update_neo4j_with_docstrings persists or omits content_hash using a fake Neo4j driver and project manager.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through nodes with a curious dash,
I track each chunk and its little hash,
From fetch to LLM then mapped once more,
I write the hash where docstrings store,
A rabbit's delight: data safe ashore. 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: preserving inference coverage even with cache misses, which is the core motivation and objective of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AMBRA7592 AMBRA7592 marked this pull request as ready for review May 15, 2026 19:43
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/modules/parsing/knowledge_graph/inference_service.py`:
- Around line 1515-1516: The method currently clears content_hash on nodes when
content_hash_by_node_id is omitted because it always sets item.content_hash and
runs SET n.content_hash = item.content_hash; instead, change the logic so that
you only set item.content_hash and include the SET n.content_hash clause when
content_hash_by_node_id is not None (i.e., when the caller provided a map).
Locate the places handling the parameters precomputed_embeddings and
content_hash_by_node_id (see references to item.content_hash and the Cypher
fragment "SET n.content_hash = item.content_hash") and wrap those
assignments/SET clauses in a conditional branch (or build the update query
dynamically) so omission of content_hash_by_node_id leaves existing node
content_hash values untouched; apply the same fix for the other block noted
around the 1530-1557 region.
- Around line 1061-1070: The loop that populates content_hash_by_node_id copies
metadata["content_hash"] for every request (and its parent) without checking
cacheability, causing stale hashes to be re-applied; modify the loop in
inference_service.py to only assign content_hash_by_node_id[request.node_id]
(and parent) when the node is currently cacheable by checking the node/request's
should_cache flag (or equivalent property) before writing the hash, and skip
writing hashes for nodes where should_cache is False.

In `@tests/unit/parsing/test_inference_cache_flow.py`:
- Around line 88-90: The FakeProjectManager test stub uses a temp-directory path
"/tmp/repo" which triggers CI lint rules; update
FakeProjectManager.get_project_from_db_by_id_sync to return a stable dummy path
(e.g., "/repo" or "repo") instead of "/tmp/repo" so the test does not reference
the system temp directory.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 06b8a942-0f01-4c2b-bb91-7d96a9be3c83

📥 Commits

Reviewing files that changed from the base of the PR and between f637312 and 8abff0d.

📒 Files selected for processing (2)
  • app/modules/parsing/knowledge_graph/inference_service.py
  • tests/unit/parsing/test_inference_cache_flow.py

Comment thread app/modules/parsing/knowledge_graph/inference_service.py
Comment thread app/modules/parsing/knowledge_graph/inference_service.py
Comment thread tests/unit/parsing/test_inference_cache_flow.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/unit/parsing/test_inference_cache_flow.py (1)

88-90: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace /tmp/repo in the fake project path to avoid CI lint failure.

Line 90 still uses "/tmp/repo", which triggers Ruff S108 in this test and can keep CI red.

🔧 Minimal fix
 class FakeProjectManager:
     def get_project_from_db_by_id_sync(self, repo_id):
-        return {"repo_path": "/tmp/repo"}
+        return {"repo_path": "/repo"}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/parsing/test_inference_cache_flow.py` around lines 88 - 90, The
test's FakeProjectManager.get_project_from_db_by_id_sync returns a hardcoded
repo path "/tmp/repo" which triggers a linter rule; update that method to return
a safe, non-/tmp path (e.g., use a relative path like "tmp/repo" or a clearly
test-scoped path such as "/var/tmp/test-repo" or better yet generate a tempfile
via tempfile.mkdtemp in the test setup) so the fake project path no longer uses
"/tmp/repo" and the test remains deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@tests/unit/parsing/test_inference_cache_flow.py`:
- Around line 88-90: The test's
FakeProjectManager.get_project_from_db_by_id_sync returns a hardcoded repo path
"/tmp/repo" which triggers a linter rule; update that method to return a safe,
non-/tmp path (e.g., use a relative path like "tmp/repo" or a clearly
test-scoped path such as "/var/tmp/test-repo" or better yet generate a tempfile
via tempfile.mkdtemp in the test setup) so the fake project path no longer uses
"/tmp/repo" and the test remains deterministic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e718de84-d067-4b41-acde-177626461504

📥 Commits

Reviewing files that changed from the base of the PR and between 8abff0d and df9cf57.

📒 Files selected for processing (2)
  • app/modules/parsing/knowledge_graph/inference_service.py
  • tests/unit/parsing/test_inference_cache_flow.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • app/modules/parsing/knowledge_graph/inference_service.py

@Dsantra92 Dsantra92 requested a review from shmbhvi101 May 27, 2026 09:36
Copy link
Copy Markdown
Contributor

@shmbhvi101 shmbhvi101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this — splitting “run inference” from “store in cache” is exactly what #223 needed, and the batching / node_id matching changes makes sense.

Requesting one fix before merge:

When update_neo4j_with_docstrings is called with content_hash_by_node_id (always from process_batch), nodes that are not in the map still get item["content_hash"] = None and SET n.content_hash = item.content_hash, which can clear hashes on uncacheable nodes in the same batch.

Please only set/persist content_hash for node IDs present in the map (or use a Cypher coalesce/CASE so omission leaves the existing value). A small unit test for a mixed cacheable + uncacheable batch would lock this in.

After that, happy to approve.

@sonarqubecloud
Copy link
Copy Markdown

@AMBRA7592
Copy link
Copy Markdown
Author

Updated in 54d398d: update_neo4j_with_docstrings now only includes content_hash for node IDs present in the hash map, and the Cypher update preserves existing hashes for omitted nodes. Added a mixed cacheable/uncacheable batch regression test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants