[codex] Preserve inference coverage with cache misses by AMBRA7592 · Pull Request #780 · potpie-ai/potpie

AMBRA7592 · 2026-05-15T19:40:33Z

Summary

Fixes part of #223 by tightening the inference cache flow so cache eligibility no longer determines whether a node receives inference.

Fetches node type and existing content hash from Neo4j for cache-aware processing.
Ensures non-cacheable nodes, such as small nodes or unresolved-reference nodes, still enter LLM inference batches instead of being skipped.
Stores content hashes back onto Neo4j nodes when writing fresh or cached inference results.
Matches LLM responses back to requests by node_id instead of relying on response order when storing cache entries.
Adds unit coverage for uncacheable-node inference coverage and content-hash persistence.

Why

The existing cache path only batched cacheable cache misses. That made caching improve speed at the cost of inference coverage for nodes that were too small or otherwise not worth caching. The bounty success criteria include no loss in inference quality, so cacheability should only affect cache storage/reuse, not whether inference runs.

Validation

python3 -m py_compile app/modules/parsing/knowledge_graph/inference_service.py tests/unit/parsing/test_inference_cache_flow.py
git diff --check

Full pytest was not run locally because this environment does not have the project test dependencies installed.

Summary by CodeRabbit

Improvements
- Inference batching refined: decouples cacheability from inference selection, excludes cache-hits while still batching uncached content, and improves batch logging/metrics.
- Persisted content-hash with inferred results, propagating hashes from chunks to parents for more reliable tracking and ensuring stored graph nodes include content-hash while conditionally clearing source text for non-local repos.
Tests
- Added unit tests for inference-cache batching behavior and content-hash persistence.

coderabbitai · 2026-05-15T19:40:43Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

The inference service now retrieves node content_hash from Neo4j, separates cacheability from inference selection in batching, maps LLM responses to requests by node_id while propagating content_hash, and persists content_hash in Neo4j updates alongside docstrings and embeddings.

Changes

Content Hash Tracking Through Inference Pipeline

Layer / File(s)	Summary
Fetch content_hash from Neo4j graph `app/modules/parsing/knowledge_graph/inference_service.py`	`fetch_graph` Cypher projection now returns `content_hash` with node attributes.
Batching logic with cacheability tracking `app/modules/parsing/knowledge_graph/inference_service.py`	Batch creation computes `nodes_needing_inference` from cache-misses plus `text`, adds `nodes_not_cacheable` metric, and updates batching logs.
LLM response mapping and content_hash propagation `app/modules/parsing/knowledge_graph/inference_service.py`	LLM batch processing builds `request_by_node_id` and `content_hash_by_node_id` (propagating chunk hashes to parent nodes), matches `response.docstrings` by `node_id`, and skips unexpected ids.
Propagation to Neo4j update payloads and Cypher `app/modules/parsing/knowledge_graph/inference_service.py`	`batch_update_neo4j_with_cached_inference` and `update_neo4j_with_docstrings` include `content_hash` in payloads and set `n.content_hash` in Cypher; updates conditionally clear source fields for non-local repos.
Unit tests for cache flow and Neo4j persistence `tests/unit/parsing/test_inference_cache_flow.py`	New tests instantiate `InferenceService` without full init, verify batching behavior with cacheable/non-cacheable nodes, and assert `update_neo4j_with_docstrings` persists or omits `content_hash` using a fake Neo4j driver and project manager.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through nodes with a curious dash,
I track each chunk and its little hash,
From fetch to LLM then mapped once more,
I write the hash where docstrings store,
A rabbit's delight: data safe ashore. 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: preserving inference coverage even with cache misses, which is the core motivation and objective of this PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/modules/parsing/knowledge_graph/inference_service.py`:
- Around line 1515-1516: The method currently clears content_hash on nodes when
content_hash_by_node_id is omitted because it always sets item.content_hash and
runs SET n.content_hash = item.content_hash; instead, change the logic so that
you only set item.content_hash and include the SET n.content_hash clause when
content_hash_by_node_id is not None (i.e., when the caller provided a map).
Locate the places handling the parameters precomputed_embeddings and
content_hash_by_node_id (see references to item.content_hash and the Cypher
fragment "SET n.content_hash = item.content_hash") and wrap those
assignments/SET clauses in a conditional branch (or build the update query
dynamically) so omission of content_hash_by_node_id leaves existing node
content_hash values untouched; apply the same fix for the other block noted
around the 1530-1557 region.
- Around line 1061-1070: The loop that populates content_hash_by_node_id copies
metadata["content_hash"] for every request (and its parent) without checking
cacheability, causing stale hashes to be re-applied; modify the loop in
inference_service.py to only assign content_hash_by_node_id[request.node_id]
(and parent) when the node is currently cacheable by checking the node/request's
should_cache flag (or equivalent property) before writing the hash, and skip
writing hashes for nodes where should_cache is False.

In `@tests/unit/parsing/test_inference_cache_flow.py`:
- Around line 88-90: The FakeProjectManager test stub uses a temp-directory path
"/tmp/repo" which triggers CI lint rules; update
FakeProjectManager.get_project_from_db_by_id_sync to return a stable dummy path
(e.g., "/repo" or "repo") instead of "/tmp/repo" so the test does not reference
the system temp directory.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 06b8a942-0f01-4c2b-bb91-7d96a9be3c83

📥 Commits

Reviewing files that changed from the base of the PR and between f637312 and 8abff0d.

📒 Files selected for processing (2)

app/modules/parsing/knowledge_graph/inference_service.py
tests/unit/parsing/test_inference_cache_flow.py

coderabbitai

♻️ Duplicate comments (1)

tests/unit/parsing/test_inference_cache_flow.py (1)

88-90: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace /tmp/repo in the fake project path to avoid CI lint failure.

Line 90 still uses "/tmp/repo", which triggers Ruff S108 in this test and can keep CI red.

🔧 Minimal fix

 class FakeProjectManager:
     def get_project_from_db_by_id_sync(self, repo_id):
-        return {"repo_path": "/tmp/repo"}
+        return {"repo_path": "/repo"}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/parsing/test_inference_cache_flow.py` around lines 88 - 90, The
test's FakeProjectManager.get_project_from_db_by_id_sync returns a hardcoded
repo path "/tmp/repo" which triggers a linter rule; update that method to return
a safe, non-/tmp path (e.g., use a relative path like "tmp/repo" or a clearly
test-scoped path such as "/var/tmp/test-repo" or better yet generate a tempfile
via tempfile.mkdtemp in the test setup) so the fake project path no longer uses
"/tmp/repo" and the test remains deterministic.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@tests/unit/parsing/test_inference_cache_flow.py`:
- Around line 88-90: The test's
FakeProjectManager.get_project_from_db_by_id_sync returns a hardcoded repo path
"/tmp/repo" which triggers a linter rule; update that method to return a safe,
non-/tmp path (e.g., use a relative path like "tmp/repo" or a clearly
test-scoped path such as "/var/tmp/test-repo" or better yet generate a tempfile
via tempfile.mkdtemp in the test setup) so the fake project path no longer uses
"/tmp/repo" and the test remains deterministic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e718de84-d067-4b41-acde-177626461504

📥 Commits

Reviewing files that changed from the base of the PR and between 8abff0d and df9cf57.

📒 Files selected for processing (2)

app/modules/parsing/knowledge_graph/inference_service.py
tests/unit/parsing/test_inference_cache_flow.py

🚧 Files skipped from review as they are similar to previous changes (1)

app/modules/parsing/knowledge_graph/inference_service.py

shmbhvi101

Nice work on this — splitting “run inference” from “store in cache” is exactly what #223 needed, and the batching / node_id matching changes makes sense.

Requesting one fix before merge:

When update_neo4j_with_docstrings is called with content_hash_by_node_id (always from process_batch), nodes that are not in the map still get item["content_hash"] = None and SET n.content_hash = item.content_hash, which can clear hashes on uncacheable nodes in the same batch.

Please only set/persist content_hash for node IDs present in the map (or use a Cypher coalesce/CASE so omission leaves the existing value). A small unit test for a mixed cacheable + uncacheable batch would lock this in.

After that, happy to approve.

sonarqubecloud · 2026-05-29T20:36:39Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

AMBRA7592 · 2026-05-29T20:49:04Z

Updated in 54d398d: update_neo4j_with_docstrings now only includes content_hash for node IDs present in the hash map, and the Cypher update preserves existing hashes for omitted nodes. Added a mixed cacheable/uncacheable batch regression test.

Preserve inference coverage with cache misses

8abff0d

AMBRA7592 marked this pull request as ready for review May 15, 2026 19:43

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread app/modules/parsing/knowledge_graph/inference_service.py

Comment thread app/modules/parsing/knowledge_graph/inference_service.py

Comment thread tests/unit/parsing/test_inference_cache_flow.py Outdated

Use static Cypher for cache updates

df9cf57

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

AMBRA7592 added 3 commits May 15, 2026 21:57

Avoid temp path in cache flow test

8667e23

Address inference cache review feedback

513c94d

Reduce cache flow test duplication

b39a1b1

Dsantra92 requested a review from shmbhvi101 May 27, 2026 09:36

shmbhvi101 requested changes May 29, 2026

View reviewed changes

Preserve content hashes for uncached inference nodes

54d398d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Preserve inference coverage with cache misses#780

[codex] Preserve inference coverage with cache misses#780
AMBRA7592 wants to merge 6 commits into
potpie-ai:mainfrom
AMBRA7592:codex-inference-cache-preserve-coverage

AMBRA7592 commented May 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 15, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

shmbhvi101 left a comment •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 29, 2026

Uh oh!

AMBRA7592 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AMBRA7592 commented May 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

shmbhvi101 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 29, 2026

Quality Gate passed

Uh oh!

AMBRA7592 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AMBRA7592 commented May 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading

shmbhvi101 left a comment •

edited

Loading