feat: add pre-warmed embedding cache for tool prefilter by vlordier · Pull Request #56 · Liquid4All/cookbook

vlordier · 2026-03-06T19:31:06Z

Summary

Add ToolEmbeddingIndex::build_with_cache() that loads from disk cache
Cache invalidation: endpoint + tool hash validation
Cache stored in app data directory (com.localcowork.app/cache)
Add tests for tool hash computation
Update orchestrator to use cached index
Add cache_dir() helper in lib.rs

This saves ~30s startup time by avoiding redundant embedding computation.

Changes

src-tauri/src/agent_core/tool_prefilter.rs: +216 lines (cache methods, metadata, validation)
src-tauri/src/agent_core/orchestrator.rs: +2 lines (use build_with_cache)
src-tauri/src/lib.rs: +8 lines (cache_dir helper)

Testing

All 365 tests pass (up from 363)
Clippy clean

- Add ToolEmbeddingIndex::build_with_cache() that loads from disk cache - Cache invalidation: endpoint + tool hash validation - Cache stored in app data directory (com.localcowork.app/cache) - Add tests for tool hash computation - Update orchestrator to use cached index - Add cache_dir() helper in lib.rs This saves ~30s startup time by avoiding redundant embedding computation.

Copilot

Pull request overview

Adds a disk-backed, pre-warmed embedding index cache to speed up tool prefilter initialization in the Tauri app by reusing previously computed tool embeddings when the model endpoint + tool list haven’t changed.

Changes:

Introduce ToolEmbeddingIndex::build_with_cache() plus cache serialization/validation in the tool prefilter module.
Update orchestrator startup to use the cached index path.
Add cache_dir() helper alongside existing data_dir().

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`examples/localcowork/src-tauri/src/agent_core/tool_prefilter.rs`	Implements cache read/write + metadata/tool-hash validation and adds unit tests for tool hashing.
`examples/localcowork/src-tauri/src/agent_core/orchestrator.rs`	Switches tool index construction to `build_with_cache()` and uses the app cache directory.
`examples/localcowork/src-tauri/src/lib.rs`	Adds `cache_dir()` helper (`data_dir()/cache`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T19:35:18Z

+#[allow(unused_imports)]
+use std::io::{self, Write};



std::io::{self, Write} is imported with #[allow(unused_imports)] but never used in this module. Please remove the unused import (and the allow) to keep the module clean and avoid masking future unused-import issues.

Suggested change

#[allow(unused_imports)]

use std::io::{self, Write};

Copilot · 2026-03-06T19:35:19Z

+        // Validate embeddings exist
+        if cached.embeddings.is_empty() && !cached.tool_names.is_empty() {
+            return Err(ToolPreFilterError::RequestFailed {
+                reason: "cached embeddings are empty".to_string(),
+            });
+        }
+
+        Ok(Self {
+            tool_names: cached.tool_names,
+            embeddings: cached.embeddings,
+        })
+    }


load_from_cache can construct an index where tool_names.len() and embeddings.len() differ, which will later panic in filter() when indexing self.embeddings[i]. Please validate that counts match (and ideally that each embedding has the expected dimension) before returning the cached index, and treat mismatches as cache-invalid.

Copilot · 2026-03-06T19:35:19Z

+        // Validate endpoint
+        if cached.metadata.endpoint != endpoint {
+            return Err(ToolPreFilterError::RequestFailed {
+                reason: "endpoint changed".to_string(),
+            });
+        }
+
+        // Validate tool hash
+        let current_hash = compute_tool_hash(tools);
+        if cached.metadata.tool_hash != current_hash {
+            return Err(ToolPreFilterError::RequestFailed {
+                reason: "tools changed".to_string(),
+            });
+        }
+
+        // Validate embeddings exist
+        if cached.embeddings.is_empty() && !cached.tool_names.is_empty() {
+            return Err(ToolPreFilterError::RequestFailed {
+                reason: "cached embeddings are empty".to_string(),
+            });
+        }
+
+        Ok(Self {
+            tool_names: cached.tool_names,
+            embeddings: cached.embeddings,
+        })


Cache metadata records dimension, but the loader never checks it (nor that all cached embeddings have consistent lengths). Because cosine similarity currently zips vectors, a dimension mismatch would silently produce incorrect scores. Consider validating metadata.dimension against the cached vectors (and/or returning ToolPreFilterError::DimensionMismatch) so invalid caches are rejected instead of degrading retrieval quality.

Copilot · 2026-03-06T19:35:19Z

+/// Compute a hash of the tool list for cache validation.
+///
+/// Uses a simple FNV-like hash for speed.
+fn compute_tool_hash(tools: &[(String, String)]) -> String {
+    use std::collections::hash_map::DefaultHasher;
+    use std::hash::{Hash, Hasher};
+
+    let mut hasher = DefaultHasher::new();
+    for (name, desc) in tools {
+        name.hash(&mut hasher);
+        desc.hash(&mut hasher);
+    }
+    format!("{:x}", hasher.finish())


The doc comment says this uses an "FNV-like" hash, but the implementation uses std::collections::hash_map::DefaultHasher (SipHash), which is not FNV. Please update the comment to match the implementation, or switch to an explicit (and stable) hash algorithm if you want the hash format/behavior to be predictable across Rust versions.

Copilot · 2026-03-06T19:35:19Z

+        let hash1 = compute_tool_hash(&tools1);
+        let hash2 = compute_tool_hash(&tools2);
+        assert_ne!(hash1, hash2, "different tools should produce different hashes");
+    }


This PR adds disk caching logic (build_with_cache / load_from_cache / save_to_cache), but the tests added only cover compute_tool_hash. Since this module already has unit tests, please add coverage for the cache round-trip and invalidation cases (endpoint change, tool_hash change, corrupted file) using a temp directory.

Suggested change

}

}

// ─── Cache Tests ─────────────────────────────────────────────────────────

use std::env;

use std::fs;

use std::path::PathBuf;

use std::time::{SystemTime, UNIX_EPOCH};

/// Create a unique temporary directory for cache tests.

fn temp_cache_dir() -> PathBuf {

let mut dir = env::temp_dir();

let timestamp = SystemTime::now()

.duration_since(UNIX_EPOCH)

.expect("time went backwards")

.as_nanos();

dir.push(format!("tool_prefilter_cache_test_{}", timestamp));

fs::create_dir_all(&dir).expect("failed to create temp cache dir");

dir

}

#[test]

fn cache_round_trip_works() {

let cache_dir = temp_cache_dir();

let index = ToolEmbeddingIndex {

tool_names: vec!["tool1".to_string()],

embeddings: vec![vec![1.0, 2.0, 3.0]],

};

let endpoint = "http://example.com/embeddings";

let tools = vec![("tool1".to_string(), "description1".to_string())];

let tool_hash = compute_tool_hash(&tools);

// Save to cache, then load it back and verify contents.

save_to_cache(&cache_dir, endpoint, &tool_hash, &index)

.expect("saving to cache should succeed");

let loaded = load_from_cache(&cache_dir, endpoint, &tool_hash)

.expect("expected Some(index) from cache");

assert_eq!(loaded.tool_names, index.tool_names);

assert_eq!(loaded.embeddings, index.embeddings);

}

#[test]

fn cache_invalidation_on_endpoint_change() {

let cache_dir = temp_cache_dir();

let index = ToolEmbeddingIndex {

tool_names: vec!["tool1".to_string()],

embeddings: vec![vec![1.0]],

};

let endpoint = "http://example.com/embeddings";

let tools = vec![("tool1".to_string(), "description1".to_string())];

let tool_hash = compute_tool_hash(&tools);

save_to_cache(&cache_dir, endpoint, &tool_hash, &index)

.expect("saving to cache should succeed");

let different_endpoint = "http://other.example.com/embeddings";

let loaded = load_from_cache(&cache_dir, different_endpoint, &tool_hash);

assert!(

loaded.is_none(),

"cache should be invalidated when endpoint changes"

);

}

#[test]

fn cache_invalidation_on_tool_hash_change() {

let cache_dir = temp_cache_dir();

let index = ToolEmbeddingIndex {

tool_names: vec!["tool1".to_string()],

embeddings: vec![vec![1.0]],

};

let endpoint = "http://example.com/embeddings";

let tools = vec![("tool1".to_string(), "description1".to_string())];

let tool_hash = compute_tool_hash(&tools);

save_to_cache(&cache_dir, endpoint, &tool_hash, &index)

.expect("saving to cache should succeed");

let different_tools =

vec![("tool2".to_string(), "a different description".to_string())];

let different_tool_hash = compute_tool_hash(&different_tools);

let loaded = load_from_cache(&cache_dir, endpoint, &different_tool_hash);

assert!(

loaded.is_none(),

"cache should be invalidated when tool hash changes"

);

}

#[test]

fn cache_corrupted_file_returns_none() {

let cache_dir = temp_cache_dir();

let index = ToolEmbeddingIndex {

tool_names: vec!["tool1".to_string()],

embeddings: vec![vec![1.0]],

};

let endpoint = "http://example.com/embeddings";

let tools = vec![("tool1".to_string(), "description1".to_string())];

let tool_hash = compute_tool_hash(&tools);

// First create a valid cache file.

save_to_cache(&cache_dir, endpoint, &tool_hash, &index)

.expect("saving to cache should succeed");

// Corrupt whatever file was created in the cache directory.

let cache_file_path = fs::read_dir(&cache_dir)

.expect("cache dir should be readable")

.filter_map(Result::ok)

.map(|entry| entry.path())

.next()

.expect("cache dir should contain at least one file");

fs::write(&cache_file_path, b"not valid json")

.expect("failed to corrupt cache file");

let loaded = load_from_cache(&cache_dir, endpoint, &tool_hash);

assert!(

loaded.is_none(),

"corrupted cache file should be treated as a cache miss"

);

}

Paulescu · 2026-05-19T11:43:04Z

Thanks — this is a real feature, scoped cleanly, no drive-by churn. Couple of issues to fix before merging.

Blocker: the cache will always miss.

Registry.tool_name_description_pairs() (mcp_client/registry.rs:101) iterates self.tools which is a HashMap<String, ...>. HashMap iteration order is randomized per process, so compute_tool_hash produces a different hash every startup. The cache gets saved on first run, invalidated on every subsequent run, and rebuilt — defeating the 30s saving the PR is trying to achieve.

Fix: sort by tool name inside compute_tool_hash before feeding to the hasher.

let mut sorted: Vec<&(String, String)> = tools.iter().collect();
sorted.sort_by(|a, b| a.0.cmp(&b.0));
for (name, desc) in &sorted {
    name.hash(&mut hasher);
    desc.hash(&mut hasher);
}

Should-fix: DefaultHasher is not stable across Rust versions.

std::collections::hash_map::DefaultHasher is SipHash-1-3 with fixed keys today, but the stdlib explicitly does not guarantee algorithm stability across releases. After a rustup update every cache silently invalidates. Failure mode is recoverable (just rebuilds), but easy to avoid with a stable hash — sha2::Sha256 or blake3. Also: the "FNV-like" comment is incorrect; SipHash-1-3 is neither FNV nor a speed-optimised hash.

Cleanups:

#[allow(unused_imports)] use std::io::{self, Write}; — remove it; the #[allow] is masking what would otherwise be a useful lint.
CacheMetadata.created_at is written but never read. Either drop the field or use it for TTL / debugging.
ToolPreFilterError::RequestFailed { reason: "cache file not found" } and friends abuse the HTTP-failure variant for cache-miss / IO errors. They don't propagate (consumed by if let Ok(...)), so it's purely a naming concern, but a CacheError variant (or Result<Option<Self>, _> for load) would be clearer.
Embeddings are stored as pretty-printed JSON Vec<Vec<f32>>. For ~80 tools × 1024-dim that's ~3 MB on disk vs ~320 KB binary. Not blocking, just noting.

Architectural note: file size.

CLAUDE.md sets a 300-line max per Rust file for LocalCowork. tool_prefilter.rs is 362 lines on main (already over) and this PR pushes it to ~580. Suggest splitting the cache logic into a sibling module — agent_core/embedding_cache.rs or agent_core/tool_prefilter/cache.rs — with CachedIndex, CacheMetadata, cache_path, load_from_cache, save_to_cache, and compute_tool_hash. Keeps tool_prefilter.rs focused on the prefilter itself.

Test gap.

The two added tests only cover compute_tool_hash. The more important test is a save → load roundtrip confirming cache hits work when inputs match and misses are triggered correctly when the endpoint or tool list changes. That test would have caught the iteration-order bug above.

Fix the iteration-order bug + stable hash + file split, add a roundtrip test, and re-request review. Should be a quick second pass.

Copilot AI review requested due to automatic review settings March 6, 2026 19:31

Copilot started reviewing on behalf of vlordier March 6, 2026 19:31 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Paulescu mentioned this pull request May 19, 2026

feat: add unified app settings persistence #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add pre-warmed embedding cache for tool prefilter#56

feat: add pre-warmed embedding cache for tool prefilter#56
vlordier wants to merge 1 commit into
Liquid4All:mainfrom
vlordier:feat/prewarmed-embedding-cache

vlordier commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Paulescu commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    }
+    }
+    // ─── Cache Tests ─────────────────────────────────────────────────────────
+    use std::env;
+    use std::fs;
+    use std::path::PathBuf;
+    use std::time::{SystemTime, UNIX_EPOCH};
+    /// Create a unique temporary directory for cache tests.
+    fn temp_cache_dir() -> PathBuf {
+        let mut dir = env::temp_dir();
+        let timestamp = SystemTime::now()
+            .duration_since(UNIX_EPOCH)
+            .expect("time went backwards")
+            .as_nanos();
+        dir.push(format!("tool_prefilter_cache_test_{}", timestamp));
+        fs::create_dir_all(&dir).expect("failed to create temp cache dir");
+        dir
+    }
+    #[test]
+    fn cache_round_trip_works() {
+        let cache_dir = temp_cache_dir();
+        let index = ToolEmbeddingIndex {
+            tool_names: vec!["tool1".to_string()],
+            embeddings: vec![vec![1.0, 2.0, 3.0]],
+        };
+        let endpoint = "http://example.com/embeddings";
+        let tools = vec![("tool1".to_string(), "description1".to_string())];
+        let tool_hash = compute_tool_hash(&tools);
+        // Save to cache, then load it back and verify contents.
+        save_to_cache(&cache_dir, endpoint, &tool_hash, &index)
+            .expect("saving to cache should succeed");
+        let loaded = load_from_cache(&cache_dir, endpoint, &tool_hash)
+            .expect("expected Some(index) from cache");
+        assert_eq!(loaded.tool_names, index.tool_names);
+        assert_eq!(loaded.embeddings, index.embeddings);
+    }
+    #[test]
+    fn cache_invalidation_on_endpoint_change() {
+        let cache_dir = temp_cache_dir();
+        let index = ToolEmbeddingIndex {
+            tool_names: vec!["tool1".to_string()],
+            embeddings: vec![vec![1.0]],
+        };
+        let endpoint = "http://example.com/embeddings";
+        let tools = vec![("tool1".to_string(), "description1".to_string())];
+        let tool_hash = compute_tool_hash(&tools);
+        save_to_cache(&cache_dir, endpoint, &tool_hash, &index)
+            .expect("saving to cache should succeed");
+        let different_endpoint = "http://other.example.com/embeddings";
+        let loaded = load_from_cache(&cache_dir, different_endpoint, &tool_hash);
+        assert!(
+            loaded.is_none(),
+            "cache should be invalidated when endpoint changes"
+        );
+    }
+    #[test]
+    fn cache_invalidation_on_tool_hash_change() {
+        let cache_dir = temp_cache_dir();
+        let index = ToolEmbeddingIndex {
+            tool_names: vec!["tool1".to_string()],
+            embeddings: vec![vec![1.0]],
+        };
+        let endpoint = "http://example.com/embeddings";
+        let tools = vec![("tool1".to_string(), "description1".to_string())];
+        let tool_hash = compute_tool_hash(&tools);
+        save_to_cache(&cache_dir, endpoint, &tool_hash, &index)
+            .expect("saving to cache should succeed");
+        let different_tools =
+            vec![("tool2".to_string(), "a different description".to_string())];
+        let different_tool_hash = compute_tool_hash(&different_tools);
+        let loaded = load_from_cache(&cache_dir, endpoint, &different_tool_hash);
+        assert!(
+            loaded.is_none(),
+            "cache should be invalidated when tool hash changes"
+        );
+    }
+    #[test]
+    fn cache_corrupted_file_returns_none() {
+        let cache_dir = temp_cache_dir();
+        let index = ToolEmbeddingIndex {
+            tool_names: vec!["tool1".to_string()],
+            embeddings: vec![vec![1.0]],
+        };
+        let endpoint = "http://example.com/embeddings";
+        let tools = vec![("tool1".to_string(), "description1".to_string())];
+        let tool_hash = compute_tool_hash(&tools);
+        // First create a valid cache file.
+        save_to_cache(&cache_dir, endpoint, &tool_hash, &index)
+            .expect("saving to cache should succeed");
+        // Corrupt whatever file was created in the cache directory.
+        let cache_file_path = fs::read_dir(&cache_dir)
+            .expect("cache dir should be readable")
+            .filter_map(Result::ok)
+            .map(|entry| entry.path())
+            .next()
+            .expect("cache dir should contain at least one file");
+        fs::write(&cache_file_path, b"not valid json")
+            .expect("failed to corrupt cache file");
+        let loaded = load_from_cache(&cache_dir, endpoint, &tool_hash);
+        assert!(
+            loaded.is_none(),
+            "corrupted cache file should be treated as a cache miss"
+        );
+    }

Conversation

vlordier commented Mar 6, 2026

Summary

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Paulescu commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants