All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
First official production release. Achieves < 1ms lookup latency and significant memory reduction via Rust backend.
- Full Cache Load Time: Loading a 1M-entry cache takes ~300-350ms total. While the binary index is search-ready in <10ms, Python pickle deserialization of response objects adds ~300ms overhead. This is a known limitation of the hybrid Python/Rust architecture and will be addressed in a future release.
- Rust Storage Backend: Core storage moved to Rust (
RustCacheStorage), reducing memory usage to 44 bytes per entry (codes + metadata). - Persistence Format v3: New binary format with SHA-256 integrity checks. Loading 1M entries now takes < 10ms (index ready).
- OpenAI Integration: Native
OpenAIEmbeddingBackendwith automatic batching, token bucket rate limiting, and cost tracking. - List-Based Response Storage: Python-side response storage optimized from dict to list, reducing overhead by ~90%.
- Migration Tool: CLI tool to migrate legacy v2 caches to v3 format (
binary-semantic-cache migrate).
- Performance: Lookup latency for 100k entries reduced from 1.14ms to 0.41ms (2.8x speedup).
- Memory: Total memory footprint per entry reduced from ~119 bytes to ~52 bytes.
- Dependency: Added
maturinbuild system for Rust extensions.
- Parity: Resolved bit-exactness issue between Rust and Python encoders (verified via OAI-08 regression test).
- Windows Compatibility: Fixed Unicode encoding issues in benchmark reporting.
(Superseded by 1.0.0)
- Initial release.
- Python-only implementation (NumPy + Numba).
- Binary quantization (256-bit Hamming distance).
- Ollama support (
OllamaEmbedder). - Basic
save/loadpersistence (pickle).