diff --git a/.agents/skills/redis-use-case-ports/SKILL.md b/.agents/skills/redis-use-case-ports/SKILL.md index 6c772f5995..904eb3306d 100644 --- a/.agents/skills/redis-use-case-ports/SKILL.md +++ b/.agents/skills/redis-use-case-ports/SKILL.md @@ -118,6 +118,8 @@ Phase 4's targeted audits work well for known bug classes (the rows in `audit-ch Run an independent reviewer (different model, fresh context — the [`codex:rescue`](../../codex/) skill is a good fit, with a prompt that lists files plus the specific concerns: correctness bugs, cross-client divergence, doc drift) **before** declaring Phase 4 done. Treat its findings as candidates for the Phase 5 retrofit, with the orchestrator triaging which to accept (some "race conditions" are safe by accident — e.g. redis-py and go-redis subscribe-ack — because the synchronous socket write closes the window before the helper returns). +**Verify each finding against the current file before fixing it.** Independent reviewers occasionally work from a stale snapshot — the file they reviewed was correct when they started, but a parallel agent kept editing it during the review window. Several of the Jedis and PHP findings on the semantic-cache project turned out to be the agent re-discovering a fix that had already landed minutes earlier (the EXISTS-race comment, the 1 MiB body cap, the docs paragraph about classpath resources). `grep` the finding's described pattern against the current file before opening an Edit — a one-second sanity check saves an inadvertent revert. + Add a new row to `assets/audit-checklist.md` for any *class* of bug the reviewer found that wasn't already covered, so the next project's Phase 4 won't have to rediscover it. ## Phase 5 — Retrofit @@ -164,3 +166,4 @@ Keep `SKILL.md` itself focused on the workflow. The concrete artefacts live in ` - [`content/develop/use-cases/job-queue/`](../../../content/develop/use-cases/job-queue/) — the project that introduced rows 11–13 of [`audit-checklist.md`](assets/audit-checklist.md) (token-checked atomic state transitions, crash-window fallback timer, shared-keyspace collision in parallel smoke tests). - [`content/develop/use-cases/pub-sub/`](../../../content/develop/use-cases/pub-sub/) — the first non-keyspace use case ported. Introduced rows 14–18 of [`audit-checklist.md`](assets/audit-checklist.md) (subscribe-ack race, concurrent-name reservation, detached-worker PID capture, silent timeout fallthrough, server-wide PUBSUB introspection) plus the pub/sub conventions section in [`redis-conventions.md`](assets/redis-conventions.md). Also the project that motivated adding Phase 4b (independent review) after Codex caught four real bugs that Phase 4 cleared. - [`content/develop/use-cases/recommendation-engine/`](../../../content/develop/use-cases/recommendation-engine/) — the first ML / vector-search use case. Introduced the **ML / vector-search use cases** section in [`redis-conventions.md`](assets/redis-conventions.md) (per-client embedding library table, pre-computed `catalog.json` wire format, FFI / Ruby-version setup blockers, per-port deviation conventions) and rows 24–28 of [`audit-checklist.md`](assets/audit-checklist.md) (vector dim mismatch in client-side blend helpers, L2 normalisation silently skipped by the embedding wrapper, TAG escape must include the backslash itself, connection-wide state toggle race on a shared client, weight=0 must disable not normalise to default). Each of the five new rows came from a real bug — bugbot or Codex caught all of them; the Python reference shipped with the TAG-escape bug originally. +- [`content/develop/use-cases/semantic-cache/`](../../../content/develop/use-cases/semantic-cache/) — the second ML / vector-search use case. Cache-on-LLM-responses backed by Redis Search KNN with a thresholded hit/miss decision and tenant/locale/model-version metadata filtering. Introduced rows 29–34 of [`audit-checklist.md`](assets/audit-checklist.md): embedder Predictor / Session thread-safety on shared instances (DJL needs `synchronized`, ONNX is fine); library config keys that look real but don't take effect (WEBrick's `MaxRequestBodySize` is not an option name; the body cap must be enforced in user code); lockfile pinning a newer runtime than the manifest declares (composer.lock requiring PHP 8.4 while composer.json said `^8.2`); NaN / Inf parsing via language-specific quirks (PHP `(float)"nan"` → 0.0 silently, must use textual rejection before parsing); per-language strings in HTML that's shared across language demos (badge text, default threshold must be populated via `/state` at first load); docs wire-form snippets must show escaped TAG values (`gpt\-4\.5\-2026`, not `gpt-4.5-2026`). Also the project that motivated the Phase 4b note about verifying independent-review findings against the current file before applying — several Jedis and PHP "missing" findings were actually re-discoveries of fixes that had landed minutes earlier. diff --git a/.agents/skills/redis-use-case-ports/assets/audit-checklist.md b/.agents/skills/redis-use-case-ports/assets/audit-checklist.md index ccfa7351b8..26ca571998 100644 --- a/.agents/skills/redis-use-case-ports/assets/audit-checklist.md +++ b/.agents/skills/redis-use-case-ports/assets/audit-checklist.md @@ -480,6 +480,104 @@ The first form lets a caller pass `0` to bypass the bonus entirely (and a downst --- +## 29. Embedder Predictor / Session is not always thread-safe on a shared instance + +**What to scan for:** the embedding wrapper's `encodeOne` / `encode_many` / `EncodeInternal` methods, and how the wrapper is reached from the HTTP handler. Particularly look at the handler executor (cached thread pool, `Executors.newCachedThreadPool`, async runtime with multiple workers, `HttpListener` callback) — does the wrapper hold any mutable state across calls, and is the underlying library documented as thread-safe? + +**Pass criterion:** the embedder is either (a) documented as thread-safe and used without synchronisation (e.g. ONNX Runtime's `InferenceSession`), (b) documented as not-thread-safe and the wrapper serialises every call (e.g. DJL `Predictor` wrapped in `synchronized` methods), or (c) the handler dispatcher is single-threaded so concurrency never arises. The wrapper's code or docstring should state which case applies so the reader doesn't have to derive it. + +This is **distinct from row 1** (which is about Redis connections and `MULTI/EXEC` interleaving). Row 1 is about transaction state on a shared Redis connection; this row is about model-inference state on a shared ML client. + +**Sample audit prompt:** + +> For each port under `content/develop/use-cases/{{USE_CASE_NAME}}/`, locate the embedding wrapper and the HTTP server's request executor. For each port, classify (a) thread-safe library with no app-level locking needed, (b) not-thread-safe library with explicit serialisation in the wrapper, or (c) single-threaded dispatcher so the question doesn't arise. Cite the line numbers. Flag any port where the wrapper shares mutable state across calls and the handler executor is multi-threaded but no synchronisation is in place. Verify by reading the library's docs / source for the underlying `Predictor` / `Pipeline` / `Session` type's thread-safety contract — don't trust the agent's choice without a citation. + +**Why on list:** Semantic-cache use case. The Jedis port shipped with a DJL `Predictor` shared across an `Executors.newCachedThreadPool` HTTP server, no synchronisation — Codex caught it. Jedis's `LocalEmbedder` was fixed by marking `encodeOne` / `encodeMany` `synchronized`. The Lettuce port (built after the Jedis lesson) included the synchronization from the start. The .NET port correctly uses an `InferenceSession` without locking because ONNX Runtime documents `Run` as thread-safe; the docstring calls that contract out so the reader knows why no lock is present. ([PR #3354 Codex review]) + +--- + +## 30. Library config keys that look real but don't take effect + +**What to scan for:** any place the demo configures a server-side limit (body size cap, connection timeout, max request bytes, max headers, etc.) by passing a named option to a library constructor. Check the **library's documented option names** — not what looks like it should work. + +**Pass criterion:** every limit the demo *advertises* in prose is actually enforced. The way to verify is to test the limit — send a request that should be rejected, confirm the response shape — not just look at the code. If the library doesn't expose the limit you need, the demo enforces the limit explicitly in user code (e.g. read at most N bytes, then check) and the prose accurately describes that path. + +**Sample audit prompt:** + +> For each port under `content/develop/use-cases/{{USE_CASE_NAME}}/`, identify every server-side limit the demo claims to enforce (POST body size, request timeout, connection cap, etc.). For each one, find the line of code that's supposed to enforce it. Then verify the option name against the library's current documentation (e.g. WEBrick's actual config keys, `com.sun.net.httpserver` knobs, `http.Server` fields, Express middleware names). Flag any limit whose enforcement relies on an option name the library doesn't recognise — those are silent no-ops. + +**Why on list:** Semantic-cache use case. The Ruby port passed `MaxRequestBodySize: MAX_BODY_BYTES` to `WEBrick::HTTPServer.new` — but `MaxRequestBodySize` is not a valid WEBrick option. The handler's `req.body` then read whatever the client sent. Codex flagged it ("the body cap is effectively a no-op") and the fix was an explicit `body_too_large?` check that examines `Content-Length` before reading the body. The class of bug is broader: any library configuration knob that's accepted as a keyword arg or property setter without validation can silently be ignored. + +--- + +## 31. Lockfile pins a newer runtime than the manifest declares + +**What to scan for:** the manifest's declared minimum runtime version (PHP `^8.2` in `composer.json`, Ruby `>= 3.0` in `Gemfile`, Rust `rust-version = "1.74"` in `Cargo.toml`, Node `engines.node` in `package.json`) versus the actual transitive dependency requirements in the lockfile (`composer.lock`, `Gemfile.lock`, `Cargo.lock`, `package-lock.json`). + +**Pass criterion:** either (a) the lockfile resolves transitively to versions compatible with the manifest's declared minimum, or (b) the manifest declares the higher minimum that the lockfile actually requires. A common form of this bug: a transitive dependency bumps its own minimum-version requirement; lock resolution picks up the new transitive; the lockfile now demands a higher runtime than the manifest advertises, and `composer install` / `bundle install` fails for users on the declared minimum. + +For PHP specifically, the fix is to add `"platform": {"php": "8.2.0"}` (or whichever minimum) under `composer.json`'s `config` block — this pins Composer to resolve transitives compatible with that version. Other ecosystems have equivalents (Bundler's `ruby` directive in `Gemfile`, Cargo's `rust-version`, npm's `engines` enforcement via `engine-strict`). + +**Sample audit prompt:** + +> For each port that ships a lockfile under `content/develop/use-cases/{{USE_CASE_NAME}}/`, identify the manifest's declared minimum runtime version. Then grep the lockfile for transitive dependencies that declare their own minimum (`"php": ">=X"`, `required_ruby_version`, `rust-version`, etc.). Confirm the highest transitive minimum is ≤ the manifest's declared minimum. Flag any port where the lockfile demands a higher runtime than the manifest — that port's `*install` step fails for users on the documented minimum. + +**Why on list:** Semantic-cache use case. The PHP port's `composer.json` declared `"php": "^8.2"` while `composer.lock` resolved `symfony/string` v8.0.x, which requires `php >= 8.4`. Users on PHP 8.2 or 8.3 hit `composer install` failures. Fixed by adding `"platform": {"php": "8.2.0"}` to `composer.json`. Caught by Codex review. ([PR #3354]) + +--- + +## 32. NaN / Inf parsing via language-specific quirks + +**What to scan for:** every place a floating-point parameter is parsed from user-controlled input (CLI flag, environment variable, HTTP form field, JSON body). Look at the parsing function (`float()`, `parseFloat()`, `strconv.ParseFloat`, `(float)$x`, `Float()`, etc.). + +**Pass criterion:** strings like `"nan"`, `"inf"`, `"+infinity"`, `"-inf"` must not produce a value that bypasses downstream comparisons. The cross-language quirks: + +- **Python** `float("nan")` → actual NaN (IEEE-754); `is_finite` catches it. +- **JavaScript** `parseFloat("nan")` → NaN; `Number.isFinite` catches it. +- **Go** `strconv.ParseFloat("nan", 64)` → NaN; `math.IsNaN` catches it. +- **PHP** `(float)"nan"` returns `0.0`, **not** NaN. `is_finite(0.0)` is true. The textual NaN reaches downstream code as `0.0` and silently corrupts any comparison. +- **Rust** `"nan".parse::()` → `Ok(NaN)`; `is_finite` catches it. +- **Java** `Double.parseDouble("NaN")` → actual NaN; `Double.isFinite` catches it. +- **C#** `double.Parse("NaN", CultureInfo.InvariantCulture)` → NaN; `double.IsFinite` catches it. + +The robust pattern is **textual rejection before parsing**: lowercase the input, check membership in the set `{"nan", "inf", "infinity", "+inf", "-inf", "+infinity", "-infinity"}`, and only then call the language-native parser. The Python reference does this; the textual-rejection branch is what the PHP port needed and what Codex flagged when the env-var path bypassed it. + +**Sample audit prompt:** + +> For each port under `content/develop/use-cases/{{USE_CASE_NAME}}/`, locate every place a floating-point parameter is parsed from external input (CLI flag, env var, HTTP form field). For each parser, mentally run it against the inputs `"nan"`, `"inf"`, `"-inf"`, `"infinity"`, `"junk"`. Confirm each input is rejected (falls back to default, returns error, or clamps out of the meaningful range). Pay special attention to PHP's `(float)` cast and any other language where the implicit cast silently returns `0.0` on garbage input. Flag any path that admits a non-finite value to a downstream comparison. + +**Why on list:** Semantic-cache use case. PHP's `load_config()` parsed `SEMCACHE_THRESHOLD` with a bare `(float)` cast — `(float)"nan"` returned `0.0` silently, the `is_finite` check immediately downstream passed, and the cache's default threshold landed at `0.0`. Every paraphrase lookup became a miss. Codex flagged it; the fix was to route the env-var value through the same `clamp_threshold` helper the HTTP boundary already used (which textually rejects "nan" / "inf" before parsing). ([PR #3354 Codex review]) + +--- + +## 33. Per-language strings in HTML that's shared across language demos + +**What to scan for:** in any use case that copies the same `index.html` across all language demos verbatim (the standard pattern in `redis-use-case-ports`), audit the HTML for **hardcoded language-specific strings**: stack badge text (`"redis-py + sentence-transformers + ..."`), default values that the server should be authoritative on (default threshold, default port displayed in copy), code-block snippets that reference one language's syntax. + +**Pass criterion:** every per-language string in the shared HTML is populated at request time via `/state` (or equivalent boot-up handshake) rather than baked into the HTML literal. The handshake returns enough info that the JS can render: a `stack_label` string, a `default_threshold` number, any per-language config the badge / lede / placeholders need. The HTML opens with placeholder content (`"loading…"`) and the first call to `/state` overwrites it. + +**Sample audit prompt:** + +> For each port under `content/develop/use-cases/{{USE_CASE_NAME}}/`, diff `index.html` against the reference's `index.html` byte-for-byte. They should be identical. Then audit the reference's `index.html` for any string that names a specific language, library, model, or config default — those are exactly the strings that need to be populated from `/state` at runtime, not baked into the HTML. Flag any hardcoded per-language string in the reference HTML. Then verify the server's `/state` response includes the field the HTML reads, and that the JS sets the value on first render (typically inside a `refreshState` or equivalent on page load). + +**Why on list:** Semantic-cache use case. Codex caught a hardcoded `"redis-py + sentence-transformers + Python standard library HTTP server"` badge in the Node.js port's `index.html` — the agent had copied the reference HTML verbatim and the badge was telling Node.js users they were running Python. The fix was to add `stack_label` and `default_threshold` to `/state` and have the JS render both on first load. Same fix propagated to all 7 sibling demos. ([PR #3354 Codex review]) + +--- + +## 34. Docs wire-form snippets must show escaped TAG values + +**What to scan for:** every code block in the use case's `_index.md` that shows a literal `FT.SEARCH` (or `FT.AGGREGATE`) query string with a TAG predicate (`@tenant:{...}`, `@category:{...}`, `@brand:{...}`). Check whether the TAG value contains any character that Redis Search treats as TAG-value syntax: `.` `-` `,` `<` `>` `{` `}` `[` `]` `"` `'` `:` `;` `!` `@` `#` `$` `%` `^` `&` `*` `(` `)` `+` `=` `~` `|` space, backslash. + +**Pass criterion:** TAG values that contain any of those characters are shown **escaped** with a leading backslash on each special character. Wire-form blocks (in `text` code fences) show single backslashes (`gpt\-4\.5\-2026`); in-language source blocks (where the demo code is shown verbatim) show the right number of backslashes for that language's string-literal escape rules (double backslashes inside double-quoted Go / Java / Rust / C# strings; single backslashes inside PHP / Ruby single-quoted strings; etc.). Either way, the snippet a reader could paste into `redis-cli` works. + +**Sample audit prompt:** + +> For each `_index.md` under `content/develop/use-cases/{{USE_CASE_NAME}}/`, find every code block that contains a `FT.SEARCH` or `FT.AGGREGATE` query string with a `@:{}` TAG predicate. For each value, identify whether it contains any TAG-syntax character (`.`, `-`, `,`, `:`, `@`, `#`, `$`, space, backslash, etc.). Confirm those characters are backslash-escaped in the snippet at the right level for the code fence's surrounding context (single backslashes in `text` fences; whatever the language requires in source-code fences). Flag any snippet that shows an unescaped special character in a TAG value — that snippet would parse as multiple tokens if a reader pasted it into `redis-cli`. + +**Why on list:** Semantic-cache use case. Codex caught `@model_version:{gpt-4.5-2026}` in the .NET `_index.md` — the unescaped hyphens and dot mean a parser would see three tokens (`gpt`, `4`, `5-2026`) rather than one. The same defect was present in all 8 sibling `_index.md` files (inherited from the Python reference). A reader pasting the snippet into `redis-cli` would get a confused response and not know the docs were wrong. ([PR #3354 Codex review]) + +--- + ## How to add a new row When a bug class is identified after this skill has been used: diff --git a/content/develop/use-cases/_index.md b/content/develop/use-cases/_index.md index fabd8bfeca..ff63deb36e 100644 --- a/content/develop/use-cases/_index.md +++ b/content/develop/use-cases/_index.md @@ -27,3 +27,4 @@ This section provides practical examples and reference implementations for commo * [Pub/sub messaging]({{< relref "/develop/use-cases/pub-sub" >}}) - Broadcast real-time events to many consumers with channel and pattern subscriptions * [Streaming]({{< relref "/develop/use-cases/streaming" >}}) - Process ordered event streams with consumer groups, replay, and configurable retention * [Recommendation engine]({{< relref "/develop/use-cases/recommendation-engine" >}}) - Serve personalized recommendations under tight latency budgets by combining vector similarity with structured filters in a single Redis call +* [Semantic cache]({{< relref "/develop/use-cases/semantic-cache" >}}) - Reuse LLM responses for semantically similar queries to cut token costs and skip multi-second model calls on near-duplicate prompts diff --git a/content/develop/use-cases/semantic-cache/_index.md b/content/develop/use-cases/semantic-cache/_index.md new file mode 100644 index 0000000000..b1b3fb9f48 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/_index.md @@ -0,0 +1,75 @@ +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: Reuse LLM responses for queries that are semantically similar, not just byte-identical, to cut token costs, reduce latency, and share validated answers across users and sessions. +hideListLinks: true +linkTitle: Semantic cache +title: Redis semantic cache +weight: 7 +--- + +## When to use Redis as a semantic cache + +Use Redis as a semantic cache when you need to reuse LLM responses for queries that are semantically similar — not just byte-identical — so paraphrased and near-duplicate questions skip the full embed-retrieve-generate pipeline and return a previously validated answer in tens of milliseconds. + +## Why the problem is hard + +Every repeated or paraphrased question that reaches an LLM triggers the full pipeline — embedding, retrieval, generation — driving per-query cost up by 10–100x compared to a cache lookup, and pushing P95 latency into multi-second territory. Some of the obvious workarounds have real drawbacks: + +- **A traditional exact-match cache** (string key to response) only hits when the query is byte-identical, so it misses near-duplicates like *"What's your return policy?"* versus *"How do I return an item?"* — exactly the queries that dominate FAQ-style workloads. +- **A standalone vector database** can find similar past queries, but adds operational overhead for what is fundamentally a caching problem, and most lack first-class TTL management, eviction, and metadata filtering — features the cache needs in order to stay correct under churn. +- **Skipping the cache and relying on prompt caching at the model provider** discounts repeated *prefixes* but still runs the model end-to-end on every call, so it does not address latency or full-response reuse across users. + +The core difficulty is threshold tuning: too loose and you serve wrong answers, too tight and the hit rate collapses. Effective semantic caching combines soft similarity matching with hard metadata boundaries (tenant, locale, model version, safety flags) so reuse stays inside well-defined limits. + +This pattern is distinct from using Redis for RAG vector search. Semantic caching stores **complete LLM responses**, not document chunks, and the goal is to **skip the LLM entirely** on a hit rather than feed retrieved context into one. + +## What you can expect from a Redis solution + +You can: + +- Return cached answers for paraphrased and near-duplicate queries in tens of milliseconds instead of multi-second LLM round trips. +- Reduce LLM token spend on workloads with repeated query patterns — FAQ bots, helpdesks, internal knowledge assistants — by 30% or more without a measurable quality regression. +- Scope cached answers by tenant, locale, or model version so reuse stays within defined boundaries, applied inside the similarity query rather than in application code. +- Let stale answers expire automatically and shed cold entries under memory pressure without manual cleanup. +- Share validated, high-quality answers across users, sessions, and channels while keeping the cache non-authoritative and rebuildable at any time. +- Add semantic caching to an existing Redis deployment without provisioning a separate vector database or cache service — it is an additional index and key pattern on the same instance. + +## How Redis supports the solution + +In practice, each cache entry is a single [Hash]({{< relref "/develop/data-types/hashes" >}}) or [JSON]({{< relref "/develop/data-types/json" >}}) document holding the prompt, its embedding vector, the LLM response, and metadata fields — tenant, locale, model version, safety flags. A [Redis Search]({{< relref "/develop/ai/search-and-query" >}}) index covers the embedding field together with the metadata fields, so a single [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) call performs KNN against the cached prompts with a TAG or NUMERIC pre-filter applied in the same pass. On a hit above the configured distance threshold the application serves the cached response directly; on a miss it runs the LLM and writes the new prompt, response, and metadata back to the same key pattern with a TTL. + +Redis provides the following features that make it a good fit for a semantic cache: + +- [Hashes]({{< relref "/develop/data-types/hashes" >}}) and [JSON]({{< relref "/develop/data-types/json" >}}) store the prompt, embedding, response, and metadata together under a single key, so a cache hit returns everything the application needs in one round trip. +- [Redis Search]({{< relref "/develop/ai/search-and-query" >}}) with [HNSW vector indexes]({{< relref "/develop/ai/search-and-query/vectors" >}}) finds the nearest cached prompt above a configurable similarity threshold in sub-millisecond time, and the same [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) call applies TAG and NUMERIC filters so tenant isolation and namespace scoping happen inside the query, not in application logic. +- [`EXPIRE`]({{< relref "/commands/expire" >}}) sets a TTL on each cache entry so stale answers age out without manual cleanup, keeping the cache aligned with the underlying knowledge base. +- Database-level [eviction policies]({{< relref "/develop/reference/eviction" >}}) (LRU / LFU) bound memory under pressure and shed cold entries automatically, so the cache stays within budget as the prompt distribution shifts. +- Sub-millisecond reads and writes from memory let the semantic cache ride on the same Redis instance already handling sessions, rate limiting, or RAG retrieval at zero marginal cost. + +## Ecosystem + +The following libraries, frameworks, and managed services build on Redis for semantic caching: + +- **Python**: [RedisVL](https://github.com/redis/redis-vl-python) provides the `SemanticCache` API with built-in embedding, distance thresholds, TTL, and metadata filters. See the [RedisVL LLM cache user guide]({{< relref "/develop/ai/redisvl/user_guide/llmcache" >}}) and the [LangCache integration guide]({{< relref "/develop/ai/redisvl/user_guide/how_to_guides/langcache_semantic_cache" >}}). +- **Frameworks**: [LangChain](https://python.langchain.com/docs/integrations/llm_caching/#redis-cache) (Redis as an LLM cache and vector store), [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/vector_stores/RedisIndexDemo/), and [LangGraph](https://langchain-ai.github.io/langgraph/) for agent memory and response caching. +- **Managed**: [Redis LangCache]({{< relref "/develop/ai/context-engine/langcache" >}}) is a fully managed semantic cache with a REST API, configurable distance thresholds, automatic eviction, and built-in metrics — no index management or embedding wiring required. + +## Code examples to build your own Redis semantic cache + +The following guides show how to build a small Redis-backed semantic cache that sits in front of an LLM call. Each guide includes a runnable interactive demo that embeds an incoming prompt, runs a thresholded KNN lookup against the cache with tenant and locale filters, serves the cached response on a hit, and on a miss calls the LLM and writes the new prompt, response, and metadata back with a TTL. + +* [redis-py (Python)]({{< relref "/develop/use-cases/semantic-cache/redis-py" >}}) +* [node-redis (Node.js)]({{< relref "/develop/use-cases/semantic-cache/nodejs" >}}) +* [go-redis (Go)]({{< relref "/develop/use-cases/semantic-cache/go" >}}) +* [redis-rs (Rust)]({{< relref "/develop/use-cases/semantic-cache/rust" >}}) +* [NRedisStack (C#)]({{< relref "/develop/use-cases/semantic-cache/dotnet" >}}) +* [Jedis (Java)]({{< relref "/develop/use-cases/semantic-cache/java-jedis" >}}) +* [Lettuce (Java)]({{< relref "/develop/use-cases/semantic-cache/java-lettuce" >}}) +* [Predis (PHP)]({{< relref "/develop/use-cases/semantic-cache/php" >}}) +* [redis-rb (Ruby)]({{< relref "/develop/use-cases/semantic-cache/ruby" >}}) diff --git a/content/develop/use-cases/semantic-cache/dotnet/.gitignore b/content/develop/use-cases/semantic-cache/dotnet/.gitignore new file mode 100644 index 0000000000..8f5917312f --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/.gitignore @@ -0,0 +1,7 @@ +bin/ +obj/ +model_cache/ +*.user +*.suo +.vs/ +.idea/ diff --git a/content/develop/use-cases/semantic-cache/dotnet/CacheHit.cs b/content/develop/use-cases/semantic-cache/dotnet/CacheHit.cs new file mode 100644 index 0000000000..8381560d38 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/CacheHit.cs @@ -0,0 +1,22 @@ +namespace SemanticCacheDemo; + +/// +/// A cache lookup that returned a cached response. +/// +/// +/// is the cosine distance +/// FT.SEARCH reported for the nearest cached prompt (0 = +/// identical, 2 = opposite). It is always at or below the threshold +/// the lookup was run with. +/// +public sealed record CacheHit( + string Id, + string Prompt, + string Response, + string Tenant, + string Locale, + string ModelVersion, + double Distance, + long TtlSeconds, + long HitCount +) : LookupResult; diff --git a/content/develop/use-cases/semantic-cache/dotnet/CacheMiss.cs b/content/develop/use-cases/semantic-cache/dotnet/CacheMiss.cs new file mode 100644 index 0000000000..d1194cb191 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/CacheMiss.cs @@ -0,0 +1,16 @@ +namespace SemanticCacheDemo; + +/// +/// A cache lookup that did not return a usable response. +/// +/// +/// is the cosine distance to the +/// closest cached prompt that did match the metadata +/// filters. Both fields are null when the cache had no entry +/// in scope at all, which is what the demo UI shows as "no +/// candidate" vs. "candidate too far". +/// +public sealed record CacheMiss( + double? NearestDistance, + string? NearestId +) : LookupResult; diff --git a/content/develop/use-cases/semantic-cache/dotnet/LocalEmbedder.cs b/content/develop/use-cases/semantic-cache/dotnet/LocalEmbedder.cs new file mode 100644 index 0000000000..104defa708 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/LocalEmbedder.cs @@ -0,0 +1,276 @@ +using System.Buffers.Binary; +using System.Net.Http; +using Microsoft.ML.OnnxRuntime; +using Microsoft.ML.OnnxRuntime.Tensors; +using Microsoft.ML.Tokenizers; + +namespace SemanticCacheDemo; + +/// +/// Local text-embedding helper backed by ONNX Runtime + a Bert +/// WordPiece tokenizer. +/// +/// +/// This is a thin wrapper around the +/// sentence-transformers/all-MiniLM-L6-v2 model loaded as an +/// ONNX export from the Xenova/all-MiniLM-L6-v2 Hugging Face +/// mirror: a 384-dimensional encoder that runs in-process on CPU +/// through ONNX Runtime, needs no API key, and produces vectors +/// numerically very close to the equivalent Python and Node ports +/// (close enough that paraphrase distances differ only at the second +/// or third decimal place). +/// +/// The class downloads model.onnx and the +/// vocab.txt WordPiece dictionary into a local cache directory +/// on the first call; every later run is offline. Vectors are mean- +/// pooled over the token positions (weighted by the attention mask) +/// and then L2-normalised explicitly so a Redis Search index declared +/// with DISTANCE_METRIC COSINE returns scores that are +/// directly comparable across entries. +/// +public sealed class LocalEmbedder : IDisposable +{ + public const string DefaultModelName = "sentence-transformers/all-MiniLM-L6-v2"; + public const int DefaultVectorDim = 384; + + // The Xenova mirror is the Node demo's source; the ONNX export + // and vocab there match the original sentence-transformers + // checkpoint and give us a single dependency-free download URL. + private const string ModelUrl = + "https://huggingface.co/Xenova/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx"; + private const string VocabUrl = + "https://huggingface.co/Xenova/all-MiniLM-L6-v2/resolve/main/vocab.txt"; + + private readonly InferenceSession _session; + private readonly BertTokenizer _tokenizer; + + public string ModelName { get; } + public int Dim { get; } + + private LocalEmbedder( + string modelName, + InferenceSession session, + BertTokenizer tokenizer, + int dim) + { + ModelName = modelName; + _session = session; + _tokenizer = tokenizer; + Dim = dim; + } + + /// + /// Load the default model. Blocks while ONNX Runtime initialises + /// and the model + tokenizer files are downloaded on the first + /// run. The single is shared + /// across handler threads — ONNX Runtime documents + /// InferenceSession.Run as thread-safe. + /// + /// + /// Directory the model and tokenizer files are cached in. Created + /// if it doesn't exist. Defaults to ./model_cache next to + /// the running binary, so a fresh checkout doesn't re-download on + /// every dotnet run. + /// + public static async Task CreateAsync(string? cacheDir = null) + { + cacheDir ??= Path.Combine(AppContext.BaseDirectory, "model_cache"); + Directory.CreateDirectory(cacheDir); + + string modelPath = Path.Combine(cacheDir, "model.onnx"); + string vocabPath = Path.Combine(cacheDir, "vocab.txt"); + + await DownloadIfMissingAsync(ModelUrl, modelPath); + await DownloadIfMissingAsync(VocabUrl, vocabPath); + + // The Xenova / sentence-transformers MiniLM tokenizer config + // says lower_case=true, do_basic_tokenize=true, + // tokenize_chinese_chars=true; surface those flags here so + // the tokens match the ones produced by the Python / + // Node.js sibling demos. + var options = new BertOptions + { + LowerCaseBeforeTokenization = true, + ApplyBasicTokenization = true, + IndividuallyTokenizeCjk = true, + }; + var tokenizer = BertTokenizer.Create(vocabPath, options); + + // One session per process; ONNX Runtime explicitly documents + // it as thread-safe for inference, so we can share it across + // every HttpListener handler thread without further + // synchronisation. + var session = new InferenceSession(modelPath); + + // Probe the output shape once so we fail loudly if a different + // model is ever wired up against the 384-dim Redis Search + // field. + var probe = EncodeInternal(session, tokenizer, "dimension probe"); + return new LocalEmbedder(DefaultModelName, session, tokenizer, probe.Length); + } + + private static async Task DownloadIfMissingAsync(string url, string path) + { + if (File.Exists(path)) return; + Console.WriteLine($"Downloading {url}"); + using var http = new HttpClient + { + Timeout = TimeSpan.FromMinutes(5), + }; + using var stream = await http.GetStreamAsync(url); + // Write to a temp path and rename so a Ctrl-C during the + // download doesn't leave a half-written file the next run + // would happily skip. + string tmp = path + ".part"; + using (var file = File.Create(tmp)) + { + await stream.CopyToAsync(file); + } + File.Move(tmp, path, overwrite: true); + } + + /// + /// Encode a single string. Returns a float[] of length + /// . + /// + public float[] EncodeOne(string text) => EncodeInternal(_session, _tokenizer, text); + + /// + /// Encode several strings sequentially and return one vector per + /// input. Throws when the underlying session returns a different + /// number of vectors than inputs. + /// + public List EncodeMany(IReadOnlyList texts) + { + var results = new List(texts.Count); + foreach (var text in texts) + { + results.Add(EncodeInternal(_session, _tokenizer, text)); + } + if (results.Count != texts.Count) + { + // Belt-and-braces. The loop above guarantees one vector + // per input on the happy path, but surfacing this as an + // explicit check matches the contract the seed loader + // relies on and avoids an index-out-of-range later if a + // future refactor batches into a single Run() call. + throw new InvalidOperationException( + $"embedder produced {results.Count} vectors for {texts.Count} inputs"); + } + return results; + } + + private static float[] EncodeInternal( + InferenceSession session, BertTokenizer tokenizer, string text) + { + // BertTokenizer.EncodeToIds adds the [CLS] / [SEP] sentinels + // that the MiniLM ONNX export expects. considerPreTokenization + // splits on whitespace + punctuation before WordPiece, which + // matches the do_basic_tokenize=true in the upstream + // tokenizer config. + var ids = tokenizer + .EncodeToIds(text, addSpecialTokens: true, considerPreTokenization: true) + .ToArray(); + int seqLen = ids.Length; + // Empty strings still need at least [CLS] [SEP] so the model + // has something to attend to. EncodeToIds gives us that for + // the empty string already; the guard above is just defensive. + + var idsLong = new long[seqLen]; + var mask = new long[seqLen]; + var tokenType = new long[seqLen]; + for (int i = 0; i < seqLen; i++) + { + idsLong[i] = ids[i]; + mask[i] = 1; + tokenType[i] = 0; + } + + var inputIds = new DenseTensor(idsLong, new[] { 1, seqLen }); + var attentionMask = new DenseTensor(mask, new[] { 1, seqLen }); + var tokenTypes = new DenseTensor(tokenType, new[] { 1, seqLen }); + + var inputs = new List + { + NamedOnnxValue.CreateFromTensor("input_ids", inputIds), + NamedOnnxValue.CreateFromTensor("attention_mask", attentionMask), + NamedOnnxValue.CreateFromTensor("token_type_ids", tokenTypes), + }; + + using var results = session.Run(inputs); + // The MiniLM ONNX export exposes a single output named + // last_hidden_state of shape [batch, seq, dim]. Pick it by + // position so we don't depend on a specific name across + // future re-exports. + var output = results[0].AsTensor(); + int dim = output.Dimensions[2]; + var pooled = new float[dim]; + + // Attention-masked mean pooling — the standard + // sentence-transformers recipe. The mask is all 1s here + // because we never pad, but write the masked sum so the + // code stays correct under a future batched implementation. + double maskTotal = 0; + for (int s = 0; s < seqLen; s++) + { + double w = mask[s]; + maskTotal += w; + for (int d = 0; d < dim; d++) + { + pooled[d] += (float)(output[0, s, d] * w); + } + } + if (maskTotal > 0) + { + float inv = (float)(1.0 / maskTotal); + for (int d = 0; d < dim; d++) pooled[d] *= inv; + } + + // L2-normalise explicitly. The MiniLM ONNX export does not + // ship the normalisation step the Python sentence-transformers + // pipeline applies by default with normalize_embeddings=True; + // doing it here keeps the cosine distances comparable across + // the Python, Node, Go, Java, and .NET demos. + double sq = 0; + foreach (var v in pooled) sq += (double)v * v; + if (sq > 0) + { + float inv = (float)(1.0 / Math.Sqrt(sq)); + for (int d = 0; d < dim; d++) pooled[d] *= inv; + } + return pooled; + } + + /// + /// Pack a float[] into the bytes Redis Search expects for + /// a FLOAT32 vector field — raw little-endian float32 + /// values, no header, no padding. Matches the encoding the + /// Python, Node, Go, and Java ports write. + /// + /// + /// We use + /// rather than because + /// the latter follows host endianness; explicit little-endian + /// here means the docs example is portable even on a hypothetical + /// big-endian .NET host. + /// is checked once at process start in to + /// catch any future surprise — every supported .NET runtime + /// today is little-endian, but the assertion documents the + /// assumption. + /// + public static byte[] ToBytes(float[] vector) + { + var bytes = new byte[vector.Length * sizeof(float)]; + var span = bytes.AsSpan(); + for (int i = 0; i < vector.Length; i++) + { + BinaryPrimitives.WriteSingleLittleEndian(span.Slice(i * sizeof(float)), vector[i]); + } + return bytes; + } + + public void Dispose() + { + _session.Dispose(); + } +} diff --git a/content/develop/use-cases/semantic-cache/dotnet/LookupResult.cs b/content/develop/use-cases/semantic-cache/dotnet/LookupResult.cs new file mode 100644 index 0000000000..483bc48a13 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/LookupResult.cs @@ -0,0 +1,10 @@ +namespace SemanticCacheDemo; + +/// +/// Result of a cache lookup. Either a or a +/// ; pattern-matched in the demo server to +/// branch between the hit and miss paths. Mirrors the +/// CacheHit | CacheMiss union the Python and Node ports +/// return. +/// +public abstract record LookupResult; diff --git a/content/develop/use-cases/semantic-cache/dotnet/MockLLM.cs b/content/develop/use-cases/semantic-cache/dotnet/MockLLM.cs new file mode 100644 index 0000000000..11d2715522 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/MockLLM.cs @@ -0,0 +1,157 @@ +using System.Diagnostics; + +namespace SemanticCacheDemo; + +/// +/// Deterministic mock LLM for the semantic-cache demo. +/// +/// +/// The point of a semantic cache is to skip an LLM +/// call when a prior answer is reusable. To make that visible in a +/// docs demo we need an LLM stand-in that: +/// +/// takes long enough that the saved time on a cache hit is +/// obvious (real-world model calls are 500 ms to several +/// seconds); +/// responds deterministically so a given prompt always +/// produces the same answer, which keeps the demo +/// reproducible; +/// exposes an estimated token count so the demo can show +/// the saving in "tokens not spent" terms alongside +/// latency; +/// needs no API keys, no network, no extra dependencies. +/// +/// It is keyword-matched against a small lookup table of FAQ +/// answers for a fictional online retailer. Anything that doesn't +/// match falls back to a generic templated reply. The +/// parameter is the simulated round trip; the +/// default (1500 ms) is in the neighbourhood of a real GPT-class +/// model on a moderately-sized prompt. +/// +public sealed class MockLLM +{ + /// One result row of a mock LLM call. + public sealed record Response( + string Text, + string ModelVersion, + double LatencyMs, + int PromptTokens, + int CompletionTokens) + { + public int TotalTokens => PromptTokens + CompletionTokens; + } + + private sealed record KnowledgeRow(string[] Keywords, string Answer); + + private static readonly KnowledgeRow[] Knowledge = new[] + { + new KnowledgeRow( + new[] { "return", "refund", "exchange" }, + "You can return any unworn item within 30 days of delivery for a " + + "full refund. Start a return from your order page; we email a " + + "prepaid label and refund the original payment method within " + + "five business days of receiving the item."), + new KnowledgeRow( + new[] { "shipping", "delivery", "arrive", "ship" }, + "Standard shipping is free on orders over $50 and arrives in " + + "three to five business days. Expedited two-day shipping is " + + "$9.99 and is available at checkout for in-stock items."), + new KnowledgeRow( + new[] { "size", "sizing", "fit" }, + "We follow standard US sizing. For most styles we recommend " + + "ordering your usual size; the product page includes a sizing " + + "chart and customer fit notes for items that run small or large."), + new KnowledgeRow( + new[] { "warranty", "guarantee", "defect", "broken" }, + "All gear is covered by a one-year manufacturer warranty against " + + "defects in materials or workmanship. Email support with your " + + "order number and a photo of the issue and we will replace the " + + "item or issue a refund."), + new KnowledgeRow( + new[] { "contact", "support", "help", "agent" }, + "You can reach our support team by email at help@example.com or " + + "by live chat from the help centre, 9am to 9pm Eastern, seven " + + "days a week. Most tickets get a first reply within two hours."), + new KnowledgeRow( + new[] { "track", "tracking", "order", "where" }, + "Your tracking number is on the order confirmation email and on " + + "the order detail page once the package has been picked up by " + + "the carrier - typically within 24 hours of order placement."), + new KnowledgeRow( + new[] { "cancel", "modify", "change" }, + "Orders can be cancelled or modified for up to one hour after " + + "placement. After that the order has usually entered our " + + "warehouse system; the fastest path is to accept delivery and " + + "start a return for any unwanted items."), + new KnowledgeRow( + new[] { "discount", "coupon", "promo", "code" }, + "Active promotional codes are listed on the homepage banner. " + + "Codes apply at checkout and cannot be combined; the system " + + "automatically uses the larger of the two when more than one " + + "would qualify."), + }; + + private const string FallbackAnswer = + "Thanks for the question. Our team would normally answer this " + + "individually; in the meantime please check the help centre or " + + "contact support@example.com for a faster response."; + + public string ModelVersion { get; } + public double LatencyMs { get; } + private long _callCount; + + public MockLLM(string modelVersion = "gpt-4.5-2026", double latencyMs = 1500.0) + { + ModelVersion = modelVersion; + LatencyMs = latencyMs; + } + + public long CallCount => Interlocked.Read(ref _callCount); + + /// + /// Pretend to call a model. Sleeps for the configured latency, + /// then returns a templated answer. + /// + public Response Complete(string prompt) + { + Interlocked.Increment(ref _callCount); + var sw = Stopwatch.StartNew(); + // Sleep first so the latency is realistic regardless of which + // branch generates the text. + int sleepMs = (int)Math.Max(0, Math.Round(LatencyMs)); + Thread.Sleep(sleepMs); + string response = AnswerFor(prompt); + sw.Stop(); + return new Response( + response, + ModelVersion, + sw.Elapsed.TotalMilliseconds, + EstimateTokens(prompt), + EstimateTokens(response)); + } + + private static string AnswerFor(string prompt) + { + string lower = prompt.ToLowerInvariant(); + foreach (var row in Knowledge) + { + foreach (var kw in row.Keywords) + { + if (lower.Contains(kw)) return row.Answer; + } + } + return FallbackAnswer; + } + + /// + /// Rough English token estimate: ~4 characters per token. Real + /// tokenizers (BPE, SentencePiece) vary slightly but this is + /// close enough for "look how many tokens you saved" demo + /// signage. + /// + public static int EstimateTokens(string? text) + { + if (string.IsNullOrEmpty(text)) return 0; + return Math.Max(1, text.Length / 4); + } +} diff --git a/content/develop/use-cases/semantic-cache/dotnet/Program.cs b/content/develop/use-cases/semantic-cache/dotnet/Program.cs new file mode 100644 index 0000000000..72c849e132 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/Program.cs @@ -0,0 +1,747 @@ +using System.Diagnostics; +using System.Net; +using System.Text; +using System.Text.Json; +using System.Text.Json.Nodes; +using System.Web; +using StackExchange.Redis; + +namespace SemanticCacheDemo; + +/// +/// Redis semantic-cache demo server (.NET 8 + NRedisStack + ONNX +/// Runtime). +/// +/// +/// Run this and visit http://localhost:8092 to drive a +/// small semantic-cache demo backed by Redis Search. The UI lets you +/// type a natural-language prompt and watch the cache decide hit or +/// miss; on a hit Redis returns the cached response in tens of +/// milliseconds and the demo LLM is not called at all, while on a +/// miss the demo LLM "thinks" for ~1.5 s before answering and the +/// new prompt, response, and embedding are written back to Redis +/// for next time. +/// +/// The server holds a single , a +/// single , and a single +/// for the lifetime of the process. The first +/// run downloads the embedding model into ./model_cache; +/// everything after is local. +/// +public static class Program +{ + private const string StackLabel = + "NRedisStack + ONNX Runtime + .NET HttpListener"; + + // 1 MiB cap on POST bodies so a runaway client (or a `curl + // --data-binary @big-file` by mistake) can't accumulate + // unbounded memory before the handler runs. The demo's largest + // legitimate body is a few hundred bytes of form-encoded query + // fields; 1 MiB is a generous ceiling and matches the Node and + // Go demos' caps. + private const int MaxBodyBytes = 1 * 1024 * 1024; + + public static int Main(string[] argv) + { + // The cache stores embeddings as raw little-endian float32 + // bytes; the wire format is fixed regardless of host + // endianness. The packer in LocalEmbedder writes little-endian + // explicitly via BinaryPrimitives, so a hypothetical + // big-endian .NET host would still produce the correct + // bytes — but every supported runtime today is little-endian + // and a surprise here would silently corrupt every vector + // we write, so assert it loudly at startup. + Debug.Assert(BitConverter.IsLittleEndian, + "this demo assumes a little-endian host"); + + Args args; + try + { + args = Args.Parse(argv); + } + catch (ArgumentException ex) + { + Console.Error.WriteLine($"Error: {ex.Message}"); + PrintHelp(); + return 2; + } + + ConnectionMultiplexer mux; + try + { + mux = ConnectionMultiplexer.Connect(new ConfigurationOptions + { + EndPoints = { { args.RedisHost, args.RedisPort } }, + AbortOnConnectFail = false, + ConnectTimeout = 2000, + SyncTimeout = 5000, + }); + var pingDb = mux.GetDatabase(); + pingDb.Ping(); + } + catch (Exception ex) + { + Console.Error.WriteLine( + $"Error: cannot reach Redis at {args.RedisHost}:{args.RedisPort}"); + Console.Error.WriteLine($" ({ex.Message})"); + return 1; + } + + var db = mux.GetDatabase(); + var cache = new RedisSemanticCache( + db, + indexName: args.IndexName, + keyPrefix: args.KeyPrefix, + distanceThreshold: args.Threshold, + defaultTtlSeconds: args.TtlSeconds); + cache.CreateIndex(); + + Console.WriteLine( + "Loading embedding model (first run downloads ~90 MB of ONNX weights)..."); + LocalEmbedder embedder; + try + { + embedder = LocalEmbedder.CreateAsync().GetAwaiter().GetResult(); + } + catch (Exception ex) + { + Console.Error.WriteLine($"Error loading embedder: {ex.Message}"); + return 1; + } + var llm = new MockLLM(latencyMs: args.LlmLatencyMs); + var demo = new SemanticCacheDemo(cache, embedder, llm); + + if (args.ResetOnStart) + { + Console.WriteLine( + $"Dropping any existing cache under '{args.KeyPrefix}*' and " + + "re-seeding from the FAQ list (pass --no-reset to keep)."); + int seeded = demo.Seed(); + Console.WriteLine($"Seeded {seeded} entries."); + } + + // Load index.html once and substitute the template tokens so + // the docs panel shows the actual values in use rather than + // the default copies. The file ships next to the binary via + // the entry in the .csproj. + string htmlPath = Path.Combine(AppContext.BaseDirectory, "index.html"); + if (!File.Exists(htmlPath)) + { + Console.Error.WriteLine( + $"index.html not found next to the binary at {htmlPath}."); + return 1; + } + string rawHtml = File.ReadAllText(htmlPath); + string htmlPage = rawHtml + .Replace("__INDEX_NAME__", args.IndexName) + .Replace("__KEY_PREFIX__", args.KeyPrefix); + + var listener = new HttpListener(); + // HttpListener prefixes need a trailing slash; '+' wildcard + // would require admin rights on macOS/Linux, so we bind to + // the literal host string. Use 127.0.0.1 to keep the demo + // off the network by default, matching the other demos. + string prefix = $"http://{args.Host}:{args.Port}/"; + listener.Prefixes.Add(prefix); + try + { + listener.Start(); + } + catch (Exception ex) + { + Console.Error.WriteLine($"Failed to bind {prefix}: {ex.Message}"); + return 1; + } + + Console.WriteLine( + $"Redis semantic cache demo listening on http://{args.Host}:{args.Port}"); + Console.WriteLine( + $"Using Redis at {args.RedisHost}:{args.RedisPort} with index '{args.IndexName}'"); + + var cts = new CancellationTokenSource(); + Console.CancelKeyPress += (_, e) => + { + e.Cancel = true; + Console.WriteLine("\nShutting down..."); + cts.Cancel(); + try { listener.Stop(); } catch { /* best-effort */ } + }; + + // One handler thread per request out of the ThreadPool. The + // ONNX session, the Redis multiplexer, and the cache are all + // thread-safe; nothing else here needs serialising beyond + // the seed/reset path's lock. + while (!cts.IsCancellationRequested) + { + HttpListenerContext ctx; + try + { + ctx = listener.GetContext(); + } + catch (HttpListenerException) { break; } + catch (ObjectDisposedException) { break; } + ThreadPool.QueueUserWorkItem(_ => + { + try + { + HandleRequest(ctx, cache, embedder, llm, demo, htmlPage); + } + catch (Exception ex) + { + Console.Error.WriteLine( + $"[demo] handler error: {ex.GetType().Name}: {ex.Message}"); + TrySendError(ctx, ex); + } + }); + } + + try { listener.Close(); } catch { /* best-effort */ } + embedder.Dispose(); + mux.Dispose(); + return 0; + } + + // ------------------------------------------------------------------ + // HTTP request handling + // ------------------------------------------------------------------ + + private static void HandleRequest( + HttpListenerContext ctx, + RedisSemanticCache cache, + LocalEmbedder embedder, + MockLLM llm, + SemanticCacheDemo demo, + string htmlPage) + { + var req = ctx.Request; + string path = req.Url?.AbsolutePath ?? "/"; + + if (string.Equals(req.HttpMethod, "GET", StringComparison.OrdinalIgnoreCase)) + { + switch (path) + { + case "/": + case "/index.html": + SendHtml(ctx, 200, htmlPage); + return; + case "/state": + SendJson(ctx, 200, BuildState(cache, embedder, llm)); + return; + default: + SendJson(ctx, 404, ErrorPayload("not found", null)); + return; + } + } + + if (string.Equals(req.HttpMethod, "POST", StringComparison.OrdinalIgnoreCase)) + { + // Reject oversized bodies before reading them. The + // Content-Length header isn't authoritative — a client + // could lie or stream — so we also bound the read below. + if (req.ContentLength64 > MaxBodyBytes) + { + SendJson(ctx, 413, ErrorPayload( + $"request body exceeds {MaxBodyBytes} bytes", null)); + return; + } + string body; + try + { + body = ReadBodyCapped(req, MaxBodyBytes); + } + catch (BodyTooLargeException ex) + { + // A client without an honest Content-Length header + // (chunked, or just lying) still gets a clean 413 + // here rather than falling through to the generic + // exception handler that would respond 500. + SendJson(ctx, 413, ErrorPayload(ex.Message, null)); + return; + } + var form = ParseForm(body); + + switch (path) + { + case "/query": + HandleQuery(ctx, form, demo, llm); + return; + case "/reset": + demo.Seed(); + SendJson(ctx, 200, new JsonObject { ["ok"] = true }); + return; + case "/drop": + HandleDrop(ctx, form, cache); + return; + default: + SendJson(ctx, 404, ErrorPayload("not found", null)); + return; + } + } + + SendJson(ctx, 405, ErrorPayload("method not allowed", null)); + } + + private static void HandleQuery( + HttpListenerContext ctx, + Dictionary form, + SemanticCacheDemo demo, + MockLLM llm) + { + string prompt = (form.GetValueOrDefault("prompt") ?? "").Trim(); + if (string.IsNullOrEmpty(prompt)) + { + SendJson(ctx, 400, ErrorPayload("prompt is required", null)); + return; + } + + double threshold = ClampThreshold(form.GetValueOrDefault("threshold")); + bool lookupOnly = !string.IsNullOrEmpty(form.GetValueOrDefault("lookup_only")); + string tenant = NonEmpty(form.GetValueOrDefault("tenant"), "acme"); + string locale = NonEmpty(form.GetValueOrDefault("locale"), "en"); + string modelVersion = NonEmpty(form.GetValueOrDefault("model_version"), llm.ModelVersion); + + var payload = demo.RunQuery( + prompt, tenant, locale, modelVersion, threshold, lookupOnly); + SendJson(ctx, 200, payload); + } + + private static void HandleDrop( + HttpListenerContext ctx, + Dictionary form, + RedisSemanticCache cache) + { + string entryId = (form.GetValueOrDefault("entry_id") ?? "").Trim(); + if (string.IsNullOrEmpty(entryId)) + { + SendJson(ctx, 400, ErrorPayload("entry_id is required", null)); + return; + } + bool deleted = cache.DeleteEntry(entryId); + SendJson(ctx, 200, new JsonObject + { + ["deleted"] = deleted, + ["entry_id"] = entryId, + }); + } + + // ------------------------------------------------------------------ + // State assembly + // ------------------------------------------------------------------ + + private static JsonObject BuildState( + RedisSemanticCache cache, LocalEmbedder embedder, MockLLM llm) + { + var info = cache.IndexInfo(); + var entries = cache.ListEntries(200); + var index = new JsonObject + { + ["num_docs"] = info.NumDocs, + ["indexing_failures"] = info.IndexingFailures, + ["vector_index_size_mb"] = info.VectorIndexSizeMb, + ["index_name"] = cache.IndexName, + ["model"] = embedder.ModelName, + ["mock_llm_latency_ms"] = llm.LatencyMs, + // default_threshold lets the --threshold flag actually + // reach the UI slider on first load. stack_label lets the + // same HTML render a per-language badge without forking + // the file per language. + ["default_threshold"] = cache.DistanceThreshold, + ["stack_label"] = StackLabel, + }; + var entriesJson = new JsonArray(); + foreach (var e in entries) + { + entriesJson.Add(new JsonObject + { + ["id"] = e.Id, + ["prompt"] = e.Prompt, + ["response"] = e.Response, + ["tenant"] = e.Tenant, + ["locale"] = e.Locale, + ["model_version"] = e.ModelVersion, + ["safety"] = e.Safety, + ["hit_count"] = e.HitCount, + ["ttl_seconds"] = e.TtlSeconds, + ["created_ts"] = e.CreatedTs, + }); + } + return new JsonObject + { + ["index"] = index, + ["entries"] = entriesJson, + }; + } + + // ------------------------------------------------------------------ + // HTTP plumbing + // ------------------------------------------------------------------ + + private static void SendHtml(HttpListenerContext ctx, int status, string html) + { + byte[] bytes = Encoding.UTF8.GetBytes(html); + ctx.Response.StatusCode = status; + ctx.Response.ContentType = "text/html; charset=utf-8"; + ctx.Response.ContentLength64 = bytes.Length; + using var os = ctx.Response.OutputStream; + os.Write(bytes, 0, bytes.Length); + } + + private static void SendJson(HttpListenerContext ctx, int status, JsonNode body) + { + byte[] bytes = Encoding.UTF8.GetBytes(body.ToJsonString()); + ctx.Response.StatusCode = status; + ctx.Response.ContentType = "application/json"; + ctx.Response.ContentLength64 = bytes.Length; + using var os = ctx.Response.OutputStream; + os.Write(bytes, 0, bytes.Length); + } + + private static void TrySendError(HttpListenerContext ctx, Exception ex) + { + // The headers may already be partially flushed; nothing + // useful left to do beyond letting the connection drop. + try + { + SendJson(ctx, 500, ErrorPayload(ex.Message, ex.GetType().Name)); + } + catch + { + try { ctx.Response.Abort(); } catch { /* best-effort */ } + } + } + + private static JsonObject ErrorPayload(string message, string? type) + { + var o = new JsonObject { ["error"] = message }; + if (type is not null) o["type"] = type; + return o; + } + + /// + /// Signals that a streamed POST body exceeded the cap before the + /// final byte arrived. Used so the dispatch loop can return a + /// clean 413 instead of letting a generic exception escape to + /// the JSON-500 fallback. + /// + private sealed class BodyTooLargeException : Exception + { + public BodyTooLargeException(string message) : base(message) { } + } + + private static string ReadBodyCapped(HttpListenerRequest req, int maxBytes) + { + // Read up to maxBytes + 1 so we can distinguish "exactly at + // the limit" from "too large". HttpListener gives us a + // forward-only stream; the Content-Length-based shortcut + // isn't safe because a malicious client can lie. + using var input = req.InputStream; + var ms = new MemoryStream(); + var buf = new byte[8192]; + int total = 0; + while (true) + { + int read = input.Read(buf, 0, buf.Length); + if (read <= 0) break; + total += read; + if (total > maxBytes) + { + throw new BodyTooLargeException( + $"request body exceeds {maxBytes} bytes"); + } + ms.Write(buf, 0, read); + } + return Encoding.UTF8.GetString(ms.ToArray()); + } + + private static Dictionary ParseForm(string body) + { + var d = new Dictionary(StringComparer.Ordinal); + if (string.IsNullOrEmpty(body)) return d; + foreach (var pair in body.Split('&')) + { + if (pair.Length == 0) continue; + int eq = pair.IndexOf('='); + string key, value; + if (eq < 0) + { + key = HttpUtility.UrlDecode(pair); + value = ""; + } + else + { + key = HttpUtility.UrlDecode(pair.Substring(0, eq)); + value = HttpUtility.UrlDecode(pair.Substring(eq + 1)); + } + d[key] = value; + } + return d; + } + + private static string NonEmpty(string? value, string fallback) + => string.IsNullOrEmpty(value) ? fallback : value!; + + /// + /// Sanitise the threshold parameter from the form body. Clamps + /// NaN/Infinity to 0.5 and otherwise clamps to [0.0, 2.0]. + /// happily + /// handles "nan" → NaN and "inf" → +∞. Either would silently + /// turn the lookup into a permanent hit (NaN comparisons + /// are always false, so distance > nan cannot reject) + /// or a permanent miss; clamping to the meaningful + /// cosine-distance range stops a malformed POST from overriding + /// the threshold semantics. + /// + internal static double ClampThreshold(string? raw) + { + double parsed = 0.5; + if (!string.IsNullOrEmpty(raw)) + { + if (!double.TryParse(raw, System.Globalization.NumberStyles.Float, + System.Globalization.CultureInfo.InvariantCulture, out parsed)) + { + parsed = 0.5; + } + } + if (!double.IsFinite(parsed)) return 0.5; + if (parsed < 0.0) return 0.0; + if (parsed > 2.0) return 2.0; + return parsed; + } + + // ------------------------------------------------------------------ + // CLI parsing + // ------------------------------------------------------------------ + + public sealed class Args + { + public string Host { get; set; } = "127.0.0.1"; + public int Port { get; set; } = 8092; + public string RedisHost { get; set; } = "localhost"; + public int RedisPort { get; set; } = 6379; + public string IndexName { get; set; } = "semcache:idx"; + public string KeyPrefix { get; set; } = "cache:"; + public long TtlSeconds { get; set; } = 3600; + public double Threshold { get; set; } = 0.5; + public double LlmLatencyMs { get; set; } = 1500.0; + public bool ResetOnStart { get; set; } = true; + + public static Args Parse(string[] argv) + { + var a = new Args(); + for (int i = 0; i < argv.Length; i++) + { + string flag = argv[i]; + switch (flag) + { + case "--host": + a.Host = RequireValue(argv, ++i, flag); + break; + case "--port": + a.Port = int.Parse(RequireValue(argv, ++i, flag), + System.Globalization.CultureInfo.InvariantCulture); + break; + case "--redis-host": + a.RedisHost = RequireValue(argv, ++i, flag); + break; + case "--redis-port": + a.RedisPort = int.Parse(RequireValue(argv, ++i, flag), + System.Globalization.CultureInfo.InvariantCulture); + break; + case "--index-name": + a.IndexName = RequireValue(argv, ++i, flag); + break; + case "--key-prefix": + a.KeyPrefix = RequireValue(argv, ++i, flag); + break; + case "--ttl-seconds": + a.TtlSeconds = long.Parse(RequireValue(argv, ++i, flag), + System.Globalization.CultureInfo.InvariantCulture); + break; + case "--threshold": + a.Threshold = double.Parse(RequireValue(argv, ++i, flag), + System.Globalization.NumberStyles.Float, + System.Globalization.CultureInfo.InvariantCulture); + break; + case "--llm-latency-ms": + a.LlmLatencyMs = double.Parse(RequireValue(argv, ++i, flag), + System.Globalization.NumberStyles.Float, + System.Globalization.CultureInfo.InvariantCulture); + break; + case "--no-reset": + a.ResetOnStart = false; + break; + case "-h": + case "--help": + PrintHelp(); + Environment.Exit(0); + break; + default: + throw new ArgumentException($"Unknown flag: {flag}"); + } + } + return a; + } + + private static string RequireValue(string[] argv, int i, string flag) + { + if (i >= argv.Length) + throw new ArgumentException($"Missing value for {flag}"); + return argv[i]; + } + } + + private static void PrintHelp() + { + Console.WriteLine("Usage: dotnet SemanticCacheDemo.dll [options]"); + Console.WriteLine(" --host HOST HTTP bind host (default 127.0.0.1)"); + Console.WriteLine(" --port PORT HTTP bind port (default 8092)"); + Console.WriteLine(" --redis-host HOST Redis host (default localhost)"); + Console.WriteLine(" --redis-port PORT Redis port (default 6379)"); + Console.WriteLine(" --index-name NAME Redis Search index name (default semcache:idx)"); + Console.WriteLine(" --key-prefix PREFIX Hash key prefix (default cache:)"); + Console.WriteLine(" --ttl-seconds N TTL for cache entries (default 3600)"); + Console.WriteLine(" --threshold F Default cosine-distance cutoff (default 0.5)"); + Console.WriteLine(" --llm-latency-ms F Mock LLM latency (default 1500.0)"); + Console.WriteLine(" --no-reset Keep existing cache instead of re-seeding"); + } +} + +/// +/// Demo state: cache management, mock LLM, and cumulative seeding. +/// +public sealed class SemanticCacheDemo +{ + private readonly RedisSemanticCache _cache; + private readonly LocalEmbedder _embedder; + private readonly MockLLM _llm; + private readonly object _seedLock = new(); + public string DefaultTenant { get; } + public string DefaultLocale { get; } + + public SemanticCacheDemo( + RedisSemanticCache cache, LocalEmbedder embedder, MockLLM llm, + string defaultTenant = "acme", string defaultLocale = "en") + { + _cache = cache; + _embedder = embedder; + _llm = llm; + DefaultTenant = defaultTenant; + DefaultLocale = defaultLocale; + } + + /// Drop everything in scope and pre-populate with FAQ entries. + public int Seed() + { + // Two clients hitting /reset back-to-back would otherwise + // race on the drop/create/seed sequence and leave the index + // in an inconsistent state. + lock (_seedLock) + { + _cache.Clear(); + return SeedCache.Seed( + _cache, _embedder, + tenant: DefaultTenant, + locale: DefaultLocale, + modelVersion: _llm.ModelVersion); + } + } + + /// + /// The hot path: embed, look up, optionally call the LLM, cache. + /// + /// + /// Timings are taken with around each + /// bounded step so the UI can display the embed / lookup / LLM + /// breakdown separately. The cache write on a miss is not + /// included in total_ms so the latency number reflects + /// the user-facing wait, not the background bookkeeping. + /// + public JsonObject RunQuery( + string prompt, string tenant, string locale, string modelVersion, + double threshold, bool lookupOnly) + { + var sw = Stopwatch.StartNew(); + float[] queryVec = _embedder.EncodeOne(prompt); + double embedMs = sw.Elapsed.TotalMilliseconds; + sw.Restart(); + var result = _cache.Lookup( + queryVec, + tenant: tenant, locale: locale, + modelVersion: modelVersion, safety: "ok", + distanceThreshold: threshold); + double lookupMs = sw.Elapsed.TotalMilliseconds; + + if (result is CacheHit hit) + { + return new JsonObject + { + ["outcome"] = "hit", + ["response"] = hit.Response, + ["entry_id"] = hit.Id, + ["distance"] = hit.Distance, + ["ttl_seconds"] = hit.TtlSeconds, + ["hit_count"] = hit.HitCount, + ["threshold"] = threshold, + ["embed_ms"] = embedMs, + ["lookup_ms"] = lookupMs, + ["llm_ms"] = null, + ["total_ms"] = embedMs + lookupMs, + ["tokens_avoided"] = EstimateResponseTokens(hit.Prompt, hit.Response), + ["ms_avoided"] = _llm.LatencyMs, + }; + } + + var miss = (CacheMiss)result; + + // Miss path. In "lookup only" mode the demo reports the miss + // without actually calling the LLM — useful for sweeping the + // threshold against a fixed prompt to see where the cutoff + // would fall without polluting the cache. + if (lookupOnly) + { + return new JsonObject + { + ["outcome"] = "miss", + ["response"] = "(LLM not called in lookup-only mode)", + ["nearest_distance"] = miss.NearestDistance, + ["threshold"] = threshold, + ["wrote_entry_id"] = null, + ["embed_ms"] = embedMs, + ["lookup_ms"] = lookupMs, + ["llm_ms"] = null, + ["total_ms"] = embedMs + lookupMs, + }; + } + + sw.Restart(); + var llmResponse = _llm.Complete(prompt); + double llmMs = sw.Elapsed.TotalMilliseconds; + + // Write the new entry back. The embedding is the same vector + // we already used for the lookup — no need to re-encode. + string entryId = _cache.Put( + prompt: prompt, + response: llmResponse.Text, + embedding: queryVec, + tenant: tenant, locale: locale, modelVersion: modelVersion); + + return new JsonObject + { + ["outcome"] = "miss", + ["response"] = llmResponse.Text, + ["nearest_distance"] = miss.NearestDistance, + ["threshold"] = threshold, + ["wrote_entry_id"] = entryId, + ["embed_ms"] = embedMs, + ["lookup_ms"] = lookupMs, + ["llm_ms"] = llmMs, + ["total_ms"] = embedMs + lookupMs + llmMs, + }; + } + + private static int EstimateResponseTokens(string prompt, string response) + { + int len = (prompt?.Length ?? 0) + (response?.Length ?? 0); + return Math.Max(1, len / 4); + } +} diff --git a/content/develop/use-cases/semantic-cache/dotnet/RedisSemanticCache.cs b/content/develop/use-cases/semantic-cache/dotnet/RedisSemanticCache.cs new file mode 100644 index 0000000000..52bc3ee772 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/RedisSemanticCache.cs @@ -0,0 +1,492 @@ +using System.Globalization; +using NRedisStack; +using NRedisStack.RedisStackCommands; +using NRedisStack.Search; +using NRedisStack.Search.Literals.Enums; +using StackExchange.Redis; + +namespace SemanticCacheDemo; + +/// +/// Redis semantic-cache helper backed by Redis Search. +/// +/// +/// Each cache entry lives as a Hash document at +/// cache:<id>. The hash stores the user's prompt and the +/// corresponding LLM response alongside the raw float32 bytes of the +/// prompt's 384-dimensional embedding and a small set of metadata +/// fields — tenant, locale, model version, and a safety flag. +/// +/// A single Redis Search index covers the embedding plus every +/// metadata field, so one FT.SEARCH call does an +/// approximate-nearest-neighbour lookup against the cached prompts +/// with a TAG pre-filter applied in the same pass — no cross-store +/// joins, no extra round trips, and tenant isolation is enforced +/// inside the query rather than after the fact in +/// application code. +/// +/// The lookup is thresholded: FT.SEARCH always returns +/// the closest cached prompt, but the cache only serves it as a hit +/// when the cosine distance is at or below +/// . Anything further away is treated +/// as a miss; the caller is expected to run the underlying LLM and +/// write the new prompt, response, and embedding back with +/// . +/// +/// Each cache entry is written with EXPIRE, so stale +/// answers age out without manual cleanup; combine with an +/// allkeys-lfu eviction policy on the database to cap memory +/// under pressure too. +/// +public sealed class RedisSemanticCache +{ + public const int VectorDimDefault = 384; + + // Characters Redis Search treats as syntax inside a TAG value; + // any of them in a user-supplied filter must be backslash-escaped + // or the surrounding `{...}` block won't parse correctly. + private static readonly HashSet TagSpecial = new( + "\\,.<>{}[]\"':;!@#$%^&*()-+=~| "); + + private readonly IDatabase _db; + private readonly ISearchCommands _ft; + public string IndexName { get; } + public string KeyPrefix { get; } + public int VectorDim { get; } + public double DistanceThreshold { get; } + public long DefaultTtlSeconds { get; } + + public RedisSemanticCache( + IDatabase db, + string indexName = "semcache:idx", + string keyPrefix = "cache:", + int vectorDim = VectorDimDefault, + double distanceThreshold = 0.5, + long defaultTtlSeconds = 3600) + { + _db = db; + _ft = db.FT(); + IndexName = indexName; + KeyPrefix = keyPrefix; + VectorDim = vectorDim; + DistanceThreshold = distanceThreshold; + DefaultTtlSeconds = defaultTtlSeconds; + } + + // ------------------------------------------------------------------ + // Keys + // ------------------------------------------------------------------ + + public string EntryKey(string entryId) => KeyPrefix + entryId; + + // ------------------------------------------------------------------ + // Index management + // ------------------------------------------------------------------ + + /// + /// Create the Redis Search index if it doesn't already exist. + /// + /// + /// One index covers the embedding plus every metadata field, so + /// a single FT.SEARCH can pre-filter by tenant / locale / + /// model and then KNN-rank the matching documents in one pass. + /// The prompt and response fields are stored as + /// TEXT so admin tooling can grep the cache by content, + /// but the cache lookup itself is vector-only. + /// + public void CreateIndex() + { + var schema = new Schema() + .AddTextField("prompt") + .AddTextField("response") + .AddTagField("tenant") + .AddTagField("locale") + .AddTagField("model_version") + .AddTagField("safety") + .AddNumericField("created_ts", sortable: true) + .AddNumericField("hit_count", sortable: true) + .AddVectorField("embedding", Schema.VectorField.VectorAlgo.HNSW, + new Dictionary + { + ["TYPE"] = "FLOAT32", + ["DIM"] = VectorDim, + ["DISTANCE_METRIC"] = "COSINE", + }); + try + { + _ft.Create( + IndexName, + new FTCreateParams() + .On(IndexDataType.HASH) + .Prefix(KeyPrefix), + schema); + } + catch (RedisServerException ex) + when (ex.Message.Contains("Index already exists", StringComparison.OrdinalIgnoreCase)) + { + // Idempotent: re-running create on an already-built + // index is the expected path on every restart. + } + } + + /// Drop the search index. Optionally also delete cached entries. + public void DropIndex(bool deleteDocuments = false) + { + try + { + _ft.DropIndex(IndexName, deleteDocuments); + } + catch (RedisServerException ex) + { + string msg = ex.Message ?? ""; + if (!msg.Contains("no such index", StringComparison.OrdinalIgnoreCase) + && !msg.Contains("unknown index name", StringComparison.OrdinalIgnoreCase)) + { + throw; + } + } + } + + // ------------------------------------------------------------------ + // Lookup + // ------------------------------------------------------------------ + + /// + /// Find the nearest in-scope cached prompt and decide hit / miss. + /// + /// + /// FT.SEARCH returns the single nearest entry that + /// satisfies the TAG pre-filters. The lookup is a hit only if the + /// reported cosine distance is at or below + /// (or the instance + /// default). Anything further away is a miss with the candidate + /// distance attached so the caller can log it. + /// + /// On a hit, the entry's hit_count is incremented + /// atomically with HINCRBY and the TTL is refreshed inside + /// the same MULTI/EXEC so a frequently used answer doesn't + /// age out under cold tail entries. + /// + public LookupResult Lookup( + float[] queryVec, + string? tenant = null, + string? locale = null, + string? modelVersion = null, + string? safety = "ok", + double? distanceThreshold = null) + { + if (queryVec is null) throw new ArgumentNullException(nameof(queryVec)); + + // Match the shape check that Put performs. A wrong-dim + // vector would otherwise hit Redis as a malformed FT.SEARCH + // parameter and surface as a server-side parse error instead + // of a clear caller-side error. + if (queryVec.Length != VectorDim) + { + throw new ArgumentException( + $"queryVec length is {queryVec.Length}; index expects {VectorDim}", + nameof(queryVec)); + } + + double threshold = distanceThreshold ?? DistanceThreshold; + + string filterClause = BuildFilterClause(tenant, locale, modelVersion, safety); + string knnQuery = $"{filterClause}=>[KNN 1 @embedding $vec AS distance]"; + byte[] vecBytes = LocalEmbedder.ToBytes(queryVec); + + var query = new Query(knnQuery) + .ReturnFields( + "prompt", "response", "tenant", "locale", + "model_version", "hit_count", "distance") + .SetSortBy("distance", ascending: true) + .Limit(0, 1) + .AddParam("vec", vecBytes) + .Dialect(2); + + var result = _ft.Search(IndexName, query); + if (result.Documents is null || result.Documents.Count == 0) + { + return new CacheMiss(null, null); + } + + var doc = result.Documents[0]; + string rawKey = doc.Id ?? ""; + string entryId = rawKey.StartsWith(KeyPrefix) + ? rawKey.Substring(KeyPrefix.Length) + : rawKey; + + var props = doc.GetProperties().ToDictionary(p => p.Key, p => p.Value); + double distance = ParseDouble(props.GetValueOrDefault("distance"), 0.0); + + if (distance > threshold) + { + return new CacheMiss(distance, entryId); + } + + // The hash may have expired between FT.SEARCH returning the + // row and us getting here — the search index lags expirations + // by its periodic scan. If we just blindly HINCRBY-ed, Redis + // would helpfully recreate the hash with only `hit_count` + // set and the search index would then log it as an indexing + // failure (no embedding, no metadata). EXISTS narrows that + // race to the round-trip below; a strictly race-free version + // would wrap the bump in a Lua script that checks existence + // and acts in one server-side step. + string entryKey = EntryKey(entryId); + if (!_db.KeyExists(entryKey)) + { + return new CacheMiss(distance, entryId); + } + + // MULTI/EXEC the three writes so they apply as a unit on the + // server — a partial failure between HINCRBY and EXPIRE would + // otherwise leave the entry without a refreshed TTL. + // StackExchange.Redis returns Task results that resolve only + // after Execute(); we collect them and read .Result here + // because the demo is intentionally synchronous to match the + // other ports. + var tx = _db.CreateTransaction(); + var hincrTask = tx.HashIncrementAsync(entryKey, "hit_count", 1); + var expireTask = tx.KeyExpireAsync(entryKey, TimeSpan.FromSeconds(DefaultTtlSeconds)); + var ttlTask = tx.KeyTimeToLiveAsync(entryKey); + bool committed = tx.Execute(); + if (!committed) + { + // Should be unreachable — we didn't queue any WATCHes — + // but documenting the contract avoids a silent NRE if a + // future refactor adds one. + return new CacheMiss(distance, entryId); + } + + long newHitCount = hincrTask.Result; + TimeSpan? ttl = ttlTask.Result; + long ttlSeconds = ttl is { TotalSeconds: > 0 } v ? (long)v.TotalSeconds : DefaultTtlSeconds; + + return new CacheHit( + Id: entryId, + Prompt: props.GetValueOrDefault("prompt").ToString() ?? "", + Response: props.GetValueOrDefault("response").ToString() ?? "", + Tenant: props.GetValueOrDefault("tenant").ToString() ?? "", + Locale: props.GetValueOrDefault("locale").ToString() ?? "", + ModelVersion: props.GetValueOrDefault("model_version").ToString() ?? "", + Distance: distance, + TtlSeconds: ttlSeconds, + HitCount: newHitCount); + } + + // ------------------------------------------------------------------ + // Write + // ------------------------------------------------------------------ + + /// + /// Write a new cache entry and return its id. + /// + /// + /// The embedding is stored as raw little-endian float32 bytes — + /// the encoding Redis Search expects from a FLOAT32 vector + /// field. EXPIRE on the key gives every entry a bounded + /// lifetime; combine with an allkeys-lfu eviction policy + /// on the database to cap memory under pressure too. + /// + public string Put( + string prompt, + string response, + float[] embedding, + string tenant = "default", + string locale = "en", + string modelVersion = "gpt-4.5-2026", + string safety = "ok", + long? ttlSeconds = null, + string? entryId = null) + { + if (embedding is null) throw new ArgumentNullException(nameof(embedding)); + if (embedding.Length != VectorDim) + { + throw new ArgumentException( + $"embedding length is {embedding.Length}; index expects {VectorDim}", + nameof(embedding)); + } + + string id = string.IsNullOrEmpty(entryId) + ? Guid.NewGuid().ToString("N").Substring(0, 12) + : entryId!; + string key = EntryKey(id); + long ttl = ttlSeconds ?? DefaultTtlSeconds; + byte[] vecBytes = LocalEmbedder.ToBytes(embedding); + + // Use HashEntry[] with explicit RedisValue so the embedding + // travels as raw bytes; mixing in a string-keyed dictionary + // would coerce the binary field through UTF-8 and corrupt + // the float bytes. + double createdTs = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds() / 1000.0; + var entries = new HashEntry[] + { + new("prompt", prompt ?? ""), + new("response", response ?? ""), + new("tenant", tenant), + new("locale", locale), + new("model_version", modelVersion), + new("safety", safety), + new("created_ts", createdTs.ToString("F6", CultureInfo.InvariantCulture)), + new("hit_count", "0"), + new("embedding", vecBytes), + }; + + // MULTI/EXEC so HSET and EXPIRE either both apply or neither + // does. Without the transaction wrapper a connection drop + // between the two writes could leave the entry without a TTL + // and the cache would then keep an answer past its intended + // lifetime (or forever, on a database with no eviction + // policy). + var tx = _db.CreateTransaction(); + _ = tx.HashSetAsync(key, entries); + _ = tx.KeyExpireAsync(key, TimeSpan.FromSeconds(ttl)); + tx.Execute(); + return id; + } + + // ------------------------------------------------------------------ + // Filter clause + // ------------------------------------------------------------------ + + internal static string EscapeTagValue(string value) + { + if (string.IsNullOrEmpty(value)) return ""; + var sb = new System.Text.StringBuilder(value.Length); + foreach (var ch in value) + { + if (TagSpecial.Contains(ch)) sb.Append('\\'); + sb.Append(ch); + } + return sb.ToString(); + } + + internal static string BuildFilterClause( + string? tenant, string? locale, string? modelVersion, string? safety) + { + var clauses = new List(4); + if (!string.IsNullOrEmpty(tenant)) + clauses.Add($"@tenant:{{{EscapeTagValue(tenant!)}}}"); + if (!string.IsNullOrEmpty(locale)) + clauses.Add($"@locale:{{{EscapeTagValue(locale!)}}}"); + if (!string.IsNullOrEmpty(modelVersion)) + clauses.Add($"@model_version:{{{EscapeTagValue(modelVersion!)}}}"); + if (!string.IsNullOrEmpty(safety)) + clauses.Add($"@safety:{{{EscapeTagValue(safety!)}}}"); + return clauses.Count == 0 + ? "(*)" + : "(" + string.Join(" ", clauses) + ")"; + } + + // ------------------------------------------------------------------ + // Inspection / admin + // ------------------------------------------------------------------ + + /// Subset of FT.INFO useful for the demo UI. + public IndexSnapshot IndexInfo() + { + try + { + var info = _ft.Info(IndexName); + return new IndexSnapshot( + NumDocs: info.NumDocs, + IndexingFailures: info.HashIndexingFailures, + VectorIndexSizeMb: info.VectorIndexSzMebibytes); + } + catch (RedisServerException) + { + return new IndexSnapshot(0, 0, 0.0); + } + } + + /// + /// Return every cached entry (no embedding) for the admin UI. + /// + public List ListEntries(int limit = 100) + { + var query = new Query("*") + .ReturnFields( + "prompt", "response", "tenant", "locale", + "model_version", "safety", "created_ts", "hit_count") + .Limit(0, limit) + .SetSortBy("created_ts", ascending: false) + .Dialect(2); + + SearchResult result; + try + { + result = _ft.Search(IndexName, query); + } + catch (RedisServerException) + { + return new List(); + } + + var out_ = new List(); + foreach (var doc in result.Documents) + { + string rawKey = doc.Id ?? ""; + string entryId = rawKey.StartsWith(KeyPrefix) + ? rawKey.Substring(KeyPrefix.Length) + : rawKey; + TimeSpan? ttl = _db.KeyTimeToLive(EntryKey(entryId)); + long ttlSeconds = ttl is { TotalSeconds: > 0 } v ? (long)v.TotalSeconds : 0L; + var props = doc.GetProperties().ToDictionary(p => p.Key, p => p.Value); + out_.Add(new EntrySnapshot( + Id: entryId, + Prompt: props.GetValueOrDefault("prompt").ToString() ?? "", + Response: props.GetValueOrDefault("response").ToString() ?? "", + Tenant: props.GetValueOrDefault("tenant").ToString() ?? "", + Locale: props.GetValueOrDefault("locale").ToString() ?? "", + ModelVersion: props.GetValueOrDefault("model_version").ToString() ?? "", + Safety: props.GetValueOrDefault("safety").ToString() ?? "", + HitCount: (long)ParseDouble(props.GetValueOrDefault("hit_count"), 0), + TtlSeconds: ttlSeconds, + CreatedTs: ParseDouble(props.GetValueOrDefault("created_ts"), 0))); + } + return out_; + } + + /// Drop a single entry. Returns true if the key existed. + public bool DeleteEntry(string entryId) + { + return _db.KeyDelete(EntryKey(entryId)); + } + + /// + /// Drop the index and every cached entry, then re-create the + /// index. Returns the count of entries that were removed. + /// + public long Clear() + { + long before = IndexInfo().NumDocs; + DropIndex(deleteDocuments: true); + CreateIndex(); + return before; + } + + // ------------------------------------------------------------------ + // Helpers + // ------------------------------------------------------------------ + + private static double ParseDouble(RedisValue value, double fallback) + { + if (value.IsNullOrEmpty) return fallback; + if (value.TryParse(out double d)) return d; + return fallback; + } +} + +public sealed record IndexSnapshot(long NumDocs, long IndexingFailures, double VectorIndexSizeMb); + +public sealed record EntrySnapshot( + string Id, + string Prompt, + string Response, + string Tenant, + string Locale, + string ModelVersion, + string Safety, + long HitCount, + long TtlSeconds, + double CreatedTs); diff --git a/content/develop/use-cases/semantic-cache/dotnet/SeedCache.cs b/content/develop/use-cases/semantic-cache/dotnet/SeedCache.cs new file mode 100644 index 0000000000..01609768fb --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/SeedCache.cs @@ -0,0 +1,90 @@ +namespace SemanticCacheDemo; + +/// +/// Pre-seed the semantic cache with a handful of FAQ answers. +/// +/// +/// In a real deployment the cache fills up organically as +/// users ask questions: a first-time question is a miss, the LLM +/// answers, and the response is written back. To make the demo +/// immediately useful — so the first query you type lands on a hit +/// instead of a cold miss — we seed a small set of canonical +/// prompts and their answers at startup. +/// +/// The seed list mirrors the keyword table in +/// but stores the canonical phrasing +/// of each question. Paraphrases of any of these prompts +/// ("How do I return an item?", "Can I get a refund?") embed close +/// to the canonical entry and the cache lookup serves the stored +/// response without ever calling the model. +/// +public static class SeedCache +{ + public sealed record SeedEntry(string Prompt, string Response); + + public static readonly IReadOnlyList SeedEntries = new[] + { + new SeedEntry( + "What is your return policy?", + "You can return any unworn item within 30 days of delivery for " + + "a full refund. Start a return from your order page; we email " + + "a prepaid label and refund the original payment method within " + + "five business days of receiving the item."), + new SeedEntry( + "How long does shipping take?", + "Standard shipping is free on orders over $50 and arrives in " + + "three to five business days. Expedited two-day shipping is " + + "$9.99 and is available at checkout for in-stock items."), + new SeedEntry( + "How do I find my size?", + "We follow standard US sizing. For most styles we recommend " + + "ordering your usual size; the product page includes a sizing " + + "chart and customer fit notes for items that run small or " + + "large."), + new SeedEntry( + "Is there a warranty on your products?", + "All gear is covered by a one-year manufacturer warranty " + + "against defects in materials or workmanship. Email support " + + "with your order number and a photo of the issue and we will " + + "replace the item or issue a refund."), + new SeedEntry( + "How can I contact customer support?", + "You can reach our support team by email at help@example.com " + + "or by live chat from the help centre, 9am to 9pm Eastern, " + + "seven days a week. Most tickets get a first reply within two " + + "hours."), + new SeedEntry( + "Where is my order?", + "Your tracking number is on the order confirmation email and " + + "on the order detail page once the package has been picked up " + + "by the carrier - typically within 24 hours of order " + + "placement."), + }; + + /// + /// Embed and write every seed entry. Returns the number of + /// entries that were written. + /// + public static int Seed( + RedisSemanticCache cache, + LocalEmbedder embedder, + string tenant = "acme", + string locale = "en", + string modelVersion = "gpt-4.5-2026") + { + var prompts = SeedEntries.Select(e => e.Prompt).ToList(); + var vectors = embedder.EncodeMany(prompts); + for (int i = 0; i < SeedEntries.Count; i++) + { + var entry = SeedEntries[i]; + cache.Put( + prompt: entry.Prompt, + response: entry.Response, + embedding: vectors[i], + tenant: tenant, + locale: locale, + modelVersion: modelVersion); + } + return SeedEntries.Count; + } +} diff --git a/content/develop/use-cases/semantic-cache/dotnet/SemanticCacheDemo.csproj b/content/develop/use-cases/semantic-cache/dotnet/SemanticCacheDemo.csproj new file mode 100644 index 0000000000..042778a079 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/SemanticCacheDemo.csproj @@ -0,0 +1,40 @@ + + + + Exe + net8.0 + SemanticCacheDemo + SemanticCacheDemo + enable + enable + latest + false + + + + + + + + + + + + + + + + + PreserveNewest + + + + diff --git a/content/develop/use-cases/semantic-cache/dotnet/_index.md b/content/develop/use-cases/semantic-cache/dotnet/_index.md new file mode 100644 index 0000000000..defd06e267 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/_index.md @@ -0,0 +1,271 @@ +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: Build a Redis-backed semantic cache for LLM responses in C# with NRedisStack and ONNX Runtime +linkTitle: NRedisStack example (C# / .NET) +title: Redis semantic cache with NRedisStack +weight: 5 +--- + +This guide shows you how to build a small Redis-backed semantic cache for LLM responses in C# / .NET with [`NRedisStack`]({{< relref "/develop/clients/dotnet/nredisstack" >}}) and [ONNX Runtime](https://onnxruntime.ai/) running the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) encoder locally. It includes a local web server built with the BCL's standard `System.Net.HttpListener` so you can send paraphrased prompts at a mock LLM, watch the cache decide hit or miss, sweep the cosine-distance threshold, and see the cumulative latency and token savings build up. + +## Overview + +Each cache entry is stored as a single Redis [Hash]({{< relref "/develop/data-types/hashes" >}}) at `cache:`. The hash holds the original prompt, the LLM's response, the raw `float32` bytes of a 384-dimensional embedding of the prompt, and metadata fields — tenant, locale, model version, safety flag — plus a `created_ts` and a `hit_count`. A single [Redis Search]({{< relref "/develop/ai/search-and-query" >}}) index covers the embedding field and every metadata field, so one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) call with a `KNN` clause does the vector lookup *and* the TAG pre-filter in the same round trip — no cross-store joins. + +The lookup is thresholded: [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) always returns the nearest entry that satisfies the filters, but the application only serves it as a hit when the reported cosine distance is at or below `DistanceThreshold`. Anything further away is treated as a miss; the caller runs the LLM and writes the new prompt, response, and embedding back to the same key pattern with a TTL. + +The embedder is [ONNX Runtime](https://onnxruntime.ai/) loading the ONNX export of [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) (mirrored at [`Xenova/all-MiniLM-L6-v2`](https://huggingface.co/Xenova/all-MiniLM-L6-v2) on Hugging Face) paired with the `BertTokenizer` from [`Microsoft.ML.Tokenizers`](https://www.nuget.org/packages/Microsoft.ML.Tokenizers). This is the same 384-dimensional encoder the [Python example]({{< relref "/develop/use-cases/semantic-cache/redis-py" >}}), the [Node.js example]({{< relref "/develop/use-cases/semantic-cache/nodejs" >}}), and the [Java/Jedis example]({{< relref "/develop/use-cases/semantic-cache/java-jedis" >}}) use. Embeddings produced by the implementations are semantically equivalent — paraphrase distances differ only at the second or third decimal place — so a cache populated by one demo can be queried by another against the same Redis instance. + +That gives you: + +* A single round trip for lookup — vector KNN + metadata pre-filter in one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}). +* Tens of milliseconds on a hit vs. a multi-second LLM call on a miss; the embedding step is the bottleneck either way, and that's a model-side cost, not a Redis one. +* Tenant, locale, and model-version isolation enforced inside the query, not in application code — a write under one tenant cannot be served to another. +* Bounded memory: every entry has an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL, and a database-level [eviction policy]({{< relref "/develop/reference/eviction" >}}) (LRU / LFU) caps the cache size under pressure. + +## How it works + +A query goes through three stages: **embed**, **lookup**, and (on a miss) **call the LLM and write back**. + +### Hit path (the goal) + +1. The application calls `embedder.EncodeOne(prompt)` to turn the incoming text into a 384-element `float[]`. +2. `cache.Lookup(queryVec, tenant, locale, modelVersion, "ok", threshold)` runs [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) with a TAG pre-filter and a `KNN 1` clause. Redis returns the closest cached prompt that satisfies the filters along with its cosine distance. +3. If the distance is at or below the threshold, the cache returns a `CacheHit` record containing the cached response. The helper also runs an [`HINCRBY`]({{< relref "/commands/hincrby" >}}) on `hit_count` and an [`EXPIRE`]({{< relref "/commands/expire" >}}) refresh inside a [`MULTI/EXEC`]({{< relref "/commands/multi" >}}) (built with `IDatabase.CreateTransaction()`), so a frequently used answer keeps its TTL and the demo UI can see which entries are load-bearing. +4. The LLM is not called at all. The application returns the cached response to the user. + +### Miss path + +When the distance is above the threshold — or there is no candidate in scope at all — the helper returns a `CacheMiss` record instead, carrying the distance of the nearest candidate (if any) for logging. The application then: + +1. Calls the LLM with the prompt. +2. Calls `cache.Put(prompt, response, embedding, tenant, locale, modelVersion, ...)`. The same embedding the lookup used is reused — no re-encode. The helper writes the Hash with [`HSET`]({{< relref "/commands/hset" >}}) and an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL inside a single [`MULTI/EXEC`]({{< relref "/commands/multi" >}}) so the entry never lands without a TTL on a partial failure. +3. Returns the LLM's response to the user. The next semantically similar prompt under the same metadata scope will be a hit. + +## The cache helper + +The `RedisSemanticCache` class wraps the Redis Search index and the lookup / write flow +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/dotnet/RedisSemanticCache.cs)): + +```csharp +using NRedisStack.RedisStackCommands; +using StackExchange.Redis; +using SemanticCacheDemo; + +var mux = ConnectionMultiplexer.Connect("localhost:6379"); +var db = mux.GetDatabase(); + +var embedder = await LocalEmbedder.CreateAsync(); // sentence-transformers/all-MiniLM-L6-v2 + +var cache = new RedisSemanticCache( + db, + indexName: "semcache:idx", + keyPrefix: "cache:", + vectorDim: 384, + distanceThreshold: 0.5, // cosine distance, lower = stricter + defaultTtlSeconds: 3600); + +// One-time index setup (idempotent). +cache.CreateIndex(); + +// 1) Embed the prompt. +string prompt = "How do I return an item?"; +float[] queryVec = embedder.EncodeOne(prompt); + +// 2) Look up under a metadata scope. The TAG filter and the KNN +// travel together in one FT.SEARCH. +var result = cache.Lookup( + queryVec, tenant: "acme", locale: "en", + modelVersion: "gpt-4.5-2026"); + +string response; +if (result is CacheHit hit) +{ + response = hit.Response; + Console.WriteLine($"hit ({hit.Distance:F3}): {response}"); +} +else +{ + // 3a) Miss — call the LLM. (Use your real client here.) + response = CallLlm(prompt); + + // 3b) Cache the new entry. Reuses the same embedding bytes the + // lookup used, so we don't pay the encoder twice. + cache.Put( + prompt: prompt, + response: response, + embedding: queryVec, + tenant: "acme", + locale: "en", + modelVersion: "gpt-4.5-2026"); +} +``` + +### Data model + +Each cache entry is one Redis Hash. The vector field is raw little-endian `float32` bytes — no JSON wrapping — because the Redis Search vector encoding expects exactly that. The helper packs the `float[]` with `BinaryPrimitives.WriteSingleLittleEndian` so the byte order is pinned to little-endian regardless of host architecture; every supported .NET runtime today is little-endian, and the explicit encoding matches the bytes the Python, Node, Go, and Java ports write. + +```text +cache:7c3f8a1b9e02 + prompt=How do I return an item? + response=You can return any unworn item within 30 days... + tenant=acme + locale=en + model_version=gpt-4.5-2026 + safety=ok + created_ts=1715990400.123 + hit_count=4 + embedding=<384 × float32 little-endian bytes> +``` + +The Redis Search index schema treats every field as queryable in its natural type: + +```text +FT.CREATE semcache:idx + ON HASH PREFIX 1 cache: + SCHEMA + prompt TEXT + response TEXT + tenant TAG + locale TAG + model_version TAG + safety TAG + created_ts NUMERIC SORTABLE + hit_count NUMERIC SORTABLE + embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE +``` + +### The query + +The lookup is a hybrid query: a TAG pre-filter expression in parentheses, then `=>[KNN 1 @embedding $vec]`. With `DIALECT 2`, Redis applies the filter first and KNN-ranks only the matching documents. In NRedisStack: + +```csharp +// `.` and `-` (and other punctuation) are TAG-value syntax in Redis +// Search and must be backslash-escaped — `gpt-4.5-2026` raw would +// be parsed as three tokens. The helper's EscapeTagValue does this +// for every TAG value; the literal below shows what the parser +// actually sees on the wire. +var query = new Query( + "(@tenant:{acme} @locale:{en} @model_version:{gpt\\-4\\.5\\-2026} @safety:{ok})" + + "=>[KNN 1 @embedding $vec AS distance]") + .ReturnFields( + "prompt", "response", "tenant", "locale", + "model_version", "hit_count", "distance") + .SetSortBy("distance", ascending: true) + .Limit(0, 1) + .AddParam("vec", LocalEmbedder.ToBytes(queryVec)) + .Dialect(2); + +var result = db.FT().Search("semcache:idx", query); +``` + +`distance` is the cosine *distance* (0 means identical, 2 means opposite). The result is sorted ascending, so the top row is the closest candidate. The application inspects `distance` against the threshold and decides hit or miss in user code — Redis returns the row either way, and treating it as a hit or a miss is a policy decision the cache helper owns, not a server-side filter. + +## The mock LLM + +To make the latency and token savings visible without requiring an API key, `MockLLM.cs` provides a deterministic stand-in +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/dotnet/MockLLM.cs)): + +```csharp +using SemanticCacheDemo; + +var llm = new MockLLM(modelVersion: "gpt-4.5-2026", latencyMs: 1500.0); +var response = llm.Complete("What is your return policy?"); +// response.Text — the templated answer text +// response.LatencyMs — wall-clock time the call took +// response.TotalTokens — estimated prompt + completion tokens +``` + +The mock sleeps for the configured latency, then keyword-matches against a small FAQ table to produce an answer. The deliberate slowness is what makes a hit visibly cheaper than a miss in the demo. In production code, you would replace `MockLLM` with your real client of choice — an HTTP call to OpenAI, Anthropic, an Azure OpenAI deployment, a self-hosted vLLM endpoint, anything — without changing the cache helper. + +## Pre-seeding the cache + +In a real deployment the cache fills up organically: a first-time question is a miss, the LLM answers, and the response is written back. For the demo, `SeedCache.cs` pre-loads a small set of canonical FAQ prompts so the very first query lands on a hit +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/dotnet/SeedCache.cs)): + +```csharp +cache.CreateIndex(); +SeedCache.Seed(cache, embedder, tenant: "acme", locale: "en"); +``` + +The seed list stores the canonical phrasing of each question ("What is your return policy?"). Paraphrases of any of these prompts ("How do I return an item?", "Can I get a refund?") embed close to the canonical entry, so the cache lookup serves the stored response without ever calling the model. + +## The interactive demo + +`Program.cs` runs an HTTP server built on the BCL's `System.Net.HttpListener` — no ASP.NET Core, no Kestrel, no Minimal API. The HTML page lets you: + +* Type a prompt and toggle metadata: tenant, locale, model version. Each combination is a separate cache namespace inside the same index. +* Slide the cosine-distance threshold and see hits flip to misses (and back) on the same prompt, with the actual distance reported on each query. +* Submit with **Ask** to run the full hit-or-miss path (calls the LLM on a miss, writes the answer back). Submit with **Lookup only (no LLM)** to sweep the threshold against a fixed prompt without polluting the cache. +* Watch the cumulative panel build up: total queries, cache hits, cache misses, hit ratio, tokens not spent, LLM milliseconds not waited. +* Inspect every cached entry, including remaining TTL and total hit count, and drop individual entries to simulate eviction. + +The server holds one `LocalEmbedder`, one `RedisSemanticCache`, and one `MockLLM` for the lifetime of the process. The HTML page is shared with the Python, Node.js, Go, and Java demos; `index.html` ships next to the binary via a `` entry in the `.csproj`, so `dotnet bin/Release/net8.0/SemanticCacheDemo.dll` works from any working directory. Endpoints: + +| Endpoint | What it does | +|-----------------|-------------------------------------------------------------------------------| +| `GET /state` | Index info and the full list of cached entries. | +| `POST /query` | Embed the prompt, run `FT.SEARCH`, on miss call the LLM and write back. | +| `POST /reset` | Drop every cached entry and re-seed from the FAQ list. | +| `POST /drop` | Delete a single cached entry by id. | + +## Run the demo locally + +1. Clone the [`redis/docs`](https://github.com/redis/docs) repository and change into the example + directory: + + ```bash + git clone https://github.com/redis/docs.git + cd docs/content/develop/use-cases/semantic-cache/dotnet + ``` + +2. Make sure a Redis instance with the Redis Search module is running locally on + port 6379. [Redis Stack]({{< relref "/operate/oss_and_stack/install/install-stack" >}}) or + [Redis 8 with Search]({{< relref "/develop/ai/search-and-query" >}}) both work. + +3. Build the project. This pulls `NRedisStack`, `Microsoft.ML.OnnxRuntime`, and + `Microsoft.ML.Tokenizers` from NuGet. The first build takes a couple of minutes: + + ```bash + dotnet build -c Release + ``` + +4. Run the demo. The first run downloads the ONNX export of + `sentence-transformers/all-MiniLM-L6-v2` (~90 MB) and the matching BERT + `vocab.txt` into a local `model_cache/` directory next to the binary; every + subsequent run is offline: + + ```bash + dotnet run -c Release + ``` + + Or run the built binary directly: + + ```bash + dotnet bin/Release/net8.0/SemanticCacheDemo.dll + ``` + +5. Open and try some queries: + + * **"What is your return policy?"** — exact match against the seed, distance ≈ 0, + hit at any threshold. + * **"How fast is delivery?"** — paraphrase of the shipping seed; distance + around 0.30, hit at the default threshold of 0.5. + * **"How do I return an item?"** — slightly looser paraphrase of the returns + seed; distance around 0.49, still a hit at the default threshold. Slide + the threshold down to 0.4 to see this one flip to a miss. + * **"What payment methods do you accept?"** — unrelated to anything in the + seed; distance > 0.6, so you'll see a miss, the mock LLM kicks in for + ~1.5 s, the new answer is cached, and a follow-up of the same question + is now an immediate hit. + * Switch the **Tenant** dropdown to `globex` or `initech` and re-ask any + seeded question — the result flips to a miss because the cache entries + live under `acme`. That's the metadata pre-filter at work inside `FT.SEARCH`. + +The server is read/write against your local Redis. The default index name is `semcache:idx` and entry keys live under `cache:`. Flags mirror the Python, Node, Go, and Java demos: `--no-reset` to keep an existing cache across restarts, `--threshold` to change the default cosine-distance cutoff, `--llm-latency-ms` to make the mock LLM faster or slower for the demo, or `--port` to listen on a different port. diff --git a/content/develop/use-cases/semantic-cache/dotnet/index.html b/content/develop/use-cases/semantic-cache/dotnet/index.html new file mode 100644 index 0000000000..e897cfdee7 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/dotnet/index.html @@ -0,0 +1,513 @@ + + + + + + Redis Semantic Cache Demo + + + +
+
loading…
+

Redis Semantic Cache Demo

+

+ A small semantic cache sits in front of a mock LLM. Each cache + entry is a Hash at __KEY_PREFIX__<id> holding + the prompt, the response, the prompt's 384-dimensional embedding, + and metadata fields. A single FT.SEARCH on + __INDEX_NAME__ does the KNN against cached prompts + with a TAG pre-filter (tenant, locale, model version, safety) in + the same round trip. If the closest cached prompt is within the + cosine-distance threshold, the demo serves the cached response + and the LLM is not called at all. +

+ +
+ +
+

Ask the LLM

+

Type a question, optionally adjust the metadata filters and + the distance threshold, and submit. The server embeds the + prompt, runs FT.SEARCH with KNN over the cache, + and either serves the cached response (hit) or runs the mock + LLM and writes the new response back to the cache (miss).

+ + +
+
+ + +
+
+ + +
+
+ + +
+
+
+ + + 0.50 +
+

+ The cache serves a hit when the closest cached prompt's + cosine distance is at or below this threshold. Lower = + stricter (fewer hits, safer reuse); higher = looser (more + hits, more risk of serving a near-miss). +

+ + + + + +
+
+ +
+

Cumulative savings

+

Every hit avoids one LLM round trip. The numbers below add + up across the session — tokens that would have been spent and + wall-clock seconds that would have been waited if the cache + had not served the answer.

+
+
+
0
+
Total queries
+
+
+
0
+
Cache hits
+
+
+
0
+
Cache misses
+
+
+
0%
+
Hit ratio
+
+
+
0
+
Tokens saved
+
+
+
0 ms
+
LLM time saved
+
+
+
+ +
+

Index state

+
+ +
+ +
+

Cached entries

+

Every prompt/response pair currently in the cache. + hit_count is the running total of times the entry + has served a hit; ttl is the remaining lifetime + in seconds before EXPIRE drops the key. Click + Drop to simulate eviction.

+ + + + + + + + + + + + +
IDPromptMetadataHitsTTL
+
+ +
+ +
+
+ + + + diff --git a/content/develop/use-cases/semantic-cache/go/.gitignore b/content/develop/use-cases/semantic-cache/go/.gitignore new file mode 100644 index 0000000000..bd6d073685 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/go/.gitignore @@ -0,0 +1,7 @@ +# Hugot downloads the ONNX model into ./models on first run; the +# weights are ~87 MB and we don't want them in the repo. +models/ + +# `go build` in this directory produces a `go` binary that we +# also don't want to commit. +/go diff --git a/content/develop/use-cases/semantic-cache/go/_index.md b/content/develop/use-cases/semantic-cache/go/_index.md new file mode 100644 index 0000000000..7ac6f0ddac --- /dev/null +++ b/content/develop/use-cases/semantic-cache/go/_index.md @@ -0,0 +1,274 @@ +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: Build a Redis-backed semantic cache for LLM responses in Go with go-redis and Hugot +linkTitle: go-redis example (Go) +title: Redis semantic cache with go-redis +weight: 3 +--- + +This guide shows you how to build a small Redis-backed semantic cache for LLM responses in Go with [`go-redis`]({{< relref "/develop/clients/go" >}}) and the [Hugot](https://pkg.go.dev/github.com/knights-analytics/hugot) library. It includes a local web server built with Go's standard `net/http` package so you can send paraphrased prompts at a mock LLM, watch the cache decide hit or miss, sweep the cosine-distance threshold, and see the cumulative latency and token savings build up. + +## Overview + +Each cache entry is stored as a single Redis [Hash]({{< relref "/develop/data-types/hashes" >}}) at `cache:`. The hash holds the original prompt, the LLM's response, the raw `float32` bytes of a 384-dimensional embedding of the prompt, and metadata fields — tenant, locale, model version, safety flag — plus a `created_ts` and a `hit_count`. A single [Redis Search]({{< relref "/develop/ai/search-and-query" >}}) index covers the embedding field and every metadata field, so one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) call with a `KNN` clause does the vector lookup *and* the TAG pre-filter in the same round trip — no cross-store joins. + +The lookup is thresholded: [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) always returns the nearest entry that satisfies the filters, but the application only serves it as a hit when the reported cosine distance is at or below `DistanceThreshold`. Anything further away is treated as a miss; the caller runs the LLM and writes the new prompt, response, and embedding back to the same key pattern with a TTL. + +The embedder is [Hugot](https://pkg.go.dev/github.com/knights-analytics/hugot) running the ONNX-exported [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model, which is the same encoder the [Python example]({{< relref "/develop/use-cases/semantic-cache/redis-py" >}}) and the [Node.js example]({{< relref "/develop/use-cases/semantic-cache/nodejs" >}}) use. Embeddings produced by the three implementations are semantically equivalent — paraphrase distances differ only at the fourth decimal place — so a cache populated by one demo can be queried by the others against the same Redis instance. + +That gives you: + +* A single round trip for lookup — vector KNN + metadata pre-filter in one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}). +* Tens of milliseconds on a hit vs. a multi-second LLM call on a miss; the embedding step is the bottleneck either way, and that's a model-side cost, not a Redis one. +* Tenant, locale, and model-version isolation enforced inside the query, not in application code — a write under one tenant cannot be served to another. +* Bounded memory: every entry has an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL, and a database-level [eviction policy]({{< relref "/develop/reference/eviction" >}}) (LRU / LFU) caps the cache size under pressure. + +## How it works + +A query goes through three stages: **embed**, **lookup**, and (on a miss) **call the LLM and write back**. + +### Hit path (the goal) + +1. The application calls `embedder.EncodeOne(ctx, prompt)` to turn the incoming text into a 384-element `[]float32`. +2. `cache.Lookup(ctx, queryVec, LookupParams{Tenant: ..., Locale: ..., ModelVersion: ...})` runs [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) with a TAG pre-filter and a `KNN 1` clause. Redis returns the closest cached prompt that satisfies the filters along with its cosine distance. +3. If the distance is at or below the threshold, the cache returns a `LookupResult` with the `Hit` field populated. The helper also runs an [`HINCRBY`]({{< relref "/commands/hincrby" >}}) on `hit_count` and an [`EXPIRE`]({{< relref "/commands/expire" >}}) refresh inside a [`MULTI/EXEC`]({{< relref "/commands/multi" >}}), so a frequently used answer keeps its TTL and the demo UI can see which entries are load-bearing. +4. The LLM is not called at all. The application returns the cached response to the user. + +### Miss path + +When the distance is above the threshold — or there is no candidate in scope at all — the helper returns a `LookupResult` with the `Miss` field populated instead, carrying the distance of the nearest candidate (if any) for logging. The application then: + +1. Calls the LLM with the prompt. +2. Calls `cache.Put(ctx, PutParams{Prompt: ..., Response: ..., Embedding: queryVec, ...})`. The same embedding the lookup used is reused — no re-encode. The helper writes the Hash with [`HSET`]({{< relref "/commands/hset" >}}) and an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL inside a single [`MULTI/EXEC`]({{< relref "/commands/multi" >}}) so the entry never lands without a TTL on a partial failure. +3. Returns the LLM's response to the user. The next semantically similar prompt under the same metadata scope will be a hit. + +## The cache helper + +The `RedisSemanticCache` struct wraps the Redis Search index and the lookup / write flow +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/go/cache.go)): + +```go +import ( + "context" + "fmt" + + "github.com/redis/go-redis/v9" +) + +ctx := context.Background() +client := redis.NewClient(&redis.Options{Addr: "localhost:6379"}) + +cache := NewRedisSemanticCache( + client, + "semcache:idx", + "cache:", + 384, // vector dimension + 0.5, // cosine distance threshold; lower = stricter + 3600, // default TTL in seconds (one hour) +) +embedder, _ := NewLocalEmbedder(ctx, "", "") // sentence-transformers/all-MiniLM-L6-v2 + +// One-time index setup (idempotent). +_ = cache.CreateIndex(ctx) + +// 1) Embed the prompt. +prompt := "How do I return an item?" +queryVec, _ := embedder.EncodeOne(ctx, prompt) + +// 2) Look up under a metadata scope. The TAG filter and the KNN +// travel together in one FT.SEARCH. +result, _ := cache.Lookup(ctx, queryVec, LookupParams{ + Tenant: "acme", + Locale: "en", + ModelVersion: "gpt-4.5-2026", +}) + +var response string +if result.Hit != nil { + response = result.Hit.Response + fmt.Printf("hit (%.3f): %s\n", result.Hit.Distance, response) +} else { + // 3a) Miss — call the LLM. (Use your real client here.) + response = callLLM(prompt) + + // 3b) Cache the new entry. Reuses the same embedding bytes the + // lookup used, so we don't pay the encoder twice. + _, _ = cache.Put(ctx, PutParams{ + Prompt: prompt, + Response: response, + Embedding: queryVec, + Tenant: "acme", + Locale: "en", + ModelVersion: "gpt-4.5-2026", + }) +} +``` + +### Data model + +Each cache entry is one Redis Hash. The vector field is raw little-endian `float32` bytes — no JSON wrapping — because the Redis Search vector encoding expects exactly that. The helper uses `binary.LittleEndian` rather than `binary.NativeEndian` so the byte layout is independent of the host architecture (every supported Go target is little-endian today, but pinning the encoding makes the docs example portable in principle). + +```text +cache:7c3f8a1b9e02 + prompt=How do I return an item? + response=You can return any unworn item within 30 days... + tenant=acme + locale=en + model_version=gpt-4.5-2026 + safety=ok + created_ts=1715990400.123 + hit_count=4 + embedding=<384 × float32 little-endian bytes> +``` + +The Redis Search index schema treats every field as queryable in its natural type: + +```text +FT.CREATE semcache:idx + ON HASH PREFIX 1 cache: + SCHEMA + prompt TEXT + response TEXT + tenant TAG + locale TAG + model_version TAG + safety TAG + created_ts NUMERIC SORTABLE + hit_count NUMERIC SORTABLE + embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE +``` + +### The query + +The lookup is a hybrid query: a TAG pre-filter expression in parentheses, then `=>[KNN 1 @embedding $vec]`. With `DialectVersion: 2`, Redis applies the filter first and KNN-ranks only the matching documents. In `go-redis`: + +```go +result, _ := client.FTSearchWithArgs(ctx, + "semcache:idx", + "(@tenant:{acme} @locale:{en} @model_version:{gpt\\-4\\.5\\-2026} @safety:{ok})"+ + "=>[KNN 1 @embedding $vec AS distance]", + &redis.FTSearchOptions{ + DialectVersion: 2, + Params: map[string]any{"vec": FloatsToBytes(queryVec)}, + SortBy: []redis.FTSearchSortBy{ + {FieldName: "distance", Asc: true}, + }, + Return: []redis.FTSearchReturn{ + {FieldName: "prompt"}, {FieldName: "response"}, + {FieldName: "tenant"}, {FieldName: "locale"}, + {FieldName: "model_version"}, + {FieldName: "hit_count"}, {FieldName: "distance"}, + }, + LimitOffset: 0, + Limit: 1, + }, +).Result() +``` + +`distance` is the cosine *distance* (0 means identical, 2 means opposite). The result is sorted ascending, so the top row is the closest candidate. The application inspects `distance` against the threshold and decides hit or miss in user code — Redis returns the row either way, and treating it as a hit or a miss is a policy decision the cache helper owns, not a server-side filter. + +## The mock LLM + +To make the latency and token savings visible without requiring an API key, `mockllm.go` provides a deterministic stand-in +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/go/mockllm.go)): + +```go +llm := NewMockLLM("", 1500) // 1.5 seconds per call +resp := llm.Complete("What is your return policy?") +// resp.Response — the templated answer text +// resp.LatencyMs — wall-clock time the call took +// resp.TotalTokens() — estimated prompt + completion tokens +``` + +The mock sleeps for the configured latency, then keyword-matches against a small FAQ table to produce an answer. The deliberate slowness is what makes a hit visibly cheaper than a miss in the demo. In production code, you would replace `MockLLM` with your real client of choice — an OpenAI Go SDK, an Anthropic SDK, an internal vLLM endpoint, anything — without changing the cache helper. + +## Pre-seeding the cache + +In a real deployment the cache fills up organically: a first-time question is a miss, the LLM answers, and the response is written back. For the demo, `seedcache.go` pre-loads a small set of canonical FAQ prompts so the very first query lands on a hit +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/go/seedcache.go)): + +```go +_ = cache.CreateIndex(ctx) +_, _ = Seed(ctx, cache, embedder, SeedOptions{Tenant: "acme", Locale: "en"}) +``` + +The seed list stores the canonical phrasing of each question ("What is your return policy?"). Paraphrases of any of these prompts ("How do I return an item?", "Can I get a refund?") embed close to the canonical entry, so the cache lookup serves the stored response without ever calling the model. + +## The interactive demo + +`main.go` runs a Go `net/http` server. The HTML page lets you: + +* Type a prompt and toggle metadata: tenant, locale, model version. Each combination is a separate cache namespace inside the same index. +* Slide the cosine-distance threshold and see hits flip to misses (and back) on the same prompt, with the actual distance reported on each query. +* Submit with **Ask** to run the full hit-or-miss path (calls the LLM on a miss, writes the answer back). Submit with **Lookup only (no LLM)** to sweep the threshold against a fixed prompt without polluting the cache. +* Watch the cumulative panel build up: total queries, cache hits, cache misses, hit ratio, tokens not spent, LLM milliseconds not waited. +* Inspect every cached entry, including remaining TTL and total hit count, and drop individual entries to simulate eviction. + +The server holds one `LocalEmbedder`, one `RedisSemanticCache`, and one `MockLLM` for the lifetime of the process. The HTML page is shared with the Python and Node.js demos and is loaded from `index.html` next to `main.go`. Endpoints: + +| Endpoint | What it does | +|-----------------|-------------------------------------------------------------------------------| +| `GET /state` | Index info and the full list of cached entries. | +| `POST /query` | Embed the prompt, run `FT.SEARCH`, on miss call the LLM and write back. | +| `POST /reset` | Drop every cached entry and re-seed from the FAQ list. | +| `POST /drop` | Delete a single cached entry by id. | + +## Run the demo locally + +1. Clone the [`redis/docs`](https://github.com/redis/docs) repository and change into the example + directory: + + ```bash + git clone https://github.com/redis/docs.git + cd docs/content/develop/use-cases/semantic-cache/go + ``` + +2. Pull the Go modules: + + ```bash + go mod tidy + ``` + + This demo uses Hugot's pure-Go inference backend (`hugot.NewGoSession`), + so no native ONNX Runtime C library is required. If you'd prefer the + higher-throughput ONNX Runtime backend you can swap `NewGoSession` for + `NewORTSession` and follow the [ONNX Runtime install + instructions](https://onnxruntime.ai/docs/install/) — the rest of the + demo is unchanged. + +3. Make sure a Redis instance with the Redis Search module is running locally on + port 6379. [Redis Stack]({{< relref "/operate/oss_and_stack/install/install-stack" >}}) or + [Redis 8 with Search]({{< relref "/develop/ai/search-and-query" >}}) both work. + +4. Start the demo server. The first run downloads the ONNX-exported + `sentence-transformers/all-MiniLM-L6-v2` model into the local `./models` + directory: + + ```bash + go run . + ``` + +5. Open and try some queries: + + * **"What is your return policy?"** — exact match against the seed, distance ≈ 0, + hit at any threshold. + * **"How fast is delivery?"** — paraphrase of the shipping seed; distance + around 0.30, hit at the default threshold of 0.5. + * **"How do I return an item?"** — slightly looser paraphrase of the returns + seed; distance around 0.49, still a hit at the default threshold. Slide + the threshold down to 0.4 to see this one flip to a miss. + * **"What payment methods do you accept?"** — unrelated to anything in the + seed; distance > 0.6, so you'll see a miss, the mock LLM kicks in for + ~1.5 s, the new answer is cached, and a follow-up of the same question + is now an immediate hit. + * Switch the **Tenant** dropdown to `globex` or `initech` and re-ask any + seeded question — the result flips to a miss because the cache entries + live under `acme`. That's the metadata pre-filter at work inside `FT.SEARCH`. + +The server is read/write against your local Redis. The default index name is `semcache:idx` and entry keys live under `cache:`. Flags mirror the Python and Node.js demos: `--no-reset` to keep an existing cache across restarts, `--threshold` to change the default cosine-distance cutoff, or `--llm-latency-ms` to make the mock LLM faster or slower for the demo. diff --git a/content/develop/use-cases/semantic-cache/go/cache.go b/content/develop/use-cases/semantic-cache/go/cache.go new file mode 100644 index 0000000000..c8675f261f --- /dev/null +++ b/content/develop/use-cases/semantic-cache/go/cache.go @@ -0,0 +1,600 @@ +// Redis semantic-cache helper backed by Redis Search. +// +// Each cache entry lives as a Hash document at `cache:`. The hash +// stores the user's prompt and the corresponding LLM response +// alongside the raw float32 bytes of the prompt's 384-dimensional +// embedding and a small set of metadata fields — tenant, locale, +// model version, and a safety flag. +// +// A single Redis Search index covers the embedding plus every +// metadata field, so one `FT.SEARCH` call does an +// approximate-nearest-neighbour lookup against the cached prompts +// with a TAG pre-filter applied in the same pass — no cross-store +// joins, no extra round trips, and tenant isolation is enforced +// *inside* the query rather than after the fact in application code. +// +// The lookup is thresholded: `FT.SEARCH` always returns the closest +// cached prompt, but the cache only serves it as a hit when the +// cosine distance is at or below `DistanceThreshold`. Anything +// further away is treated as a miss; the caller is expected to run +// the underlying LLM and write the new prompt, response, and +// embedding back with `Put`. +// +// Each cache entry is written with `EXPIRE`, so stale answers age out +// without manual cleanup; combine with an `allkeys-lfu` eviction +// policy on the database to cap memory under pressure too. + +package main + +import ( + "context" + "crypto/rand" + "encoding/hex" + "fmt" + "sort" + "strconv" + "strings" + "time" + + "github.com/redis/go-redis/v9" +) + +const vectorDimDefault = 384 + +// CacheHit is what `Lookup` returns when the nearest cached prompt +// passes the threshold. `Distance` is the cosine distance reported +// by FT.SEARCH (0 = identical, 2 = opposite); it is always at or +// below the threshold the lookup ran with. +type CacheHit struct { + ID string + Prompt string + Response string + Tenant string + Locale string + ModelVersion string + Distance float64 + TTLSeconds int + HitCount int +} + +// CacheMiss is what `Lookup` returns when the nearest candidate is +// too far away, or when no candidate satisfied the metadata filter +// at all. `NearestDistance` is non-nil only in the "candidate too +// far" case — the demo UI uses that distinction to display either +// "no candidate" or "candidate too far". +type CacheMiss struct { + NearestDistance *float64 + NearestID string +} + +// LookupResult is a tagged union: exactly one of Hit and Miss is set. +type LookupResult struct { + Hit *CacheHit + Miss *CacheMiss +} + +// RedisSemanticCache wraps the Redis Search index and lookup / write +// flow for a thresholded semantic cache. +type RedisSemanticCache struct { + Client *redis.Client + IndexName string + KeyPrefix string + VectorDim int + DistanceThreshold float64 + DefaultTTLSeconds int +} + +// NewRedisSemanticCache returns a cache helper with the supplied +// client. Pass zero values for any field to use the defaults +// (semcache:idx / cache: / 384 / 0.5 / 3600). +func NewRedisSemanticCache( + client *redis.Client, + indexName, keyPrefix string, + vectorDim int, + distanceThreshold float64, + defaultTTLSeconds int, +) *RedisSemanticCache { + if indexName == "" { + indexName = "semcache:idx" + } + if keyPrefix == "" { + keyPrefix = "cache:" + } + if vectorDim <= 0 { + vectorDim = vectorDimDefault + } + // distanceThreshold is honoured as-is. Zero is a legitimate value + // ("exact matches only") and negative numbers are clamped by the + // HTTP boundary anyway. Silently rewriting `0` to a default would + // make `--threshold 0` uncallable — see audit-checklist row 28. + if defaultTTLSeconds <= 0 { + defaultTTLSeconds = 3600 + } + return &RedisSemanticCache{ + Client: client, + IndexName: indexName, + KeyPrefix: keyPrefix, + VectorDim: vectorDim, + DistanceThreshold: distanceThreshold, + DefaultTTLSeconds: defaultTTLSeconds, + } +} + +// EntryKey returns the Redis key for an entry id. +func (c *RedisSemanticCache) EntryKey(entryID string) string { + return c.KeyPrefix + entryID +} + +// CreateIndex creates the Redis Search index if it doesn't already +// exist. One index covers the embedding plus every metadata field, +// so a single FT.SEARCH can pre-filter by tenant / locale / model +// and then KNN-rank the matching documents in one pass. The `prompt` +// and `response` fields are stored as TEXT so admin tooling can grep +// the cache by content, but the cache lookup itself is vector-only. +func (c *RedisSemanticCache) CreateIndex(ctx context.Context) error { + _, err := c.Client.FTCreate(ctx, + c.IndexName, + &redis.FTCreateOptions{ + OnHash: true, + Prefix: []any{c.KeyPrefix}, + }, + &redis.FieldSchema{FieldName: "prompt", FieldType: redis.SearchFieldTypeText}, + &redis.FieldSchema{FieldName: "response", FieldType: redis.SearchFieldTypeText}, + &redis.FieldSchema{FieldName: "tenant", FieldType: redis.SearchFieldTypeTag}, + &redis.FieldSchema{FieldName: "locale", FieldType: redis.SearchFieldTypeTag}, + &redis.FieldSchema{FieldName: "model_version", FieldType: redis.SearchFieldTypeTag}, + &redis.FieldSchema{FieldName: "safety", FieldType: redis.SearchFieldTypeTag}, + &redis.FieldSchema{FieldName: "created_ts", FieldType: redis.SearchFieldTypeNumeric, Sortable: true}, + &redis.FieldSchema{FieldName: "hit_count", FieldType: redis.SearchFieldTypeNumeric, Sortable: true}, + &redis.FieldSchema{ + FieldName: "embedding", + FieldType: redis.SearchFieldTypeVector, + VectorArgs: &redis.FTVectorArgs{ + HNSWOptions: &redis.FTHNSWOptions{ + Type: "FLOAT32", + Dim: c.VectorDim, + DistanceMetric: "COSINE", + }, + }, + }, + ).Result() + if err != nil && !strings.Contains(err.Error(), "Index already exists") { + return err + } + return nil +} + +// DropIndex drops the Redis Search index. If `deleteDocuments` is +// true the cached entry hashes are deleted alongside the index. +func (c *RedisSemanticCache) DropIndex(ctx context.Context, deleteDocuments bool) error { + _, err := c.Client.FTDropIndexWithArgs(ctx, + c.IndexName, + &redis.FTDropIndexOptions{DeleteDocs: deleteDocuments}, + ).Result() + if err == nil { + return nil + } + msg := strings.ToLower(err.Error()) + if strings.Contains(msg, "no such index") || strings.Contains(msg, "unknown index name") { + return nil + } + return err +} + +// LookupParams collects the optional metadata filters and threshold +// override for a lookup. Using a struct keeps the call site readable +// when only a couple of fields are set. +type LookupParams struct { + Tenant string + Locale string + ModelVersion string + Safety string // empty string => "ok"; pass "-" to disable + DistanceThreshold *float64 +} + +// Lookup runs the thresholded FT.SEARCH and decides hit vs. miss. +// +// FT.SEARCH returns the single nearest entry that satisfies the TAG +// pre-filters. The lookup is a hit only if the reported cosine +// distance is at or below the threshold (instance default or +// override). Anything further away is a miss with the candidate +// distance attached so the caller can log it. +// +// On a hit, the entry's `hit_count` is incremented atomically with +// HINCRBY so the demo UI can show which entries are load-bearing. +// The TTL is refreshed on every hit so frequently used answers don't +// age out under cold tail entries. +func (c *RedisSemanticCache) Lookup( + ctx context.Context, + queryVec []float32, + params LookupParams, +) (LookupResult, error) { + // Match the shape check that `Put` performs. A wrong-dim vector + // would otherwise hit Redis as a malformed FT.SEARCH parameter + // and surface as a server-side parse error instead of a clear + // caller-side error. + if len(queryVec) != c.VectorDim { + return LookupResult{}, fmt.Errorf( + "queryVec length is %d; index expects %d", + len(queryVec), c.VectorDim, + ) + } + + threshold := c.DistanceThreshold + if params.DistanceThreshold != nil { + threshold = *params.DistanceThreshold + } + + safety := params.Safety + if safety == "" { + safety = "ok" + } else if safety == "-" { + safety = "" + } + + filterClause := buildFilterClause(params.Tenant, params.Locale, params.ModelVersion, safety) + queryStr := filterClause + "=>[KNN 1 @embedding $vec AS distance]" + + res, err := c.Client.FTSearchWithArgs(ctx, + c.IndexName, + queryStr, + &redis.FTSearchOptions{ + DialectVersion: 2, + Params: map[string]any{"vec": FloatsToBytes(queryVec)}, + SortBy: []redis.FTSearchSortBy{ + {FieldName: "distance", Asc: true}, + }, + Return: []redis.FTSearchReturn{ + {FieldName: "prompt"}, + {FieldName: "response"}, + {FieldName: "tenant"}, + {FieldName: "locale"}, + {FieldName: "model_version"}, + {FieldName: "hit_count"}, + {FieldName: "distance"}, + }, + LimitOffset: 0, + Limit: 1, + }, + ).Result() + if err != nil { + return LookupResult{}, fmt.Errorf("FT.SEARCH: %w", err) + } + + if len(res.Docs) == 0 { + return LookupResult{Miss: &CacheMiss{NearestDistance: nil, NearestID: ""}}, nil + } + + doc := res.Docs[0] + rawKey := doc.ID + entryID := strings.TrimPrefix(rawKey, c.KeyPrefix) + + distance, _ := strconv.ParseFloat(doc.Fields["distance"], 64) + + if distance > threshold { + d := distance + return LookupResult{Miss: &CacheMiss{NearestDistance: &d, NearestID: entryID}}, nil + } + + // The hash may have expired between FT.SEARCH returning the row + // and us getting here — the search index lags expirations by its + // periodic scan. If we just blindly HINCRBY-ed, Redis would + // helpfully recreate the hash with only `hit_count` set and the + // search index would then log it as an indexing failure (no + // embedding, no metadata). EXISTS narrows that race to the + // pipeline round-trip; a strictly race-free version would wrap + // the bump in a Lua script that checks existence and acts in one + // server-side step. + entryKey := c.EntryKey(entryID) + exists, err := c.Client.Exists(ctx, entryKey).Result() + if err != nil { + return LookupResult{}, fmt.Errorf("EXISTS: %w", err) + } + if exists == 0 { + d := distance + return LookupResult{Miss: &CacheMiss{NearestDistance: &d, NearestID: entryID}}, nil + } + + // MULTI/EXEC the three writes so they apply as a unit on the + // server — a partial failure between HINCRBY and EXPIRE would + // otherwise leave the entry without a refreshed TTL. + var hincrCmd *redis.IntCmd + var ttlCmd *redis.DurationCmd + if _, err := c.Client.TxPipelined(ctx, func(pipe redis.Pipeliner) error { + hincrCmd = pipe.HIncrBy(ctx, entryKey, "hit_count", 1) + pipe.Expire(ctx, entryKey, time.Duration(c.DefaultTTLSeconds)*time.Second) + ttlCmd = pipe.TTL(ctx, entryKey) + return nil + }); err != nil { + return LookupResult{}, fmt.Errorf("hit-count bump MULTI/EXEC: %w", err) + } + + newHitCount := hincrCmd.Val() + ttlSeconds := int(ttlCmd.Val() / time.Second) + if ttlSeconds <= 0 { + ttlSeconds = c.DefaultTTLSeconds + } + + return LookupResult{Hit: &CacheHit{ + ID: entryID, + Prompt: doc.Fields["prompt"], + Response: doc.Fields["response"], + Tenant: doc.Fields["tenant"], + Locale: doc.Fields["locale"], + ModelVersion: doc.Fields["model_version"], + Distance: distance, + TTLSeconds: ttlSeconds, + HitCount: int(newHitCount), + }}, nil +} + +// PutParams collects the fields of a new cache entry. +type PutParams struct { + Prompt string + Response string + Embedding []float32 + Tenant string // default "default" + Locale string // default "en" + ModelVersion string // default "gpt-4.5-2026" + Safety string // default "ok" + TTLSeconds int // 0 => use DefaultTTLSeconds + EntryID string // empty => generate a random 12-hex id +} + +// Put writes a new cache entry and returns its id. +// +// The embedding is stored as raw little-endian float32 bytes — the +// encoding Redis Search expects from a FLOAT32 vector field. EXPIRE +// on the key gives every entry a bounded lifetime; combine with an +// `allkeys-lfu` eviction policy on the database to cap memory under +// pressure too. +func (c *RedisSemanticCache) Put(ctx context.Context, p PutParams) (string, error) { + if len(p.Embedding) != c.VectorDim { + return "", fmt.Errorf( + "embedding length is %d; index expects %d", + len(p.Embedding), c.VectorDim, + ) + } + + entryID := p.EntryID + if entryID == "" { + var err error + entryID, err = newEntryID() + if err != nil { + return "", fmt.Errorf("generating entry id: %w", err) + } + } + tenant := p.Tenant + if tenant == "" { + tenant = "default" + } + locale := p.Locale + if locale == "" { + locale = "en" + } + modelVersion := p.ModelVersion + if modelVersion == "" { + modelVersion = "gpt-4.5-2026" + } + safety := p.Safety + if safety == "" { + safety = "ok" + } + ttl := p.TTLSeconds + if ttl <= 0 { + ttl = c.DefaultTTLSeconds + } + + key := c.EntryKey(entryID) + mapping := map[string]any{ + "prompt": p.Prompt, + "response": p.Response, + "tenant": tenant, + "locale": locale, + "model_version": modelVersion, + "safety": safety, + "created_ts": strconv.FormatFloat(float64(time.Now().UnixNano())/1e9, 'f', -1, 64), + "hit_count": "0", + "embedding": FloatsToBytes(p.Embedding), + } + + // MULTI/EXEC so HSET and EXPIRE either both apply or neither + // does. Without the transaction wrapper a connection drop + // between the two writes could leave the entry without a TTL + // and the cache would then keep an answer past its intended + // lifetime (or forever, on a database with no eviction policy). + if _, err := c.Client.TxPipelined(ctx, func(pipe redis.Pipeliner) error { + pipe.HSet(ctx, key, mapping) + pipe.Expire(ctx, key, time.Duration(ttl)*time.Second) + return nil + }); err != nil { + return "", fmt.Errorf("put MULTI/EXEC: %w", err) + } + return entryID, nil +} + +// ----- Filter clause helpers ------------------------------------------ + +// tagSpecial is the set of characters Redis Search treats as syntax +// inside a TAG value; any of them in a user-supplied filter must be +// backslash-escaped or the surrounding `{...}` block won't parse +// correctly. +var tagSpecial = map[rune]struct{}{ + '\\': {}, ',': {}, '.': {}, '<': {}, '>': {}, '{': {}, '}': {}, + '[': {}, ']': {}, '"': {}, '\'': {}, ':': {}, ';': {}, '!': {}, + '@': {}, '#': {}, '$': {}, '%': {}, '^': {}, '&': {}, '*': {}, + '(': {}, ')': {}, '-': {}, '+': {}, '=': {}, '~': {}, '|': {}, + ' ': {}, +} + +func escapeTagValue(v string) string { + var b strings.Builder + b.Grow(len(v)) + for _, r := range v { + if _, ok := tagSpecial[r]; ok { + b.WriteByte('\\') + } + b.WriteRune(r) + } + return b.String() +} + +func buildFilterClause(tenant, locale, modelVersion, safety string) string { + var clauses []string + if tenant != "" { + clauses = append(clauses, "@tenant:{"+escapeTagValue(tenant)+"}") + } + if locale != "" { + clauses = append(clauses, "@locale:{"+escapeTagValue(locale)+"}") + } + if modelVersion != "" { + clauses = append(clauses, "@model_version:{"+escapeTagValue(modelVersion)+"}") + } + if safety != "" { + clauses = append(clauses, "@safety:{"+escapeTagValue(safety)+"}") + } + if len(clauses) == 0 { + return "(*)" + } + return "(" + strings.Join(clauses, " ") + ")" +} + +// ----- Inspection / admin --------------------------------------------- + +// IndexInfo is a subset of FT.INFO useful for the demo UI. +type IndexInfo struct { + NumDocs int `json:"num_docs"` + IndexingFailures int `json:"indexing_failures"` + VectorIndexSizeMB float64 `json:"vector_index_size_mb"` +} + +// FTInfo returns a small subset of FT.INFO. Failures (for example, +// an index that hasn't been created yet) return zeroed counters +// rather than surface as an error to the caller, since the demo UI +// just renders "0 entries" in that case. +func (c *RedisSemanticCache) FTInfo(ctx context.Context) IndexInfo { + info, err := c.Client.FTInfo(ctx, c.IndexName).Result() + if err != nil { + return IndexInfo{} + } + return IndexInfo{ + NumDocs: info.NumDocs, + IndexingFailures: info.HashIndexingFailures, + VectorIndexSizeMB: info.VectorIndexSzMB, + } +} + +// Entry is the public shape of a cached entry as the demo UI sees it. +// The embedding bytes are intentionally not included; the UI doesn't +// render them and shipping them over JSON wastes bandwidth. +type Entry struct { + ID string `json:"id"` + Prompt string `json:"prompt"` + Response string `json:"response"` + Tenant string `json:"tenant"` + Locale string `json:"locale"` + ModelVersion string `json:"model_version"` + Safety string `json:"safety"` + HitCount int `json:"hit_count"` + TTLSeconds int `json:"ttl_seconds"` + CreatedTS float64 `json:"created_ts"` +} + +// ListEntries returns every cached entry (no embedding) for the +// admin panel. The result is sorted by created_ts descending so the +// most recently written entry is at the top of the table. +func (c *RedisSemanticCache) ListEntries(ctx context.Context, limit int) ([]Entry, error) { + if limit <= 0 { + limit = 100 + } + res, err := c.Client.FTSearchWithArgs(ctx, + c.IndexName, + "*", + &redis.FTSearchOptions{ + DialectVersion: 2, + Return: []redis.FTSearchReturn{ + {FieldName: "prompt"}, + {FieldName: "response"}, + {FieldName: "tenant"}, + {FieldName: "locale"}, + {FieldName: "model_version"}, + {FieldName: "safety"}, + {FieldName: "created_ts"}, + {FieldName: "hit_count"}, + }, + SortBy: []redis.FTSearchSortBy{ + {FieldName: "created_ts", Desc: true}, + }, + LimitOffset: 0, + Limit: limit, + }, + ).Result() + if err != nil { + return nil, fmt.Errorf("FT.SEARCH '*': %w", err) + } + + out := make([]Entry, 0, len(res.Docs)) + for _, doc := range res.Docs { + rawKey := doc.ID + entryID := strings.TrimPrefix(rawKey, c.KeyPrefix) + ttl, _ := c.Client.TTL(ctx, c.EntryKey(entryID)).Result() + ttlSeconds := int(ttl / time.Second) + if ttlSeconds < 0 { + ttlSeconds = 0 + } + hitCount, _ := strconv.Atoi(doc.Fields["hit_count"]) + createdTS, _ := strconv.ParseFloat(doc.Fields["created_ts"], 64) + out = append(out, Entry{ + ID: entryID, + Prompt: doc.Fields["prompt"], + Response: doc.Fields["response"], + Tenant: doc.Fields["tenant"], + Locale: doc.Fields["locale"], + ModelVersion: doc.Fields["model_version"], + Safety: doc.Fields["safety"], + HitCount: hitCount, + TTLSeconds: ttlSeconds, + CreatedTS: createdTS, + }) + } + // Belt-and-braces sort in case Redis returns an unsorted top-N. + sort.SliceStable(out, func(i, j int) bool { + return out[i].CreatedTS > out[j].CreatedTS + }) + return out, nil +} + +// DeleteEntry drops a single entry. Returns true if the key existed. +func (c *RedisSemanticCache) DeleteEntry(ctx context.Context, entryID string) (bool, error) { + n, err := c.Client.Del(ctx, c.EntryKey(entryID)).Result() + if err != nil { + return false, fmt.Errorf("DEL: %w", err) + } + return n > 0, nil +} + +// Clear drops the index and every cached entry, then recreates the +// index. Returns the number of entries that were removed. Used by +// the demo's "reset" button — in production the equivalent is just +// FLUSHDB on a dedicated cache database, or letting TTLs expire +// naturally. +func (c *RedisSemanticCache) Clear(ctx context.Context) (int, error) { + before := c.FTInfo(ctx).NumDocs + if err := c.DropIndex(ctx, true); err != nil { + return 0, err + } + if err := c.CreateIndex(ctx); err != nil { + return 0, err + } + return before, nil +} + +// newEntryID returns a random 12-hex-character id, matching the +// shape the Python and Node helpers produce. +func newEntryID() (string, error) { + var b [6]byte + if _, err := rand.Read(b[:]); err != nil { + return "", err + } + return hex.EncodeToString(b[:]), nil +} diff --git a/content/develop/use-cases/semantic-cache/go/embeddings.go b/content/develop/use-cases/semantic-cache/go/embeddings.go new file mode 100644 index 0000000000..cc76176dd5 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/go/embeddings.go @@ -0,0 +1,189 @@ +// Local text-embedding helper backed by Hugot. +// +// This is a thin wrapper around the sentence-transformers model +// `sentence-transformers/all-MiniLM-L6-v2`: a 384-dimensional encoder +// that runs in-process on CPU through Hugot's pure-Go inference +// backend (`hugot.NewGoSession`), needs no API key, and produces +// vectors numerically equivalent to the equivalent PyTorch model +// from sentence-transformers. +// +// Vectors are explicitly L2-normalised after extraction so cosine +// distance against another normalised vector reduces to `1 - dot +// product` — matching the behaviour of `sentence-transformers`' +// `normalize_embeddings=True` flag in the Python example and +// `@xenova/transformers`' `normalize: true` option in the Node.js +// example. The model is downloaded into the local `./models` cache +// on the first call; every later call runs offline. + +package main + +import ( + "context" + "encoding/binary" + "fmt" + "math" + "os" + "path/filepath" + + "github.com/knights-analytics/hugot" + "github.com/knights-analytics/hugot/pipelines" +) + +const defaultEmbedModel = "sentence-transformers/all-MiniLM-L6-v2" + +// LocalEmbedder wraps a Hugot feature-extraction pipeline. +// +// Use `NewLocalEmbedder` instead of constructing the struct directly +// because the pipeline load is asynchronous in spirit (it downloads +// the model on first call) and we want one place that owns the wait +// and the dimension probe. +type LocalEmbedder struct { + ModelName string + Dim int + session *hugot.Session + pipeline *pipelines.FeatureExtractionPipeline +} + +// NewLocalEmbedder loads the ONNX model (downloading on first run) +// and returns a ready-to-use embedder. The dimension is probed once +// from a synthetic input so we can fail loudly if a different model +// is wired up against the 384-dim Redis Search field. +func NewLocalEmbedder(ctx context.Context, modelName, modelsDir string) (*LocalEmbedder, error) { + if modelName == "" { + modelName = defaultEmbedModel + } + if modelsDir == "" { + modelsDir = "./models" + } + if err := os.MkdirAll(modelsDir, 0o755); err != nil { + return nil, fmt.Errorf("creating models dir %q: %w", modelsDir, err) + } + + session, err := hugot.NewGoSession(ctx) + if err != nil { + return nil, fmt.Errorf("starting Hugot session: %w", err) + } + + downloadOpts := hugot.NewDownloadOptions() + downloadOpts.OnnxFilePath = "onnx/model.onnx" + modelPath, err := hugot.DownloadModel(ctx, modelName, modelsDir, downloadOpts) + if err != nil { + _ = session.Destroy() + return nil, fmt.Errorf("downloading model %q: %w", modelName, err) + } + + cfg := hugot.FeatureExtractionConfig{ + ModelPath: modelPath, + Name: filepath.Base(modelPath), + } + pipe, err := hugot.NewPipeline(session, cfg) + if err != nil { + _ = session.Destroy() + return nil, fmt.Errorf("creating feature-extraction pipeline: %w", err) + } + + // Probe the output shape once so we can fail loudly if a different + // model is wired up against the 384-dim Redis Search field. + probe, err := pipe.RunPipeline(ctx, []string{"dimension probe"}) + if err != nil { + _ = session.Destroy() + return nil, fmt.Errorf("probing embedding pipeline: %w", err) + } + if len(probe.Embeddings) == 0 || len(probe.Embeddings[0]) == 0 { + _ = session.Destroy() + return nil, fmt.Errorf("embedding probe returned empty result") + } + + return &LocalEmbedder{ + ModelName: modelName, + Dim: len(probe.Embeddings[0]), + session: session, + pipeline: pipe, + }, nil +} + +// Close tears down the underlying Hugot session. Safe to call more +// than once; subsequent calls are no-ops. +func (e *LocalEmbedder) Close() error { + if e == nil || e.session == nil { + return nil + } + err := e.session.Destroy() + e.session = nil + return err +} + +// EncodeOne returns a 384-element float32 vector for the input string. +// The vector is L2-normalised so cosine distance against another +// normalised vector reduces to 1 - dot product. +func (e *LocalEmbedder) EncodeOne(ctx context.Context, text string) ([]float32, error) { + out, err := e.pipeline.RunPipeline(ctx, []string{text}) + if err != nil { + return nil, fmt.Errorf("encoding text: %w", err) + } + if len(out.Embeddings) == 0 { + return nil, fmt.Errorf("pipeline returned no embeddings") + } + normalizeInPlace(out.Embeddings[0]) + return out.Embeddings[0], nil +} + +// EncodeMany batches several strings in a single pipeline call so the +// model only pays the setup cost once. Returns one float32 slice per +// input, in the same order as the input. +func (e *LocalEmbedder) EncodeMany(ctx context.Context, texts []string) ([][]float32, error) { + if len(texts) == 0 { + return nil, nil + } + out, err := e.pipeline.RunPipeline(ctx, texts) + if err != nil { + return nil, fmt.Errorf("encoding texts: %w", err) + } + // Hugot guarantees one vector per input on success, but defensive + // callers (seed loaders, batch ingest) assume that contract; + // surfacing it as an explicit check avoids an index-out-of-range + // panic later if the backend ever returns a short batch. + if len(out.Embeddings) != len(texts) { + return nil, fmt.Errorf( + "pipeline returned %d vectors for %d inputs", + len(out.Embeddings), len(texts), + ) + } + for i := range out.Embeddings { + normalizeInPlace(out.Embeddings[i]) + } + return out.Embeddings, nil +} + +// normalizeInPlace L2-normalises a vector so it has unit length. +// A zero vector is left untouched (its cosine distance to anything +// is undefined, but at least Redis won't reject the bytes). +func normalizeInPlace(v []float32) { + var sumSq float64 + for _, x := range v { + sumSq += float64(x) * float64(x) + } + if sumSq == 0 { + return + } + inv := float32(1.0 / math.Sqrt(sumSq)) + for i := range v { + v[i] *= inv + } +} + +// FloatsToBytes packs a []float32 into the raw little-endian byte +// sequence Redis Search expects for a FLOAT32 vector field. The +// `binary.LittleEndian` here matters: Redis Search reads the bytes +// in little-endian order regardless of the host architecture, so we +// can't use `binary.NativeEndian` if the docs example ever needs to +// run on a big-endian box. Every supported Go target is little-endian +// today, so the practical difference is zero — but explicit is +// cheaper than mysterious off-by-everything vector mismatches. +func FloatsToBytes(fs []float32) []byte { + buf := make([]byte, len(fs)*4) + for i, f := range fs { + binary.LittleEndian.PutUint32(buf[i*4:], math.Float32bits(f)) + } + return buf +} diff --git a/content/develop/use-cases/semantic-cache/go/go.mod b/content/develop/use-cases/semantic-cache/go/go.mod new file mode 100644 index 0000000000..9607081bea --- /dev/null +++ b/content/develop/use-cases/semantic-cache/go/go.mod @@ -0,0 +1,38 @@ +module github.com/redis/docs/content/develop/use-cases/semantic-cache/go + +go 1.26.0 + +require ( + github.com/knights-analytics/hugot v0.7.3 + github.com/redis/go-redis/v9 v9.19.0 +) + +require ( + github.com/cespare/xxhash/v2 v2.3.0 // indirect + github.com/daulet/tokenizers v1.27.0 // indirect + github.com/dustin/go-humanize v1.0.1 // indirect + github.com/go-errors/errors v1.5.1 // indirect + github.com/go-logr/logr v1.4.3 // indirect + github.com/gofrs/flock v0.13.0 // indirect + github.com/gomlx/exceptions v0.0.3 // indirect + github.com/gomlx/go-huggingface v0.3.5 // indirect + github.com/gomlx/go-xla v0.2.2 // indirect + github.com/gomlx/gomlx v0.27.3 // indirect + github.com/gomlx/onnx-gomlx v0.4.2 // indirect + github.com/google/uuid v1.6.0 // indirect + github.com/knights-analytics/ortgenai v0.3.1 // indirect + github.com/pkg/errors v0.9.1 // indirect + github.com/viant/afs v1.30.0 // indirect + github.com/x448/float16 v0.8.4 // indirect + github.com/yalue/onnxruntime_go v1.30.1 // indirect + go.uber.org/atomic v1.11.0 // indirect + golang.org/x/crypto v0.51.0 // indirect + golang.org/x/exp v0.0.0-20260508232706-74f9aab9d74a // indirect + golang.org/x/image v0.40.0 // indirect + golang.org/x/sync v0.20.0 // indirect + golang.org/x/sys v0.44.0 // indirect + golang.org/x/term v0.43.0 // indirect + golang.org/x/text v0.37.0 // indirect + google.golang.org/protobuf v1.36.11 // indirect + k8s.io/klog/v2 v2.140.0 // indirect +) diff --git a/content/develop/use-cases/semantic-cache/go/go.sum b/content/develop/use-cases/semantic-cache/go/go.sum new file mode 100644 index 0000000000..cad2867528 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/go/go.sum @@ -0,0 +1,130 @@ +codeberg.org/go-fonts/liberation v0.5.0 h1:SsKoMO1v1OZmzkG2DY+7ZkCL9U+rrWI09niOLfQ5Bo0= +codeberg.org/go-fonts/liberation v0.5.0/go.mod h1:zS/2e1354/mJ4pGzIIaEtm/59VFCFnYC7YV6YdGl5GU= +codeberg.org/go-latex/latex v0.1.0 h1:hoGO86rIbWVyjtlDLzCqZPjNykpWQ9YuTZqAzPcfL3c= +codeberg.org/go-latex/latex v0.1.0/go.mod h1:LA0q/AyWIYrqVd+A9Upkgsb+IqPcmSTKc9Dny04MHMw= +codeberg.org/go-pdf/fpdf v0.10.0 h1:u+w669foDDx5Ds43mpiiayp40Ov6sZalgcPMDBcZRd4= +codeberg.org/go-pdf/fpdf v0.10.0/go.mod h1:Y0DGRAdZ0OmnZPvjbMp/1bYxmIPxm0ws4tfoPOc4LjU= +git.sr.ht/~sbinet/gg v0.6.0 h1:RIzgkizAk+9r7uPzf/VfbJHBMKUr0F5hRFxTUGMnt38= +git.sr.ht/~sbinet/gg v0.6.0/go.mod h1:uucygbfC9wVPQIfrmwM2et0imr8L7KQWywX0xpFMm94= +github.com/ajstarks/svgo v0.0.0-20211024235047-1546f124cd8b h1:slYM766cy2nI3BwyRiyQj/Ud48djTMtMebDqepE95rw= +github.com/ajstarks/svgo v0.0.0-20211024235047-1546f124cd8b/go.mod h1:1KcenG0jGWcpt8ov532z81sp/kMMUG485J2InIOyADM= +github.com/aymanbagabas/go-osc52/v2 v2.0.1 h1:HwpRHbFMcZLEVr42D4p7XBqjyuxQH5SMiErDT4WkJ2k= +github.com/aymanbagabas/go-osc52/v2 v2.0.1/go.mod h1:uYgXzlJ7ZpABp8OJ+exZzJJhRNQ2ASbcXHWsFqH8hp8= +github.com/bsm/ginkgo/v2 v2.12.0 h1:Ny8MWAHyOepLGlLKYmXG4IEkioBysk6GpaRTLC8zwWs= +github.com/bsm/ginkgo/v2 v2.12.0/go.mod h1:SwYbGRRDovPVboqFv0tPTcG1sN61LM1Z4ARdbAV9g4c= +github.com/bsm/gomega v1.27.10 h1:yeMWxP2pV2fG3FgAODIY8EiRE3dy0aeFYt4l7wh6yKA= +github.com/bsm/gomega v1.27.10/go.mod h1:JyEr/xRbxbtgWNi8tIEVPUYZ5Dzef52k01W3YH0H+O0= +github.com/campoy/embedmd v1.0.0 h1:V4kI2qTJJLf4J29RzI/MAt2c3Bl4dQSYPuflzwFH2hY= +github.com/campoy/embedmd v1.0.0/go.mod h1:oxyr9RCiSXg0M3VJ3ks0UGfp98BpSSGr0kpiX3MzVl8= +github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs= +github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs= +github.com/charmbracelet/colorprofile v0.4.3 h1:QPa1IWkYI+AOB+fE+mg/5/4HRMZcaXex9t5KX76i20Q= +github.com/charmbracelet/colorprofile v0.4.3/go.mod h1:/zT4BhpD5aGFpqQQqw7a+VtHCzu+zrQtt1zhMt9mR4Q= +github.com/charmbracelet/lipgloss v1.1.0 h1:vYXsiLHVkK7fp74RkV7b2kq9+zDLoEU4MZoFqR/noCY= +github.com/charmbracelet/lipgloss v1.1.0/go.mod h1:/6Q8FR2o+kj8rz4Dq0zQc3vYf7X+B0binUUBwA0aL30= +github.com/charmbracelet/x/ansi v0.11.6 h1:GhV21SiDz/45W9AnV2R61xZMRri5NlLnl6CVF7ihZW8= +github.com/charmbracelet/x/ansi v0.11.6/go.mod h1:2JNYLgQUsyqaiLovhU2Rv/pb8r6ydXKS3NIttu3VGZQ= +github.com/charmbracelet/x/cellbuf v0.0.15 h1:ur3pZy0o6z/R7EylET877CBxaiE1Sp1GMxoFPAIztPI= +github.com/charmbracelet/x/cellbuf v0.0.15/go.mod h1:J1YVbR7MUuEGIFPCaaZ96KDl5NoS0DAWkskup+mOY+Q= +github.com/charmbracelet/x/term v0.2.2 h1:xVRT/S2ZcKdhhOuSP4t5cLi5o+JxklsoEObBSgfgZRk= +github.com/charmbracelet/x/term v0.2.2/go.mod h1:kF8CY5RddLWrsgVwpw4kAa6TESp6EB5y3uxGLeCqzAI= +github.com/clipperhouse/displaywidth v0.11.0 h1:lBc6kY44VFw+TDx4I8opi/EtL9m20WSEFgwIwO+UVM8= +github.com/clipperhouse/displaywidth v0.11.0/go.mod h1:bkrFNkf81G8HyVqmKGxsPufD3JhNl3dSqnGhOoSD/o0= +github.com/clipperhouse/uax29/v2 v2.7.0 h1:+gs4oBZ2gPfVrKPthwbMzWZDaAFPGYK72F0NJv2v7Vk= +github.com/clipperhouse/uax29/v2 v2.7.0/go.mod h1:EFJ2TJMRUaplDxHKj1qAEhCtQPW2tJSwu5BF98AuoVM= +github.com/daulet/tokenizers v1.27.0 h1:MmFYAEDFz69s/nNQfHg59DWqHz3v94m99kEZ/JbL+s4= +github.com/daulet/tokenizers v1.27.0/go.mod h1:YjFY1o1HGMyWkQgbXJDghhvke/yFDp2vGdIO2hYs4MQ= +github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc h1:U9qPSI2PIWSS1VwoXQT9A3Wy9MM3WgvqSxFWenqJduM= +github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY= +github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto= +github.com/go-errors/errors v1.5.1 h1:ZwEMSLRCapFLflTpT7NKaAc7ukJ8ZPEjzlxt8rPN8bk= +github.com/go-errors/errors v1.5.1/go.mod h1:sIVyrIiJhuEF+Pj9Ebtd6P/rEYROXFi3BopGUQ5a5Og= +github.com/go-logr/logr v1.4.3 h1:CjnDlHq8ikf6E492q6eKboGOC0T8CDaOvkHCIg8idEI= +github.com/go-logr/logr v1.4.3/go.mod h1:9T104GzyrTigFIr8wt5mBrctHMim0Nb2HLGrmQ40KvY= +github.com/gofrs/flock v0.13.0 h1:95JolYOvGMqeH31+FC7D2+uULf6mG61mEZ/A8dRYMzw= +github.com/gofrs/flock v0.13.0/go.mod h1:jxeyy9R1auM5S6JYDBhDt+E2TCo7DkratH4Pgi8P+Z0= +github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0 h1:DACJavvAHhabrF08vX0COfcOBJRhZ8lUbR+ZWIs0Y5g= +github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0/go.mod h1:E/TSTwGwJL78qG/PmXZO1EjYhfJinVAhrmmHX6Z8B9k= +github.com/gomlx/exceptions v0.0.3 h1:HKnTgEjj4jlmhr8zVFkTP9qmV1ey7ypYYosQ8GzXWuM= +github.com/gomlx/exceptions v0.0.3/go.mod h1:uHL0TQwJ0xaV2/snJOJV6hSE4yRmhhfymuYgNredGxU= +github.com/gomlx/go-huggingface v0.3.5 h1:eZz1huOvfr0TW30e11TkGAUZY4Jj5Oh/g0Thz4cvu0I= +github.com/gomlx/go-huggingface v0.3.5/go.mod h1:r/Z6JQTPm2nd6zHYKp6ig8ofQZK16+Rj9iqZpWq8OTQ= +github.com/gomlx/go-xla v0.2.2 h1:2YMzXAcmK8BvqFjRnUHHtE2QwKDEts2tRglcFcKhZj8= +github.com/gomlx/go-xla v0.2.2/go.mod h1:T2CsL/E90te3k4qpuzlXv2uQU2FmLMLfUsRlAGqKSuI= +github.com/gomlx/gomlx v0.27.3 h1:4cCcVi2m3lvMzDyZtepIl3+6cBGMTXhrYvQtOdtU5Z4= +github.com/gomlx/gomlx v0.27.3/go.mod h1:gqqTny0q1kcxml72T313SZy5U9pfX9c54NmzcYtzg5k= +github.com/gomlx/onnx-gomlx v0.4.2 h1:nBDbjzZOVMkCudk0AKMREHMdm54xNcp34dAte9aNwqQ= +github.com/gomlx/onnx-gomlx v0.4.2/go.mod h1:jh/oy07gw7aloPO3R8A2tHIVF7sVVXE2erp5IQCqlPY= +github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8= +github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU= +github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0= +github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= +github.com/janpfeifer/go-benchmarks v0.1.1 h1:gLLy07/JrOKSnMWeUxSnjTdhkglgmrNR2IBDnR4kRqw= +github.com/janpfeifer/go-benchmarks v0.1.1/go.mod h1:5AagXCOUzevvmYFQalcgoa4oWPyH1IkZNckolGWfiSM= +github.com/janpfeifer/must v0.2.0 h1:yWy1CE5gtk1i2ICBvqAcMMXrCMqil9CJPkc7x81fRdQ= +github.com/janpfeifer/must v0.2.0/go.mod h1:S6c5Yg/YSMR43cJw4zhIq7HFMci90a7kPY9XA4c8UIs= +github.com/klauspost/cpuid/v2 v2.2.10 h1:tBs3QSyvjDyFTq3uoc/9xFpCuOsJQFNPiAhYdw2skhE= +github.com/klauspost/cpuid/v2 v2.2.10/go.mod h1:hqwkgyIinND0mEev00jJYCxPNVRVXFQeu1XKlok6oO0= +github.com/knights-analytics/hugot v0.7.3 h1:39UqU52s4nAmNIE4JG5ViASCvd8dhue7XGtt5RhK3T4= +github.com/knights-analytics/hugot v0.7.3/go.mod h1:86tRz/GzyoNFHuUUzgiYnALQNZU8Vzd5F0pApYizwrs= +github.com/knights-analytics/ortgenai v0.3.1 h1:0Awe43Zu+giDxzlpoNvx9ekbez/zxc8XMzKU++sOUB8= +github.com/knights-analytics/ortgenai v0.3.1/go.mod h1:lSbQsRP5wY5NS+4W5CUGhdxjTzERQkR7WprAFxrBSt4= +github.com/lucasb-eyer/go-colorful v1.3.0 h1:2/yBRLdWBZKrf7gB40FoiKfAWYQ0lqNcbuQwVHXptag= +github.com/lucasb-eyer/go-colorful v1.3.0/go.mod h1:R4dSotOR9KMtayYi1e77YzuveK+i7ruzyGqttikkLy0= +github.com/mattn/go-isatty v0.0.20 h1:xfD0iDuEKnDkl03q4limB+vH+GxLEtL/jb4xVJSWWEY= +github.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y= +github.com/mattn/go-runewidth v0.0.21 h1:jJKAZiQH+2mIinzCJIaIG9Be1+0NR+5sz/lYEEjdM8w= +github.com/mattn/go-runewidth v0.0.21/go.mod h1:XBkDxAl56ILZc9knddidhrOlY5R/pDhgLpndooCuJAs= +github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db h1:62I3jR2EmQ4l5rM/4FEfDWcRD+abF5XlKShorW5LRoQ= +github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db/go.mod h1:l0dey0ia/Uv7NcFFVbCLtqEBQbrT4OCwCSKTEv6enCw= +github.com/muesli/termenv v0.16.0 h1:S5AlUN9dENB57rsbnkPyfdGuWIlkmzJjbFf0Tf5FWUc= +github.com/muesli/termenv v0.16.0/go.mod h1:ZRfOIKPFDYQoDFF4Olj7/QJbW60Ol/kL1pU3VfY/Cnk= +github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4= +github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0= +github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 h1:Jamvg5psRIccs7FGNTlIRMkT8wgtp5eCXdBlqhYGL6U= +github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= +github.com/redis/go-redis/v9 v9.19.0 h1:XPVaaPSnG6RhYf7p+rmSa9zZfeVAnWsH5h3lxthOm/k= +github.com/redis/go-redis/v9 v9.19.0/go.mod h1:v/M13XI1PVCDcm01VtPFOADfZtHf8YW3baQf57KlIkA= +github.com/rivo/uniseg v0.4.7 h1:WUdvkW8uEhrYfLC4ZzdpI2ztxP1I582+49Oc5Mq64VQ= +github.com/rivo/uniseg v0.4.7/go.mod h1:FN3SvrM+Zdj16jyLfmOkMNblXMcoc8DfTHruCPUcx88= +github.com/schollz/progressbar/v3 v3.19.0 h1:Ea18xuIRQXLAUidVDox3AbwfUhD0/1IvohyTutOIFoc= +github.com/schollz/progressbar/v3 v3.19.0/go.mod h1:IsO3lpbaGuzh8zIMzgY3+J8l4C8GjO0Y9S69eFvNsec= +github.com/streadway/quantile v0.0.0-20220407130108-4246515d968d h1:X4+kt6zM/OVO6gbJdAfJR60MGPsqCzbtXNnjoGqdfAs= +github.com/streadway/quantile v0.0.0-20220407130108-4246515d968d/go.mod h1:lbP8tGiBjZ5YWIc2fzuRpTaz0b/53vT6PEs3QuAWzuU= +github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U= +github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U= +github.com/viant/afs v1.30.0 h1:dbgVVSCPwGHUgpgkWJ5gdjKBqssT7OV7Z2M81CjwZEY= +github.com/viant/afs v1.30.0/go.mod h1:rScbFd9LJPGTM8HOI8Kjwee0AZ+MZMupAvFpPg+Qdj4= +github.com/x448/float16 v0.8.4 h1:qLwI1I70+NjRFUR3zs1JPUCgaCXSh3SW62uAKT1mSBM= +github.com/x448/float16 v0.8.4/go.mod h1:14CWIYCyZA/cWjXOioeEpHeN/83MdbZDRQHoFcYsOfg= +github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e h1:JVG44RsyaB9T2KIHavMF/ppJZNG9ZpyihvCd0w101no= +github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e/go.mod h1:RbqR21r5mrJuqunuUZ/Dhy/avygyECGrLceyNeo4LiM= +github.com/yalue/onnxruntime_go v1.30.1 h1:NaEng5lWbsHZ/8X1dtaw1mIj7eV1ozyjbFo//g0ktl4= +github.com/yalue/onnxruntime_go v1.30.1/go.mod h1:b4X26A8pekNb1ACJ58wAXgNKeUCGEAQ9dmACut9Sm/4= +github.com/zeebo/xxh3 v1.1.0 h1:s7DLGDK45Dyfg7++yxI0khrfwq9661w9EN78eP/UZVs= +github.com/zeebo/xxh3 v1.1.0/go.mod h1:IisAie1LELR4xhVinxWS5+zf1lA4p0MW4T+w+W07F5s= +go.uber.org/atomic v1.11.0 h1:ZvwS0R+56ePWxUNi+Atn9dWONBPp/AUETXlHW0DxSjE= +go.uber.org/atomic v1.11.0/go.mod h1:LUxbIzbOniOlMKjJjyPfpl4v+PKK2cNJn91OQbhoJI0= +golang.org/x/crypto v0.51.0 h1:IBPXwPfKxY7cWQZ38ZCIRPI50YLeevDLlLnyC5wRGTI= +golang.org/x/crypto v0.51.0/go.mod h1:8AdwkbraGNABw2kOX6YFPs3WM22XqI4EXEd8g+x7Oc8= +golang.org/x/exp v0.0.0-20260508232706-74f9aab9d74a h1:+3jdDGGB8NGb1Zktc737jlt3/A5f6UlwSzmvqUuufxw= +golang.org/x/exp v0.0.0-20260508232706-74f9aab9d74a/go.mod h1:d2fgXJLVs4dYDHUk5lwMIfzRzSrWCfGZb0ZqeLa/Vcw= +golang.org/x/image v0.40.0 h1:Tw4GyDXMo+daZN1znreBRC3VayR1aLFUyUEOLUdW1a8= +golang.org/x/image v0.40.0/go.mod h1:uIc348UZMSvS5Z65CVZ7iDPaNobNFEPeJ4kbqTOszmA= +golang.org/x/sync v0.20.0 h1:e0PTpb7pjO8GAtTs2dQ6jYa5BWYlMuX047Dco/pItO4= +golang.org/x/sync v0.20.0/go.mod h1:9xrNwdLfx4jkKbNva9FpL6vEN7evnE43NNNJQ2LF3+0= +golang.org/x/sys v0.44.0 h1:ildZl3J4uzeKP07r2F++Op7E9B29JRUy+a27EibtBTQ= +golang.org/x/sys v0.44.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw= +golang.org/x/term v0.43.0 h1:S4RLU2sB31O/NCl+zFN9Aru9A/Cq2aqKpTZJ6B+DwT4= +golang.org/x/term v0.43.0/go.mod h1:lrhlHNdQJHO+1qVYiHfFKVuVioJIheAc3fBSMFYEIsk= +golang.org/x/text v0.37.0 h1:Cqjiwd9eSg8e0QAkyCaQTNHFIIzWtidPahFWR83rTrc= +golang.org/x/text v0.37.0/go.mod h1:a5sjxXGs9hsn/AJVwuElvCAo9v8QYLzvavO5z2PiM38= +gonum.org/v1/plot v0.15.2 h1:Tlfh/jBk2tqjLZ4/P8ZIwGrLEWQSPDLRm/SNWKNXiGI= +gonum.org/v1/plot v0.15.2/go.mod h1:DX+x+DWso3LTha+AdkJEv5Txvi+Tql3KAGkehP0/Ubg= +google.golang.org/protobuf v1.36.11 h1:fV6ZwhNocDyBLK0dj+fg8ektcVegBBuEolpbTQyBNVE= +google.golang.org/protobuf v1.36.11/go.mod h1:HTf+CrKn2C3g5S8VImy6tdcUvCska2kB7j23XfzDpco= +gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= +gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= +k8s.io/klog/v2 v2.140.0 h1:Tf+J3AH7xnUzZyVVXhTgGhEKnFqye14aadWv7bzXdzc= +k8s.io/klog/v2 v2.140.0/go.mod h1:o+/RWfJ6PwpnFn7OyAG3QnO47BFsymfEfrz6XyYSSp0= diff --git a/content/develop/use-cases/semantic-cache/go/index.html b/content/develop/use-cases/semantic-cache/go/index.html new file mode 100644 index 0000000000..e897cfdee7 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/go/index.html @@ -0,0 +1,513 @@ + + + + + + Redis Semantic Cache Demo + + + +
+
loading…
+

Redis Semantic Cache Demo

+

+ A small semantic cache sits in front of a mock LLM. Each cache + entry is a Hash at __KEY_PREFIX__<id> holding + the prompt, the response, the prompt's 384-dimensional embedding, + and metadata fields. A single FT.SEARCH on + __INDEX_NAME__ does the KNN against cached prompts + with a TAG pre-filter (tenant, locale, model version, safety) in + the same round trip. If the closest cached prompt is within the + cosine-distance threshold, the demo serves the cached response + and the LLM is not called at all. +

+ +
+ +
+

Ask the LLM

+

Type a question, optionally adjust the metadata filters and + the distance threshold, and submit. The server embeds the + prompt, runs FT.SEARCH with KNN over the cache, + and either serves the cached response (hit) or runs the mock + LLM and writes the new response back to the cache (miss).

+ + +
+
+ + +
+
+ + +
+
+ + +
+
+
+ + + 0.50 +
+

+ The cache serves a hit when the closest cached prompt's + cosine distance is at or below this threshold. Lower = + stricter (fewer hits, safer reuse); higher = looser (more + hits, more risk of serving a near-miss). +

+ + + + + +
+
+ +
+

Cumulative savings

+

Every hit avoids one LLM round trip. The numbers below add + up across the session — tokens that would have been spent and + wall-clock seconds that would have been waited if the cache + had not served the answer.

+
+
+
0
+
Total queries
+
+
+
0
+
Cache hits
+
+
+
0
+
Cache misses
+
+
+
0%
+
Hit ratio
+
+
+
0
+
Tokens saved
+
+
+
0 ms
+
LLM time saved
+
+
+
+ +
+

Index state

+
+ +
+ +
+

Cached entries

+

Every prompt/response pair currently in the cache. + hit_count is the running total of times the entry + has served a hit; ttl is the remaining lifetime + in seconds before EXPIRE drops the key. Click + Drop to simulate eviction.

+ + + + + + + + + + + + +
IDPromptMetadataHitsTTL
+
+ +
+ +
+
+ + + + diff --git a/content/develop/use-cases/semantic-cache/go/main.go b/content/develop/use-cases/semantic-cache/go/main.go new file mode 100644 index 0000000000..57ec59f608 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/go/main.go @@ -0,0 +1,585 @@ +// Redis semantic-cache demo server (Go). +// +// Run this file and visit http://localhost:8088 to drive a small +// semantic-cache demo backed by Redis Search. The UI lets you: +// +// - Type a natural-language prompt and watch the cache decide hit +// or miss. On a hit Redis returns the cached response in tens of +// milliseconds and the demo LLM is not called at all; on a miss +// the demo LLM "thinks" for ~1.5 s before answering and the new +// prompt, response, and embedding are written back to Redis for +// next time. +// - Adjust the cosine-distance threshold to see how close a +// paraphrase must be for the cache to serve it. +// - Switch tenant, locale, or model version to see metadata +// isolation in action — entries written under one tenant cannot +// be served to another, because the TAG filter goes into the +// same FT.SEARCH call as the KNN. +// - Inspect every cached entry with TTL and hit count, and drop +// individual entries to simulate eviction. +// +// The server holds a single `LocalEmbedder`, a single +// `RedisSemanticCache`, and a single `MockLLM` for the lifetime of +// the process. The first run downloads the embedding model into the +// local `./models` directory; everything after is local. + +package main + +import ( + "context" + "encoding/json" + "errors" + "flag" + "fmt" + "io" + "log" + "math" + "net/http" + "os" + "os/signal" + "path/filepath" + "runtime/debug" + "strconv" + "strings" + "syscall" + "time" + + "github.com/redis/go-redis/v9" +) + +// SemanticCacheDemo owns the cache, embedder, and LLM for the +// lifetime of the process. The handlers thread requests through +// `RunQuery` and the seed / reset endpoints reuse `Seed` so there's +// only one description of the cache lifecycle. +type SemanticCacheDemo struct { + Cache *RedisSemanticCache + Embedder *LocalEmbedder + LLM *MockLLM + DefaultTenant string + DefaultLocale string +} + +// Seed clears every entry in scope and re-populates the FAQ list. +func (d *SemanticCacheDemo) Seed(ctx context.Context) (int, error) { + if _, err := d.Cache.Clear(ctx); err != nil { + return 0, err + } + return Seed(ctx, d.Cache, d.Embedder, SeedOptions{ + Tenant: d.DefaultTenant, + Locale: d.DefaultLocale, + ModelVersion: d.LLM.ModelVersion, + }) +} + +// QueryParams collects what the /query endpoint accepts. `LookupOnly` +// is the toggle the UI uses to sweep the threshold against a fixed +// prompt without polluting the cache. +type QueryParams struct { + Prompt string + Tenant string + Locale string + ModelVersion string + Threshold float64 + LookupOnly bool +} + +// RunQuery is the hot path: embed, look up, optionally call the LLM, +// cache. +// +// Timings are taken with `time.Now()` around each bounded step so +// the UI can display the embed / lookup / LLM breakdown separately. +// The cache write on a miss is *not* included in `total_ms` so the +// latency number reflects the user-facing wait, not the background +// bookkeeping. +func (d *SemanticCacheDemo) RunQuery(ctx context.Context, p QueryParams) (map[string]any, error) { + threshold := p.Threshold + + t0 := time.Now() + queryVec, err := d.Embedder.EncodeOne(ctx, p.Prompt) + if err != nil { + return nil, fmt.Errorf("embed: %w", err) + } + embedMs := msSince(t0) + + t1 := time.Now() + result, err := d.Cache.Lookup(ctx, queryVec, LookupParams{ + Tenant: p.Tenant, + Locale: p.Locale, + ModelVersion: p.ModelVersion, + DistanceThreshold: &threshold, + }) + if err != nil { + return nil, fmt.Errorf("lookup: %w", err) + } + lookupMs := msSince(t1) + + if result.Hit != nil { + hit := result.Hit + return map[string]any{ + "outcome": "hit", + "response": hit.Response, + "entry_id": hit.ID, + "distance": hit.Distance, + "ttl_seconds": hit.TTLSeconds, + "hit_count": hit.HitCount, + "threshold": threshold, + "embed_ms": embedMs, + "lookup_ms": lookupMs, + "llm_ms": nil, + "total_ms": embedMs + lookupMs, + "tokens_avoided": estimateResponseTokens(hit.Prompt, hit.Response), + "ms_avoided": d.LLM.LatencyMs, + }, nil + } + + miss := result.Miss + + // Miss path. In "lookup only" mode the demo reports the miss + // without actually calling the LLM — useful for sweeping the + // threshold against a fixed prompt to see where the cutoff + // would fall without polluting the cache. + if p.LookupOnly { + return map[string]any{ + "outcome": "miss", + "response": "(LLM not called in lookup-only mode)", + "nearest_distance": miss.NearestDistance, + "threshold": threshold, + "wrote_entry_id": nil, + "embed_ms": embedMs, + "lookup_ms": lookupMs, + "llm_ms": nil, + "total_ms": embedMs + lookupMs, + }, nil + } + + t2 := time.Now() + llmResp := d.LLM.Complete(p.Prompt) + llmMs := msSince(t2) + + // Write the new entry back. The embedding is the same vector we + // already used for the lookup — no need to re-encode. + entryID, err := d.Cache.Put(ctx, PutParams{ + Prompt: p.Prompt, + Response: llmResp.Response, + Embedding: queryVec, + Tenant: p.Tenant, + Locale: p.Locale, + ModelVersion: p.ModelVersion, + }) + if err != nil { + return nil, fmt.Errorf("put: %w", err) + } + + return map[string]any{ + "outcome": "miss", + "response": llmResp.Response, + "nearest_distance": miss.NearestDistance, + "threshold": threshold, + "wrote_entry_id": entryID, + "embed_ms": embedMs, + "lookup_ms": lookupMs, + "llm_ms": llmMs, + "total_ms": embedMs + lookupMs + llmMs, + }, nil +} + +func msSince(t time.Time) float64 { + return float64(time.Since(t)) / float64(time.Millisecond) +} + +func estimateResponseTokens(prompt, response string) int { + n := (len(prompt) + len(response)) / 4 + if n < 1 { + return 1 + } + return n +} + +// ---- HTTP plumbing -------------------------------------------------- + +func sendJSON(w http.ResponseWriter, payload any, status int) { + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(status) + if err := json.NewEncoder(w).Encode(payload); err != nil { + log.Printf("[demo] encode: %v", err) + } +} + +func sendHTML(w http.ResponseWriter, html string, status int) { + w.Header().Set("Content-Type", "text/html; charset=utf-8") + w.WriteHeader(status) + _, _ = io.WriteString(w, html) +} + +// clampThreshold sanitises the threshold parameter from the form +// body. `strconv.ParseFloat` happily handles "nan" → NaN and "inf" +// → +Inf. Either would silently turn the lookup into a permanent +// hit (NaN comparisons are always false, so `distance > NaN` cannot +// reject) or a permanent miss. Clamp to the meaningful +// cosine-distance range so a malformed POST can't override the +// threshold semantics. +func clampThreshold(raw string) float64 { + parsed, err := strconv.ParseFloat(strings.TrimSpace(raw), 64) + if err != nil || math.IsNaN(parsed) || math.IsInf(parsed, 0) { + return 0.5 + } + if parsed < 0 { + return 0 + } + if parsed > 2 { + return 2 + } + return parsed +} + +// State is the response shape of /state. It is intentionally the +// same shape the Python and Node demos serve so the shared HTML +// works without modification. +type State struct { + Index struct { + NumDocs int `json:"num_docs"` + IndexName string `json:"index_name"` + IndexingFailures int `json:"indexing_failures"` + VectorIndexSizeMB float64 `json:"vector_index_size_mb"` + Model string `json:"model"` + MockLLMLatencyMs float64 `json:"mock_llm_latency_ms"` + // DefaultThreshold is what the --threshold flag actually + // configures; the UI slider initialises to this on first + // load so the flag visibly changes the demo's behaviour. + // StackLabel lets the same HTML render a per-language badge + // (redis-py, node-redis, go-redis, …) without forking the + // file per language. + DefaultThreshold float64 `json:"default_threshold"` + StackLabel string `json:"stack_label"` + } `json:"index"` + Entries []Entry `json:"entries"` +} + +func buildState(ctx context.Context, cache *RedisSemanticCache, embedder *LocalEmbedder, llm *MockLLM, stackLabel string) (State, error) { + info := cache.FTInfo(ctx) + entries, err := cache.ListEntries(ctx, 200) + if err != nil { + return State{}, err + } + var s State + s.Index.NumDocs = info.NumDocs + s.Index.IndexName = cache.IndexName + s.Index.IndexingFailures = info.IndexingFailures + s.Index.VectorIndexSizeMB = info.VectorIndexSizeMB + s.Index.Model = embedder.ModelName + s.Index.MockLLMLatencyMs = llm.LatencyMs + s.Index.DefaultThreshold = cache.DistanceThreshold + s.Index.StackLabel = stackLabel + s.Entries = entries + return s, nil +} + +type serverDeps struct { + cache *RedisSemanticCache + embedder *LocalEmbedder + llm *MockLLM + demo *SemanticCacheDemo + htmlPage string + stackLabel string +} + +// Cap POST bodies so a runaway client (or, more realistically, a +// `curl --data-binary @big-file` by mistake) can't accumulate +// unbounded memory before the handler runs. The demo's largest +// legitimate body is a few hundred bytes of form-encoded query +// fields; 1 MiB is a generous ceiling and matches the Node demo's +// readBody cap. Go's `ParseForm` defaults to 10 MiB on top of this +// — we tighten the cap by wrapping the request body in +// `http.MaxBytesReader` at the start of each POST handler. +const maxBodyBytes = 1 * 1024 * 1024 + +// jsonRecover turns any panic in a handler into a JSON 500 instead +// of letting the default net/http handler render a plain-text stack +// trace. Without this wrapper the client's `await res.json()` +// explodes with an opaque parse error instead of surfacing what +// actually went wrong. +func jsonRecover(w http.ResponseWriter, r *http.Request) { + if rec := recover(); rec != nil { + log.Printf("[demo] panic: %v\n%s", rec, debug.Stack()) + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(http.StatusInternalServerError) + _ = json.NewEncoder(w).Encode(map[string]any{ + "error": fmt.Sprintf("%v", rec), + "type": "panic", + }) + } +} + +func makeHandler(deps *serverDeps) http.Handler { + mux := http.NewServeMux() + + mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) { + defer jsonRecover(w, r) + if r.URL.Path != "/" && r.URL.Path != "/index.html" { + sendJSON(w, map[string]any{"error": "not found"}, http.StatusNotFound) + return + } + if r.Method != http.MethodGet { + sendJSON(w, map[string]any{"error": "method not allowed"}, http.StatusMethodNotAllowed) + return + } + sendHTML(w, deps.htmlPage, http.StatusOK) + }) + + mux.HandleFunc("/state", func(w http.ResponseWriter, r *http.Request) { + defer jsonRecover(w, r) + if r.Method != http.MethodGet { + sendJSON(w, map[string]any{"error": "method not allowed"}, http.StatusMethodNotAllowed) + return + } + s, err := buildState(r.Context(), deps.cache, deps.embedder, deps.llm, deps.stackLabel) + if err != nil { + sendJSON(w, map[string]any{"error": err.Error()}, http.StatusInternalServerError) + return + } + sendJSON(w, s, http.StatusOK) + }) + + mux.HandleFunc("/query", func(w http.ResponseWriter, r *http.Request) { + defer jsonRecover(w, r) + if r.Method != http.MethodPost { + sendJSON(w, map[string]any{"error": "method not allowed"}, http.StatusMethodNotAllowed) + return + } + r.Body = http.MaxBytesReader(w, r.Body, maxBodyBytes) + if err := r.ParseForm(); err != nil { + sendJSON(w, map[string]any{"error": err.Error()}, http.StatusBadRequest) + return + } + prompt := strings.TrimSpace(r.PostForm.Get("prompt")) + if prompt == "" { + sendJSON(w, map[string]any{"error": "prompt is required"}, http.StatusBadRequest) + return + } + tenant := orDefault(r.PostForm.Get("tenant"), "acme") + locale := orDefault(r.PostForm.Get("locale"), "en") + modelVersion := orDefault(r.PostForm.Get("model_version"), deps.llm.ModelVersion) + threshold := clampThreshold(r.PostForm.Get("threshold")) + lookupOnly := r.PostForm.Get("lookup_only") != "" + + payload, err := deps.demo.RunQuery(r.Context(), QueryParams{ + Prompt: prompt, + Tenant: tenant, + Locale: locale, + ModelVersion: modelVersion, + Threshold: threshold, + LookupOnly: lookupOnly, + }) + if err != nil { + log.Printf("[demo] query: %v", err) + sendJSON(w, map[string]any{"error": err.Error()}, http.StatusInternalServerError) + return + } + sendJSON(w, payload, http.StatusOK) + }) + + mux.HandleFunc("/reset", func(w http.ResponseWriter, r *http.Request) { + defer jsonRecover(w, r) + if r.Method != http.MethodPost { + sendJSON(w, map[string]any{"error": "method not allowed"}, http.StatusMethodNotAllowed) + return + } + if _, err := deps.demo.Seed(r.Context()); err != nil { + sendJSON(w, map[string]any{"error": err.Error()}, http.StatusInternalServerError) + return + } + sendJSON(w, map[string]any{"ok": true}, http.StatusOK) + }) + + mux.HandleFunc("/drop", func(w http.ResponseWriter, r *http.Request) { + defer jsonRecover(w, r) + if r.Method != http.MethodPost { + sendJSON(w, map[string]any{"error": "method not allowed"}, http.StatusMethodNotAllowed) + return + } + r.Body = http.MaxBytesReader(w, r.Body, maxBodyBytes) + if err := r.ParseForm(); err != nil { + sendJSON(w, map[string]any{"error": err.Error()}, http.StatusBadRequest) + return + } + entryID := strings.TrimSpace(r.PostForm.Get("entry_id")) + if entryID == "" { + sendJSON(w, map[string]any{"error": "entry_id is required"}, http.StatusBadRequest) + return + } + deleted, err := deps.cache.DeleteEntry(r.Context(), entryID) + if err != nil { + sendJSON(w, map[string]any{"error": err.Error()}, http.StatusInternalServerError) + return + } + sendJSON(w, map[string]any{"deleted": deleted, "entry_id": entryID}, http.StatusOK) + }) + + return mux +} + +func orDefault(s, dflt string) string { + if s == "" { + return dflt + } + return s +} + +// ---- Main ----------------------------------------------------------- + +type flags struct { + host string + port int + redisHost string + redisPort int + indexName string + keyPrefix string + ttlSeconds int + threshold float64 + llmLatencyMs float64 + noReset bool +} + +func parseFlags() flags { + var f flags + flag.StringVar(&f.host, "host", "127.0.0.1", "interface to bind to") + flag.IntVar(&f.port, "port", 8088, "HTTP port for the demo UI") + flag.StringVar(&f.redisHost, "redis-host", "localhost", "Redis host") + flag.IntVar(&f.redisPort, "redis-port", 6379, "Redis port") + flag.StringVar(&f.indexName, "index-name", "semcache:idx", "Redis Search index name") + flag.StringVar(&f.keyPrefix, "key-prefix", "cache:", "key prefix for cache entries") + flag.IntVar(&f.ttlSeconds, "ttl-seconds", 3600, "TTL applied to every cache entry") + flag.Float64Var(&f.threshold, "threshold", 0.5, "default cosine-distance threshold for hits") + flag.Float64Var(&f.llmLatencyMs, "llm-latency-ms", 1500, "simulated latency of the mock LLM in milliseconds") + flag.BoolVar(&f.noReset, "no-reset", false, "skip the cache reset + seed on startup") + flag.Parse() + return f +} + +func main() { + f := parseFlags() + + ctx := context.Background() + client := redis.NewClient(&redis.Options{ + Addr: fmt.Sprintf("%s:%d", f.redisHost, f.redisPort), + Protocol: 2, + }) + if err := client.Ping(ctx).Err(); err != nil { + fmt.Fprintf(os.Stderr, "Error: cannot reach Redis at %s:%d\n (%v)\n", + f.redisHost, f.redisPort, err) + os.Exit(1) + } + + cache := NewRedisSemanticCache( + client, + f.indexName, + f.keyPrefix, + vectorDimDefault, + f.threshold, + f.ttlSeconds, + ) + if err := cache.CreateIndex(ctx); err != nil { + log.Fatalf("creating index: %v", err) + } + + fmt.Println("Loading embedding model (first run downloads the ONNX weights)...") + embedder, err := NewLocalEmbedder(ctx, "", "") + if err != nil { + log.Fatalf("loading embedder: %v", err) + } + defer embedder.Close() + + llm := NewMockLLM("", f.llmLatencyMs) + + demo := &SemanticCacheDemo{ + Cache: cache, + Embedder: embedder, + LLM: llm, + DefaultTenant: "acme", + DefaultLocale: "en", + } + if !f.noReset { + fmt.Printf("Dropping any existing cache under '%s*' and "+ + "re-seeding from the FAQ list (pass --no-reset to keep).\n", + f.keyPrefix) + seeded, err := demo.Seed(ctx) + if err != nil { + log.Fatalf("seeding cache: %v", err) + } + fmt.Printf("Seeded %d entries.\n", seeded) + } + + // Load the HTML once and replace the template tokens with the + // configured index name and key prefix so the docs panel shows + // the actual values in use rather than the default copies. + here, err := executableDir() + if err != nil { + log.Fatalf("locating executable dir: %v", err) + } + rawHTML, err := os.ReadFile(filepath.Join(here, "index.html")) + if err != nil { + log.Fatalf("reading index.html: %v", err) + } + htmlPage := strings.ReplaceAll(string(rawHTML), "__INDEX_NAME__", f.indexName) + htmlPage = strings.ReplaceAll(htmlPage, "__KEY_PREFIX__", f.keyPrefix) + + deps := &serverDeps{ + cache: cache, + embedder: embedder, + llm: llm, + demo: demo, + htmlPage: htmlPage, + stackLabel: "go-redis + Hugot + Go standard library HTTP server", + } + + addr := fmt.Sprintf("%s:%d", f.host, f.port) + server := &http.Server{ + Addr: addr, + Handler: makeHandler(deps), + ReadHeaderTimeout: 10 * time.Second, + // ReadTimeout bounds the whole request (headers + body), not + // just the headers — without it a slow-drip POST body can + // hold a handler goroutine open indefinitely while + // `ParseForm` waits for more bytes. 30 s is comfortable for + // any realistic prompt and gives slow networks plenty of + // margin without leaving the server exposed. + ReadTimeout: 30 * time.Second, + } + + go func() { + fmt.Printf("Redis semantic cache demo listening on http://%s\n", addr) + fmt.Printf("Using Redis at %s:%d with index '%s'\n", + f.redisHost, f.redisPort, f.indexName) + if err := server.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) { + log.Fatalf("server error: %v", err) + } + }() + + stop := make(chan os.Signal, 1) + signal.Notify(stop, syscall.SIGINT, syscall.SIGTERM) + sig := <-stop + fmt.Printf("\nReceived %s, shutting down...\n", sig) + shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + _ = server.Shutdown(shutdownCtx) + _ = client.Close() +} + +// executableDir returns the directory where the running binary +// lives. `go run .` puts that in a temp directory, so we fall back +// to the working directory in that case — `index.html` is expected +// to sit next to the source file the user is iterating on. +func executableDir() (string, error) { + exe, err := os.Executable() + if err == nil { + dir := filepath.Dir(exe) + if _, err := os.Stat(filepath.Join(dir, "index.html")); err == nil { + return dir, nil + } + } + wd, err := os.Getwd() + if err != nil { + return "", err + } + return wd, nil +} diff --git a/content/develop/use-cases/semantic-cache/go/mockllm.go b/content/develop/use-cases/semantic-cache/go/mockllm.go new file mode 100644 index 0000000000..93f8f81934 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/go/mockllm.go @@ -0,0 +1,178 @@ +// Deterministic mock LLM for the semantic-cache demo. +// +// The point of a semantic cache is to *skip* an LLM call when a prior +// answer is reusable. To make that visible in a docs demo we need an +// LLM stand-in that: +// +// - takes long enough that the saved time on a cache hit is obvious +// (real-world model calls are 500 ms to several seconds); +// - responds deterministically so a given prompt always produces the +// same answer, which keeps the demo reproducible; +// - exposes an estimated token count so the demo can show the +// saving in "tokens not spent" terms alongside latency; +// - needs no API keys, no network, no extra dependencies. +// +// It is keyword-matched against a small lookup table of FAQ-style +// answers for a fictional online retailer. Anything that doesn't +// match falls back to a generic templated reply. The `LatencyMs` +// field is the simulated round trip; the default (1500 ms) is in the +// neighbourhood of a real GPT-class model on a moderately-sized +// prompt. + +package main + +import ( + "strings" + "sync/atomic" + "time" +) + +type knowledgeEntry struct { + keywords []string + answer string +} + +var knowledge = []knowledgeEntry{ + { + keywords: []string{"return", "refund", "exchange"}, + answer: "You can return any unworn item within 30 days of delivery for a " + + "full refund. Start a return from your order page; we email a " + + "prepaid label and refund the original payment method within " + + "five business days of receiving the item.", + }, + { + keywords: []string{"shipping", "delivery", "arrive", "ship"}, + answer: "Standard shipping is free on orders over $50 and arrives in " + + "three to five business days. Expedited two-day shipping is " + + "$9.99 and is available at checkout for in-stock items.", + }, + { + keywords: []string{"size", "sizing", "fit"}, + answer: "We follow standard US sizing. For most styles we recommend " + + "ordering your usual size; the product page includes a sizing " + + "chart and customer fit notes for items that run small or large.", + }, + { + keywords: []string{"warranty", "guarantee", "defect", "broken"}, + answer: "All gear is covered by a one-year manufacturer warranty against " + + "defects in materials or workmanship. Email support with your " + + "order number and a photo of the issue and we will replace the " + + "item or issue a refund.", + }, + { + keywords: []string{"contact", "support", "help", "agent"}, + answer: "You can reach our support team by email at help@example.com or " + + "by live chat from the help centre, 9am to 9pm Eastern, seven " + + "days a week. Most tickets get a first reply within two hours.", + }, + { + keywords: []string{"track", "tracking", "order", "where"}, + answer: "Your tracking number is on the order confirmation email and on " + + "the order detail page once the package has been picked up by " + + "the carrier — typically within 24 hours of order placement.", + }, + { + keywords: []string{"cancel", "modify", "change"}, + answer: "Orders can be cancelled or modified for up to one hour after " + + "placement. After that the order has usually entered our " + + "warehouse system; the fastest path is to accept delivery and " + + "start a return for any unwanted items.", + }, + { + keywords: []string{"discount", "coupon", "promo", "code"}, + answer: "Active promotional codes are listed on the homepage banner. " + + "Codes apply at checkout and cannot be combined; the system " + + "automatically uses the larger of the two when more than one " + + "would qualify.", + }, +} + +const fallbackAnswer = "Thanks for the question. Our team would normally answer this " + + "individually; in the meantime please check the help centre or " + + "contact support@example.com for a faster response." + +// estimateTokens is a rough English token estimate: ~4 characters per +// token. Real tokenizers (BPE, SentencePiece) vary slightly but this +// is close enough for "look how many tokens you saved" demo signage. +func estimateTokens(text string) int { + if text == "" { + return 0 + } + if n := len(text) / 4; n > 1 { + return n + } + return 1 +} + +func answerFor(prompt string) string { + lower := strings.ToLower(prompt) + for _, row := range knowledge { + for _, k := range row.keywords { + if strings.Contains(lower, k) { + return row.answer + } + } + } + return fallbackAnswer +} + +// LLMResponse captures everything the demo UI wants to show about a +// mock LLM call. `LatencyMs` is the actual measured wall-clock +// duration, not the configured target, in case scheduling jitter +// pushes us a few milliseconds over. +type LLMResponse struct { + Response string + ModelVersion string + LatencyMs float64 + PromptTokens int + CompletionTokens int +} + +// TotalTokens is the convenience sum the demo panel uses. +func (r LLMResponse) TotalTokens() int { + return r.PromptTokens + r.CompletionTokens +} + +// MockLLM stands in for a real model client. +type MockLLM struct { + ModelVersion string + LatencyMs float64 + callCount atomic.Int64 +} + +// NewMockLLM returns a MockLLM configured with sensible defaults. Pass +// 0 for `latencyMs` to use the 1500 ms default; any other non-negative +// value overrides it (including very small values for tests). +func NewMockLLM(modelVersion string, latencyMs float64) *MockLLM { + if modelVersion == "" { + modelVersion = "gpt-4.5-2026" + } + if latencyMs <= 0 { + latencyMs = 1500 + } + return &MockLLM{ModelVersion: modelVersion, LatencyMs: latencyMs} +} + +// CallCount is the number of times Complete has been invoked. Useful +// for tests that assert the cache really skipped the LLM on a hit. +func (m *MockLLM) CallCount() int64 { + return m.callCount.Load() +} + +// Complete pretends to call an LLM. Sleeps first so the latency is +// realistic regardless of which branch generates the text, then +// keyword-matches a templated answer. +func (m *MockLLM) Complete(prompt string) LLMResponse { + m.callCount.Add(1) + start := time.Now() + time.Sleep(time.Duration(m.LatencyMs * float64(time.Millisecond))) + resp := answerFor(prompt) + elapsedMs := float64(time.Since(start)) / float64(time.Millisecond) + return LLMResponse{ + Response: resp, + ModelVersion: m.ModelVersion, + LatencyMs: elapsedMs, + PromptTokens: estimateTokens(prompt), + CompletionTokens: estimateTokens(resp), + } +} diff --git a/content/develop/use-cases/semantic-cache/go/seedcache.go b/content/develop/use-cases/semantic-cache/go/seedcache.go new file mode 100644 index 0000000000..94ea230b16 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/go/seedcache.go @@ -0,0 +1,119 @@ +// Pre-seed the semantic cache with a handful of FAQ answers. +// +// In a real deployment the cache fills up organically as users ask +// questions: a first-time question is a miss, the LLM answers, and +// the response is written back. To make the demo immediately useful +// — so the first query you type lands on a hit instead of a cold +// miss — we seed a small set of canonical prompts and their answers +// at startup. +// +// The seed list mirrors the keyword table in `mockllm.go` but stores +// the *canonical phrasing* of each question. Paraphrases of any of +// these prompts ("How do I return an item?", "Can I get a refund?") +// embed close to the canonical entry and the cache lookup serves the +// stored response without ever calling the model. + +package main + +import "context" + +// SeedEntry is one canonical prompt + response pair. +type SeedEntry struct { + Prompt string + Response string +} + +// SeedEntries is the canonical FAQ list. It is exported so the docs +// page can link to it as a single source of truth for the demo +// transcript. +var SeedEntries = []SeedEntry{ + { + Prompt: "What is your return policy?", + Response: "You can return any unworn item within 30 days of delivery for " + + "a full refund. Start a return from your order page; we email " + + "a prepaid label and refund the original payment method within " + + "five business days of receiving the item.", + }, + { + Prompt: "How long does shipping take?", + Response: "Standard shipping is free on orders over $50 and arrives in " + + "three to five business days. Expedited two-day shipping is " + + "$9.99 and is available at checkout for in-stock items.", + }, + { + Prompt: "How do I find my size?", + Response: "We follow standard US sizing. For most styles we recommend " + + "ordering your usual size; the product page includes a sizing " + + "chart and customer fit notes for items that run small or " + + "large.", + }, + { + Prompt: "Is there a warranty on your products?", + Response: "All gear is covered by a one-year manufacturer warranty " + + "against defects in materials or workmanship. Email support " + + "with your order number and a photo of the issue and we will " + + "replace the item or issue a refund.", + }, + { + Prompt: "How can I contact customer support?", + Response: "You can reach our support team by email at help@example.com " + + "or by live chat from the help centre, 9am to 9pm Eastern, " + + "seven days a week. Most tickets get a first reply within two " + + "hours.", + }, + { + Prompt: "Where is my order?", + Response: "Your tracking number is on the order confirmation email and " + + "on the order detail page once the package has been picked up " + + "by the carrier — typically within 24 hours of order " + + "placement.", + }, +} + +// SeedOptions captures the metadata scope every seed entry shares. +type SeedOptions struct { + Tenant string + Locale string + ModelVersion string +} + +// Seed writes every entry in `SeedEntries` to the cache under the +// supplied metadata scope. Embeddings are produced in one batched +// `EncodeMany` call so the encoder only pays the setup cost once. +// Returns the number of entries that were written. +func Seed(ctx context.Context, cache *RedisSemanticCache, embedder *LocalEmbedder, opts SeedOptions) (int, error) { + tenant := opts.Tenant + if tenant == "" { + tenant = "acme" + } + locale := opts.Locale + if locale == "" { + locale = "en" + } + modelVersion := opts.ModelVersion + if modelVersion == "" { + modelVersion = "gpt-4.5-2026" + } + + prompts := make([]string, len(SeedEntries)) + for i, e := range SeedEntries { + prompts[i] = e.Prompt + } + vectors, err := embedder.EncodeMany(ctx, prompts) + if err != nil { + return 0, err + } + for i, entry := range SeedEntries { + if _, err := cache.Put(ctx, PutParams{ + Prompt: entry.Prompt, + Response: entry.Response, + Embedding: vectors[i], + Tenant: tenant, + Locale: locale, + ModelVersion: modelVersion, + }); err != nil { + return 0, err + } + } + return len(SeedEntries), nil +} diff --git a/content/develop/use-cases/semantic-cache/java-jedis/.gitignore b/content/develop/use-cases/semantic-cache/java-jedis/.gitignore new file mode 100644 index 0000000000..c048217367 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/.gitignore @@ -0,0 +1,5 @@ +# Maven build output: ~25 MB shaded jar plus intermediates. The +# DJL native libraries and the all-MiniLM-L6-v2 model weights land +# in ~/.djl.ai/ on first run, not here, so no extra ignore is needed +# for those. +target/ diff --git a/content/develop/use-cases/semantic-cache/java-jedis/README.md b/content/develop/use-cases/semantic-cache/java-jedis/README.md new file mode 100644 index 0000000000..3880de9a32 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/README.md @@ -0,0 +1,33 @@ +# Redis semantic-cache demo (Java + Jedis) + +See `_index.md` for the full walkthrough. Quick start: + +```bash +# 1. Make sure Redis with the Search module is running on localhost:6379. +# 2. Build the fat jar (first build pulls Jedis, DJL, and the PyTorch +# native libraries; takes a minute or two): +mvn -q package + +# 3. Run. The first run downloads the sentence-transformers/all-MiniLM-L6-v2 +# PyTorch weights into the local DJL cache (~90 MB). +java -jar target/semantic-cache-jedis.jar + +# Or with Maven directly: +mvn -q exec:java +``` + +Then open . + +Notable flags (full list with `--help`): + +| Flag | Default | +|---------------------------|--------------------| +| `--port` | `8089` | +| `--redis-host` | `localhost` | +| `--redis-port` | `6379` | +| `--index-name` | `semcache:idx` | +| `--key-prefix` | `cache:` | +| `--ttl-seconds` | `3600` | +| `--threshold` | `0.5` | +| `--llm-latency-ms` | `1500.0` | +| `--no-reset` | (re-seeds by default) | diff --git a/content/develop/use-cases/semantic-cache/java-jedis/_index.md b/content/develop/use-cases/semantic-cache/java-jedis/_index.md new file mode 100644 index 0000000000..3a5c034cac --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/_index.md @@ -0,0 +1,266 @@ +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: Build a Redis-backed semantic cache for LLM responses in Java with Jedis and DJL (PyTorch) +linkTitle: Jedis example (Java) +title: Redis semantic cache with Jedis +weight: 3 +--- + +This guide shows you how to build a small Redis-backed semantic cache for LLM responses in Java with [Jedis]({{< relref "/develop/clients/jedis" >}}) and [DJL (Deep Java Library)](https://djl.ai/) running the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) encoder locally on PyTorch. It includes a local web server built with the JDK's standard `com.sun.net.httpserver.HttpServer` so you can send paraphrased prompts at a mock LLM, watch the cache decide hit or miss, sweep the cosine-distance threshold, and see the cumulative latency and token savings build up. + +## Overview + +Each cache entry is stored as a single Redis [Hash]({{< relref "/develop/data-types/hashes" >}}) at `cache:`. The hash holds the original prompt, the LLM's response, the raw `float32` bytes of a 384-dimensional embedding of the prompt, and metadata fields — tenant, locale, model version, safety flag — plus a `created_ts` and a `hit_count`. A single [Redis Search]({{< relref "/develop/ai/search-and-query" >}}) index covers the embedding field and every metadata field, so one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) call with a `KNN` clause does the vector lookup *and* the TAG pre-filter in the same round trip — no cross-store joins. + +The lookup is thresholded: [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) always returns the nearest entry that satisfies the filters, but the application only serves it as a hit when the reported cosine distance is at or below `distanceThreshold`. Anything further away is treated as a miss; the caller runs the LLM and writes the new prompt, response, and embedding back to the same key pattern with a TTL. + +The embedder is [DJL](https://djl.ai/) loading the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) PyTorch model from the DJL model zoo. This is the same 384-dimensional encoder the [Python example]({{< relref "/develop/use-cases/semantic-cache/redis-py" >}}) and the [Node.js example]({{< relref "/develop/use-cases/semantic-cache/nodejs" >}}) use. Embeddings produced by the three implementations are semantically equivalent — paraphrase distances differ only at the fourth decimal place — so a cache populated by one demo can be queried by another against the same Redis instance. + +That gives you: + +* A single round trip for lookup — vector KNN + metadata pre-filter in one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}). +* Tens of milliseconds on a hit vs. a multi-second LLM call on a miss; the embedding step is the bottleneck either way, and that's a model-side cost, not a Redis one. +* Tenant, locale, and model-version isolation enforced inside the query, not in application code — a write under one tenant cannot be served to another. +* Bounded memory: every entry has an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL, and a database-level [eviction policy]({{< relref "/develop/reference/eviction" >}}) (LRU / LFU) caps the cache size under pressure. + +## How it works + +A query goes through three stages: **embed**, **lookup**, and (on a miss) **call the LLM and write back**. + +### Hit path (the goal) + +1. The application calls `embedder.encodeOne(prompt)` to turn the incoming text into a 384-dimensional `float[]`. +2. `cache.lookup(queryVec, tenant, locale, modelVersion, "ok", threshold)` runs [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) with a TAG pre-filter and a `KNN 1` clause. Redis returns the closest cached prompt that satisfies the filters along with its cosine distance. +3. If the distance is at or below the threshold, the cache returns a `CacheHit` containing the cached response. The helper also runs an [`HINCRBY`]({{< relref "/commands/hincrby" >}}) on `hit_count` and an [`EXPIRE`]({{< relref "/commands/expire" >}}) refresh inside a [`MULTI/EXEC`]({{< relref "/commands/multi" >}}), so a frequently used answer keeps its TTL and the demo UI can see which entries are load-bearing. +4. The LLM is not called at all. The application returns the cached response to the user. + +### Miss path + +When the distance is above the threshold — or there is no candidate in scope at all — the helper returns a `CacheMiss` instead, carrying the distance of the nearest candidate (if any) for logging. The application then: + +1. Calls the LLM with the prompt. +2. Calls `cache.put(prompt, response, embedding, tenant, locale, modelVersion, ...)`. The same embedding the lookup used is reused — no re-encode. The helper writes the Hash with [`HSET`]({{< relref "/commands/hset" >}}) and an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL inside a single [`MULTI/EXEC`]({{< relref "/commands/multi" >}}) so the entry never lands without a TTL on a partial failure. +3. Returns the LLM's response to the user. The next semantically similar prompt under the same metadata scope will be a hit. + +## The cache helper + +The `RedisSemanticCache` class wraps the Redis Search index and the lookup / write flow +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/RedisSemanticCache.java)): + +```java +import redis.clients.jedis.JedisPooled; +import com.redis.semcache.RedisSemanticCache; +import com.redis.semcache.LocalEmbedder; +import com.redis.semcache.LookupResult; +import com.redis.semcache.CacheHit; + +JedisPooled jedis = new JedisPooled("localhost", 6379); +LocalEmbedder embedder = LocalEmbedder.create(); // sentence-transformers/all-MiniLM-L6-v2 + +RedisSemanticCache cache = new RedisSemanticCache( + jedis, + "semcache:idx", + "cache:", + 384, + 0.5, // cosine distance, lower = stricter + 3600 // TTL in seconds (one hour) +); + +// One-time index setup (idempotent). +cache.createIndex(); + +// 1) Embed the prompt. +String prompt = "How do I return an item?"; +float[] queryVec = embedder.encodeOne(prompt); + +// 2) Look up under a metadata scope. The TAG filter and the KNN +// travel together in one FT.SEARCH. +LookupResult result = cache.lookup( + queryVec, "acme", "en", "gpt-4.5-2026", "ok", null); + +String response; +if (result instanceof CacheHit hit) { + response = hit.response(); + System.out.printf("hit (%.3f): %s%n", hit.distance(), response); +} else { + // 3a) Miss — call the LLM. (Use your real client here.) + response = callLlm(prompt); + + // 3b) Cache the new entry. Reuses the same embedding bytes the + // lookup used, so we don't pay the encoder twice. + cache.put( + prompt, + response, + queryVec, + "acme", + "en", + "gpt-4.5-2026", + "ok", + null, // ttl override (null = default) + null // entry id (null = generated) + ); +} +``` + +### Data model + +Each cache entry is one Redis Hash. The vector field is raw little-endian `float32` bytes — no JSON wrapping — because the Redis Search vector encoding expects exactly that. The helper packs the `float[]` with a `ByteBuffer` in `ByteOrder.LITTLE_ENDIAN`, which matches the bytes Redis Search reads and is identical to the encoding the Python and Node ports write. + +```text +cache:7c3f8a1b9e02 + prompt=How do I return an item? + response=You can return any unworn item within 30 days... + tenant=acme + locale=en + model_version=gpt-4.5-2026 + safety=ok + created_ts=1715990400.123 + hit_count=4 + embedding=<384 × float32 little-endian bytes> +``` + +The Redis Search index schema treats every field as queryable in its natural type: + +```text +FT.CREATE semcache:idx + ON HASH PREFIX 1 cache: + SCHEMA + prompt TEXT + response TEXT + tenant TAG + locale TAG + model_version TAG + safety TAG + created_ts NUMERIC SORTABLE + hit_count NUMERIC SORTABLE + embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE +``` + +### The query + +The lookup is a hybrid query: a TAG pre-filter expression in parentheses, then `=>[KNN 1 @embedding $vec]`. With `DIALECT 2`, Redis applies the filter first and KNN-ranks only the matching documents. In Jedis: + +```java +Query q = new Query( + "(@tenant:{acme} @locale:{en} @model_version:{gpt\\-4\\.5\\-2026} @safety:{ok})" + + "=>[KNN 1 @embedding $vec AS distance]") + .returnFields("prompt", "response", "tenant", "locale", + "model_version", "hit_count", "distance") + .setSortBy("distance", true) + .limit(0, 1) + .addParam("vec", LocalEmbedder.toBytes(queryVec)) + .dialect(2); + +SearchResult result = jedis.ftSearch("semcache:idx", q); +``` + +`distance` is the cosine *distance* (0 means identical, 2 means opposite). The result is sorted ascending, so the top row is the closest candidate. The application inspects `distance` against the threshold and decides hit or miss in user code — Redis returns the row either way, and treating it as a hit or a miss is a policy decision the cache helper owns, not a server-side filter. + +## The mock LLM + +To make the latency and token savings visible without requiring an API key, `MockLLM.java` provides a deterministic stand-in +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/MockLLM.java)): + +```java +import com.redis.semcache.MockLLM; + +MockLLM llm = new MockLLM("gpt-4.5-2026", 1500.0); +MockLLM.Response response = llm.complete("What is your return policy?"); +// response.response() — the templated answer text +// response.latencyMs() — wall-clock time the call took +// response.totalTokens() — estimated prompt + completion tokens +``` + +The mock sleeps for the configured latency, then keyword-matches against a small FAQ table to produce an answer. The deliberate slowness is what makes a hit visibly cheaper than a miss in the demo. In production code, you would replace `MockLLM` with your real client of choice — an HTTP call to OpenAI, Anthropic, a self-hosted vLLM endpoint, anything — without changing the cache helper. + +## Pre-seeding the cache + +In a real deployment the cache fills up organically: a first-time question is a miss, the LLM answers, and the response is written back. For the demo, `SeedCache.java` pre-loads a small set of canonical FAQ prompts so the very first query lands on a hit +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/SeedCache.java)): + +```java +import com.redis.semcache.SeedCache; + +cache.createIndex(); +SeedCache.seed(cache, embedder, "acme", "en", "gpt-4.5-2026"); +``` + +The seed list stores the canonical phrasing of each question ("What is your return policy?"). Paraphrases of any of these prompts ("How do I return an item?", "Can I get a refund?") embed close to the canonical entry, so the cache lookup serves the stored response without ever calling the model. + +## The interactive demo + +`DemoServer.java` runs an HTTP server built on the JDK's `com.sun.net.httpserver.HttpServer` — no Spring, no Jetty, no embedded framework. The HTML page lets you: + +* Type a prompt and toggle metadata: tenant, locale, model version. Each combination is a separate cache namespace inside the same index. +* Slide the cosine-distance threshold and see hits flip to misses (and back) on the same prompt, with the actual distance reported on each query. +* Submit with **Ask** to run the full hit-or-miss path (calls the LLM on a miss, writes the answer back). Submit with **Lookup only (no LLM)** to sweep the threshold against a fixed prompt without polluting the cache. +* Watch the cumulative panel build up: total queries, cache hits, cache misses, hit ratio, tokens not spent, LLM milliseconds not waited. +* Inspect every cached entry, including remaining TTL and total hit count, and drop individual entries to simulate eviction. + +The server holds one `LocalEmbedder`, one `RedisSemanticCache`, and one `MockLLM` for the lifetime of the process. The HTML page is shared with the Python, Node.js, and Go demos; the build embeds `index.html` from the project root as a classpath resource so the jar runs from any working directory. Endpoints: + +| Endpoint | What it does | +|-----------------|-------------------------------------------------------------------------------| +| `GET /state` | Index info and the full list of cached entries. | +| `POST /query` | Embed the prompt, run `FT.SEARCH`, on miss call the LLM and write back. | +| `POST /reset` | Drop every cached entry and re-seed from the FAQ list. | +| `POST /drop` | Delete a single cached entry by id. | + +## Run the demo locally + +1. Clone the [`redis/docs`](https://github.com/redis/docs) repository and change into the example + directory: + + ```bash + git clone https://github.com/redis/docs.git + cd docs/content/develop/use-cases/semantic-cache/java-jedis + ``` + +2. Make sure a Redis instance with the Redis Search module is running locally on + port 6379. [Redis Stack]({{< relref "/operate/oss_and_stack/install/install-stack" >}}) or + [Redis 8 with Search]({{< relref "/develop/ai/search-and-query" >}}) both work. + +3. Build the project with Maven. This pulls Jedis, DJL, and the PyTorch native + libraries. The first build takes a couple of minutes: + + ```bash + mvn -q package + ``` + +4. Run the demo. The first run also downloads the `sentence-transformers/all-MiniLM-L6-v2` + PyTorch weights into the local DJL cache (~90 MB); every subsequent run is offline: + + ```bash + java -jar target/semantic-cache-jedis.jar + ``` + + Or with `mvn`: + + ```bash + mvn -q exec:java + ``` + +5. Open and try some queries: + + * **"What is your return policy?"** — exact match against the seed, distance ≈ 0, + hit at any threshold. + * **"How fast is delivery?"** — paraphrase of the shipping seed; distance + around 0.30, hit at the default threshold of 0.5. + * **"How do I return an item?"** — slightly looser paraphrase of the returns + seed; distance around 0.49, still a hit at the default threshold. Slide + the threshold down to 0.4 to see this one flip to a miss. + * **"What payment methods do you accept?"** — unrelated to anything in the + seed; distance > 0.8, so you'll see a miss, the mock LLM kicks in for + ~1.5 s, the new answer is cached, and a follow-up of the same question + is now an immediate hit. + * Switch the **Tenant** dropdown to `globex` or `initech` and re-ask any + seeded question — the result flips to a miss because the cache entries + live under `acme`. That's the metadata pre-filter at work inside `FT.SEARCH`. + +The server is read/write against your local Redis. The default index name is `semcache:idx` and entry keys live under `cache:`. Flags mirror the Python and Node demos: `--no-reset` to keep an existing cache across restarts, `--threshold` to change the default cosine-distance cutoff, `--llm-latency-ms` to make the mock LLM faster or slower for the demo, or `--port` to listen on a different port. diff --git a/content/develop/use-cases/semantic-cache/java-jedis/index.html b/content/develop/use-cases/semantic-cache/java-jedis/index.html new file mode 100644 index 0000000000..e897cfdee7 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/index.html @@ -0,0 +1,513 @@ + + + + + + Redis Semantic Cache Demo + + + +
+
loading…
+

Redis Semantic Cache Demo

+

+ A small semantic cache sits in front of a mock LLM. Each cache + entry is a Hash at __KEY_PREFIX__<id> holding + the prompt, the response, the prompt's 384-dimensional embedding, + and metadata fields. A single FT.SEARCH on + __INDEX_NAME__ does the KNN against cached prompts + with a TAG pre-filter (tenant, locale, model version, safety) in + the same round trip. If the closest cached prompt is within the + cosine-distance threshold, the demo serves the cached response + and the LLM is not called at all. +

+ +
+ +
+

Ask the LLM

+

Type a question, optionally adjust the metadata filters and + the distance threshold, and submit. The server embeds the + prompt, runs FT.SEARCH with KNN over the cache, + and either serves the cached response (hit) or runs the mock + LLM and writes the new response back to the cache (miss).

+ + +
+
+ + +
+
+ + +
+
+ + +
+
+
+ + + 0.50 +
+

+ The cache serves a hit when the closest cached prompt's + cosine distance is at or below this threshold. Lower = + stricter (fewer hits, safer reuse); higher = looser (more + hits, more risk of serving a near-miss). +

+ + + + + +
+
+ +
+

Cumulative savings

+

Every hit avoids one LLM round trip. The numbers below add + up across the session — tokens that would have been spent and + wall-clock seconds that would have been waited if the cache + had not served the answer.

+
+
+
0
+
Total queries
+
+
+
0
+
Cache hits
+
+
+
0
+
Cache misses
+
+
+
0%
+
Hit ratio
+
+
+
0
+
Tokens saved
+
+
+
0 ms
+
LLM time saved
+
+
+
+ +
+

Index state

+
+ +
+ +
+

Cached entries

+

Every prompt/response pair currently in the cache. + hit_count is the running total of times the entry + has served a hit; ttl is the remaining lifetime + in seconds before EXPIRE drops the key. Click + Drop to simulate eviction.

+ + + + + + + + + + + + +
IDPromptMetadataHitsTTL
+
+ +
+ +
+
+ + + + diff --git a/content/develop/use-cases/semantic-cache/java-jedis/pom.xml b/content/develop/use-cases/semantic-cache/java-jedis/pom.xml new file mode 100644 index 0000000000..6f87085b1f --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/pom.xml @@ -0,0 +1,135 @@ + + + 4.0.0 + + com.redis + semantic-cache-jedis + 1.0.0 + jar + + Redis Semantic Cache Demo (Jedis) + + Interactive semantic-cache demo backed by Redis Search, using + Jedis for Redis access and DJL (PyTorch) for local sentence + embeddings. + + + + 17 + 17 + UTF-8 + 5.2.0 + 0.33.0 + 20240303 + + + + + + redis.clients + jedis + ${jedis.version} + + + + + ai.djl + api + ${djl.version} + + + ai.djl.huggingface + tokenizers + ${djl.version} + + + ai.djl.pytorch + pytorch-model-zoo + ${djl.version} + + + + + org.json + json + ${json.version} + + + + + semantic-cache-jedis + + + + ${project.basedir} + + index.html + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 3.13.0 + + 17 + + + + + org.apache.maven.plugins + maven-shade-plugin + 3.5.3 + + + package + shade + + false + + + com.redis.semcache.DemoServer + + + + + + *:* + + META-INF/*.SF + META-INF/*.DSA + META-INF/*.RSA + + + + + + + + + org.codehaus.mojo + exec-maven-plugin + 3.5.0 + + com.redis.semcache.DemoServer + + + + + diff --git a/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/CacheHit.java b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/CacheHit.java new file mode 100644 index 0000000000..6f6b8fe094 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/CacheHit.java @@ -0,0 +1,22 @@ +package com.redis.semcache; + +/** + * A cache lookup that returned a cached response. + * + *

{@code distance} is the cosine distance {@code FT.SEARCH} + * reported for the nearest cached prompt (0 = identical, 2 = + * opposite). It is always at or below the threshold the lookup was + * run with. + */ +public record CacheHit( + String id, + String prompt, + String response, + String tenant, + String locale, + String modelVersion, + double distance, + long ttlSeconds, + long hitCount +) implements LookupResult { +} diff --git a/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/CacheMiss.java b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/CacheMiss.java new file mode 100644 index 0000000000..be6634b85b --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/CacheMiss.java @@ -0,0 +1,16 @@ +package com.redis.semcache; + +/** + * A cache lookup that did not return a usable response. + * + *

{@code nearestDistance} is the cosine distance to the closest + * cached prompt that did match the metadata filters. Both + * fields are {@code null} when the cache had no entry in scope at + * all, which is what the demo UI shows as "no candidate" + * vs. "candidate too far". + */ +public record CacheMiss( + Double nearestDistance, + String nearestId +) implements LookupResult { +} diff --git a/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/DemoServer.java b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/DemoServer.java new file mode 100644 index 0000000000..a8089e9fb0 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/DemoServer.java @@ -0,0 +1,591 @@ +package com.redis.semcache; + +import com.sun.net.httpserver.HttpExchange; +import com.sun.net.httpserver.HttpHandler; +import com.sun.net.httpserver.HttpServer; +import org.json.JSONArray; +import org.json.JSONObject; +import redis.clients.jedis.ConnectionPoolConfig; +import redis.clients.jedis.HostAndPort; +import redis.clients.jedis.JedisPooled; + +import java.io.IOException; +import java.io.InputStream; +import java.net.InetSocketAddress; +import java.net.URI; +import java.net.URLDecoder; +import java.nio.charset.StandardCharsets; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.concurrent.Executors; + +/** + * Redis semantic-cache demo server (Java + Jedis). + * + *

Run this main and visit {@code http://localhost:8089} to drive + * a small semantic-cache demo backed by Redis Search. The UI lets + * you type a natural-language prompt and watch the cache decide hit + * or miss; on a hit Redis returns the cached response in tens of + * milliseconds and the demo LLM is not called at all, while on a + * miss the demo LLM "thinks" for ~1.5 s before answering + * and the new prompt, response, and embedding are written back to + * Redis for next time. + * + *

The server holds a single {@link LocalEmbedder}, a single + * {@link RedisSemanticCache}, and a single {@link MockLLM} for the + * lifetime of the process. The first run downloads the embedding + * model into the local DJL cache; everything after is local. + */ +public final class DemoServer { + + static final class Args { + String host = "127.0.0.1"; + int port = 8089; + String redisHost = "localhost"; + int redisPort = 6379; + String indexName = "semcache:idx"; + String keyPrefix = "cache:"; + long ttlSeconds = 3600; + double threshold = 0.5; + double llmLatencyMs = 1500.0; + boolean resetOnStart = true; + } + + public static void main(String[] argv) throws Exception { + Args args = parseArgs(argv); + + ConnectionPoolConfig poolConfig = new ConnectionPoolConfig(); + poolConfig.setMaxTotal(16); + poolConfig.setMaxIdle(4); + poolConfig.setMinIdle(1); + JedisPooled jedis = new JedisPooled( + poolConfig, + new HostAndPort(args.redisHost, args.redisPort), + redis.clients.jedis.DefaultJedisClientConfig.builder() + .socketTimeoutMillis(2000) + .connectionTimeoutMillis(2000) + .build()); + try { + jedis.ping(); + } catch (Exception ex) { + System.err.println("Error: cannot reach Redis at " + + args.redisHost + ":" + args.redisPort); + System.err.println(" (" + ex.getMessage() + ")"); + jedis.close(); + System.exit(1); + } + + RedisSemanticCache cache = new RedisSemanticCache( + jedis, + args.indexName, + args.keyPrefix, + LocalEmbedder.defaultVectorDim(), + args.threshold, + args.ttlSeconds + ); + cache.createIndex(); + + System.out.println("Loading embedding model " + + "(first run downloads the PyTorch weights)..."); + LocalEmbedder embedder = LocalEmbedder.create(); + MockLLM llm = new MockLLM("gpt-4.5-2026", args.llmLatencyMs); + + SemanticCacheDemo demo = new SemanticCacheDemo(cache, embedder, llm); + + if (args.resetOnStart) { + System.out.println( + "Dropping any existing cache under '" + args.keyPrefix + + "*' and re-seeding from the FAQ list " + + "(pass --no-reset to keep)."); + int seeded = demo.seed(); + System.out.println("Seeded " + seeded + " entries."); + } + + // Load the HTML once and substitute the template tokens so the + // docs panel shows the actual values in use rather than the + // default copies. + String rawHtml = loadIndexHtml(); + String htmlPage = rawHtml + .replace("__INDEX_NAME__", args.indexName) + .replace("__KEY_PREFIX__", args.keyPrefix); + + HttpServer server = HttpServer.create( + new InetSocketAddress(args.host, args.port), 0); + server.setExecutor(Executors.newCachedThreadPool()); + server.createContext("/", new RootHandler(cache, embedder, llm, demo, htmlPage)); + + System.out.println("Redis semantic cache demo listening on " + + "http://" + args.host + ":" + args.port); + System.out.println("Using Redis at " + args.redisHost + ":" + args.redisPort + + " with index '" + args.indexName + "'"); + + Runtime.getRuntime().addShutdownHook(new Thread(() -> { + System.out.println("\nShutting down..."); + server.stop(0); + try { embedder.close(); } catch (Exception ignored) {} + jedis.close(); + })); + + server.start(); + } + + // ------------------------------------------------------------------ + // Demo orchestrator + // ------------------------------------------------------------------ + + static final class SemanticCacheDemo { + private final RedisSemanticCache cache; + private final LocalEmbedder embedder; + private final MockLLM llm; + private final String defaultTenant = "acme"; + private final String defaultLocale = "en"; + + SemanticCacheDemo(RedisSemanticCache cache, LocalEmbedder embedder, MockLLM llm) { + this.cache = cache; + this.embedder = embedder; + this.llm = llm; + } + + /** Drop everything in scope and pre-populate with FAQ entries. */ + synchronized int seed() throws Exception { + cache.clear(); + return SeedCache.seed(cache, embedder, + defaultTenant, defaultLocale, llm.modelVersion()); + } + + /** + * The hot path: embed, look up, optionally call the LLM, write back. + * + *

Timings are taken with {@code System.nanoTime()} around + * each bounded step so the UI can display the embed / lookup + * / LLM breakdown separately. The cache write on a miss is + * not included in {@code total_ms} so the latency + * number reflects the user-facing wait, not the background + * bookkeeping. + */ + Map runQuery( + String prompt, + String tenant, + String locale, + String modelVersion, + double threshold, + boolean lookupOnly) throws Exception { + + long t0 = System.nanoTime(); + float[] queryVec = embedder.encodeOne(prompt); + double embedMs = (System.nanoTime() - t0) / 1_000_000.0; + + long t1 = System.nanoTime(); + LookupResult result = cache.lookup( + queryVec, tenant, locale, modelVersion, "ok", threshold); + double lookupMs = (System.nanoTime() - t1) / 1_000_000.0; + + Map payload = new LinkedHashMap<>(); + + if (result instanceof CacheHit hit) { + payload.put("outcome", "hit"); + payload.put("response", hit.response()); + payload.put("entry_id", hit.id()); + payload.put("distance", hit.distance()); + payload.put("ttl_seconds", hit.ttlSeconds()); + payload.put("hit_count", hit.hitCount()); + payload.put("threshold", threshold); + payload.put("embed_ms", embedMs); + payload.put("lookup_ms", lookupMs); + payload.put("llm_ms", null); + payload.put("total_ms", embedMs + lookupMs); + payload.put("tokens_avoided", + estimateResponseTokens(hit.prompt(), hit.response())); + payload.put("ms_avoided", llm.latencyMs()); + return payload; + } + + // Miss path. In "lookup only" mode the demo reports the + // miss without actually calling the LLM — useful for + // sweeping the threshold against a fixed prompt to see + // where the cutoff would fall without polluting the cache. + CacheMiss miss = (CacheMiss) result; + if (lookupOnly) { + payload.put("outcome", "miss"); + payload.put("response", "(LLM not called in lookup-only mode)"); + payload.put("nearest_distance", miss.nearestDistance()); + payload.put("threshold", threshold); + payload.put("wrote_entry_id", null); + payload.put("embed_ms", embedMs); + payload.put("lookup_ms", lookupMs); + payload.put("llm_ms", null); + payload.put("total_ms", embedMs + lookupMs); + return payload; + } + + long t2 = System.nanoTime(); + MockLLM.Response llmResponse = llm.complete(prompt); + double llmMs = (System.nanoTime() - t2) / 1_000_000.0; + + // Write the new entry back. The embedding is the same + // vector we already used for the lookup — no need to + // re-encode. + String entryId = cache.put( + prompt, + llmResponse.response(), + queryVec, + tenant, + locale, + modelVersion, + "ok", + null, + null + ); + + payload.put("outcome", "miss"); + payload.put("response", llmResponse.response()); + payload.put("nearest_distance", miss.nearestDistance()); + payload.put("threshold", threshold); + payload.put("wrote_entry_id", entryId); + payload.put("embed_ms", embedMs); + payload.put("lookup_ms", lookupMs); + payload.put("llm_ms", llmMs); + payload.put("total_ms", embedMs + lookupMs + llmMs); + return payload; + } + + private static int estimateResponseTokens(String prompt, String response) { + int len = (prompt == null ? 0 : prompt.length()) + + (response == null ? 0 : response.length()); + return Math.max(1, len / 4); + } + } + + // ------------------------------------------------------------------ + // HTTP plumbing + // ------------------------------------------------------------------ + + static final class RootHandler implements HttpHandler { + private final RedisSemanticCache cache; + private final LocalEmbedder embedder; + private final MockLLM llm; + private final SemanticCacheDemo demo; + private final String htmlPage; + + RootHandler(RedisSemanticCache cache, LocalEmbedder embedder, + MockLLM llm, SemanticCacheDemo demo, String htmlPage) { + this.cache = cache; + this.embedder = embedder; + this.llm = llm; + this.demo = demo; + this.htmlPage = htmlPage; + } + + @Override + public void handle(HttpExchange ex) throws IOException { + try { + String method = ex.getRequestMethod(); + URI uri = ex.getRequestURI(); + String path = uri.getPath(); + + if ("GET".equalsIgnoreCase(method)) { + if (path.equals("/") || path.equals("/index.html")) { + sendHtml(ex, 200, htmlPage); + return; + } + if (path.equals("/state")) { + sendJson(ex, 200, buildState()); + return; + } + sendJson(ex, 404, errorPayload("not found", null)); + return; + } + if ("POST".equalsIgnoreCase(method)) { + String body = readBody(ex); + Map params = parseForm(body); + + if (path.equals("/query")) { + handleQuery(ex, params); + return; + } + if (path.equals("/reset")) { + try { + demo.seed(); + JSONObject ok = new JSONObject(); + ok.put("ok", true); + sendJson(ex, 200, ok); + } catch (Exception inner) { + handleException(ex, inner); + } + return; + } + if (path.equals("/drop")) { + String entryId = params.getOrDefault("entry_id", "").trim(); + if (entryId.isEmpty()) { + sendJson(ex, 400, errorPayload("entry_id is required", null)); + return; + } + boolean deleted = cache.deleteEntry(entryId); + JSONObject out = new JSONObject(); + out.put("deleted", deleted); + out.put("entry_id", entryId); + sendJson(ex, 200, out); + return; + } + sendJson(ex, 404, errorPayload("not found", null)); + return; + } + sendJson(ex, 405, errorPayload("method not allowed", null)); + } catch (Exception exc) { + handleException(ex, exc); + } + } + + private void handleQuery(HttpExchange ex, Map params) + throws IOException { + String prompt = params.getOrDefault("prompt", "").trim(); + if (prompt.isEmpty()) { + sendJson(ex, 400, errorPayload("prompt is required", null)); + return; + } + double threshold = clampThreshold(params.get("threshold")); + boolean lookupOnly = params.getOrDefault("lookup_only", "").length() > 0; + String tenant = nonEmpty(params.get("tenant"), "acme"); + String locale = nonEmpty(params.get("locale"), "en"); + String modelVersion = nonEmpty(params.get("model_version"), llm.modelVersion()); + + try { + Map payload = demo.runQuery( + prompt, tenant, locale, modelVersion, threshold, lookupOnly); + sendJson(ex, 200, toJson(payload)); + } catch (Exception inner) { + handleException(ex, inner); + } + } + + private JSONObject buildState() { + Map info = cache.indexInfo(); + JSONObject index = new JSONObject(); + index.put("num_docs", info.getOrDefault("num_docs", 0L)); + index.put("indexing_failures", info.getOrDefault("indexing_failures", 0L)); + index.put("vector_index_size_mb", + info.getOrDefault("vector_index_size_mb", 0.0)); + index.put("index_name", cache.indexName()); + index.put("model", embedder.modelName()); + index.put("mock_llm_latency_ms", llm.latencyMs()); + // default_threshold is what the --threshold flag actually + // configures; the UI slider initialises to this on first + // load so the flag visibly changes the demo's behaviour. + // stack_label lets the same HTML render a per-language + // badge without forking the file per language. + index.put("default_threshold", cache.distanceThreshold()); + index.put("stack_label", + "Jedis + DJL (PyTorch + HuggingFace) + Java standard library HTTP server"); + + JSONArray entries = new JSONArray(); + List> rows = cache.listEntries(200); + for (Map row : rows) { + entries.put(toJson(row)); + } + + JSONObject out = new JSONObject(); + out.put("index", index); + out.put("entries", entries); + return out; + } + + private void handleException(HttpExchange ex, Exception exc) { + System.err.println("[demo] handler error: " + + exc.getClass().getSimpleName() + ": " + exc.getMessage()); + exc.printStackTrace(System.err); + try { + JSONObject body = errorPayload( + exc.getMessage() == null ? exc.getClass().getSimpleName() : exc.getMessage(), + exc.getClass().getSimpleName()); + sendJson(ex, 500, body); + } catch (Exception ignored) { + // Headers may already be partially flushed; nothing + // useful left to do beyond letting the connection drop. + } + } + } + + // ------------------------------------------------------------------ + // Helpers + // ------------------------------------------------------------------ + + /** + * Parse a threshold value, clamping NaN/Infinity to {@code 0.5} + * and otherwise clamping to {@code [0.0, 2.0]}. {@code parseDouble} + * happily handles "nan" → {@code NaN} and + * "inf" → {@code Infinity}. Either would silently turn + * the lookup into a permanent hit ({@code NaN} comparisons are + * always {@code false}, so {@code distance > nan} cannot reject) + * or a permanent miss; clamping to the meaningful cosine-distance + * range stops a malformed POST from overriding the threshold + * semantics. + */ + static double clampThreshold(String raw) { + double parsed = 0.5; + if (raw != null && !raw.isEmpty()) { + try { + parsed = Double.parseDouble(raw); + } catch (NumberFormatException ex) { + parsed = 0.5; + } + } + if (Double.isNaN(parsed) || Double.isInfinite(parsed)) return 0.5; + return Math.max(0.0, Math.min(2.0, parsed)); + } + + private static String nonEmpty(String value, String fallback) { + return (value == null || value.isEmpty()) ? fallback : value; + } + + /** + * Cap POST bodies so a runaway client can't accumulate unbounded + * memory before the handler runs. {@code com.sun.net.httpserver} + * provides no built-in limit on request bodies; left unchecked, + * {@code InputStream.readAllBytes()} will read whatever the + * client sends. The demo's largest legitimate body is a few + * hundred bytes of form-encoded query fields; 1 MiB is a + * generous ceiling and matches the Node and Go demos' caps. + */ + private static final int MAX_BODY_BYTES = 1 * 1024 * 1024; + + private static String readBody(HttpExchange ex) throws IOException { + try (InputStream in = ex.getRequestBody()) { + // Read up to MAX_BODY_BYTES + 1 so we can distinguish + // "exactly at the limit" from "too large". + byte[] bytes = in.readNBytes(MAX_BODY_BYTES + 1); + if (bytes.length > MAX_BODY_BYTES) { + throw new IOException( + "request body exceeds " + MAX_BODY_BYTES + " bytes"); + } + return new String(bytes, StandardCharsets.UTF_8); + } + } + + static Map parseForm(String body) { + Map out = new HashMap<>(); + if (body == null || body.isEmpty()) return out; + for (String pair : body.split("&")) { + if (pair.isEmpty()) continue; + int eq = pair.indexOf('='); + String key, value; + if (eq < 0) { + key = URLDecoder.decode(pair, StandardCharsets.UTF_8); + value = ""; + } else { + key = URLDecoder.decode(pair.substring(0, eq), StandardCharsets.UTF_8); + value = URLDecoder.decode(pair.substring(eq + 1), StandardCharsets.UTF_8); + } + out.put(key, value); + } + return out; + } + + private static void sendHtml(HttpExchange ex, int status, String html) throws IOException { + byte[] bytes = html.getBytes(StandardCharsets.UTF_8); + ex.getResponseHeaders().set("Content-Type", "text/html; charset=utf-8"); + ex.sendResponseHeaders(status, bytes.length); + ex.getResponseBody().write(bytes); + ex.getResponseBody().close(); + } + + private static void sendJson(HttpExchange ex, int status, JSONObject body) throws IOException { + byte[] bytes = body.toString().getBytes(StandardCharsets.UTF_8); + ex.getResponseHeaders().set("Content-Type", "application/json"); + ex.sendResponseHeaders(status, bytes.length); + ex.getResponseBody().write(bytes); + ex.getResponseBody().close(); + } + + private static JSONObject errorPayload(String message, String type) { + JSONObject out = new JSONObject(); + out.put("error", message); + if (type != null) out.put("type", type); + return out; + } + + private static JSONObject toJson(Map map) { + JSONObject out = new JSONObject(); + for (Map.Entry entry : map.entrySet()) { + Object value = entry.getValue(); + if (value == null) { + out.put(entry.getKey(), JSONObject.NULL); + } else { + out.put(entry.getKey(), value); + } + } + return out; + } + + private static String loadIndexHtml() throws IOException { + // index.html is shipped as a classpath resource (Maven pulls + // it from the project root via the entry in + // pom.xml). Loading from the classpath rather than the + // working directory means `java -jar target/...` works from + // anywhere, not just the project root. + try (InputStream in = + DemoServer.class.getResourceAsStream("/index.html")) { + if (in == null) { + throw new IOException( + "index.html not found on classpath; rebuild with `mvn package`"); + } + return new String(in.readAllBytes(), StandardCharsets.UTF_8); + } + } + + // ------------------------------------------------------------------ + // CLI parsing + // ------------------------------------------------------------------ + + static Args parseArgs(String[] argv) { + Args args = new Args(); + for (int i = 0; i < argv.length; i++) { + String a = argv[i]; + switch (a) { + case "--host": args.host = require(argv, ++i, a); break; + case "--port": args.port = Integer.parseInt(require(argv, ++i, a)); break; + case "--redis-host": args.redisHost = require(argv, ++i, a); break; + case "--redis-port": args.redisPort = Integer.parseInt(require(argv, ++i, a)); break; + case "--index-name": args.indexName = require(argv, ++i, a); break; + case "--key-prefix": args.keyPrefix = require(argv, ++i, a); break; + case "--ttl-seconds": args.ttlSeconds = Long.parseLong(require(argv, ++i, a)); break; + case "--threshold": args.threshold = Double.parseDouble(require(argv, ++i, a)); break; + case "--llm-latency-ms":args.llmLatencyMs = Double.parseDouble(require(argv, ++i, a)); break; + case "--no-reset": args.resetOnStart = false; break; + case "-h": + case "--help": + printHelp(); + System.exit(0); + break; + default: + throw new IllegalArgumentException("Unknown flag: " + a); + } + } + return args; + } + + private static String require(String[] argv, int i, String flag) { + if (i >= argv.length) { + throw new IllegalArgumentException("Missing value for " + flag); + } + return argv[i]; + } + + private static void printHelp() { + System.out.println("Usage: java -jar semantic-cache-jedis.jar [options]"); + System.out.println(" --host HOST HTTP bind host (default 127.0.0.1)"); + System.out.println(" --port PORT HTTP bind port (default 8089)"); + System.out.println(" --redis-host HOST Redis host (default localhost)"); + System.out.println(" --redis-port PORT Redis port (default 6379)"); + System.out.println(" --index-name NAME Redis Search index name (default semcache:idx)"); + System.out.println(" --key-prefix PREFIX Hash key prefix (default cache:)"); + System.out.println(" --ttl-seconds N TTL for cache entries (default 3600)"); + System.out.println(" --threshold F Default cosine-distance cutoff (default 0.5)"); + System.out.println(" --llm-latency-ms F Mock LLM latency (default 1500.0)"); + System.out.println(" --no-reset Keep existing cache instead of re-seeding"); + } +} diff --git a/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/LocalEmbedder.java b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/LocalEmbedder.java new file mode 100644 index 0000000000..02c1d1d8d6 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/LocalEmbedder.java @@ -0,0 +1,149 @@ +package com.redis.semcache; + +import ai.djl.huggingface.translator.TextEmbeddingTranslatorFactory; +import ai.djl.inference.Predictor; +import ai.djl.repository.zoo.Criteria; +import ai.djl.repository.zoo.ZooModel; +import ai.djl.training.util.ProgressBar; + +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.util.ArrayList; +import java.util.List; + +/** + * Local text-embedding helper backed by DJL + PyTorch. + * + *

This is a thin wrapper around the + * {@code sentence-transformers/all-MiniLM-L6-v2} model loaded from + * DJL's model zoo: a 384-dimensional encoder that runs in-process on + * CPU through libtorch, needs no API key, and produces vectors that + * are numerically very close to the equivalent Python and Node ports + * (close enough that paraphrase distances differ only at the fourth + * decimal place). + * + *

DJL's {@link TextEmbeddingTranslatorFactory} returns mean-pooled, + * L2-normalised vectors by default, so a Redis Search index declared + * with {@code DISTANCE_METRIC COSINE} returns scores that are + * directly comparable across entries. The model is downloaded into + * the local DJL cache on the first call; every later call runs + * offline. + */ +public final class LocalEmbedder implements AutoCloseable { + + private static final String DEFAULT_MODEL_URL = + "djl://ai.djl.huggingface.pytorch/sentence-transformers/all-MiniLM-L6-v2"; + private static final String DEFAULT_MODEL_NAME = + "sentence-transformers/all-MiniLM-L6-v2"; + private static final int DEFAULT_VECTOR_DIM = 384; + + private final String modelName; + private final ZooModel model; + private final Predictor predictor; + private final int dim; + + private LocalEmbedder( + String modelName, + ZooModel model, + Predictor predictor, + int dim) { + this.modelName = modelName; + this.model = model; + this.predictor = predictor; + this.dim = dim; + } + + /** + * Load the default model. Blocks while DJL downloads the + * PyTorch weights on the first run, then keeps a single loaded + * predictor for the lifetime of the embedder. + */ + public static LocalEmbedder create() throws Exception { + Criteria criteria = Criteria.builder() + .setTypes(String.class, float[].class) + .optModelUrls(DEFAULT_MODEL_URL) + .optEngine("PyTorch") + .optTranslatorFactory(new TextEmbeddingTranslatorFactory()) + .optProgress(new ProgressBar()) + .build(); + ZooModel model = criteria.loadModel(); + Predictor predictor = model.newPredictor(); + // Probe the output shape once so we fail loudly if a + // different model is wired up against the 384-dim Redis + // Search field. + float[] probe = predictor.predict("dimension probe"); + int dim = probe.length; + return new LocalEmbedder(DEFAULT_MODEL_NAME, model, predictor, dim); + } + + public String modelName() { + return modelName; + } + + public int dim() { + return dim; + } + + /** + * Encode a single string. Returns a {@code float[]} of length + * {@link #dim()}. + * + *

The DJL PyTorch {@code Predictor} is not thread-safe — its + * underlying NDManager and tokenizer state mutate per call. The + * demo server uses a cached thread pool, so two browser tabs + * could land on different handler threads and call this method + * concurrently. We {@code synchronized}-guard both encode entry + * points to serialise access to the shared predictor; encoding + * is the bottleneck either way and a single CPU-bound model + * won't usefully run two requests in parallel. A higher- + * throughput deployment would replace this with a small pool + * of {@code Predictor} instances or a dedicated single-threaded + * inference executor. + */ + public synchronized float[] encodeOne(String text) throws Exception { + return predictor.predict(text); + } + + /** Encode several strings sequentially. See {@link #encodeOne} + * for the rationale behind the synchronisation. */ + public synchronized List encodeMany(List texts) throws Exception { + List out = new ArrayList<>(texts.size()); + for (String text : texts) { + out.add(predictor.predict(text)); + } + return out; + } + + /** + * Pack a {@code float[]} into the bytes Redis Search expects. + * Vectors are little-endian {@code float32}; this matches the + * encoding the Python and Node ports write. + */ + public static byte[] toBytes(float[] vector) { + byte[] bytes = new byte[Float.BYTES * vector.length]; + ByteBuffer + .wrap(bytes) + .order(ByteOrder.LITTLE_ENDIAN) + .asFloatBuffer() + .put(vector); + return bytes; + } + + @Override + public void close() { + try { + predictor.close(); + } catch (Exception ignored) { + // best-effort cleanup + } + try { + model.close(); + } catch (Exception ignored) { + // best-effort cleanup + } + } + + public static int defaultVectorDim() { + return DEFAULT_VECTOR_DIM; + } +} diff --git a/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/LookupResult.java b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/LookupResult.java new file mode 100644 index 0000000000..141278bb24 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/LookupResult.java @@ -0,0 +1,10 @@ +package com.redis.semcache; + +/** + * Sealed result of a cache lookup. Pattern-matched in the demo + * server to branch between the hit and miss paths; mirrors the + * {@code CacheHit | CacheMiss} union the Python and Node ports + * return. + */ +public sealed interface LookupResult permits CacheHit, CacheMiss { +} diff --git a/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/MockLLM.java b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/MockLLM.java new file mode 100644 index 0000000000..29cff2776c --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/MockLLM.java @@ -0,0 +1,178 @@ +package com.redis.semcache; + +import java.util.List; +import java.util.concurrent.atomic.AtomicLong; + +/** + * Deterministic mock LLM for the semantic-cache demo. + * + *

The point of a semantic cache is to skip an LLM call + * when a prior answer is reusable. To make that visible in a docs + * demo we need an LLM stand-in that: + * + *

    + *
  • takes long enough that the saved time on a cache hit is + * obvious (real-world model calls are 500 ms to several + * seconds);
  • + *
  • responds deterministically so a given prompt always produces + * the same answer, which keeps the demo reproducible;
  • + *
  • exposes an estimated token count so the demo can show the + * saving in "tokens not spent" terms alongside + * latency;
  • + *
  • needs no API keys, no network, no extra dependencies.
  • + *
+ * + *

It is keyword-matched against a small lookup table of FAQ-style + * answers for a fictional online retailer. Anything that doesn't + * match falls back to a generic templated reply. The + * {@code latencyMs} parameter is the simulated round trip; the + * default (1500 ms) is in the neighbourhood of a real GPT-class + * model on a moderately-sized prompt. + */ +public final class MockLLM { + + private record KnowledgeRow(List keywords, String answer) {} + + private static final List KNOWLEDGE = List.of( + new KnowledgeRow( + List.of("return", "refund", "exchange"), + "You can return any unworn item within 30 days of delivery for a " + + "full refund. Start a return from your order page; we email a " + + "prepaid label and refund the original payment method within " + + "five business days of receiving the item." + ), + new KnowledgeRow( + List.of("shipping", "delivery", "arrive", "ship"), + "Standard shipping is free on orders over $50 and arrives in " + + "three to five business days. Expedited two-day shipping is " + + "$9.99 and is available at checkout for in-stock items." + ), + new KnowledgeRow( + List.of("size", "sizing", "fit"), + "We follow standard US sizing. For most styles we recommend " + + "ordering your usual size; the product page includes a sizing " + + "chart and customer fit notes for items that run small or large." + ), + new KnowledgeRow( + List.of("warranty", "guarantee", "defect", "broken"), + "All gear is covered by a one-year manufacturer warranty against " + + "defects in materials or workmanship. Email support with your " + + "order number and a photo of the issue and we will replace the " + + "item or issue a refund." + ), + new KnowledgeRow( + List.of("contact", "support", "help", "agent"), + "You can reach our support team by email at help@example.com or " + + "by live chat from the help centre, 9am to 9pm Eastern, seven " + + "days a week. Most tickets get a first reply within two hours." + ), + new KnowledgeRow( + List.of("track", "tracking", "order", "where"), + "Your tracking number is on the order confirmation email and on " + + "the order detail page once the package has been picked up by " + + "the carrier — typically within 24 hours of order placement." + ), + new KnowledgeRow( + List.of("cancel", "modify", "change"), + "Orders can be cancelled or modified for up to one hour after " + + "placement. After that the order has usually entered our " + + "warehouse system; the fastest path is to accept delivery and " + + "start a return for any unwanted items." + ), + new KnowledgeRow( + List.of("discount", "coupon", "promo", "code"), + "Active promotional codes are listed on the homepage banner. " + + "Codes apply at checkout and cannot be combined; the system " + + "automatically uses the larger of the two when more than one " + + "would qualify." + ) + ); + + private static final String FALLBACK_ANSWER = + "Thanks for the question. Our team would normally answer this " + + "individually; in the meantime please check the help centre or " + + "contact support@example.com for a faster response."; + + /** Result of a mock LLM call. */ + public record Response( + String response, + String modelVersion, + double latencyMs, + int promptTokens, + int completionTokens + ) { + public int totalTokens() { + return promptTokens + completionTokens; + } + } + + private final String modelVersion; + private final double latencyMs; + private final AtomicLong callCount = new AtomicLong(); + + public MockLLM() { + this("gpt-4.5-2026", 1500.0); + } + + public MockLLM(String modelVersion, double latencyMs) { + this.modelVersion = modelVersion; + this.latencyMs = latencyMs; + } + + public String modelVersion() { + return modelVersion; + } + + public double latencyMs() { + return latencyMs; + } + + public long callCount() { + return callCount.get(); + } + + /** + * Pretend to call a model. Sleeps for the configured latency, + * then returns a templated answer. + */ + public Response complete(String prompt) { + callCount.incrementAndGet(); + long start = System.nanoTime(); + try { + // Sleep first so the latency is realistic regardless of + // which branch generates the text. + long ms = (long) latencyMs; + int ns = (int) ((latencyMs - ms) * 1_000_000.0); + Thread.sleep(ms, ns); + } catch (InterruptedException ex) { + Thread.currentThread().interrupt(); + } + String response = answerFor(prompt); + double elapsedMs = (System.nanoTime() - start) / 1_000_000.0; + return new Response( + response, + modelVersion, + elapsedMs, + estimateTokens(prompt), + estimateTokens(response) + ); + } + + private static String answerFor(String prompt) { + String lower = prompt.toLowerCase(); + for (KnowledgeRow row : KNOWLEDGE) { + for (String kw : row.keywords()) { + if (lower.contains(kw)) { + return row.answer(); + } + } + } + return FALLBACK_ANSWER; + } + + /** Rough English token estimate: ~4 characters per token. */ + public static int estimateTokens(String text) { + if (text == null || text.isEmpty()) return 0; + return Math.max(1, text.length() / 4); + } +} diff --git a/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/RedisSemanticCache.java b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/RedisSemanticCache.java new file mode 100644 index 0000000000..a10bb02448 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/RedisSemanticCache.java @@ -0,0 +1,506 @@ +package com.redis.semcache; + +import redis.clients.jedis.AbstractTransaction; +import redis.clients.jedis.JedisPooled; +import redis.clients.jedis.Response; +import redis.clients.jedis.exceptions.JedisDataException; +import redis.clients.jedis.search.Document; +import redis.clients.jedis.search.FTCreateParams; +import redis.clients.jedis.search.IndexDataType; +import redis.clients.jedis.search.Query; +import redis.clients.jedis.search.SearchResult; +import redis.clients.jedis.search.schemafields.NumericField; +import redis.clients.jedis.search.schemafields.SchemaField; +import redis.clients.jedis.search.schemafields.TagField; +import redis.clients.jedis.search.schemafields.TextField; +import redis.clients.jedis.search.schemafields.VectorField; +import redis.clients.jedis.search.schemafields.VectorField.VectorAlgorithm; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.UUID; + +/** + * Redis semantic-cache helper backed by Redis Search. + * + *

Each cache entry lives as a Hash document at + * {@code cache:}. The hash stores the user's prompt and the + * corresponding LLM response alongside the raw float32 bytes of the + * prompt's 384-dimensional embedding and a small set of metadata + * fields — tenant, locale, model version, and a safety flag. + * + *

A single Redis Search index covers the embedding plus every + * metadata field, so one {@code FT.SEARCH} call does an + * approximate-nearest-neighbour lookup against the cached prompts + * with a TAG pre-filter applied in the same pass — no cross-store + * joins, no extra round trips, and tenant isolation is enforced + * inside the query rather than after the fact in + * application code. + * + *

The lookup is thresholded: {@code FT.SEARCH} always returns the + * closest cached prompt, but the cache only serves it as a hit when + * the cosine distance is at or below {@code distanceThreshold}. + * Anything further away is treated as a miss; the caller is expected + * to run the underlying LLM and write the new prompt, response, and + * embedding back with {@link #put}. + * + *

Each cache entry is written with {@code EXPIRE}, so stale + * answers age out without manual cleanup; combine with an + * {@code allkeys-lfu} eviction policy on the database to cap memory + * under pressure too. + */ +public final class RedisSemanticCache { + + public static final int VECTOR_DIM_DEFAULT = 384; + + /** + * Characters Redis Search treats as syntax inside a TAG value; + * any of them in a user-supplied filter must be backslash-escaped + * or the surrounding {@code {...}} block won't parse correctly. + */ + private static final String TAG_SPECIAL = "\\,.<>{}[]\"':;!@#$%^&*()-+=~| "; + + private final JedisPooled jedis; + private final String indexName; + private final String keyPrefix; + private final int vectorDim; + private final double distanceThreshold; + private final long defaultTtlSeconds; + + public RedisSemanticCache( + JedisPooled jedis, + String indexName, + String keyPrefix, + int vectorDim, + double distanceThreshold, + long defaultTtlSeconds) { + this.jedis = jedis; + this.indexName = indexName; + this.keyPrefix = keyPrefix; + this.vectorDim = vectorDim; + this.distanceThreshold = distanceThreshold; + this.defaultTtlSeconds = defaultTtlSeconds; + } + + public String indexName() { + return indexName; + } + + public String keyPrefix() { + return keyPrefix; + } + + public int vectorDim() { + return vectorDim; + } + + public long defaultTtlSeconds() { + return defaultTtlSeconds; + } + + public double distanceThreshold() { + return distanceThreshold; + } + + // ------------------------------------------------------------------ + // Keys + // ------------------------------------------------------------------ + + public String entryKey(String entryId) { + return keyPrefix + entryId; + } + + // ------------------------------------------------------------------ + // Index management + // ------------------------------------------------------------------ + + /** + * Create the Redis Search index if it doesn't already exist. + * + *

One index covers the embedding plus every metadata field, so + * a single {@code FT.SEARCH} can pre-filter by tenant / locale / + * model and then KNN-rank the matching documents in one pass. + * The {@code prompt} and {@code response} fields are stored as + * {@code TEXT} so admin tooling can grep the cache by content, + * but the cache lookup itself is vector-only. + */ + public void createIndex() { + List schema = List.of( + TextField.of("prompt"), + TextField.of("response"), + TagField.of("tenant"), + TagField.of("locale"), + TagField.of("model_version"), + TagField.of("safety"), + NumericField.of("created_ts").sortable(), + NumericField.of("hit_count").sortable(), + VectorField.builder() + .fieldName("embedding") + .algorithm(VectorAlgorithm.HNSW) + .attributes(Map.of( + "TYPE", "FLOAT32", + "DIM", vectorDim, + "DISTANCE_METRIC", "COSINE" + )) + .build() + ); + try { + jedis.ftCreate( + indexName, + FTCreateParams.createParams() + .on(IndexDataType.HASH) + .addPrefix(keyPrefix), + schema + ); + } catch (JedisDataException ex) { + if (!String.valueOf(ex.getMessage()).contains("Index already exists")) { + throw ex; + } + } + } + + /** Drop the search index. Optionally also delete cached entries. */ + public void dropIndex(boolean deleteDocuments) { + try { + if (deleteDocuments) { + jedis.ftDropIndexDD(indexName); + } else { + jedis.ftDropIndex(indexName); + } + } catch (JedisDataException ex) { + String msg = String.valueOf(ex.getMessage()).toLowerCase(Locale.ROOT); + if (!msg.contains("no such index") && !msg.contains("unknown index name")) { + throw ex; + } + } + } + + // ------------------------------------------------------------------ + // Lookup + // ------------------------------------------------------------------ + + /** + * Find the nearest in-scope cached prompt and decide hit / miss. + * + *

{@code FT.SEARCH} returns the single nearest entry that + * satisfies the TAG pre-filters. The lookup is a hit only if the + * reported cosine distance is at or below + * {@code distanceThreshold} (or the instance default). Anything + * further away is a miss with the candidate distance attached so + * the caller can log it. + * + *

On a hit, the entry's {@code hit_count} is incremented + * atomically with {@code HINCRBY} and the TTL is refreshed inside + * the same {@code MULTI/EXEC} so a frequently used answer doesn't + * age out under cold tail entries. + */ + public LookupResult lookup( + float[] queryVec, + String tenant, + String locale, + String modelVersion, + String safety, + Double distanceThreshold) { + + // Match the shape check that `put` performs. A wrong-dim + // vector would otherwise hit Redis as a malformed FT.SEARCH + // parameter and surface as a server-side parse error instead + // of a clear caller-side error. + if (queryVec.length != vectorDim) { + throw new IllegalArgumentException( + "queryVec length is " + queryVec.length + + "; index expects " + vectorDim + ); + } + + double threshold = distanceThreshold != null + ? distanceThreshold : this.distanceThreshold; + + String filterClause = buildFilterClause(tenant, locale, modelVersion, safety); + String knnQuery = filterClause + "=>[KNN 1 @embedding $vec AS distance]"; + byte[] vecBytes = LocalEmbedder.toBytes(queryVec); + + Query q = new Query(knnQuery) + .returnFields("prompt", "response", "tenant", "locale", + "model_version", "hit_count", "distance") + .setSortBy("distance", true) + .limit(0, 1) + .addParam("vec", vecBytes) + .dialect(2); + + SearchResult result = jedis.ftSearch(indexName, q); + List docs = result.getDocuments(); + if (docs.isEmpty()) { + return new CacheMiss(null, null); + } + + Document doc = docs.get(0); + String rawKey = doc.getId(); + String entryId = rawKey.startsWith(keyPrefix) + ? rawKey.substring(keyPrefix.length()) : rawKey; + double distance = parseDouble(doc.get("distance"), 0.0); + + if (distance > threshold) { + return new CacheMiss(distance, entryId); + } + + // The hash may have expired between FT.SEARCH returning the + // row and us getting here — the search index lags expirations + // by its periodic scan. If we just blindly HINCRBY-ed, Redis + // would helpfully recreate the hash with only `hit_count` + // set and the search index would then log it as an indexing + // failure (no embedding, no metadata). EXISTS narrows that + // race to the pipeline round-trip; a strictly race-free + // version would wrap the bump in a Lua script that checks + // existence and acts in one server-side step. + String entryKey = entryKey(entryId); + if (!jedis.exists(entryKey)) { + return new CacheMiss(distance, entryId); + } + + // MULTI/EXEC the three writes so they apply as a unit on the + // server — a partial failure between HINCRBY and EXPIRE would + // otherwise leave the entry without a refreshed TTL. + long newHitCount; + long ttl; + try (AbstractTransaction tx = jedis.multi()) { + Response hincrResp = tx.hincrBy(entryKey, "hit_count", 1); + tx.expire(entryKey, defaultTtlSeconds); + Response ttlResp = tx.ttl(entryKey); + tx.exec(); + newHitCount = hincrResp.get(); + ttl = ttlResp.get(); + } + + return new CacheHit( + entryId, + nullSafe(doc.getString("prompt")), + nullSafe(doc.getString("response")), + nullSafe(doc.getString("tenant")), + nullSafe(doc.getString("locale")), + nullSafe(doc.getString("model_version")), + distance, + ttl > 0 ? ttl : defaultTtlSeconds, + newHitCount + ); + } + + // ------------------------------------------------------------------ + // Write + // ------------------------------------------------------------------ + + /** + * Write a new cache entry and return its id. + * + *

The embedding is stored as raw little-endian float32 bytes — + * the encoding Redis Search expects from a {@code FLOAT32} vector + * field. {@code EXPIRE} on the key gives every entry a bounded + * lifetime; combine with an {@code allkeys-lfu} eviction policy + * on the database to cap memory under pressure too. + */ + public String put( + String prompt, + String response, + float[] embedding, + String tenant, + String locale, + String modelVersion, + String safety, + Long ttlSeconds, + String entryId) { + + if (embedding.length != vectorDim) { + throw new IllegalArgumentException( + "embedding length is " + embedding.length + + "; index expects " + vectorDim + ); + } + + String id = (entryId == null || entryId.isEmpty()) + ? UUID.randomUUID().toString().replace("-", "").substring(0, 12) + : entryId; + String key = entryKey(id); + long ttl = ttlSeconds != null ? ttlSeconds : defaultTtlSeconds; + byte[] vecBytes = LocalEmbedder.toBytes(embedding); + + byte[] keyBytes = key.getBytes(java.nio.charset.StandardCharsets.UTF_8); + + // Build the byte-keyed hash mapping for the embedding field + // (binary) and pair it with the textual fields. We have to + // use the byte[] HSET overload for `embedding` because the + // string variant would corrupt the float bytes; the textual + // metadata can ride in the same multi-field byte[] HSET. + Map mapping = new HashMap<>(); + putUtf8(mapping, "prompt", prompt); + putUtf8(mapping, "response", response); + putUtf8(mapping, "tenant", tenant); + putUtf8(mapping, "locale", locale); + putUtf8(mapping, "model_version", modelVersion); + putUtf8(mapping, "safety", safety); + putUtf8(mapping, "created_ts", String.format(Locale.ROOT, "%.6f", + System.currentTimeMillis() / 1000.0)); + putUtf8(mapping, "hit_count", "0"); + mapping.put("embedding".getBytes(java.nio.charset.StandardCharsets.UTF_8), vecBytes); + + // MULTI/EXEC so HSET and EXPIRE either both apply or neither + // does. Without the transaction wrapper a connection drop + // between the two writes could leave the entry without a TTL + // and the cache would then keep an answer past its intended + // lifetime (or forever, on a database with no eviction + // policy). + try (AbstractTransaction tx = jedis.multi()) { + tx.hset(keyBytes, mapping); + tx.expire(keyBytes, ttl); + tx.exec(); + } + return id; + } + + private static void putUtf8(Map mapping, String field, String value) { + if (value == null) value = ""; + mapping.put( + field.getBytes(java.nio.charset.StandardCharsets.UTF_8), + value.getBytes(java.nio.charset.StandardCharsets.UTF_8) + ); + } + + // ------------------------------------------------------------------ + // Filter clause + // ------------------------------------------------------------------ + + static String escapeTagValue(String value) { + StringBuilder out = new StringBuilder(value.length()); + for (int i = 0; i < value.length(); i++) { + char ch = value.charAt(i); + if (TAG_SPECIAL.indexOf(ch) >= 0) { + out.append('\\'); + } + out.append(ch); + } + return out.toString(); + } + + static String buildFilterClause( + String tenant, String locale, String modelVersion, String safety) { + List clauses = new ArrayList<>(4); + if (tenant != null && !tenant.isEmpty()) { + clauses.add("@tenant:{" + escapeTagValue(tenant) + "}"); + } + if (locale != null && !locale.isEmpty()) { + clauses.add("@locale:{" + escapeTagValue(locale) + "}"); + } + if (modelVersion != null && !modelVersion.isEmpty()) { + clauses.add("@model_version:{" + escapeTagValue(modelVersion) + "}"); + } + if (safety != null && !safety.isEmpty()) { + clauses.add("@safety:{" + escapeTagValue(safety) + "}"); + } + if (clauses.isEmpty()) return "(*)"; + return "(" + String.join(" ", clauses) + ")"; + } + + // ------------------------------------------------------------------ + // Inspection / admin + // ------------------------------------------------------------------ + + /** Subset of {@code FT.INFO} useful for the demo UI. */ + public Map indexInfo() { + Map out = new HashMap<>(); + out.put("num_docs", 0L); + out.put("indexing_failures", 0L); + out.put("vector_index_size_mb", 0.0); + try { + Map info = jedis.ftInfo(indexName); + out.put("num_docs", parseLong(info.get("num_docs"), 0L)); + out.put("indexing_failures", + parseLong(info.get("hash_indexing_failures"), 0L)); + out.put("vector_index_size_mb", + parseDouble(info.get("vector_index_sz_mb"), 0.0)); + } catch (JedisDataException ignored) { + // index does not exist + } + return out; + } + + /** Return every cached entry (no embedding) for the admin UI. */ + public List> listEntries(int limit) { + Query q = new Query("*") + .returnFields("prompt", "response", "tenant", "locale", + "model_version", "safety", "created_ts", "hit_count") + .limit(0, limit) + .setSortBy("created_ts", false) + .dialect(2); + + List> out = new ArrayList<>(); + SearchResult result = jedis.ftSearch(indexName, q); + for (Document doc : result.getDocuments()) { + String rawKey = doc.getId(); + String entryId = rawKey.startsWith(keyPrefix) + ? rawKey.substring(keyPrefix.length()) : rawKey; + long ttl = jedis.ttl(entryKey(entryId)); + Map row = new HashMap<>(); + row.put("id", entryId); + row.put("prompt", nullSafe(doc.getString("prompt"))); + row.put("response", nullSafe(doc.getString("response"))); + row.put("tenant", nullSafe(doc.getString("tenant"))); + row.put("locale", nullSafe(doc.getString("locale"))); + row.put("model_version", nullSafe(doc.getString("model_version"))); + row.put("safety", nullSafe(doc.getString("safety"))); + row.put("hit_count", parseLong(doc.get("hit_count"), 0L)); + row.put("ttl_seconds", ttl > 0 ? ttl : 0L); + row.put("created_ts", parseDouble(doc.get("created_ts"), 0.0)); + out.add(row); + } + return out; + } + + /** Drop a single entry. Returns {@code true} if the key existed. */ + public boolean deleteEntry(String entryId) { + return jedis.del(entryKey(entryId)) > 0; + } + + /** + * Drop the index and every cached entry, then re-create the + * index. Returns the count of entries that were removed. + */ + public long clear() { + long before = (long) indexInfo().getOrDefault("num_docs", 0L); + dropIndex(true); + createIndex(); + return before; + } + + // ------------------------------------------------------------------ + // Helpers + // ------------------------------------------------------------------ + + private static String nullSafe(String s) { + return s == null ? "" : s; + } + + private static double parseDouble(Object value, double dflt) { + if (value == null) return dflt; + try { + return Double.parseDouble(value.toString()); + } catch (NumberFormatException ex) { + return dflt; + } + } + + private static long parseLong(Object value, long dflt) { + if (value == null) return dflt; + if (value instanceof Number n) return n.longValue(); + try { + return Long.parseLong(value.toString()); + } catch (NumberFormatException ex) { + try { + return (long) Double.parseDouble(value.toString()); + } catch (NumberFormatException ignored) { + return dflt; + } + } + } +} diff --git a/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/SeedCache.java b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/SeedCache.java new file mode 100644 index 0000000000..4615a483cc --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-jedis/src/main/java/com/redis/semcache/SeedCache.java @@ -0,0 +1,101 @@ +package com.redis.semcache; + +import java.util.ArrayList; +import java.util.List; + +/** + * Pre-seed the semantic cache with a handful of FAQ answers. + * + *

In a real deployment the cache fills up organically as users + * ask questions: a first-time question is a miss, the LLM answers, + * and the response is written back. To make the demo immediately + * useful — so the first query you type lands on a hit instead of a + * cold miss — we seed a small set of canonical prompts and their + * answers at startup. + * + *

The seed list mirrors the keyword table in {@link MockLLM} but + * stores the canonical phrasing of each question. + * Paraphrases of any of these prompts ("How do I return an item?", + * "Can I get a refund?") embed close to the canonical entry and the + * cache lookup serves the stored response without ever calling the + * model. + */ +public final class SeedCache { + + public record SeedEntry(String prompt, String response) {} + + public static final List SEED_ENTRIES = List.of( + new SeedEntry( + "What is your return policy?", + "You can return any unworn item within 30 days of delivery for " + + "a full refund. Start a return from your order page; we email " + + "a prepaid label and refund the original payment method within " + + "five business days of receiving the item." + ), + new SeedEntry( + "How long does shipping take?", + "Standard shipping is free on orders over $50 and arrives in " + + "three to five business days. Expedited two-day shipping is " + + "$9.99 and is available at checkout for in-stock items." + ), + new SeedEntry( + "How do I find my size?", + "We follow standard US sizing. For most styles we recommend " + + "ordering your usual size; the product page includes a sizing " + + "chart and customer fit notes for items that run small or " + + "large." + ), + new SeedEntry( + "Is there a warranty on your products?", + "All gear is covered by a one-year manufacturer warranty " + + "against defects in materials or workmanship. Email support " + + "with your order number and a photo of the issue and we will " + + "replace the item or issue a refund." + ), + new SeedEntry( + "How can I contact customer support?", + "You can reach our support team by email at help@example.com " + + "or by live chat from the help centre, 9am to 9pm Eastern, " + + "seven days a week. Most tickets get a first reply within two " + + "hours." + ), + new SeedEntry( + "Where is my order?", + "Your tracking number is on the order confirmation email and " + + "on the order detail page once the package has been picked up " + + "by the carrier — typically within 24 hours of order " + + "placement." + ) + ); + + private SeedCache() {} + + /** Embed and write the seed list. Returns the number of entries seeded. */ + public static int seed( + RedisSemanticCache cache, + LocalEmbedder embedder, + String tenant, + String locale, + String modelVersion) throws Exception { + + List prompts = new ArrayList<>(SEED_ENTRIES.size()); + for (SeedEntry entry : SEED_ENTRIES) prompts.add(entry.prompt()); + List vectors = embedder.encodeMany(prompts); + + for (int i = 0; i < SEED_ENTRIES.size(); i++) { + SeedEntry entry = SEED_ENTRIES.get(i); + cache.put( + entry.prompt(), + entry.response(), + vectors.get(i), + tenant, + locale, + modelVersion, + "ok", + null, + null + ); + } + return SEED_ENTRIES.size(); + } +} diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/.gitignore b/content/develop/use-cases/semantic-cache/java-lettuce/.gitignore new file mode 100644 index 0000000000..c048217367 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/.gitignore @@ -0,0 +1,5 @@ +# Maven build output: ~25 MB shaded jar plus intermediates. The +# DJL native libraries and the all-MiniLM-L6-v2 model weights land +# in ~/.djl.ai/ on first run, not here, so no extra ignore is needed +# for those. +target/ diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/README.md b/content/develop/use-cases/semantic-cache/java-lettuce/README.md new file mode 100644 index 0000000000..8320ba41a9 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/README.md @@ -0,0 +1,33 @@ +# Redis semantic-cache demo (Java + Lettuce) + +See `_index.md` for the full walkthrough. Quick start: + +```bash +# 1. Make sure Redis with the Search module is running on localhost:6379. +# 2. Build the fat jar (first build pulls Lettuce, DJL, and the PyTorch +# native libraries; takes a minute or two): +mvn -q package + +# 3. Run. The first run downloads the sentence-transformers/all-MiniLM-L6-v2 +# PyTorch weights into the local DJL cache (~90 MB). +java -jar target/semantic-cache-lettuce.jar + +# Or with Maven directly: +mvn -q exec:java +``` + +Then open . + +Notable flags (full list with `--help`): + +| Flag | Default | +|---------------------------|--------------------| +| `--port` | `8090` | +| `--redis-host` | `localhost` | +| `--redis-port` | `6379` | +| `--index-name` | `semcache:idx` | +| `--key-prefix` | `cache:` | +| `--ttl-seconds` | `3600` | +| `--threshold` | `0.5` | +| `--llm-latency-ms` | `1500.0` | +| `--no-reset` | (re-seeds by default) | diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/_index.md b/content/develop/use-cases/semantic-cache/java-lettuce/_index.md new file mode 100644 index 0000000000..bd49059b5d --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/_index.md @@ -0,0 +1,284 @@ +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: Build a Redis-backed semantic cache for LLM responses in Java with Lettuce and DJL (PyTorch) +linkTitle: Lettuce example (Java) +title: Redis semantic cache with Lettuce +weight: 4 +--- + +This guide shows you how to build a small Redis-backed semantic cache for LLM responses in Java with [Lettuce]({{< relref "/develop/clients/lettuce" >}}) and [DJL (Deep Java Library)](https://djl.ai/) running the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) encoder locally on PyTorch. It includes a local web server built with the JDK's standard `com.sun.net.httpserver.HttpServer` so you can send paraphrased prompts at a mock LLM, watch the cache decide hit or miss, sweep the cosine-distance threshold, and see the cumulative latency and token savings build up. + +## Overview + +Each cache entry is stored as a single Redis [Hash]({{< relref "/develop/data-types/hashes" >}}) at `cache:`. The hash holds the original prompt, the LLM's response, the raw `float32` bytes of a 384-dimensional embedding of the prompt, and metadata fields — tenant, locale, model version, safety flag — plus a `created_ts` and a `hit_count`. A single [Redis Search]({{< relref "/develop/ai/search-and-query" >}}) index covers the embedding field and every metadata field, so one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) call with a `KNN` clause does the vector lookup *and* the TAG pre-filter in the same round trip — no cross-store joins. + +The lookup is thresholded: [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) always returns the nearest entry that satisfies the filters, but the application only serves it as a hit when the reported cosine distance is at or below `distanceThreshold`. Anything further away is treated as a miss; the caller runs the LLM and writes the new prompt, response, and embedding back to the same key pattern with a TTL. + +The embedder is [DJL](https://djl.ai/) loading the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) PyTorch model from the DJL model zoo. This is the same 384-dimensional encoder the [Python example]({{< relref "/develop/use-cases/semantic-cache/redis-py" >}}), the [Node.js example]({{< relref "/develop/use-cases/semantic-cache/nodejs" >}}), and the [Jedis example]({{< relref "/develop/use-cases/semantic-cache/java-jedis" >}}) use. Embeddings produced by the four implementations are semantically equivalent — paraphrase distances differ only at the fourth decimal place — so a cache populated by one demo can be queried by another against the same Redis instance. + +That gives you: + +* A single round trip for lookup — vector KNN + metadata pre-filter in one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}). +* Tens of milliseconds on a hit vs. a multi-second LLM call on a miss; the embedding step is the bottleneck either way, and that's a model-side cost, not a Redis one. +* Tenant, locale, and model-version isolation enforced inside the query, not in application code — a write under one tenant cannot be served to another. +* Bounded memory: every entry has an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL, and a database-level [eviction policy]({{< relref "/develop/reference/eviction" >}}) (LRU / LFU) caps the cache size under pressure. + +## How it works + +A query goes through three stages: **embed**, **lookup**, and (on a miss) **call the LLM and write back**. + +### Hit path (the goal) + +1. The application calls `embedder.encodeOne(prompt)` to turn the incoming text into a 384-dimensional `float[]`. +2. `cache.lookup(queryVec, tenant, locale, modelVersion, "ok", threshold)` runs [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) with a TAG pre-filter and a `KNN 1` clause. Redis returns the closest cached prompt that satisfies the filters along with its cosine distance. +3. If the distance is at or below the threshold, the cache returns a `CacheHit` containing the cached response. The helper also runs an [`HINCRBY`]({{< relref "/commands/hincrby" >}}) on `hit_count` and an [`EXPIRE`]({{< relref "/commands/expire" >}}) refresh inside a [`MULTI/EXEC`]({{< relref "/commands/multi" >}}), so a frequently used answer keeps its TTL and the demo UI can see which entries are load-bearing. +4. The LLM is not called at all. The application returns the cached response to the user. + +### Miss path + +When the distance is above the threshold — or there is no candidate in scope at all — the helper returns a `CacheMiss` instead, carrying the distance of the nearest candidate (if any) for logging. The application then: + +1. Calls the LLM with the prompt. +2. Calls `cache.put(prompt, response, embedding, tenant, locale, modelVersion, ...)`. The same embedding the lookup used is reused — no re-encode. The helper writes the Hash with [`HSET`]({{< relref "/commands/hset" >}}) and an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL inside a single [`MULTI/EXEC`]({{< relref "/commands/multi" >}}) so the entry never lands without a TTL on a partial failure. +3. Returns the LLM's response to the user. The next semantically similar prompt under the same metadata scope will be a hit. + +## The cache helper + +The `RedisSemanticCache` class wraps the Redis Search index and the lookup / write flow +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/RedisSemanticCache.java)): + +```java +import io.lettuce.core.ClientOptions; +import io.lettuce.core.RedisClient; +import io.lettuce.core.api.StatefulRedisConnection; +import io.lettuce.core.codec.ByteArrayCodec; +import io.lettuce.core.protocol.ProtocolVersion; +import com.redis.semcache.RedisSemanticCache; +import com.redis.semcache.LocalEmbedder; +import com.redis.semcache.LookupResult; +import com.redis.semcache.CacheHit; + +RedisClient client = RedisClient.create("redis://localhost:6379"); +client.setOptions(ClientOptions.builder() + .protocolVersion(ProtocolVersion.RESP2) + .build()); +StatefulRedisConnection connection = + client.connect(ByteArrayCodec.INSTANCE); + +LocalEmbedder embedder = LocalEmbedder.create(); // sentence-transformers/all-MiniLM-L6-v2 + +RedisSemanticCache cache = new RedisSemanticCache( + connection, + "semcache:idx", + "cache:", + 384, + 0.5, // cosine distance, lower = stricter + 3600 // TTL in seconds (one hour) +); + +// One-time index setup (idempotent). +cache.createIndex(); + +// 1) Embed the prompt. +String prompt = "How do I return an item?"; +float[] queryVec = embedder.encodeOne(prompt); + +// 2) Look up under a metadata scope. The TAG filter and the KNN +// travel together in one FT.SEARCH. +LookupResult result = cache.lookup( + queryVec, "acme", "en", "gpt-4.5-2026", "ok", null); + +String response; +if (result instanceof CacheHit hit) { + response = hit.response(); + System.out.printf("hit (%.3f): %s%n", hit.distance(), response); +} else { + // 3a) Miss — call the LLM. (Use your real client here.) + response = callLlm(prompt); + + // 3b) Cache the new entry. Reuses the same embedding bytes the + // lookup used, so we don't pay the encoder twice. + cache.put( + prompt, + response, + queryVec, + "acme", + "en", + "gpt-4.5-2026", + "ok", + null, // ttl override (null = default) + null // entry id (null = generated) + ); +} +``` + +The connection uses Lettuce's `ByteArrayCodec` so the binary float-32 bytes of the embedding can share an `HSET` mapping with the UTF-8 text fields without a second connection. RESP2 is pinned explicitly: Lettuce 6.7 negotiates RESP3 by default, which returns the [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) reply as a map keyed by `results` / `total_results` instead of the flat alternating list the demo's parser expects, and pinning RESP2 keeps the wire format identical to what the Python, Node, and Jedis ports speak. + +### Data model + +Each cache entry is one Redis Hash. The vector field is raw little-endian `float32` bytes — no JSON wrapping — because the Redis Search vector encoding expects exactly that. The helper packs the `float[]` with a `ByteBuffer` in `ByteOrder.LITTLE_ENDIAN`, which matches the bytes Redis Search reads and is identical to the encoding the Python and Node ports write. + +```text +cache:7c3f8a1b9e02 + prompt=How do I return an item? + response=You can return any unworn item within 30 days... + tenant=acme + locale=en + model_version=gpt-4.5-2026 + safety=ok + created_ts=1715990400.123 + hit_count=4 + embedding=<384 × float32 little-endian bytes> +``` + +The Redis Search index schema treats every field as queryable in its natural type: + +```text +FT.CREATE semcache:idx + ON HASH PREFIX 1 cache: + SCHEMA + prompt TEXT + response TEXT + tenant TAG + locale TAG + model_version TAG + safety TAG + created_ts NUMERIC SORTABLE + hit_count NUMERIC SORTABLE + embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE +``` + +### The query + +The lookup is a hybrid query: a TAG pre-filter expression in parentheses, then `=>[KNN 1 @embedding $vec]`. With `DIALECT 2`, Redis applies the filter first and KNN-ranks only the matching documents. Lettuce 6.7 doesn't yet ship first-class [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) bindings, so the helper sends the command through `RedisCommands.dispatch()` with a custom `ProtocolKeyword` — the same machinery Lettuce uses internally for any command it does support natively, just spelled out by the caller. The wire bytes are identical to typing the command in `redis-cli`: + +```java +CommandArgs args = new CommandArgs<>(ByteArrayCodec.INSTANCE) + .add(indexNameBytes) + .add("(@tenant:{acme} @locale:{en} @model_version:{gpt\\-4\\.5\\-2026} @safety:{ok})" + + "=>[KNN 1 @embedding $vec AS distance]") + .add("RETURN").add(7) + .add("prompt").add("response").add("tenant").add("locale") + .add("model_version").add("hit_count").add("distance") + .add("SORTBY").add("distance").add("ASC") + .add("LIMIT").add(0).add(1) + .add("PARAMS").add(2).add("vec".getBytes()).add(LocalEmbedder.toBytes(queryVec)) + .add("DIALECT").add(2); + +List raw = sync.dispatch( + FtCommand.FT_SEARCH, + new NestedMultiOutput<>(ByteArrayCodec.INSTANCE), + args +); +``` + +`distance` is the cosine *distance* (0 means identical, 2 means opposite). The result is sorted ascending, so the top row is the closest candidate. The application inspects `distance` against the threshold and decides hit or miss in user code — Redis returns the row either way, and treating it as a hit or a miss is a policy decision the cache helper owns, not a server-side filter. + +## The mock LLM + +To make the latency and token savings visible without requiring an API key, `MockLLM.java` provides a deterministic stand-in +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/MockLLM.java)): + +```java +import com.redis.semcache.MockLLM; + +MockLLM llm = new MockLLM("gpt-4.5-2026", 1500.0); +MockLLM.Response response = llm.complete("What is your return policy?"); +// response.response() — the templated answer text +// response.latencyMs() — wall-clock time the call took +// response.totalTokens() — estimated prompt + completion tokens +``` + +The mock sleeps for the configured latency, then keyword-matches against a small FAQ table to produce an answer. The deliberate slowness is what makes a hit visibly cheaper than a miss in the demo. In production code, you would replace `MockLLM` with your real client of choice — an HTTP call to OpenAI, Anthropic, a self-hosted vLLM endpoint, anything — without changing the cache helper. + +## Pre-seeding the cache + +In a real deployment the cache fills up organically: a first-time question is a miss, the LLM answers, and the response is written back. For the demo, `SeedCache.java` pre-loads a small set of canonical FAQ prompts so the very first query lands on a hit +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/SeedCache.java)): + +```java +import com.redis.semcache.SeedCache; + +cache.createIndex(); +SeedCache.seed(cache, embedder, "acme", "en", "gpt-4.5-2026"); +``` + +The seed list stores the canonical phrasing of each question ("What is your return policy?"). Paraphrases of any of these prompts ("How do I return an item?", "Can I get a refund?") embed close to the canonical entry, so the cache lookup serves the stored response without ever calling the model. + +## The interactive demo + +`DemoServer.java` runs an HTTP server built on the JDK's `com.sun.net.httpserver.HttpServer` — no Spring, no Jetty, no embedded framework. The HTML page lets you: + +* Type a prompt and toggle metadata: tenant, locale, model version. Each combination is a separate cache namespace inside the same index. +* Slide the cosine-distance threshold and see hits flip to misses (and back) on the same prompt, with the actual distance reported on each query. +* Submit with **Ask** to run the full hit-or-miss path (calls the LLM on a miss, writes the answer back). Submit with **Lookup only (no LLM)** to sweep the threshold against a fixed prompt without polluting the cache. +* Watch the cumulative panel build up: total queries, cache hits, cache misses, hit ratio, tokens not spent, LLM milliseconds not waited. +* Inspect every cached entry, including remaining TTL and total hit count, and drop individual entries to simulate eviction. + +The server holds one `LocalEmbedder`, one `RedisSemanticCache`, and one `MockLLM` for the lifetime of the process. The HTML page is shared with the Python, Node.js, Go, and Jedis demos; the build embeds `index.html` from the project root as a classpath resource so the jar runs from any working directory. Endpoints: + +| Endpoint | What it does | +|-----------------|-------------------------------------------------------------------------------| +| `GET /state` | Index info and the full list of cached entries. | +| `POST /query` | Embed the prompt, run `FT.SEARCH`, on miss call the LLM and write back. | +| `POST /reset` | Drop every cached entry and re-seed from the FAQ list. | +| `POST /drop` | Delete a single cached entry by id. | + +## Run the demo locally + +1. Clone the [`redis/docs`](https://github.com/redis/docs) repository and change into the example + directory: + + ```bash + git clone https://github.com/redis/docs.git + cd docs/content/develop/use-cases/semantic-cache/java-lettuce + ``` + +2. Make sure a Redis instance with the Redis Search module is running locally on + port 6379. [Redis Stack]({{< relref "/operate/oss_and_stack/install/install-stack" >}}) or + [Redis 8 with Search]({{< relref "/develop/ai/search-and-query" >}}) both work. + +3. Build the project with Maven. This pulls Lettuce, DJL, and the PyTorch native + libraries. The first build takes a couple of minutes: + + ```bash + mvn -q package + ``` + +4. Run the demo. The first run also downloads the `sentence-transformers/all-MiniLM-L6-v2` + PyTorch weights into the local DJL cache (~90 MB); every subsequent run is offline: + + ```bash + java -jar target/semantic-cache-lettuce.jar + ``` + + Or with `mvn`: + + ```bash + mvn -q exec:java + ``` + +5. Open and try some queries: + + * **"What is your return policy?"** — exact match against the seed, distance ≈ 0, + hit at any threshold. + * **"How fast is delivery?"** — paraphrase of the shipping seed; distance + around 0.30, hit at the default threshold of 0.5. + * **"How do I return an item?"** — slightly looser paraphrase of the returns + seed; distance around 0.49, still a hit at the default threshold. Slide + the threshold down to 0.4 to see this one flip to a miss. + * **"What payment methods do you accept?"** — unrelated to anything in the + seed; distance > 0.65, so you'll see a miss, the mock LLM kicks in for + ~1.5 s, the new answer is cached, and a follow-up of the same question + is now an immediate hit. + * Switch the **Tenant** dropdown to `globex` or `initech` and re-ask any + seeded question — the result flips to a miss because the cache entries + live under `acme`. That's the metadata pre-filter at work inside `FT.SEARCH`. + +The server is read/write against your local Redis. The default index name is `semcache:idx` and entry keys live under `cache:`. Flags mirror the Python, Node, and Jedis demos: `--no-reset` to keep an existing cache across restarts, `--threshold` to change the default cosine-distance cutoff, `--llm-latency-ms` to make the mock LLM faster or slower for the demo, or `--port` to listen on a different port. diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/index.html b/content/develop/use-cases/semantic-cache/java-lettuce/index.html new file mode 100644 index 0000000000..e897cfdee7 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/index.html @@ -0,0 +1,513 @@ + + + + + + Redis Semantic Cache Demo + + + +
+
loading…
+

Redis Semantic Cache Demo

+

+ A small semantic cache sits in front of a mock LLM. Each cache + entry is a Hash at __KEY_PREFIX__<id> holding + the prompt, the response, the prompt's 384-dimensional embedding, + and metadata fields. A single FT.SEARCH on + __INDEX_NAME__ does the KNN against cached prompts + with a TAG pre-filter (tenant, locale, model version, safety) in + the same round trip. If the closest cached prompt is within the + cosine-distance threshold, the demo serves the cached response + and the LLM is not called at all. +

+ +
+ +
+

Ask the LLM

+

Type a question, optionally adjust the metadata filters and + the distance threshold, and submit. The server embeds the + prompt, runs FT.SEARCH with KNN over the cache, + and either serves the cached response (hit) or runs the mock + LLM and writes the new response back to the cache (miss).

+ + +
+
+ + +
+
+ + +
+
+ + +
+
+
+ + + 0.50 +
+

+ The cache serves a hit when the closest cached prompt's + cosine distance is at or below this threshold. Lower = + stricter (fewer hits, safer reuse); higher = looser (more + hits, more risk of serving a near-miss). +

+ + + + + +
+
+ +
+

Cumulative savings

+

Every hit avoids one LLM round trip. The numbers below add + up across the session — tokens that would have been spent and + wall-clock seconds that would have been waited if the cache + had not served the answer.

+
+
+
0
+
Total queries
+
+
+
0
+
Cache hits
+
+
+
0
+
Cache misses
+
+
+
0%
+
Hit ratio
+
+
+
0
+
Tokens saved
+
+
+
0 ms
+
LLM time saved
+
+
+
+ +
+

Index state

+
+ +
+ +
+

Cached entries

+

Every prompt/response pair currently in the cache. + hit_count is the running total of times the entry + has served a hit; ttl is the remaining lifetime + in seconds before EXPIRE drops the key. Click + Drop to simulate eviction.

+ + + + + + + + + + + + +
IDPromptMetadataHitsTTL
+
+ +
+ +
+
+ + + + diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/pom.xml b/content/develop/use-cases/semantic-cache/java-lettuce/pom.xml new file mode 100644 index 0000000000..f28828fcbd --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/pom.xml @@ -0,0 +1,134 @@ + + + 4.0.0 + + com.redis + semantic-cache-lettuce + 1.0.0 + jar + + Redis Semantic Cache Demo (Lettuce) + + Interactive semantic-cache demo backed by Redis Search, using + Lettuce for Redis access and DJL (PyTorch) for local sentence + embeddings. + + + + 17 + 17 + UTF-8 + 6.7.1.RELEASE + 0.33.0 + 20240303 + + + + + + io.lettuce + lettuce-core + ${lettuce.version} + + + + + ai.djl + api + ${djl.version} + + + ai.djl.huggingface + tokenizers + ${djl.version} + + + ai.djl.pytorch + pytorch-model-zoo + ${djl.version} + + + + + org.json + json + ${json.version} + + + + + semantic-cache-lettuce + + + + ${project.basedir} + + index.html + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 3.13.0 + + 17 + + + + + org.apache.maven.plugins + maven-shade-plugin + 3.5.3 + + + package + shade + + false + + + com.redis.semcache.DemoServer + + + + + + *:* + + META-INF/*.SF + META-INF/*.DSA + META-INF/*.RSA + + + + + + + + + org.codehaus.mojo + exec-maven-plugin + 3.5.0 + + com.redis.semcache.DemoServer + + + + + diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/CacheHit.java b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/CacheHit.java new file mode 100644 index 0000000000..6f6b8fe094 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/CacheHit.java @@ -0,0 +1,22 @@ +package com.redis.semcache; + +/** + * A cache lookup that returned a cached response. + * + *

{@code distance} is the cosine distance {@code FT.SEARCH} + * reported for the nearest cached prompt (0 = identical, 2 = + * opposite). It is always at or below the threshold the lookup was + * run with. + */ +public record CacheHit( + String id, + String prompt, + String response, + String tenant, + String locale, + String modelVersion, + double distance, + long ttlSeconds, + long hitCount +) implements LookupResult { +} diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/CacheMiss.java b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/CacheMiss.java new file mode 100644 index 0000000000..be6634b85b --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/CacheMiss.java @@ -0,0 +1,16 @@ +package com.redis.semcache; + +/** + * A cache lookup that did not return a usable response. + * + *

{@code nearestDistance} is the cosine distance to the closest + * cached prompt that did match the metadata filters. Both + * fields are {@code null} when the cache had no entry in scope at + * all, which is what the demo UI shows as "no candidate" + * vs. "candidate too far". + */ +public record CacheMiss( + Double nearestDistance, + String nearestId +) implements LookupResult { +} diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/DemoServer.java b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/DemoServer.java new file mode 100644 index 0000000000..c5006e2a3d --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/DemoServer.java @@ -0,0 +1,601 @@ +package com.redis.semcache; + +import com.sun.net.httpserver.HttpExchange; +import com.sun.net.httpserver.HttpHandler; +import com.sun.net.httpserver.HttpServer; +import io.lettuce.core.ClientOptions; +import io.lettuce.core.RedisClient; +import io.lettuce.core.RedisURI; +import io.lettuce.core.api.StatefulRedisConnection; +import io.lettuce.core.codec.ByteArrayCodec; +import io.lettuce.core.protocol.ProtocolVersion; +import org.json.JSONArray; +import org.json.JSONObject; + +import java.io.IOException; +import java.io.InputStream; +import java.net.InetSocketAddress; +import java.net.URI; +import java.net.URLDecoder; +import java.nio.charset.StandardCharsets; +import java.time.Duration; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.concurrent.Executors; + +/** + * Redis semantic-cache demo server (Java + Lettuce). + * + *

Run this main and visit {@code http://localhost:8090} to drive + * a small semantic-cache demo backed by Redis Search. The UI lets + * you type a natural-language prompt and watch the cache decide hit + * or miss; on a hit Redis returns the cached response in tens of + * milliseconds and the demo LLM is not called at all, while on a + * miss the demo LLM "thinks" for ~1.5 s before answering + * and the new prompt, response, and embedding are written back to + * Redis for next time. + * + *

The server holds a single {@link LocalEmbedder}, a single + * {@link RedisSemanticCache}, and a single {@link MockLLM} for the + * lifetime of the process. The first run downloads the embedding + * model into the local DJL cache; everything after is local. + */ +public final class DemoServer { + + static final class Args { + String host = "127.0.0.1"; + int port = 8090; + String redisHost = "localhost"; + int redisPort = 6379; + String indexName = "semcache:idx"; + String keyPrefix = "cache:"; + long ttlSeconds = 3600; + double threshold = 0.5; + double llmLatencyMs = 1500.0; + boolean resetOnStart = true; + } + + public static void main(String[] argv) throws Exception { + Args args = parseArgs(argv); + + RedisURI uri = RedisURI.Builder + .redis(args.redisHost, args.redisPort) + .withTimeout(Duration.ofSeconds(2)) + .build(); + RedisClient client = RedisClient.create(uri); + // Pin the connection to RESP2 so FT.SEARCH and FT.INFO come + // back in the flat alternating-list shape the cache parses. + // Lettuce 6.7 negotiates RESP3 by default, which wraps the + // search reply in a map keyed by "results", "total_results", + // etc. — a richer shape, but one the demo's parser doesn't + // need; RESP2 keeps the wire format identical to what the + // Python, Node, Go, and Jedis ports already speak. + client.setOptions(ClientOptions.builder() + .protocolVersion(ProtocolVersion.RESP2) + .build()); + + StatefulRedisConnection connection; + try { + connection = client.connect(ByteArrayCodec.INSTANCE); + connection.sync().ping(); + } catch (Exception ex) { + System.err.println("Error: cannot reach Redis at " + + args.redisHost + ":" + args.redisPort); + System.err.println(" (" + ex.getMessage() + ")"); + client.shutdown(); + System.exit(1); + return; + } + + RedisSemanticCache cache = new RedisSemanticCache( + connection, + args.indexName, + args.keyPrefix, + LocalEmbedder.defaultVectorDim(), + args.threshold, + args.ttlSeconds + ); + cache.createIndex(); + + System.out.println("Loading embedding model " + + "(first run downloads the PyTorch weights)..."); + LocalEmbedder embedder = LocalEmbedder.create(); + MockLLM llm = new MockLLM("gpt-4.5-2026", args.llmLatencyMs); + + SemanticCacheDemo demo = new SemanticCacheDemo(cache, embedder, llm); + + if (args.resetOnStart) { + System.out.println( + "Dropping any existing cache under '" + args.keyPrefix + + "*' and re-seeding from the FAQ list " + + "(pass --no-reset to keep)."); + int seeded = demo.seed(); + System.out.println("Seeded " + seeded + " entries."); + } + + // Load the HTML once and substitute the template tokens so the + // docs panel shows the actual values in use rather than the + // default copies. + String rawHtml = loadIndexHtml(); + String htmlPage = rawHtml + .replace("__INDEX_NAME__", args.indexName) + .replace("__KEY_PREFIX__", args.keyPrefix); + + HttpServer server = HttpServer.create( + new InetSocketAddress(args.host, args.port), 0); + server.setExecutor(Executors.newCachedThreadPool()); + server.createContext("/", new RootHandler(cache, embedder, llm, demo, htmlPage)); + + System.out.println("Redis semantic cache demo listening on " + + "http://" + args.host + ":" + args.port); + System.out.println("Using Redis at " + args.redisHost + ":" + args.redisPort + + " with index '" + args.indexName + "'"); + + Runtime.getRuntime().addShutdownHook(new Thread(() -> { + System.out.println("\nShutting down..."); + server.stop(0); + try { embedder.close(); } catch (Exception ignored) {} + try { connection.close(); } catch (Exception ignored) {} + try { client.shutdown(); } catch (Exception ignored) {} + })); + + server.start(); + } + + // ------------------------------------------------------------------ + // Demo orchestrator + // ------------------------------------------------------------------ + + static final class SemanticCacheDemo { + private final RedisSemanticCache cache; + private final LocalEmbedder embedder; + private final MockLLM llm; + private final String defaultTenant = "acme"; + private final String defaultLocale = "en"; + + SemanticCacheDemo(RedisSemanticCache cache, LocalEmbedder embedder, MockLLM llm) { + this.cache = cache; + this.embedder = embedder; + this.llm = llm; + } + + /** Drop everything in scope and pre-populate with FAQ entries. */ + synchronized int seed() throws Exception { + cache.clear(); + return SeedCache.seed(cache, embedder, + defaultTenant, defaultLocale, llm.modelVersion()); + } + + /** + * The hot path: embed, look up, optionally call the LLM, write back. + * + *

Timings are taken with {@code System.nanoTime()} around + * each bounded step so the UI can display the embed / lookup + * / LLM breakdown separately. The cache write on a miss is + * not included in {@code total_ms} so the latency + * number reflects the user-facing wait, not the background + * bookkeeping. + */ + Map runQuery( + String prompt, + String tenant, + String locale, + String modelVersion, + double threshold, + boolean lookupOnly) throws Exception { + + long t0 = System.nanoTime(); + float[] queryVec = embedder.encodeOne(prompt); + double embedMs = (System.nanoTime() - t0) / 1_000_000.0; + + long t1 = System.nanoTime(); + LookupResult result = cache.lookup( + queryVec, tenant, locale, modelVersion, "ok", threshold); + double lookupMs = (System.nanoTime() - t1) / 1_000_000.0; + + Map payload = new LinkedHashMap<>(); + + if (result instanceof CacheHit hit) { + payload.put("outcome", "hit"); + payload.put("response", hit.response()); + payload.put("entry_id", hit.id()); + payload.put("distance", hit.distance()); + payload.put("ttl_seconds", hit.ttlSeconds()); + payload.put("hit_count", hit.hitCount()); + payload.put("threshold", threshold); + payload.put("embed_ms", embedMs); + payload.put("lookup_ms", lookupMs); + payload.put("llm_ms", null); + payload.put("total_ms", embedMs + lookupMs); + payload.put("tokens_avoided", + estimateResponseTokens(hit.prompt(), hit.response())); + payload.put("ms_avoided", llm.latencyMs()); + return payload; + } + + // Miss path. In "lookup only" mode the demo reports the + // miss without actually calling the LLM — useful for + // sweeping the threshold against a fixed prompt to see + // where the cutoff would fall without polluting the cache. + CacheMiss miss = (CacheMiss) result; + if (lookupOnly) { + payload.put("outcome", "miss"); + payload.put("response", "(LLM not called in lookup-only mode)"); + payload.put("nearest_distance", miss.nearestDistance()); + payload.put("threshold", threshold); + payload.put("wrote_entry_id", null); + payload.put("embed_ms", embedMs); + payload.put("lookup_ms", lookupMs); + payload.put("llm_ms", null); + payload.put("total_ms", embedMs + lookupMs); + return payload; + } + + long t2 = System.nanoTime(); + MockLLM.Response llmResponse = llm.complete(prompt); + double llmMs = (System.nanoTime() - t2) / 1_000_000.0; + + // Write the new entry back. The embedding is the same + // vector we already used for the lookup — no need to + // re-encode. + String entryId = cache.put( + prompt, + llmResponse.response(), + queryVec, + tenant, + locale, + modelVersion, + "ok", + null, + null + ); + + payload.put("outcome", "miss"); + payload.put("response", llmResponse.response()); + payload.put("nearest_distance", miss.nearestDistance()); + payload.put("threshold", threshold); + payload.put("wrote_entry_id", entryId); + payload.put("embed_ms", embedMs); + payload.put("lookup_ms", lookupMs); + payload.put("llm_ms", llmMs); + payload.put("total_ms", embedMs + lookupMs + llmMs); + return payload; + } + + private static int estimateResponseTokens(String prompt, String response) { + int len = (prompt == null ? 0 : prompt.length()) + + (response == null ? 0 : response.length()); + return Math.max(1, len / 4); + } + } + + // ------------------------------------------------------------------ + // HTTP plumbing + // ------------------------------------------------------------------ + + static final class RootHandler implements HttpHandler { + private final RedisSemanticCache cache; + private final LocalEmbedder embedder; + private final MockLLM llm; + private final SemanticCacheDemo demo; + private final String htmlPage; + + RootHandler(RedisSemanticCache cache, LocalEmbedder embedder, + MockLLM llm, SemanticCacheDemo demo, String htmlPage) { + this.cache = cache; + this.embedder = embedder; + this.llm = llm; + this.demo = demo; + this.htmlPage = htmlPage; + } + + @Override + public void handle(HttpExchange ex) throws IOException { + try { + String method = ex.getRequestMethod(); + URI uri = ex.getRequestURI(); + String path = uri.getPath(); + + if ("GET".equalsIgnoreCase(method)) { + if (path.equals("/") || path.equals("/index.html")) { + sendHtml(ex, 200, htmlPage); + return; + } + if (path.equals("/state")) { + sendJson(ex, 200, buildState()); + return; + } + sendJson(ex, 404, errorPayload("not found", null)); + return; + } + if ("POST".equalsIgnoreCase(method)) { + String body = readBody(ex); + Map params = parseForm(body); + + if (path.equals("/query")) { + handleQuery(ex, params); + return; + } + if (path.equals("/reset")) { + try { + demo.seed(); + JSONObject ok = new JSONObject(); + ok.put("ok", true); + sendJson(ex, 200, ok); + } catch (Exception inner) { + handleException(ex, inner); + } + return; + } + if (path.equals("/drop")) { + String entryId = params.getOrDefault("entry_id", "").trim(); + if (entryId.isEmpty()) { + sendJson(ex, 400, errorPayload("entry_id is required", null)); + return; + } + boolean deleted = cache.deleteEntry(entryId); + JSONObject out = new JSONObject(); + out.put("deleted", deleted); + out.put("entry_id", entryId); + sendJson(ex, 200, out); + return; + } + sendJson(ex, 404, errorPayload("not found", null)); + return; + } + sendJson(ex, 405, errorPayload("method not allowed", null)); + } catch (Exception exc) { + handleException(ex, exc); + } + } + + private void handleQuery(HttpExchange ex, Map params) + throws IOException { + String prompt = params.getOrDefault("prompt", "").trim(); + if (prompt.isEmpty()) { + sendJson(ex, 400, errorPayload("prompt is required", null)); + return; + } + double threshold = clampThreshold(params.get("threshold")); + boolean lookupOnly = params.getOrDefault("lookup_only", "").length() > 0; + String tenant = nonEmpty(params.get("tenant"), "acme"); + String locale = nonEmpty(params.get("locale"), "en"); + String modelVersion = nonEmpty(params.get("model_version"), llm.modelVersion()); + + try { + Map payload = demo.runQuery( + prompt, tenant, locale, modelVersion, threshold, lookupOnly); + sendJson(ex, 200, toJson(payload)); + } catch (Exception inner) { + handleException(ex, inner); + } + } + + private JSONObject buildState() { + Map info = cache.indexInfo(); + JSONObject index = new JSONObject(); + index.put("num_docs", info.getOrDefault("num_docs", 0L)); + index.put("indexing_failures", info.getOrDefault("indexing_failures", 0L)); + index.put("vector_index_size_mb", + info.getOrDefault("vector_index_size_mb", 0.0)); + index.put("index_name", cache.indexName()); + index.put("model", embedder.modelName()); + index.put("mock_llm_latency_ms", llm.latencyMs()); + // default_threshold is what the --threshold flag actually + // configures; the UI slider initialises to this on first + // load so the flag visibly changes the demo's behaviour. + // stack_label lets the same HTML render a per-language + // badge without forking the file per language. + index.put("default_threshold", cache.distanceThreshold()); + index.put("stack_label", + "Lettuce + DJL (PyTorch + HuggingFace) + Java standard library HTTP server"); + + JSONArray entries = new JSONArray(); + List> rows = cache.listEntries(200); + for (Map row : rows) { + entries.put(toJson(row)); + } + + JSONObject out = new JSONObject(); + out.put("index", index); + out.put("entries", entries); + return out; + } + + private void handleException(HttpExchange ex, Exception exc) { + System.err.println("[demo] handler error: " + + exc.getClass().getSimpleName() + ": " + exc.getMessage()); + exc.printStackTrace(System.err); + try { + JSONObject body = errorPayload( + exc.getMessage() == null ? exc.getClass().getSimpleName() : exc.getMessage(), + exc.getClass().getSimpleName()); + sendJson(ex, 500, body); + } catch (Exception ignored) { + // Headers may already be partially flushed; nothing + // useful left to do beyond letting the connection drop. + } + } + } + + // ------------------------------------------------------------------ + // Helpers + // ------------------------------------------------------------------ + + /** + * Parse a threshold value, clamping NaN/Infinity to {@code 0.5} + * and otherwise clamping to {@code [0.0, 2.0]}. {@code parseDouble} + * happily handles "nan" → {@code NaN} and + * "inf" → {@code Infinity}. Either would silently turn + * the lookup into a permanent hit ({@code NaN} comparisons are + * always {@code false}, so {@code distance > nan} cannot reject) + * or a permanent miss; clamping to the meaningful cosine-distance + * range stops a malformed POST from overriding the threshold + * semantics. + */ + static double clampThreshold(String raw) { + double parsed = 0.5; + if (raw != null && !raw.isEmpty()) { + try { + parsed = Double.parseDouble(raw); + } catch (NumberFormatException ex) { + parsed = 0.5; + } + } + if (Double.isNaN(parsed) || Double.isInfinite(parsed)) return 0.5; + return Math.max(0.0, Math.min(2.0, parsed)); + } + + private static String nonEmpty(String value, String fallback) { + return (value == null || value.isEmpty()) ? fallback : value; + } + + /** + * Cap POST bodies so a runaway client can't accumulate unbounded + * memory before the handler runs. {@code com.sun.net.httpserver} + * provides no built-in limit on request bodies; left unchecked, + * {@code InputStream.readAllBytes()} will read whatever the + * client sends. The demo's largest legitimate body is a few + * hundred bytes of form-encoded query fields; 1 MiB is a + * generous ceiling and matches the Node and Go demos' caps. + */ + private static final int MAX_BODY_BYTES = 1 * 1024 * 1024; + + private static String readBody(HttpExchange ex) throws IOException { + try (InputStream in = ex.getRequestBody()) { + // Read up to MAX_BODY_BYTES + 1 so we can distinguish + // "exactly at the limit" from "too large". + byte[] bytes = in.readNBytes(MAX_BODY_BYTES + 1); + if (bytes.length > MAX_BODY_BYTES) { + throw new IOException( + "request body exceeds " + MAX_BODY_BYTES + " bytes"); + } + return new String(bytes, StandardCharsets.UTF_8); + } + } + + static Map parseForm(String body) { + Map out = new HashMap<>(); + if (body == null || body.isEmpty()) return out; + for (String pair : body.split("&")) { + if (pair.isEmpty()) continue; + int eq = pair.indexOf('='); + String key, value; + if (eq < 0) { + key = URLDecoder.decode(pair, StandardCharsets.UTF_8); + value = ""; + } else { + key = URLDecoder.decode(pair.substring(0, eq), StandardCharsets.UTF_8); + value = URLDecoder.decode(pair.substring(eq + 1), StandardCharsets.UTF_8); + } + out.put(key, value); + } + return out; + } + + private static void sendHtml(HttpExchange ex, int status, String html) throws IOException { + byte[] bytes = html.getBytes(StandardCharsets.UTF_8); + ex.getResponseHeaders().set("Content-Type", "text/html; charset=utf-8"); + ex.sendResponseHeaders(status, bytes.length); + ex.getResponseBody().write(bytes); + ex.getResponseBody().close(); + } + + private static void sendJson(HttpExchange ex, int status, JSONObject body) throws IOException { + byte[] bytes = body.toString().getBytes(StandardCharsets.UTF_8); + ex.getResponseHeaders().set("Content-Type", "application/json"); + ex.sendResponseHeaders(status, bytes.length); + ex.getResponseBody().write(bytes); + ex.getResponseBody().close(); + } + + private static JSONObject errorPayload(String message, String type) { + JSONObject out = new JSONObject(); + out.put("error", message); + if (type != null) out.put("type", type); + return out; + } + + private static JSONObject toJson(Map map) { + JSONObject out = new JSONObject(); + for (Map.Entry entry : map.entrySet()) { + Object value = entry.getValue(); + if (value == null) { + out.put(entry.getKey(), JSONObject.NULL); + } else { + out.put(entry.getKey(), value); + } + } + return out; + } + + private static String loadIndexHtml() throws IOException { + // index.html is shipped as a classpath resource (Maven pulls + // it from the project root via the entry in + // pom.xml). Loading from the classpath rather than the + // working directory means `java -jar target/...` works from + // anywhere, not just the project root. + try (InputStream in = + DemoServer.class.getResourceAsStream("/index.html")) { + if (in == null) { + throw new IOException( + "index.html not found on classpath; rebuild with `mvn package`"); + } + return new String(in.readAllBytes(), StandardCharsets.UTF_8); + } + } + + // ------------------------------------------------------------------ + // CLI parsing + // ------------------------------------------------------------------ + + static Args parseArgs(String[] argv) { + Args args = new Args(); + for (int i = 0; i < argv.length; i++) { + String a = argv[i]; + switch (a) { + case "--host": args.host = require(argv, ++i, a); break; + case "--port": args.port = Integer.parseInt(require(argv, ++i, a)); break; + case "--redis-host": args.redisHost = require(argv, ++i, a); break; + case "--redis-port": args.redisPort = Integer.parseInt(require(argv, ++i, a)); break; + case "--index-name": args.indexName = require(argv, ++i, a); break; + case "--key-prefix": args.keyPrefix = require(argv, ++i, a); break; + case "--ttl-seconds": args.ttlSeconds = Long.parseLong(require(argv, ++i, a)); break; + case "--threshold": args.threshold = Double.parseDouble(require(argv, ++i, a)); break; + case "--llm-latency-ms":args.llmLatencyMs = Double.parseDouble(require(argv, ++i, a)); break; + case "--no-reset": args.resetOnStart = false; break; + case "-h": + case "--help": + printHelp(); + System.exit(0); + break; + default: + throw new IllegalArgumentException("Unknown flag: " + a); + } + } + return args; + } + + private static String require(String[] argv, int i, String flag) { + if (i >= argv.length) { + throw new IllegalArgumentException("Missing value for " + flag); + } + return argv[i]; + } + + private static void printHelp() { + System.out.println("Usage: java -jar semantic-cache-lettuce.jar [options]"); + System.out.println(" --host HOST HTTP bind host (default 127.0.0.1)"); + System.out.println(" --port PORT HTTP bind port (default 8090)"); + System.out.println(" --redis-host HOST Redis host (default localhost)"); + System.out.println(" --redis-port PORT Redis port (default 6379)"); + System.out.println(" --index-name NAME Redis Search index name (default semcache:idx)"); + System.out.println(" --key-prefix PREFIX Hash key prefix (default cache:)"); + System.out.println(" --ttl-seconds N TTL for cache entries (default 3600)"); + System.out.println(" --threshold F Default cosine-distance cutoff (default 0.5)"); + System.out.println(" --llm-latency-ms F Mock LLM latency (default 1500.0)"); + System.out.println(" --no-reset Keep existing cache instead of re-seeding"); + } +} diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/LocalEmbedder.java b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/LocalEmbedder.java new file mode 100644 index 0000000000..184b2fd1b6 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/LocalEmbedder.java @@ -0,0 +1,176 @@ +package com.redis.semcache; + +import ai.djl.huggingface.translator.TextEmbeddingTranslatorFactory; +import ai.djl.inference.Predictor; +import ai.djl.repository.zoo.Criteria; +import ai.djl.repository.zoo.ZooModel; +import ai.djl.training.util.ProgressBar; + +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.util.ArrayList; +import java.util.List; + +/** + * Local text-embedding helper backed by DJL + PyTorch. + * + *

This is a thin wrapper around the + * {@code sentence-transformers/all-MiniLM-L6-v2} model loaded from + * DJL's model zoo: a 384-dimensional encoder that runs in-process on + * CPU through libtorch, needs no API key, and produces vectors that + * are numerically very close to the equivalent Python and Node ports + * (close enough that paraphrase distances differ only at the fourth + * decimal place). + * + *

DJL's {@link TextEmbeddingTranslatorFactory} returns mean-pooled + * vectors. They are normalised by default for cosine similarity, but + * the demo L2-normalises the result explicitly in {@link #encodeOne} + * before returning. That belt-and-braces step makes the cosine + * distance reported by Redis Search numerically equivalent to what + * the Python and Go ports produce, regardless of whether a future + * DJL release tweaks its default normalisation behaviour. + */ +public final class LocalEmbedder implements AutoCloseable { + + private static final String DEFAULT_MODEL_URL = + "djl://ai.djl.huggingface.pytorch/sentence-transformers/all-MiniLM-L6-v2"; + private static final String DEFAULT_MODEL_NAME = + "sentence-transformers/all-MiniLM-L6-v2"; + private static final int DEFAULT_VECTOR_DIM = 384; + + private final String modelName; + private final ZooModel model; + private final Predictor predictor; + private final int dim; + + private LocalEmbedder( + String modelName, + ZooModel model, + Predictor predictor, + int dim) { + this.modelName = modelName; + this.model = model; + this.predictor = predictor; + this.dim = dim; + } + + /** + * Load the default model. Blocks while DJL downloads the + * PyTorch weights on the first run, then keeps a single loaded + * predictor for the lifetime of the embedder. + */ + public static LocalEmbedder create() throws Exception { + Criteria criteria = Criteria.builder() + .setTypes(String.class, float[].class) + .optModelUrls(DEFAULT_MODEL_URL) + .optEngine("PyTorch") + .optTranslatorFactory(new TextEmbeddingTranslatorFactory()) + .optProgress(new ProgressBar()) + .build(); + ZooModel model = criteria.loadModel(); + Predictor predictor = model.newPredictor(); + // Probe the output shape once so we fail loudly if a + // different model is wired up against the 384-dim Redis + // Search field. + float[] probe = predictor.predict("dimension probe"); + int dim = probe.length; + return new LocalEmbedder(DEFAULT_MODEL_NAME, model, predictor, dim); + } + + public String modelName() { + return modelName; + } + + public int dim() { + return dim; + } + + /** + * Encode a single string. Returns a {@code float[]} of length + * {@link #dim()}, L2-normalised in place. + * + *

The DJL PyTorch {@code Predictor} is not thread-safe — its + * underlying NDManager and tokenizer state mutate per call. The + * demo server uses a cached thread pool, so two browser tabs + * could land on different handler threads and call this method + * concurrently. We {@code synchronized}-guard both encode entry + * points to serialise access to the shared predictor; encoding + * is the bottleneck either way and a single CPU-bound model + * won't usefully run two requests in parallel. A higher- + * throughput deployment would replace this with a small pool + * of {@code Predictor} instances or a dedicated single-threaded + * inference executor. + */ + public synchronized float[] encodeOne(String text) throws Exception { + float[] vector = predictor.predict(text); + l2Normalise(vector); + return vector; + } + + /** Encode several strings sequentially. See {@link #encodeOne} + * for the rationale behind the synchronisation. */ + public synchronized List encodeMany(List texts) throws Exception { + List out = new ArrayList<>(texts.size()); + for (String text : texts) { + float[] vector = predictor.predict(text); + l2Normalise(vector); + out.add(vector); + } + return out; + } + + /** + * Scale {@code vector} to unit length in place. DJL's default + * translator already returns near-unit vectors for this model, + * so the multiplier sits right on top of {@code 1.0} — but + * re-normalising explicitly insulates the demo from any future + * change in the translator's defaults, and a vector that has + * drifted by even one part in a million would otherwise leak + * into the cosine distance the demo prints to the UI. + */ + private static void l2Normalise(float[] vector) { + double sumSq = 0.0; + for (float v : vector) { + sumSq += (double) v * (double) v; + } + if (sumSq <= 0.0) return; + double norm = Math.sqrt(sumSq); + float scale = (float) (1.0 / norm); + for (int i = 0; i < vector.length; i++) { + vector[i] = vector[i] * scale; + } + } + + /** + * Pack a {@code float[]} into the bytes Redis Search expects. + * Vectors are little-endian {@code float32}; this matches the + * encoding the Python and Node ports write. + */ + public static byte[] toBytes(float[] vector) { + byte[] bytes = new byte[Float.BYTES * vector.length]; + ByteBuffer + .wrap(bytes) + .order(ByteOrder.LITTLE_ENDIAN) + .asFloatBuffer() + .put(vector); + return bytes; + } + + @Override + public void close() { + try { + predictor.close(); + } catch (Exception ignored) { + // best-effort cleanup + } + try { + model.close(); + } catch (Exception ignored) { + // best-effort cleanup + } + } + + public static int defaultVectorDim() { + return DEFAULT_VECTOR_DIM; + } +} diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/LookupResult.java b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/LookupResult.java new file mode 100644 index 0000000000..141278bb24 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/LookupResult.java @@ -0,0 +1,10 @@ +package com.redis.semcache; + +/** + * Sealed result of a cache lookup. Pattern-matched in the demo + * server to branch between the hit and miss paths; mirrors the + * {@code CacheHit | CacheMiss} union the Python and Node ports + * return. + */ +public sealed interface LookupResult permits CacheHit, CacheMiss { +} diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/MockLLM.java b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/MockLLM.java new file mode 100644 index 0000000000..29cff2776c --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/MockLLM.java @@ -0,0 +1,178 @@ +package com.redis.semcache; + +import java.util.List; +import java.util.concurrent.atomic.AtomicLong; + +/** + * Deterministic mock LLM for the semantic-cache demo. + * + *

The point of a semantic cache is to skip an LLM call + * when a prior answer is reusable. To make that visible in a docs + * demo we need an LLM stand-in that: + * + *

    + *
  • takes long enough that the saved time on a cache hit is + * obvious (real-world model calls are 500 ms to several + * seconds);
  • + *
  • responds deterministically so a given prompt always produces + * the same answer, which keeps the demo reproducible;
  • + *
  • exposes an estimated token count so the demo can show the + * saving in "tokens not spent" terms alongside + * latency;
  • + *
  • needs no API keys, no network, no extra dependencies.
  • + *
+ * + *

It is keyword-matched against a small lookup table of FAQ-style + * answers for a fictional online retailer. Anything that doesn't + * match falls back to a generic templated reply. The + * {@code latencyMs} parameter is the simulated round trip; the + * default (1500 ms) is in the neighbourhood of a real GPT-class + * model on a moderately-sized prompt. + */ +public final class MockLLM { + + private record KnowledgeRow(List keywords, String answer) {} + + private static final List KNOWLEDGE = List.of( + new KnowledgeRow( + List.of("return", "refund", "exchange"), + "You can return any unworn item within 30 days of delivery for a " + + "full refund. Start a return from your order page; we email a " + + "prepaid label and refund the original payment method within " + + "five business days of receiving the item." + ), + new KnowledgeRow( + List.of("shipping", "delivery", "arrive", "ship"), + "Standard shipping is free on orders over $50 and arrives in " + + "three to five business days. Expedited two-day shipping is " + + "$9.99 and is available at checkout for in-stock items." + ), + new KnowledgeRow( + List.of("size", "sizing", "fit"), + "We follow standard US sizing. For most styles we recommend " + + "ordering your usual size; the product page includes a sizing " + + "chart and customer fit notes for items that run small or large." + ), + new KnowledgeRow( + List.of("warranty", "guarantee", "defect", "broken"), + "All gear is covered by a one-year manufacturer warranty against " + + "defects in materials or workmanship. Email support with your " + + "order number and a photo of the issue and we will replace the " + + "item or issue a refund." + ), + new KnowledgeRow( + List.of("contact", "support", "help", "agent"), + "You can reach our support team by email at help@example.com or " + + "by live chat from the help centre, 9am to 9pm Eastern, seven " + + "days a week. Most tickets get a first reply within two hours." + ), + new KnowledgeRow( + List.of("track", "tracking", "order", "where"), + "Your tracking number is on the order confirmation email and on " + + "the order detail page once the package has been picked up by " + + "the carrier — typically within 24 hours of order placement." + ), + new KnowledgeRow( + List.of("cancel", "modify", "change"), + "Orders can be cancelled or modified for up to one hour after " + + "placement. After that the order has usually entered our " + + "warehouse system; the fastest path is to accept delivery and " + + "start a return for any unwanted items." + ), + new KnowledgeRow( + List.of("discount", "coupon", "promo", "code"), + "Active promotional codes are listed on the homepage banner. " + + "Codes apply at checkout and cannot be combined; the system " + + "automatically uses the larger of the two when more than one " + + "would qualify." + ) + ); + + private static final String FALLBACK_ANSWER = + "Thanks for the question. Our team would normally answer this " + + "individually; in the meantime please check the help centre or " + + "contact support@example.com for a faster response."; + + /** Result of a mock LLM call. */ + public record Response( + String response, + String modelVersion, + double latencyMs, + int promptTokens, + int completionTokens + ) { + public int totalTokens() { + return promptTokens + completionTokens; + } + } + + private final String modelVersion; + private final double latencyMs; + private final AtomicLong callCount = new AtomicLong(); + + public MockLLM() { + this("gpt-4.5-2026", 1500.0); + } + + public MockLLM(String modelVersion, double latencyMs) { + this.modelVersion = modelVersion; + this.latencyMs = latencyMs; + } + + public String modelVersion() { + return modelVersion; + } + + public double latencyMs() { + return latencyMs; + } + + public long callCount() { + return callCount.get(); + } + + /** + * Pretend to call a model. Sleeps for the configured latency, + * then returns a templated answer. + */ + public Response complete(String prompt) { + callCount.incrementAndGet(); + long start = System.nanoTime(); + try { + // Sleep first so the latency is realistic regardless of + // which branch generates the text. + long ms = (long) latencyMs; + int ns = (int) ((latencyMs - ms) * 1_000_000.0); + Thread.sleep(ms, ns); + } catch (InterruptedException ex) { + Thread.currentThread().interrupt(); + } + String response = answerFor(prompt); + double elapsedMs = (System.nanoTime() - start) / 1_000_000.0; + return new Response( + response, + modelVersion, + elapsedMs, + estimateTokens(prompt), + estimateTokens(response) + ); + } + + private static String answerFor(String prompt) { + String lower = prompt.toLowerCase(); + for (KnowledgeRow row : KNOWLEDGE) { + for (String kw : row.keywords()) { + if (lower.contains(kw)) { + return row.answer(); + } + } + } + return FALLBACK_ANSWER; + } + + /** Rough English token estimate: ~4 characters per token. */ + public static int estimateTokens(String text) { + if (text == null || text.isEmpty()) return 0; + return Math.max(1, text.length() / 4); + } +} diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/RedisSemanticCache.java b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/RedisSemanticCache.java new file mode 100644 index 0000000000..d3da18d9bd --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/RedisSemanticCache.java @@ -0,0 +1,723 @@ +package com.redis.semcache; + +import io.lettuce.core.RedisException; +import io.lettuce.core.TransactionResult; +import io.lettuce.core.api.StatefulRedisConnection; +import io.lettuce.core.api.sync.RedisCommands; +import io.lettuce.core.codec.ByteArrayCodec; +import io.lettuce.core.output.NestedMultiOutput; +import io.lettuce.core.output.StatusOutput; +import io.lettuce.core.protocol.CommandArgs; +import io.lettuce.core.protocol.ProtocolKeyword; + +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.UUID; + +/** + * Redis semantic-cache helper backed by Redis Search. + * + *

Each cache entry lives as a Hash document at + * {@code cache:}. The hash stores the user's prompt and the + * corresponding LLM response alongside the raw float32 bytes of the + * prompt's 384-dimensional embedding and a small set of metadata + * fields — tenant, locale, model version, and a safety flag. + * + *

A single Redis Search index covers the embedding plus every + * metadata field, so one {@code FT.SEARCH} call does an + * approximate-nearest-neighbour lookup against the cached prompts + * with a TAG pre-filter applied in the same pass — no cross-store + * joins, no extra round trips, and tenant isolation is enforced + * inside the query rather than after the fact in + * application code. + * + *

The lookup is thresholded: {@code FT.SEARCH} always returns the + * closest cached prompt, but the cache only serves it as a hit when + * the cosine distance is at or below {@code distanceThreshold}. + * Anything further away is treated as a miss; the caller is expected + * to run the underlying LLM and write the new prompt, response, and + * embedding back with {@link #put}. + * + *

Each cache entry is written with {@code EXPIRE}, so stale + * answers age out without manual cleanup; combine with an + * {@code allkeys-lfu} eviction policy on the database to cap memory + * under pressure too. + * + *

Lettuce 6.7 doesn't yet ship first-class {@code FT.*} bindings, + * so the cache uses {@code dispatch()} with a custom + * {@link ProtocolKeyword} for {@code FT.CREATE}, {@code FT.SEARCH}, + * {@code FT.INFO}, and {@code FT.DROPINDEX}. Everything else — + * {@code HSET}, {@code HINCRBY}, {@code EXPIRE}, {@code TTL}, + * {@code MULTI}/{@code EXEC} — goes through the built-in synchronous + * API on a {@code byte[]}-codec connection so binary embedding bytes + * and UTF-8 text can share the same hash without separate connections. + */ +public final class RedisSemanticCache { + + public static final int VECTOR_DIM_DEFAULT = 384; + + /** + * Characters Redis Search treats as syntax inside a TAG value; + * any of them in a user-supplied filter must be backslash-escaped + * or the surrounding {@code {...}} block won't parse correctly. + */ + private static final String TAG_SPECIAL = "\\,.<>{}[]\"':;!@#$%^&*()-+=~| "; + + /** + * Custom {@link ProtocolKeyword}s for the Redis Search + * subcommands we send via {@code dispatch()}. Lettuce 6.7 has no + * native {@code FT.*} bindings, but {@code dispatch()} accepts + * any keyword whose {@code getBytes()} returns the raw command + * name. We deliberately spell out each one as its own keyword + * (rather than {@code add("CREATE")} on a single {@code FT} + * keyword) so the wire bytes match the standard Redis Search + * tooling — {@code MONITOR}, {@code LATENCY HISTORY}, server-side + * ACL rules — exactly as if the commands had been typed in + * {@code redis-cli}. + */ + private enum FtCommand implements ProtocolKeyword { + FT_CREATE("FT.CREATE"), + FT_SEARCH("FT.SEARCH"), + FT_INFO("FT.INFO"), + FT_DROPINDEX("FT.DROPINDEX"); + + private final byte[] bytes; + private final String wire; + + FtCommand(String wire) { + this.wire = wire; + this.bytes = wire.getBytes(StandardCharsets.US_ASCII); + } + + @Override + public byte[] getBytes() { + return bytes; + } + + @Override + public String toString() { + return wire; + } + } + + private final StatefulRedisConnection connection; + /** + * Lettuce connections are thread-safe for individual command + * dispatch, but transaction state ({@code MULTI}/queued commands/ + * {@code EXEC}) is connection-scoped. Two concurrent handler + * threads sharing one connection can interleave their queued + * writes into the same transaction, with each thread's + * {@code EXEC} returning a mix of replies. We serialise the + * entire {@code MULTI…EXEC} span on this lock so transactions + * see consistent state. A higher-throughput deployment would + * use a small pool of connections via Lettuce's + * {@code ConnectionPoolSupport} instead. + */ + private final Object txLock = new Object(); + private final RedisCommands sync; + private final String indexName; + private final String keyPrefix; + private final byte[] indexNameBytes; + private final int vectorDim; + private final double distanceThreshold; + private final long defaultTtlSeconds; + + public RedisSemanticCache( + StatefulRedisConnection connection, + String indexName, + String keyPrefix, + int vectorDim, + double distanceThreshold, + long defaultTtlSeconds) { + this.connection = connection; + this.sync = connection.sync(); + this.indexName = indexName; + this.keyPrefix = keyPrefix; + this.indexNameBytes = indexName.getBytes(StandardCharsets.UTF_8); + this.vectorDim = vectorDim; + this.distanceThreshold = distanceThreshold; + this.defaultTtlSeconds = defaultTtlSeconds; + } + + public String indexName() { + return indexName; + } + + public String keyPrefix() { + return keyPrefix; + } + + public int vectorDim() { + return vectorDim; + } + + public long defaultTtlSeconds() { + return defaultTtlSeconds; + } + + public double distanceThreshold() { + return distanceThreshold; + } + + // ------------------------------------------------------------------ + // Keys + // ------------------------------------------------------------------ + + public String entryKey(String entryId) { + return keyPrefix + entryId; + } + + private byte[] entryKeyBytes(String entryId) { + return entryKey(entryId).getBytes(StandardCharsets.UTF_8); + } + + // ------------------------------------------------------------------ + // Index management + // ------------------------------------------------------------------ + + /** + * Create the Redis Search index if it doesn't already exist. + * + *

One index covers the embedding plus every metadata field, so + * a single {@code FT.SEARCH} can pre-filter by tenant / locale / + * model and then KNN-rank the matching documents in one pass. + * The {@code prompt} and {@code response} fields are stored as + * {@code TEXT} so admin tooling can grep the cache by content, + * but the cache lookup itself is vector-only. + */ + public void createIndex() { + CommandArgs args = new CommandArgs<>(ByteArrayCodec.INSTANCE) + .add(indexNameBytes) + .add("ON").add("HASH") + .add("PREFIX").add(1).add(keyPrefix.getBytes(StandardCharsets.UTF_8)) + .add("SCHEMA") + .add("prompt").add("TEXT") + .add("response").add("TEXT") + .add("tenant").add("TAG") + .add("locale").add("TAG") + .add("model_version").add("TAG") + .add("safety").add("TAG") + .add("created_ts").add("NUMERIC").add("SORTABLE") + .add("hit_count").add("NUMERIC").add("SORTABLE") + .add("embedding").add("VECTOR").add("HNSW").add(6) + .add("TYPE").add("FLOAT32") + .add("DIM").add(vectorDim) + .add("DISTANCE_METRIC").add("COSINE"); + try { + sync.dispatch( + FtCommand.FT_CREATE, + new StatusOutput<>(ByteArrayCodec.INSTANCE), + args + ); + } catch (RedisException ex) { + if (!String.valueOf(ex.getMessage()).contains("Index already exists")) { + throw ex; + } + } + } + + /** Drop the search index. Optionally also delete cached entries. */ + public void dropIndex(boolean deleteDocuments) { + CommandArgs args = new CommandArgs<>(ByteArrayCodec.INSTANCE) + .add(indexNameBytes); + if (deleteDocuments) { + args.add("DD"); + } + try { + sync.dispatch( + FtCommand.FT_DROPINDEX, + new StatusOutput<>(ByteArrayCodec.INSTANCE), + args + ); + } catch (RedisException ex) { + String msg = String.valueOf(ex.getMessage()).toLowerCase(Locale.ROOT); + if (!msg.contains("no such index") && !msg.contains("unknown index name")) { + throw ex; + } + } + } + + // ------------------------------------------------------------------ + // Lookup + // ------------------------------------------------------------------ + + /** + * Find the nearest in-scope cached prompt and decide hit / miss. + * + *

{@code FT.SEARCH} returns the single nearest entry that + * satisfies the TAG pre-filters. The lookup is a hit only if the + * reported cosine distance is at or below + * {@code distanceThreshold} (or the instance default). Anything + * further away is a miss with the candidate distance attached so + * the caller can log it. + * + *

On a hit, the entry's {@code hit_count} is incremented + * atomically with {@code HINCRBY} and the TTL is refreshed inside + * the same {@code MULTI/EXEC} so a frequently used answer doesn't + * age out under cold tail entries. + */ + public LookupResult lookup( + float[] queryVec, + String tenant, + String locale, + String modelVersion, + String safety, + Double distanceThreshold) { + + // Match the shape check that `put` performs. A wrong-dim + // vector would otherwise hit Redis as a malformed FT.SEARCH + // parameter and surface as a server-side parse error instead + // of a clear caller-side error. + if (queryVec.length != vectorDim) { + throw new IllegalArgumentException( + "queryVec length is " + queryVec.length + + "; index expects " + vectorDim + ); + } + + double threshold = distanceThreshold != null + ? distanceThreshold : this.distanceThreshold; + + String filterClause = buildFilterClause(tenant, locale, modelVersion, safety); + String knnQuery = filterClause + "=>[KNN 1 @embedding $vec AS distance]"; + byte[] vecBytes = LocalEmbedder.toBytes(queryVec); + + CommandArgs args = new CommandArgs<>(ByteArrayCodec.INSTANCE) + .add(indexNameBytes) + .add(knnQuery) + .add("RETURN").add(7) + .add("prompt").add("response").add("tenant").add("locale") + .add("model_version").add("hit_count").add("distance") + .add("SORTBY").add("distance").add("ASC") + .add("LIMIT").add(0).add(1) + .add("PARAMS").add(2).add("vec".getBytes(StandardCharsets.UTF_8)).add(vecBytes) + .add("DIALECT").add(2); + + List raw = sync.dispatch( + FtCommand.FT_SEARCH, + new NestedMultiOutput<>(ByteArrayCodec.INSTANCE), + args + ); + + SearchHit hit = parseFirstHit(raw); + if (hit == null) { + return new CacheMiss(null, null); + } + + String entryId = hit.id.startsWith(keyPrefix) + ? hit.id.substring(keyPrefix.length()) : hit.id; + double distance = parseDouble(hit.fields.get("distance"), 0.0); + + if (distance > threshold) { + return new CacheMiss(distance, entryId); + } + + // The hash may have expired between FT.SEARCH returning the + // row and us getting here — the search index lags expirations + // by its periodic scan. If we just blindly HINCRBY-ed, Redis + // would helpfully recreate the hash with only `hit_count` + // set and the search index would then log it as an indexing + // failure (no embedding, no metadata). EXISTS narrows that + // race to the pipeline round-trip; a strictly race-free + // version would wrap the bump in a Lua script that checks + // existence and acts in one server-side step. + byte[] entryKey = entryKeyBytes(entryId); + if (sync.exists(entryKey) == 0L) { + return new CacheMiss(distance, entryId); + } + + // MULTI/EXEC the three writes so they apply as a unit on the + // server — a partial failure between HINCRBY and EXPIRE would + // otherwise leave the entry without a refreshed TTL. In + // Lettuce sync mode the queued commands' return values are + // null; the real responses come back in order on the + // TransactionResult that exec() returns. The `txLock` + // synchronisation serialises this whole MULTI…EXEC block + // against any concurrent transaction on the same connection + // — see the `txLock` field comment for why. + TransactionResult txResult; + synchronized (txLock) { + sync.multi(); + sync.hincrby(entryKey, "hit_count".getBytes(StandardCharsets.UTF_8), 1); + sync.expire(entryKey, defaultTtlSeconds); + sync.ttl(entryKey); + txResult = sync.exec(); + } + if (txResult == null || txResult.wasDiscarded() || txResult.size() < 3) { + // Lettuce returns a discarded transaction on connection- + // level errors. Fall back to a single TTL read so the UI + // still gets a sensible number, but treat the bookkeeping + // as best-effort — the cached response is the load-bearing + // bit, not the hit_count. + long ttlOnly = sync.ttl(entryKey); + return new CacheHit( + entryId, + nullSafe(hit.fields.get("prompt")), + nullSafe(hit.fields.get("response")), + nullSafe(hit.fields.get("tenant")), + nullSafe(hit.fields.get("locale")), + nullSafe(hit.fields.get("model_version")), + distance, + ttlOnly > 0 ? ttlOnly : defaultTtlSeconds, + parseLong(hit.fields.get("hit_count"), 0L) + ); + } + + long newHitCount = parseLong(txResult.get(0), 0L); + long ttl = parseLong(txResult.get(2), 0L); + + return new CacheHit( + entryId, + nullSafe(hit.fields.get("prompt")), + nullSafe(hit.fields.get("response")), + nullSafe(hit.fields.get("tenant")), + nullSafe(hit.fields.get("locale")), + nullSafe(hit.fields.get("model_version")), + distance, + ttl > 0 ? ttl : defaultTtlSeconds, + newHitCount + ); + } + + // ------------------------------------------------------------------ + // Write + // ------------------------------------------------------------------ + + /** + * Write a new cache entry and return its id. + * + *

The embedding is stored as raw little-endian float32 bytes — + * the encoding Redis Search expects from a {@code FLOAT32} vector + * field. {@code EXPIRE} on the key gives every entry a bounded + * lifetime; combine with an {@code allkeys-lfu} eviction policy + * on the database to cap memory under pressure too. + */ + public String put( + String prompt, + String response, + float[] embedding, + String tenant, + String locale, + String modelVersion, + String safety, + Long ttlSeconds, + String entryId) { + + if (embedding.length != vectorDim) { + throw new IllegalArgumentException( + "embedding length is " + embedding.length + + "; index expects " + vectorDim + ); + } + + String id = (entryId == null || entryId.isEmpty()) + ? UUID.randomUUID().toString().replace("-", "").substring(0, 12) + : entryId; + byte[] key = entryKeyBytes(id); + long ttl = ttlSeconds != null ? ttlSeconds : defaultTtlSeconds; + byte[] vecBytes = LocalEmbedder.toBytes(embedding); + + // Build a byte[]-keyed hash mapping. The ByteArrayCodec + // connection lets the binary embedding share an HSET call + // with the UTF-8 text fields, so a single round trip lands + // both halves of the entry on the server. + Map mapping = new LinkedHashMap<>(); + putUtf8(mapping, "prompt", prompt); + putUtf8(mapping, "response", response); + putUtf8(mapping, "tenant", tenant); + putUtf8(mapping, "locale", locale); + putUtf8(mapping, "model_version", modelVersion); + putUtf8(mapping, "safety", safety); + putUtf8(mapping, "created_ts", String.format(Locale.ROOT, "%.6f", + System.currentTimeMillis() / 1000.0)); + putUtf8(mapping, "hit_count", "0"); + mapping.put("embedding".getBytes(StandardCharsets.UTF_8), vecBytes); + + // MULTI/EXEC so HSET and EXPIRE either both apply or neither + // does. Without the transaction wrapper a connection drop + // between the two writes could leave the entry without a TTL + // and the cache would then keep an answer past its intended + // lifetime (or forever, on a database with no eviction + // policy). The `txLock` synchronisation prevents concurrent + // transactions on the shared Lettuce connection from + // interleaving — see the field comment. + TransactionResult txResult; + synchronized (txLock) { + sync.multi(); + sync.hset(key, mapping); + sync.expire(key, ttl); + txResult = sync.exec(); + } + if (txResult == null || txResult.wasDiscarded()) { + throw new RedisException("MULTI/EXEC for cache put was discarded"); + } + return id; + } + + private static void putUtf8(Map mapping, String field, String value) { + if (value == null) value = ""; + mapping.put( + field.getBytes(StandardCharsets.UTF_8), + value.getBytes(StandardCharsets.UTF_8) + ); + } + + // ------------------------------------------------------------------ + // Filter clause + // ------------------------------------------------------------------ + + static String escapeTagValue(String value) { + StringBuilder out = new StringBuilder(value.length()); + for (int i = 0; i < value.length(); i++) { + char ch = value.charAt(i); + if (TAG_SPECIAL.indexOf(ch) >= 0) { + out.append('\\'); + } + out.append(ch); + } + return out.toString(); + } + + static String buildFilterClause( + String tenant, String locale, String modelVersion, String safety) { + List clauses = new ArrayList<>(4); + if (tenant != null && !tenant.isEmpty()) { + clauses.add("@tenant:{" + escapeTagValue(tenant) + "}"); + } + if (locale != null && !locale.isEmpty()) { + clauses.add("@locale:{" + escapeTagValue(locale) + "}"); + } + if (modelVersion != null && !modelVersion.isEmpty()) { + clauses.add("@model_version:{" + escapeTagValue(modelVersion) + "}"); + } + if (safety != null && !safety.isEmpty()) { + clauses.add("@safety:{" + escapeTagValue(safety) + "}"); + } + if (clauses.isEmpty()) return "(*)"; + return "(" + String.join(" ", clauses) + ")"; + } + + // ------------------------------------------------------------------ + // Inspection / admin + // ------------------------------------------------------------------ + + /** Subset of {@code FT.INFO} useful for the demo UI. */ + public Map indexInfo() { + Map out = new HashMap<>(); + out.put("num_docs", 0L); + out.put("indexing_failures", 0L); + out.put("vector_index_size_mb", 0.0); + CommandArgs args = new CommandArgs<>(ByteArrayCodec.INSTANCE) + .add(indexNameBytes); + List raw; + try { + raw = sync.dispatch( + FtCommand.FT_INFO, + new NestedMultiOutput<>(ByteArrayCodec.INSTANCE), + args + ); + } catch (RedisException ignored) { + return out; + } + Map info = pairsToMap(raw); + out.put("num_docs", parseLong(info.get("num_docs"), 0L)); + out.put("indexing_failures", + parseLong(info.get("hash_indexing_failures"), 0L)); + out.put("vector_index_size_mb", + parseDouble(info.get("vector_index_sz_mb"), 0.0)); + return out; + } + + /** Return every cached entry (no embedding) for the admin UI. */ + public List> listEntries(int limit) { + CommandArgs args = new CommandArgs<>(ByteArrayCodec.INSTANCE) + .add(indexNameBytes) + .add("*") + .add("RETURN").add(8) + .add("prompt").add("response").add("tenant").add("locale") + .add("model_version").add("safety").add("created_ts").add("hit_count") + .add("SORTBY").add("created_ts").add("DESC") + .add("LIMIT").add(0).add(limit) + .add("DIALECT").add(2); + + List raw = sync.dispatch( + FtCommand.FT_SEARCH, + new NestedMultiOutput<>(ByteArrayCodec.INSTANCE), + args + ); + + List hits = parseAllHits(raw); + List> out = new ArrayList<>(hits.size()); + for (SearchHit hit : hits) { + String entryId = hit.id.startsWith(keyPrefix) + ? hit.id.substring(keyPrefix.length()) : hit.id; + long ttl = sync.ttl(entryKeyBytes(entryId)); + Map row = new HashMap<>(); + row.put("id", entryId); + row.put("prompt", nullSafe(hit.fields.get("prompt"))); + row.put("response", nullSafe(hit.fields.get("response"))); + row.put("tenant", nullSafe(hit.fields.get("tenant"))); + row.put("locale", nullSafe(hit.fields.get("locale"))); + row.put("model_version", nullSafe(hit.fields.get("model_version"))); + row.put("safety", nullSafe(hit.fields.get("safety"))); + row.put("hit_count", parseLong(hit.fields.get("hit_count"), 0L)); + row.put("ttl_seconds", ttl > 0 ? ttl : 0L); + row.put("created_ts", parseDouble(hit.fields.get("created_ts"), 0.0)); + out.add(row); + } + return out; + } + + /** Drop a single entry. Returns {@code true} if the key existed. */ + public boolean deleteEntry(String entryId) { + return sync.del(entryKeyBytes(entryId)) > 0L; + } + + /** + * Drop the index and every cached entry, then re-create the + * index. Returns the count of entries that were removed. + */ + public long clear() { + long before = (long) indexInfo().getOrDefault("num_docs", 0L); + dropIndex(true); + createIndex(); + return before; + } + + public StatefulRedisConnection connection() { + return connection; + } + + // ------------------------------------------------------------------ + // FT.SEARCH / FT.INFO parsing + // ------------------------------------------------------------------ + + /** A single FT.SEARCH row: the document key plus its field map. */ + private static final class SearchHit { + final String id; + final Map fields; + + SearchHit(String id, Map fields) { + this.id = id; + this.fields = fields; + } + } + + /** + * Parse the first hit from an FT.SEARCH reply. The RESP2 shape + * is [count, key1, [field, value, field, value, ...], key2, ...]. + * Returns {@code null} when count is 0 or the reply is shorter + * than expected. + */ + private static SearchHit parseFirstHit(List raw) { + if (raw == null || raw.isEmpty()) return null; + long count = parseLong(raw.get(0), 0L); + if (count <= 0 || raw.size() < 3) return null; + String id = decode(raw.get(1)); + Map fields = fieldsToMap(raw.get(2)); + return new SearchHit(id, fields); + } + + /** Parse every hit from an FT.SEARCH reply, preserving order. */ + private static List parseAllHits(List raw) { + List out = new ArrayList<>(); + if (raw == null || raw.size() < 3) return out; + // index 0 holds the total count; entries follow as + // (key, fields)* pairs. + for (int i = 1; i + 1 < raw.size(); i += 2) { + String id = decode(raw.get(i)); + Map fields = fieldsToMap(raw.get(i + 1)); + out.add(new SearchHit(id, fields)); + } + return out; + } + + /** + * Turn the [field, value, field, value, ...] array inside an + * FT.SEARCH document into a {@code Map}, decoding + * every entry as UTF-8. + */ + private static Map fieldsToMap(Object array) { + Map out = new HashMap<>(); + if (!(array instanceof List list)) return out; + for (int i = 0; i + 1 < list.size(); i += 2) { + String field = decode(list.get(i)); + String value = decode(list.get(i + 1)); + out.put(field, value); + } + return out; + } + + /** + * FT.INFO returns a flat alternating [key, value, key, value, ...] + * list. Sub-values that are themselves arrays (e.g. {@code attributes}) + * are kept as the {@code List} {@code NestedMultiOutput} produced — + * the demo only needs scalar fields. + */ + private static Map pairsToMap(List raw) { + Map out = new HashMap<>(); + if (raw == null) return out; + for (int i = 0; i + 1 < raw.size(); i += 2) { + String key = decode(raw.get(i)); + Object value = raw.get(i + 1); + if (value instanceof byte[] bytes) { + out.put(key, new String(bytes, StandardCharsets.UTF_8)); + } else if (value != null) { + out.put(key, value); + } + } + return out; + } + + private static String decode(Object value) { + if (value == null) return null; + if (value instanceof byte[] bytes) { + return new String(bytes, StandardCharsets.UTF_8); + } + if (value instanceof String s) return s; + return value.toString(); + } + + // ------------------------------------------------------------------ + // Helpers + // ------------------------------------------------------------------ + + private static String nullSafe(String s) { + return s == null ? "" : s; + } + + private static double parseDouble(Object value, double dflt) { + if (value == null) return dflt; + if (value instanceof Number n) return n.doubleValue(); + String s = (value instanceof byte[] bytes) + ? new String(bytes, StandardCharsets.UTF_8) + : value.toString(); + try { + return Double.parseDouble(s); + } catch (NumberFormatException ex) { + return dflt; + } + } + + private static long parseLong(Object value, long dflt) { + if (value == null) return dflt; + if (value instanceof Number n) return n.longValue(); + String s = (value instanceof byte[] bytes) + ? new String(bytes, StandardCharsets.UTF_8) + : value.toString(); + try { + return Long.parseLong(s); + } catch (NumberFormatException ex) { + try { + return (long) Double.parseDouble(s); + } catch (NumberFormatException ignored) { + return dflt; + } + } + } +} diff --git a/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/SeedCache.java b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/SeedCache.java new file mode 100644 index 0000000000..4615a483cc --- /dev/null +++ b/content/develop/use-cases/semantic-cache/java-lettuce/src/main/java/com/redis/semcache/SeedCache.java @@ -0,0 +1,101 @@ +package com.redis.semcache; + +import java.util.ArrayList; +import java.util.List; + +/** + * Pre-seed the semantic cache with a handful of FAQ answers. + * + *

In a real deployment the cache fills up organically as users + * ask questions: a first-time question is a miss, the LLM answers, + * and the response is written back. To make the demo immediately + * useful — so the first query you type lands on a hit instead of a + * cold miss — we seed a small set of canonical prompts and their + * answers at startup. + * + *

The seed list mirrors the keyword table in {@link MockLLM} but + * stores the canonical phrasing of each question. + * Paraphrases of any of these prompts ("How do I return an item?", + * "Can I get a refund?") embed close to the canonical entry and the + * cache lookup serves the stored response without ever calling the + * model. + */ +public final class SeedCache { + + public record SeedEntry(String prompt, String response) {} + + public static final List SEED_ENTRIES = List.of( + new SeedEntry( + "What is your return policy?", + "You can return any unworn item within 30 days of delivery for " + + "a full refund. Start a return from your order page; we email " + + "a prepaid label and refund the original payment method within " + + "five business days of receiving the item." + ), + new SeedEntry( + "How long does shipping take?", + "Standard shipping is free on orders over $50 and arrives in " + + "three to five business days. Expedited two-day shipping is " + + "$9.99 and is available at checkout for in-stock items." + ), + new SeedEntry( + "How do I find my size?", + "We follow standard US sizing. For most styles we recommend " + + "ordering your usual size; the product page includes a sizing " + + "chart and customer fit notes for items that run small or " + + "large." + ), + new SeedEntry( + "Is there a warranty on your products?", + "All gear is covered by a one-year manufacturer warranty " + + "against defects in materials or workmanship. Email support " + + "with your order number and a photo of the issue and we will " + + "replace the item or issue a refund." + ), + new SeedEntry( + "How can I contact customer support?", + "You can reach our support team by email at help@example.com " + + "or by live chat from the help centre, 9am to 9pm Eastern, " + + "seven days a week. Most tickets get a first reply within two " + + "hours." + ), + new SeedEntry( + "Where is my order?", + "Your tracking number is on the order confirmation email and " + + "on the order detail page once the package has been picked up " + + "by the carrier — typically within 24 hours of order " + + "placement." + ) + ); + + private SeedCache() {} + + /** Embed and write the seed list. Returns the number of entries seeded. */ + public static int seed( + RedisSemanticCache cache, + LocalEmbedder embedder, + String tenant, + String locale, + String modelVersion) throws Exception { + + List prompts = new ArrayList<>(SEED_ENTRIES.size()); + for (SeedEntry entry : SEED_ENTRIES) prompts.add(entry.prompt()); + List vectors = embedder.encodeMany(prompts); + + for (int i = 0; i < SEED_ENTRIES.size(); i++) { + SeedEntry entry = SEED_ENTRIES.get(i); + cache.put( + entry.prompt(), + entry.response(), + vectors.get(i), + tenant, + locale, + modelVersion, + "ok", + null, + null + ); + } + return SEED_ENTRIES.size(); + } +} diff --git a/content/develop/use-cases/semantic-cache/nodejs/_index.md b/content/develop/use-cases/semantic-cache/nodejs/_index.md new file mode 100644 index 0000000000..4cb171dd51 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/nodejs/_index.md @@ -0,0 +1,259 @@ +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: Build a Redis-backed semantic cache for LLM responses in Node.js with node-redis and @xenova/transformers +linkTitle: node-redis example (Node.js) +title: Redis semantic cache with node-redis +weight: 2 +--- + +This guide shows you how to build a small Redis-backed semantic cache for LLM responses in Node.js with [`node-redis`]({{< relref "/develop/clients/nodejs" >}}) and the [`@xenova/transformers`](https://www.npmjs.com/package/@xenova/transformers) library. It includes a local web server built with Node's standard `http` module so you can send paraphrased prompts at a mock LLM, watch the cache decide hit or miss, sweep the cosine-distance threshold, and see the cumulative latency and token savings build up. + +## Overview + +Each cache entry is stored as a single Redis [Hash]({{< relref "/develop/data-types/hashes" >}}) at `cache:`. The hash holds the original prompt, the LLM's response, the raw `float32` bytes of a 384-dimensional embedding of the prompt, and metadata fields — tenant, locale, model version, safety flag — plus a `created_ts` and a `hit_count`. A single [Redis Search]({{< relref "/develop/ai/search-and-query" >}}) index covers the embedding field and every metadata field, so one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) call with a `KNN` clause does the vector lookup *and* the TAG pre-filter in the same round trip — no cross-store joins. + +The lookup is thresholded: [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) always returns the nearest entry that satisfies the filters, but the application only serves it as a hit when the reported cosine distance is at or below `distanceThreshold`. Anything further away is treated as a miss; the caller runs the LLM and writes the new prompt, response, and embedding back to the same key pattern with a TTL. + +The embedder is [`@xenova/transformers`](https://www.npmjs.com/package/@xenova/transformers) running the ONNX-exported [`Xenova/all-MiniLM-L6-v2`](https://huggingface.co/Xenova/all-MiniLM-L6-v2) model, which is the same encoder the [Python example]({{< relref "/develop/use-cases/semantic-cache/redis-py" >}}) uses. Embeddings produced by the two implementations are semantically equivalent — paraphrase distances differ only at the fourth decimal place — so a cache populated by one demo can be queried by the other against the same Redis instance. + +That gives you: + +* A single round trip for lookup — vector KNN + metadata pre-filter in one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}). +* Tens of milliseconds on a hit vs. a multi-second LLM call on a miss; the embedding step is the bottleneck either way, and that's a model-side cost, not a Redis one. +* Tenant, locale, and model-version isolation enforced inside the query, not in application code — a write under one tenant cannot be served to another. +* Bounded memory: every entry has an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL, and a database-level [eviction policy]({{< relref "/develop/reference/eviction" >}}) (LRU / LFU) caps the cache size under pressure. + +## How it works + +A query goes through three stages: **embed**, **lookup**, and (on a miss) **call the LLM and write back**. + +### Hit path (the goal) + +1. The application calls `embedder.encodeOne(prompt)` to turn the incoming text into a 384-dimensional `Float32Array`. +2. `cache.lookup({ queryVec, tenant, locale, modelVersion })` runs [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) with a TAG pre-filter and a `KNN 1` clause. Redis returns the closest cached prompt that satisfies the filters along with its cosine distance. +3. If the distance is at or below the threshold, the cache returns a hit (`{ kind: 'hit', ... }`) containing the cached response. The helper also runs an [`HINCRBY`]({{< relref "/commands/hincrby" >}}) on `hit_count` and an [`EXPIRE`]({{< relref "/commands/expire" >}}) refresh inside a [`MULTI/EXEC`]({{< relref "/commands/multi" >}}), so a frequently used answer keeps its TTL and the demo UI can see which entries are load-bearing. +4. The LLM is not called at all. The application returns the cached response to the user. + +### Miss path + +When the distance is above the threshold — or there is no candidate in scope at all — the helper returns a miss instead, carrying the distance of the nearest candidate (if any) for logging. The application then: + +1. Calls the LLM with the prompt. +2. Calls `cache.put({ prompt, response, embedding, tenant, locale, modelVersion })`. The same embedding the lookup used is reused — no re-encode. The helper writes the Hash with [`HSET`]({{< relref "/commands/hset" >}}) and an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL inside a single [`MULTI/EXEC`]({{< relref "/commands/multi" >}}) so the entry never lands without a TTL on a partial failure. +3. Returns the LLM's response to the user. The next semantically similar prompt under the same metadata scope will be a hit. + +## The cache helper + +The `RedisSemanticCache` class wraps the Redis Search index and the lookup / write flow +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/nodejs/cache.js)): + +```javascript +import { createClient } from 'redis'; +import { RedisSemanticCache } from './cache.js'; +import { LocalEmbedder } from './embeddings.js'; + +const client = createClient(); +await client.connect(); + +const cache = new RedisSemanticCache({ + client, + indexName: 'semcache:idx', + distanceThreshold: 0.5, // cosine distance, lower = stricter + defaultTtlSeconds: 3600, // one hour +}); +const embedder = await LocalEmbedder.create(); // Xenova/all-MiniLM-L6-v2 + +// One-time index setup (idempotent). +await cache.createIndex(); + +// 1) Embed the prompt. +const prompt = 'How do I return an item?'; +const queryVec = await embedder.encodeOne(prompt); + +// 2) Look up under a metadata scope. The TAG filter and the KNN +// travel together in one FT.SEARCH. +const result = await cache.lookup({ + queryVec, + tenant: 'acme', + locale: 'en', + modelVersion: 'gpt-4.5-2026', +}); + +let response; +if (result.kind === 'hit') { + response = result.response; + console.log(`hit (${result.distance.toFixed(3)}): ${response}`); +} else { + // 3a) Miss — call the LLM. (Use your real client here.) + response = await callLlm(prompt); + + // 3b) Cache the new entry. Reuses the same embedding bytes the + // lookup used, so we don't pay the encoder twice. + await cache.put({ + prompt, + response, + embedding: queryVec, + tenant: 'acme', + locale: 'en', + modelVersion: 'gpt-4.5-2026', + }); +} +``` + +### Data model + +Each cache entry is one Redis Hash. The vector field is raw little-endian `float32` bytes — no JSON wrapping — because the Redis Search vector encoding expects exactly that. Node's `Float32Array` is little-endian on every supported platform (x86_64, arm64), so a `Buffer.from(vec.buffer)` produces the exact bytes Redis Search reads. + +```text +cache:7c3f8a1b9e02 + prompt=How do I return an item? + response=You can return any unworn item within 30 days... + tenant=acme + locale=en + model_version=gpt-4.5-2026 + safety=ok + created_ts=1715990400.123 + hit_count=4 + embedding=<384 × float32 little-endian bytes> +``` + +The Redis Search index schema treats every field as queryable in its natural type: + +```text +FT.CREATE semcache:idx + ON HASH PREFIX 1 cache: + SCHEMA + prompt TEXT + response TEXT + tenant TAG + locale TAG + model_version TAG + safety TAG + created_ts NUMERIC SORTABLE + hit_count NUMERIC SORTABLE + embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE +``` + +### The query + +The lookup is a hybrid query: a TAG pre-filter expression in parentheses, then `=>[KNN 1 @embedding $vec]`. With `DIALECT: 2`, Redis applies the filter first and KNN-ranks only the matching documents. In `node-redis`: + +```javascript +const result = await client.ft.search( + 'semcache:idx', + '(@tenant:{acme} @locale:{en} @model_version:{gpt\-4\.5\-2026} @safety:{ok})' + + '=>[KNN 1 @embedding $vec AS distance]', + { + PARAMS: { vec: Buffer.from(queryVec.buffer, queryVec.byteOffset, queryVec.byteLength) }, + DIALECT: 2, + SORTBY: 'distance', + RETURN: ['prompt', 'response', 'tenant', 'locale', + 'model_version', 'hit_count', 'distance'], + LIMIT: { from: 0, size: 1 }, + }, +); +``` + +`distance` is the cosine *distance* (0 means identical, 2 means opposite). The result is sorted ascending, so the top row is the closest candidate. The application inspects `distance` against the threshold and decides hit or miss in user code — Redis returns the row either way, and treating it as a hit or a miss is a policy decision the cache helper owns, not a server-side filter. + +## The mock LLM + +To make the latency and token savings visible without requiring an API key, `mockLlm.js` provides a deterministic stand-in +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/nodejs/mockLlm.js)): + +```javascript +import { MockLLM } from './mockLlm.js'; + +const llm = new MockLLM({ latencyMs: 1500.0 }); +const response = await llm.complete('What is your return policy?'); +// response.response — the templated answer text +// response.latencyMs — wall-clock time the call took +// response.totalTokens — estimated prompt + completion tokens +``` + +The mock sleeps for the configured latency, then keyword-matches against a small FAQ table to produce an answer. The deliberate slowness is what makes a hit visibly cheaper than a miss in the demo. In production code, you would replace `MockLLM` with your real client of choice — OpenAI's Node SDK, Anthropic's SDK, an internal vLLM endpoint, anything — without changing the cache helper. + +## Pre-seeding the cache + +In a real deployment the cache fills up organically: a first-time question is a miss, the LLM answers, and the response is written back. For the demo, `seedCache.js` pre-loads a small set of canonical FAQ prompts so the very first query lands on a hit +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/nodejs/seedCache.js)): + +```javascript +import { seed } from './seedCache.js'; + +await cache.createIndex(); +await seed(cache, embedder, { tenant: 'acme', locale: 'en' }); +``` + +The seed list stores the canonical phrasing of each question ("What is your return policy?"). Paraphrases of any of these prompts ("How do I return an item?", "Can I get a refund?") embed close to the canonical entry, so the cache lookup serves the stored response without ever calling the model. + +## The interactive demo + +`demoServer.js` runs a Node `http` server. The HTML page lets you: + +* Type a prompt and toggle metadata: tenant, locale, model version. Each combination is a separate cache namespace inside the same index. +* Slide the cosine-distance threshold and see hits flip to misses (and back) on the same prompt, with the actual distance reported on each query. +* Submit with **Ask** to run the full hit-or-miss path (calls the LLM on a miss, writes the answer back). Submit with **Lookup only (no LLM)** to sweep the threshold against a fixed prompt without polluting the cache. +* Watch the cumulative panel build up: total queries, cache hits, cache misses, hit ratio, tokens not spent, LLM milliseconds not waited. +* Inspect every cached entry, including remaining TTL and total hit count, and drop individual entries to simulate eviction. + +The server holds one `LocalEmbedder`, one `RedisSemanticCache`, and one `MockLLM` for the lifetime of the process. The HTML page is shared with the Python demo and is loaded from `index.html` next to `demoServer.js`. Endpoints: + +| Endpoint | What it does | +|-----------------|-------------------------------------------------------------------------------| +| `GET /state` | Index info and the full list of cached entries. | +| `POST /query` | Embed the prompt, run `FT.SEARCH`, on miss call the LLM and write back. | +| `POST /reset` | Drop every cached entry and re-seed from the FAQ list. | +| `POST /drop` | Delete a single cached entry by id. | + +## Run the demo locally + +1. Clone the [`redis/docs`](https://github.com/redis/docs) repository and change into the example + directory: + + ```bash + git clone https://github.com/redis/docs.git + cd docs/content/develop/use-cases/semantic-cache/nodejs + ``` + +2. Install the dependencies: + + ```bash + npm install + ``` + +3. Make sure a Redis instance with the Redis Search module is running locally on + port 6379. [Redis Stack]({{< relref "/operate/oss_and_stack/install/install-stack" >}}) or + [Redis 8 with Search]({{< relref "/develop/ai/search-and-query" >}}) both work. + +4. Start the demo server. The first run downloads the ONNX-exported + `Xenova/all-MiniLM-L6-v2` model into the local Hugging Face cache: + + ```bash + npm start + ``` + +5. Open and try some queries: + + * **"What is your return policy?"** — exact match against the seed, distance ≈ 0, + hit at any threshold. + * **"How fast is delivery?"** — paraphrase of the shipping seed; distance + around 0.30, hit at the default threshold of 0.5. + * **"How do I return an item?"** — slightly looser paraphrase of the returns + seed; distance around 0.49, still a hit at the default threshold. Slide + the threshold down to 0.4 to see this one flip to a miss. + * **"What payment methods do you accept?"** — unrelated to anything in the + seed; distance > 0.8, so you'll see a miss, the mock LLM kicks in for + ~1.5 s, the new answer is cached, and a follow-up of the same question + is now an immediate hit. + * Switch the **Tenant** dropdown to `globex` or `initech` and re-ask any + seeded question — the result flips to a miss because the cache entries + live under `acme`. That's the metadata pre-filter at work inside `FT.SEARCH`. + +The server is read/write against your local Redis. The default index name is `semcache:idx` and entry keys live under `cache:`. Flags mirror the Python demo: `--no-reset` to keep an existing cache across restarts, `--threshold` to change the default cosine-distance cutoff, or `--llm-latency-ms` to make the mock LLM faster or slower for the demo. diff --git a/content/develop/use-cases/semantic-cache/nodejs/cache.js b/content/develop/use-cases/semantic-cache/nodejs/cache.js new file mode 100644 index 0000000000..003137fc53 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/nodejs/cache.js @@ -0,0 +1,360 @@ +// Redis semantic-cache helper backed by Redis Search. +// +// Each cache entry lives as a Hash document at `cache:`. The hash +// stores the user's prompt and the corresponding LLM response +// alongside the raw float32 bytes of the prompt's 384-dimensional +// embedding and a small set of metadata fields — tenant, locale, +// model version, and a safety flag. +// +// A single Redis Search index covers the embedding plus every +// metadata field, so one `FT.SEARCH` call does an +// approximate-nearest-neighbour lookup against the cached prompts +// with a TAG pre-filter applied in the same pass — no cross-store +// joins, no extra round trips, and tenant isolation is enforced +// *inside* the query rather than after the fact in application code. +// +// The lookup is thresholded: `FT.SEARCH` always returns the closest +// cached prompt, but the cache only serves it as a hit when the +// cosine distance is at or below `distanceThreshold`. Anything +// further away is treated as a miss; the caller is expected to run +// the underlying LLM and write the new prompt, response, and +// embedding back with `put`. +// +// Each cache entry is written with `EXPIRE`, so stale answers age out +// without manual cleanup; combine with an `allkeys-lfu` eviction +// policy on the database to cap memory under pressure too. + +import { randomUUID } from 'node:crypto'; +import { + SCHEMA_FIELD_TYPE, + SCHEMA_VECTOR_FIELD_ALGORITHM, +} from 'redis'; + +const VECTOR_DIM_DEFAULT = 384; + +export class RedisSemanticCache { + constructor({ + client, + indexName = 'semcache:idx', + keyPrefix = 'cache:', + vectorDim = VECTOR_DIM_DEFAULT, + distanceThreshold = 0.5, + defaultTtlSeconds = 3600, + }) { + this.client = client; + this.indexName = indexName; + this.keyPrefix = keyPrefix; + this.vectorDim = vectorDim; + this.distanceThreshold = distanceThreshold; + this.defaultTtlSeconds = defaultTtlSeconds; + } + + // -- Keys ----------------------------------------------------------- + + entryKey(entryId) { + return `${this.keyPrefix}${entryId}`; + } + + // -- Index management ---------------------------------------------- + + async createIndex() { + // One index covers the embedding plus every metadata field, so a + // single FT.SEARCH can pre-filter by tenant / locale / model and + // then KNN-rank the matching documents in one pass. + const schema = { + prompt: { type: SCHEMA_FIELD_TYPE.TEXT }, + response: { type: SCHEMA_FIELD_TYPE.TEXT }, + tenant: { type: SCHEMA_FIELD_TYPE.TAG }, + locale: { type: SCHEMA_FIELD_TYPE.TAG }, + model_version: { type: SCHEMA_FIELD_TYPE.TAG }, + safety: { type: SCHEMA_FIELD_TYPE.TAG }, + created_ts: { type: SCHEMA_FIELD_TYPE.NUMERIC, SORTABLE: true }, + hit_count: { type: SCHEMA_FIELD_TYPE.NUMERIC, SORTABLE: true }, + embedding: { + type: SCHEMA_FIELD_TYPE.VECTOR, + ALGORITHM: SCHEMA_VECTOR_FIELD_ALGORITHM.HNSW, + TYPE: 'FLOAT32', + DIM: this.vectorDim, + DISTANCE_METRIC: 'COSINE', + }, + }; + try { + await this.client.ft.create(this.indexName, schema, { + ON: 'HASH', + PREFIX: this.keyPrefix, + }); + } catch (err) { + if (!String(err.message || err).includes('Index already exists')) { + throw err; + } + } + } + + async dropIndex({ deleteDocuments = false } = {}) { + try { + await this.client.ft.dropIndex(this.indexName, { DD: deleteDocuments }); + } catch (err) { + const msg = String(err.message || err).toLowerCase(); + if (!msg.includes('no such index') && !msg.includes('unknown index name')) { + throw err; + } + } + } + + // -- Lookup --------------------------------------------------------- + + // Returns either { kind: 'hit', ...fields } or { kind: 'miss', nearestDistance, nearestId }. + async lookup({ + queryVec, + tenant, + locale, + modelVersion, + safety = 'ok', + distanceThreshold, + }) { + // Match the shape check that `put` performs. A wrong-dim vector + // would otherwise hit Redis as a malformed FT.SEARCH parameter + // and surface as a server-side parse error instead of a clear + // caller-side error. We also coerce to Float32Array so a + // Float64Array view of length 384 (which would send 3072 bytes + // to a FLOAT32 DIM 384 field) is silently up-cast rather than + // producing a corrupt query. + if (!(queryVec instanceof Float32Array)) { + queryVec = Float32Array.from(queryVec); + } + if (queryVec.length !== this.vectorDim) { + throw new Error( + `queryVec length is ${queryVec.length}; index expects ${this.vectorDim}`, + ); + } + + const threshold = + distanceThreshold !== undefined ? distanceThreshold : this.distanceThreshold; + + const filterClause = RedisSemanticCache.buildFilterClause({ + tenant, locale, modelVersion, safety, + }); + + const queryStr = `${filterClause}=>[KNN 1 @embedding $vec AS distance]`; + const vecBytes = Buffer.from(queryVec.buffer, queryVec.byteOffset, queryVec.byteLength); + + const result = await this.client.ft.search(this.indexName, queryStr, { + PARAMS: { vec: vecBytes }, + DIALECT: 2, + SORTBY: 'distance', + RETURN: [ + 'prompt', 'response', 'tenant', 'locale', + 'model_version', 'hit_count', 'distance', + ], + LIMIT: { from: 0, size: 1 }, + }); + + if (!result.documents || result.documents.length === 0) { + return { kind: 'miss', nearestDistance: null, nearestId: null }; + } + + const doc = result.documents[0]; + const rawKey = doc.id; + const entryId = rawKey.startsWith(this.keyPrefix) + ? rawKey.slice(this.keyPrefix.length) + : rawKey; + const distance = parseFloat(doc.value.distance ?? '0') || 0; + + if (distance > threshold) { + return { kind: 'miss', nearestDistance: distance, nearestId: entryId }; + } + + // The hash may have expired between FT.SEARCH returning the row + // and us getting here — the search index lags expirations by its + // periodic scan. If we just blindly HINCRBY-ed, Redis would + // helpfully recreate the hash with only `hit_count` set and the + // search index would then log it as an indexing failure (no + // embedding, no metadata). EXISTS narrows that race to the + // pipeline round-trip; a strictly race-free version would wrap + // the bump in a Lua script that checks existence and acts in one + // server-side step. + const entryKey = this.entryKey(entryId); + const exists = await this.client.exists(entryKey); + if (!exists) { + return { kind: 'miss', nearestDistance: distance, nearestId: entryId }; + } + + // MULTI/EXEC the three writes so they apply as a unit on the + // server — a partial failure between HINCRBY and EXPIRE would + // otherwise leave the entry without a refreshed TTL. + const replies = await this.client.multi() + .hIncrBy(entryKey, 'hit_count', 1) + .expire(entryKey, this.defaultTtlSeconds) + .ttl(entryKey) + .exec(); + const [newHitCount, , ttl] = replies; + + return { + kind: 'hit', + id: entryId, + prompt: doc.value.prompt ?? '', + response: doc.value.response ?? '', + tenant: doc.value.tenant ?? '', + locale: doc.value.locale ?? '', + modelVersion: doc.value.model_version ?? '', + distance, + ttlSeconds: ttl > 0 ? ttl : this.defaultTtlSeconds, + hitCount: Number(newHitCount), + }; + } + + // -- Write ---------------------------------------------------------- + + async put({ + prompt, + response, + embedding, + tenant = 'default', + locale = 'en', + modelVersion = 'gpt-4.5-2026', + safety = 'ok', + ttlSeconds, + entryId, + }) { + // Coerce any array-like (Float64Array, plain Array, etc.) to + // Float32Array so byteLength is always exactly vectorDim * 4 — + // the only encoding Redis Search accepts for a FLOAT32 vector + // field. + if (!(embedding instanceof Float32Array)) { + embedding = Float32Array.from(embedding); + } + if (embedding.length !== this.vectorDim) { + throw new Error( + `embedding length is ${embedding.length}; index expects ${this.vectorDim}`, + ); + } + + const id = entryId || randomUUID().replace(/-/g, '').slice(0, 12); + const key = this.entryKey(id); + const ttl = ttlSeconds !== undefined ? ttlSeconds : this.defaultTtlSeconds; + const vecBytes = Buffer.from( + embedding.buffer, embedding.byteOffset, embedding.byteLength, + ); + + // MULTI/EXEC so HSET and EXPIRE either both apply or neither does. + // Without the transaction wrapper a connection drop between the + // two writes could leave the entry without a TTL and the cache + // would then keep an answer past its intended lifetime (or + // forever, on a database with no eviction policy). + await this.client.multi() + .hSet(key, { + prompt, + response, + tenant, + locale, + model_version: modelVersion, + safety, + created_ts: String(Date.now() / 1000), + hit_count: '0', + embedding: vecBytes, + }) + .expire(key, ttl) + .exec(); + return id; + } + + // -- Filter clause ------------------------------------------------- + + // Characters Redis Search treats as syntax inside a TAG value; any + // of them in a user-supplied filter must be backslash-escaped or + // the surrounding `{...}` block won't parse correctly. + static _TAG_SPECIAL = new Set('\\,.<>{}[]"\':;!@#$%^&*()-+=~| '.split('')); + + static escapeTagValue(value) { + let out = ''; + for (const ch of value) { + out += RedisSemanticCache._TAG_SPECIAL.has(ch) ? '\\' + ch : ch; + } + return out; + } + + static buildFilterClause({ tenant, locale, modelVersion, safety }) { + const clauses = []; + if (tenant) { + clauses.push(`@tenant:{${RedisSemanticCache.escapeTagValue(tenant)}}`); + } + if (locale) { + clauses.push(`@locale:{${RedisSemanticCache.escapeTagValue(locale)}}`); + } + if (modelVersion) { + clauses.push(`@model_version:{${RedisSemanticCache.escapeTagValue(modelVersion)}}`); + } + if (safety) { + clauses.push(`@safety:{${RedisSemanticCache.escapeTagValue(safety)}}`); + } + return clauses.length === 0 ? '(*)' : `(${clauses.join(' ')})`; + } + + // -- Inspection / admin -------------------------------------------- + + async indexInfo() { + try { + const info = await this.client.ft.info(this.indexName); + return { + num_docs: Number(info.numDocs ?? info.num_docs ?? 0), + indexing_failures: Number( + info.hashIndexingFailures ?? info.hash_indexing_failures ?? 0, + ), + vector_index_size_mb: Number( + info.vectorIndexSzMb ?? info.vector_index_sz_mb ?? 0, + ), + }; + } catch (err) { + return { num_docs: 0, indexing_failures: 0, vector_index_size_mb: 0.0 }; + } + } + + async listEntries({ limit = 100 } = {}) { + const result = await this.client.ft.search(this.indexName, '*', { + RETURN: [ + 'prompt', 'response', 'tenant', 'locale', + 'model_version', 'safety', 'created_ts', 'hit_count', + ], + LIMIT: { from: 0, size: limit }, + SORTBY: { BY: 'created_ts', DIRECTION: 'DESC' }, + }); + + const out = []; + for (const doc of result.documents) { + const rawKey = doc.id; + const entryId = rawKey.startsWith(this.keyPrefix) + ? rawKey.slice(this.keyPrefix.length) + : rawKey; + const ttl = await this.client.ttl(this.entryKey(entryId)); + out.push({ + id: entryId, + prompt: doc.value.prompt ?? '', + response: doc.value.response ?? '', + tenant: doc.value.tenant ?? '', + locale: doc.value.locale ?? '', + model_version: doc.value.model_version ?? '', + safety: doc.value.safety ?? '', + hit_count: Number(doc.value.hit_count ?? 0), + ttl_seconds: ttl > 0 ? ttl : 0, + created_ts: Number(doc.value.created_ts ?? 0), + }); + } + return out; + } + + async deleteEntry(entryId) { + const deleted = await this.client.del(this.entryKey(entryId)); + return deleted > 0; + } + + async clear() { + // Returns the number of entries that were removed. Used by the + // demo's "reset" button — in production the equivalent is just + // FLUSHDB on a dedicated cache database, or letting TTLs expire + // naturally. + const before = (await this.indexInfo()).num_docs; + await this.dropIndex({ deleteDocuments: true }); + await this.createIndex(); + return before; + } +} diff --git a/content/develop/use-cases/semantic-cache/nodejs/demoServer.js b/content/develop/use-cases/semantic-cache/nodejs/demoServer.js new file mode 100644 index 0000000000..c148d8f406 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/nodejs/demoServer.js @@ -0,0 +1,378 @@ +#!/usr/bin/env node +// Redis semantic-cache demo server (Node.js). +// +// Run this file and visit http://localhost:8087 to drive a small +// semantic-cache demo backed by Redis Search. The UI lets you: +// +// * Type a natural-language prompt and watch the cache decide hit or +// miss. On a hit Redis returns the cached response in tens of +// milliseconds and the demo LLM is not called at all; on a miss the +// demo LLM "thinks" for ~1.5 s before answering and the new prompt, +// response, and embedding are written back to Redis for next time. +// * Adjust the cosine-distance threshold to see how close a paraphrase +// must be for the cache to serve it. +// * Switch tenant, locale, or model version to see metadata isolation +// in action — entries written under one tenant cannot be served to +// another, because the TAG filter goes into the same `FT.SEARCH` +// call as the KNN. +// * Inspect every cached entry with TTL and hit count, and drop +// individual entries to simulate eviction. +// +// The server holds a single `LocalEmbedder`, a single +// `RedisSemanticCache`, and a single `MockLLM` for the lifetime of +// the process. The first run downloads the embedding model into the +// local Hugging Face cache; everything after is local. + +import { createServer } from 'node:http'; +import { readFile } from 'node:fs/promises'; +import { fileURLToPath } from 'node:url'; +import { dirname, join } from 'node:path'; +import { createClient } from 'redis'; +import { parseArgs } from 'node:util'; + +import { RedisSemanticCache } from './cache.js'; +import { LocalEmbedder } from './embeddings.js'; +import { MockLLM } from './mockLlm.js'; +import { seed } from './seedCache.js'; + +const HERE = dirname(fileURLToPath(import.meta.url)); + +class SemanticCacheDemo { + constructor({ cache, embedder, llm, defaultTenant = 'acme', defaultLocale = 'en' }) { + this.cache = cache; + this.embedder = embedder; + this.llm = llm; + this.defaultTenant = defaultTenant; + this.defaultLocale = defaultLocale; + } + + // Drop everything in scope and pre-populate with FAQ entries. + async seed() { + await this.cache.clear(); + return await seed(this.cache, this.embedder, { + tenant: this.defaultTenant, + locale: this.defaultLocale, + modelVersion: this.llm.modelVersion, + }); + } + + // The hot path: embed, look up, optionally call the LLM, cache. + // + // Timings are taken with `performance.now()` around each bounded + // step so the UI can display the embed / lookup / LLM breakdown + // separately. The cache write on a miss is *not* included in + // `total_ms` so the latency number reflects the user-facing wait, + // not the background bookkeeping. + async runQuery({ prompt, tenant, locale, modelVersion, threshold, lookupOnly }) { + const t0 = performance.now(); + const queryVec = await this.embedder.encodeOne(prompt); + const embedMs = performance.now() - t0; + + const t1 = performance.now(); + const result = await this.cache.lookup({ + queryVec, tenant, locale, modelVersion, + distanceThreshold: threshold, + }); + const lookupMs = performance.now() - t1; + + if (result.kind === 'hit') { + return { + outcome: 'hit', + response: result.response, + entry_id: result.id, + distance: result.distance, + ttl_seconds: result.ttlSeconds, + hit_count: result.hitCount, + threshold, + embed_ms: embedMs, + lookup_ms: lookupMs, + llm_ms: null, + total_ms: embedMs + lookupMs, + tokens_avoided: estimateResponseTokens(result.prompt, result.response), + ms_avoided: this.llm.latencyMs, + }; + } + + // Miss path. In "lookup only" mode the demo reports the miss + // without actually calling the LLM — useful for sweeping the + // threshold against a fixed prompt to see where the cutoff would + // fall without polluting the cache. + if (lookupOnly) { + return { + outcome: 'miss', + response: '(LLM not called in lookup-only mode)', + nearest_distance: result.nearestDistance, + threshold, + wrote_entry_id: null, + embed_ms: embedMs, + lookup_ms: lookupMs, + llm_ms: null, + total_ms: embedMs + lookupMs, + }; + } + + const t2 = performance.now(); + const llmResponse = await this.llm.complete(prompt); + const llmMs = performance.now() - t2; + + // Write the new entry back. The embedding is the same vector we + // already used for the lookup — no need to re-encode. + const entryId = await this.cache.put({ + prompt, + response: llmResponse.response, + embedding: queryVec, + tenant, locale, modelVersion, + }); + + return { + outcome: 'miss', + response: llmResponse.response, + nearest_distance: result.nearestDistance, + threshold, + wrote_entry_id: entryId, + embed_ms: embedMs, + lookup_ms: lookupMs, + llm_ms: llmMs, + total_ms: embedMs + lookupMs + llmMs, + }; + } +} + +function estimateResponseTokens(prompt, response) { + return Math.max(1, Math.floor((prompt.length + response.length) / 4)); +} + +// ---- HTTP plumbing -------------------------------------------------- + +function sendJson(res, payload, status = 200) { + res.writeHead(status, { 'Content-Type': 'application/json' }); + res.end(JSON.stringify(payload)); +} + +function sendHtml(res, html, status = 200) { + res.writeHead(status, { 'Content-Type': 'text/html; charset=utf-8' }); + res.end(html); +} + +// Cap POST bodies so a runaway client (or, more realistically, a +// curl --data-binary @big-file by mistake) can't accumulate +// unbounded memory before the handler runs. The demo's largest +// legitimate body is a few hundred bytes of form-encoded query +// fields; 1 MiB is a generous ceiling. +const MAX_BODY_BYTES = 1 * 1024 * 1024; + +async function readBody(req) { + return new Promise((resolve, reject) => { + const chunks = []; + let total = 0; + req.on('data', c => { + total += c.length; + if (total > MAX_BODY_BYTES) { + req.destroy(); + reject(new Error(`request body exceeds ${MAX_BODY_BYTES} bytes`)); + return; + } + chunks.push(c); + }); + req.on('end', () => resolve(Buffer.concat(chunks).toString('utf-8'))); + req.on('error', reject); + }); +} + +function parseForm(body) { + const params = new URLSearchParams(body); + const result = {}; + for (const [k, v] of params) result[k] = v; + return result; +} + +function clampThreshold(raw) { + const parsed = parseFloat(raw); + // `parseFloat` happily handles "nan" → NaN, "inf" → Infinity. Either + // would silently turn the lookup into a permanent hit (NaN + // comparisons are always false, so `distance > nan` cannot reject) + // or a permanent miss. Clamp to the meaningful cosine-distance + // range so a malformed POST can't override the threshold semantics. + if (!Number.isFinite(parsed)) return 0.5; + return Math.max(0.0, Math.min(2.0, parsed)); +} + +function buildState(cache, embedder, llm, stackLabel) { + // Returns the same shape the Python demo serves so the shared HTML + // works without modification. ``default_threshold`` is what the + // ``--threshold`` flag actually configures; the UI slider + // initialises to this on first load so the flag visibly changes + // the demo's behaviour. ``stack_label`` lets the same HTML render + // a per-language badge (redis-py, node-redis, etc.) without + // forking the file per language. + return (async () => { + const info = await cache.indexInfo(); + return { + index: { + ...info, + index_name: cache.indexName, + model: embedder.modelName, + mock_llm_latency_ms: llm.latencyMs, + default_threshold: cache.distanceThreshold, + stack_label: stackLabel, + }, + entries: await cache.listEntries({ limit: 200 }), + }; + })(); +} + +function makeHandler({ cache, embedder, llm, demo, htmlPage, stackLabel }) { + return async (req, res) => { + try { + const url = new URL(req.url, 'http://localhost'); + if (req.method === 'GET') { + if (url.pathname === '/' || url.pathname === '/index.html') { + return sendHtml(res, htmlPage); + } + if (url.pathname === '/state') { + return sendJson(res, await buildState(cache, embedder, llm, stackLabel)); + } + return sendJson(res, { error: 'not found' }, 404); + } + if (req.method === 'POST') { + const body = await readBody(req); + const params = parseForm(body); + + if (url.pathname === '/query') { + const prompt = (params.prompt || '').trim(); + if (!prompt) return sendJson(res, { error: 'prompt is required' }, 400); + const payload = await demo.runQuery({ + prompt, + tenant: params.tenant || 'acme', + locale: params.locale || 'en', + modelVersion: params.model_version || llm.modelVersion, + threshold: clampThreshold(params.threshold ?? '0.5'), + lookupOnly: !!params.lookup_only, + }); + return sendJson(res, payload); + } + if (url.pathname === '/reset') { + await demo.seed(); + return sendJson(res, { ok: true }); + } + if (url.pathname === '/drop') { + const entryId = (params.entry_id || '').trim(); + if (!entryId) return sendJson(res, { error: 'entry_id is required' }, 400); + const deleted = await cache.deleteEntry(entryId); + return sendJson(res, { deleted, entry_id: entryId }); + } + return sendJson(res, { error: 'not found' }, 404); + } + return sendJson(res, { error: 'method not allowed' }, 405); + } catch (exc) { + // Without this wrapper, an exception escapes to the default + // Node error handler and the client's `await res.json()` + // explodes with an opaque parse error instead of surfacing + // what actually went wrong. + process.stderr.write(`[demo] handler error: ${exc?.stack || exc}\n`); + try { + sendJson(res, { error: String(exc?.message || exc), type: exc?.name || 'Error' }, 500); + } catch { + // Headers may already be partially flushed; nothing useful + // left to do beyond letting the connection drop. + } + } + }; +} + +// ---- Main ----------------------------------------------------------- + +function parseFlags() { + const { values } = parseArgs({ + options: { + host: { type: 'string', default: '127.0.0.1' }, + port: { type: 'string', default: '8087' }, + 'redis-host': { type: 'string', default: 'localhost' }, + 'redis-port': { type: 'string', default: '6379' }, + 'index-name': { type: 'string', default: 'semcache:idx' }, + 'key-prefix': { type: 'string', default: 'cache:' }, + 'ttl-seconds': { type: 'string', default: '3600' }, + threshold: { type: 'string', default: '0.5' }, + 'llm-latency-ms': { type: 'string', default: '1500' }, + 'no-reset': { type: 'boolean', default: false }, + }, + }); + return values; +} + +async function main() { + const args = parseFlags(); + const port = Number(args.port); + const redisHost = args['redis-host']; + const redisPort = Number(args['redis-port']); + const indexName = args['index-name']; + const keyPrefix = args['key-prefix']; + const ttlSeconds = Number(args['ttl-seconds']); + const threshold = Number(args.threshold); + const llmLatencyMs = Number(args['llm-latency-ms']); + const resetOnStart = !args['no-reset']; + + const client = createClient({ socket: { host: redisHost, port: redisPort } }); + client.on('error', err => console.error('[redis]', err)); + try { + await client.connect(); + await client.ping(); + } catch (exc) { + console.error(`Error: cannot reach Redis at ${redisHost}:${redisPort}`); + console.error(` (${exc.message || exc})`); + process.exit(1); + } + + const cache = new RedisSemanticCache({ + client, + indexName, + keyPrefix, + distanceThreshold: threshold, + defaultTtlSeconds: ttlSeconds, + }); + await cache.createIndex(); + + console.log('Loading embedding model (first run downloads the ONNX weights)...'); + const embedder = await LocalEmbedder.create(); + const llm = new MockLLM({ latencyMs: llmLatencyMs }); + + const demo = new SemanticCacheDemo({ cache, embedder, llm }); + if (resetOnStart) { + console.log( + `Dropping any existing cache under '${keyPrefix}*' and ` + + 're-seeding from the FAQ list (pass --no-reset to keep).', + ); + const seeded = await demo.seed(); + console.log(`Seeded ${seeded} entries.`); + } + + // Load the HTML once and replace the template tokens with the + // configured index name and key prefix so the docs panel shows the + // actual values in use rather than the default copies. + const rawHtml = await readFile(join(HERE, 'index.html'), 'utf-8'); + const htmlPage = rawHtml + .replaceAll('__INDEX_NAME__', indexName) + .replaceAll('__KEY_PREFIX__', keyPrefix); + + const stackLabel = 'node-redis + @xenova/transformers + Node.js standard library HTTP server'; + const server = createServer(makeHandler({ cache, embedder, llm, demo, htmlPage, stackLabel })); + server.listen(port, args.host, () => { + console.log(`Redis semantic cache demo listening on http://${args.host}:${port}`); + console.log(`Using Redis at ${redisHost}:${redisPort} with index '${indexName}'`); + }); + + // Clean shutdown so the Redis client closes its socket. + const shutdown = async (signal) => { + console.log(`\nReceived ${signal}, shutting down...`); + server.close(); + try { await client.disconnect(); } catch {} + process.exit(0); + }; + process.on('SIGINT', () => shutdown('SIGINT')); + process.on('SIGTERM', () => shutdown('SIGTERM')); +} + +main().catch(err => { + console.error(err); + process.exit(1); +}); diff --git a/content/develop/use-cases/semantic-cache/nodejs/embeddings.js b/content/develop/use-cases/semantic-cache/nodejs/embeddings.js new file mode 100644 index 0000000000..77ac76d748 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/nodejs/embeddings.js @@ -0,0 +1,75 @@ +// Local text-embedding helper backed by @xenova/transformers. +// +// This is a thin wrapper around the ONNX-exported sentence-transformers +// model `Xenova/all-MiniLM-L6-v2`: a 384-dimensional encoder that runs +// in-process on CPU through ONNX Runtime Web, needs no API key, and +// produces vectors that are numerically very close to the equivalent +// PyTorch model (close enough that paraphrase distances differ only at +// the fourth decimal place — see the smoke-test in the README). +// +// Vectors are L2-normalised so a Redis Search index declared with +// `DISTANCE_METRIC COSINE` returns scores that are directly comparable +// across entries. The model is downloaded into the local Hugging Face +// cache on the first call; every later call runs offline. + +import { env, pipeline } from '@xenova/transformers'; + +// Allow the local cache to satisfy subsequent runs without re-downloading. +env.allowLocalModels = true; + +const DEFAULT_MODEL = 'Xenova/all-MiniLM-L6-v2'; + +export class LocalEmbedder { + // Use `LocalEmbedder.create(...)` instead of `new LocalEmbedder(...)` + // because the pipeline load is async; we want one place that owns + // the wait and the dimension probe. + constructor(modelName, extractor, dim) { + this.modelName = modelName; + this.extractor = extractor; + this.dim = dim; + } + + static async create(modelName = DEFAULT_MODEL) { + const extractor = await pipeline('feature-extraction', modelName); + // Probe the output shape once and record it on the instance so + // callers can compare against the cache's expected vectorDim + // before doing any inserts. RedisSemanticCache also checks + // length on every put / lookup, so a model swap that produces + // wrong-dim vectors fails at the call site with a clear error. + const probe = await extractor('dimension probe', { + pooling: 'mean', normalize: true, + }); + const dim = probe.dims[probe.dims.length - 1]; + return new LocalEmbedder(modelName, extractor, dim); + } + + // Encode a single string. Returns a Float32Array of length `dim`. + async encodeOne(text) { + const out = await this.extractor(text, { + pooling: 'mean', normalize: true, + }); + return new Float32Array(out.data); + } + + // Encode several strings in one pipeline call. Returns an array of + // Float32Array; callers that need raw bytes use `toBytes` per row. + async encodeMany(texts) { + const out = await this.extractor(texts, { + pooling: 'mean', normalize: true, + }); + const rows = out.dims[0]; + const cols = out.dims[1]; + const result = []; + for (let i = 0; i < rows; i++) { + result.push(new Float32Array(out.data.slice(i * cols, (i + 1) * cols))); + } + return result; + } + + // Pack a Float32Array into the bytes Redis Search expects. + // Float32Array.buffer is little-endian on every architecture we care + // about — Node runs on x86_64/arm64, both little-endian. + static toBytes(vector) { + return Buffer.from(vector.buffer, vector.byteOffset, vector.byteLength); + } +} diff --git a/content/develop/use-cases/semantic-cache/nodejs/index.html b/content/develop/use-cases/semantic-cache/nodejs/index.html new file mode 100644 index 0000000000..e897cfdee7 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/nodejs/index.html @@ -0,0 +1,513 @@ + + + + + + Redis Semantic Cache Demo + + + +

+
loading…
+

Redis Semantic Cache Demo

+

+ A small semantic cache sits in front of a mock LLM. Each cache + entry is a Hash at __KEY_PREFIX__<id> holding + the prompt, the response, the prompt's 384-dimensional embedding, + and metadata fields. A single FT.SEARCH on + __INDEX_NAME__ does the KNN against cached prompts + with a TAG pre-filter (tenant, locale, model version, safety) in + the same round trip. If the closest cached prompt is within the + cosine-distance threshold, the demo serves the cached response + and the LLM is not called at all. +

+ +
+ +
+

Ask the LLM

+

Type a question, optionally adjust the metadata filters and + the distance threshold, and submit. The server embeds the + prompt, runs FT.SEARCH with KNN over the cache, + and either serves the cached response (hit) or runs the mock + LLM and writes the new response back to the cache (miss).

+ + +
+
+ + +
+
+ + +
+
+ + +
+
+
+ + + 0.50 +
+

+ The cache serves a hit when the closest cached prompt's + cosine distance is at or below this threshold. Lower = + stricter (fewer hits, safer reuse); higher = looser (more + hits, more risk of serving a near-miss). +

+ + + + + +
+
+ +
+

Cumulative savings

+

Every hit avoids one LLM round trip. The numbers below add + up across the session — tokens that would have been spent and + wall-clock seconds that would have been waited if the cache + had not served the answer.

+
+
+
0
+
Total queries
+
+
+
0
+
Cache hits
+
+
+
0
+
Cache misses
+
+
+
0%
+
Hit ratio
+
+
+
0
+
Tokens saved
+
+
+
0 ms
+
LLM time saved
+
+
+
+ +
+

Index state

+
+ +
+ +
+

Cached entries

+

Every prompt/response pair currently in the cache. + hit_count is the running total of times the entry + has served a hit; ttl is the remaining lifetime + in seconds before EXPIRE drops the key. Click + Drop to simulate eviction.

+ + + + + + + + + + + + +
IDPromptMetadataHitsTTL
+
+ +
+ +
+
+ + + + diff --git a/content/develop/use-cases/semantic-cache/nodejs/mockLlm.js b/content/develop/use-cases/semantic-cache/nodejs/mockLlm.js new file mode 100644 index 0000000000..bb8863f84a --- /dev/null +++ b/content/develop/use-cases/semantic-cache/nodejs/mockLlm.js @@ -0,0 +1,136 @@ +// Deterministic mock LLM for the semantic-cache demo. +// +// The point of a semantic cache is to *skip* an LLM call when a prior +// answer is reusable. To make that visible in a docs demo we need an +// LLM stand-in that: +// +// * takes long enough that the saved time on a cache hit is obvious +// (real-world model calls are 500 ms to several seconds); +// * responds deterministically so a given prompt always produces the +// same answer, which keeps the demo reproducible; +// * exposes an estimated token count so the demo can show the saving +// in "tokens not spent" terms alongside latency; +// * needs no API keys, no network, no extra dependencies. +// +// It is keyword-matched against a small lookup table of FAQ-style +// answers for a fictional online retailer. Anything that doesn't +// match falls back to a generic templated reply. The `latencyMs` +// parameter is the simulated round trip; the default (1500 ms) is in +// the neighbourhood of a real GPT-class model on a moderately-sized +// prompt. + +const KNOWLEDGE = [ + { + keywords: ['return', 'refund', 'exchange'], + answer: + 'You can return any unworn item within 30 days of delivery for a ' + + 'full refund. Start a return from your order page; we email a ' + + 'prepaid label and refund the original payment method within ' + + 'five business days of receiving the item.', + }, + { + keywords: ['shipping', 'delivery', 'arrive', 'ship'], + answer: + 'Standard shipping is free on orders over $50 and arrives in ' + + 'three to five business days. Expedited two-day shipping is ' + + '$9.99 and is available at checkout for in-stock items.', + }, + { + keywords: ['size', 'sizing', 'fit'], + answer: + 'We follow standard US sizing. For most styles we recommend ' + + 'ordering your usual size; the product page includes a sizing ' + + 'chart and customer fit notes for items that run small or large.', + }, + { + keywords: ['warranty', 'guarantee', 'defect', 'broken'], + answer: + 'All gear is covered by a one-year manufacturer warranty against ' + + 'defects in materials or workmanship. Email support with your ' + + 'order number and a photo of the issue and we will replace the ' + + 'item or issue a refund.', + }, + { + keywords: ['contact', 'support', 'help', 'agent'], + answer: + 'You can reach our support team by email at help@example.com or ' + + 'by live chat from the help centre, 9am to 9pm Eastern, seven ' + + 'days a week. Most tickets get a first reply within two hours.', + }, + { + keywords: ['track', 'tracking', 'order', 'where'], + answer: + 'Your tracking number is on the order confirmation email and on ' + + 'the order detail page once the package has been picked up by ' + + 'the carrier — typically within 24 hours of order placement.', + }, + { + keywords: ['cancel', 'modify', 'change'], + answer: + 'Orders can be cancelled or modified for up to one hour after ' + + 'placement. After that the order has usually entered our ' + + 'warehouse system; the fastest path is to accept delivery and ' + + 'start a return for any unwanted items.', + }, + { + keywords: ['discount', 'coupon', 'promo', 'code'], + answer: + 'Active promotional codes are listed on the homepage banner. ' + + 'Codes apply at checkout and cannot be combined; the system ' + + 'automatically uses the larger of the two when more than one ' + + 'would qualify.', + }, +]; + +// Rough English token estimate: ~4 characters per token. Real +// tokenizers (BPE, SentencePiece) vary slightly but this is close +// enough for "look how many tokens you saved" demo signage. +function estimateTokens(text) { + if (!text) return 0; + return Math.max(1, Math.floor(text.length / 4)); +} + +function answerFor(prompt) { + const lower = prompt.toLowerCase(); + for (const row of KNOWLEDGE) { + if (row.keywords.some(k => lower.includes(k))) { + return row.answer; + } + } + // Generic fallback — keeps the demo working for queries that don't + // match any FAQ keyword. + return ( + 'Thanks for the question. Our team would normally answer this ' + + 'individually; in the meantime please check the help centre or ' + + 'contact support@example.com for a faster response.' + ); +} + +export class MockLLM { + constructor({ modelVersion = 'gpt-4.5-2026', latencyMs = 1500.0 } = {}) { + this.modelVersion = modelVersion; + this.latencyMs = latencyMs; + this.callCount = 0; + } + + // Pretend to call a model. Sleeps, then returns a templated answer. + async complete(prompt) { + this.callCount += 1; + const start = performance.now(); + // Sleep first so the latency is realistic regardless of which + // branch generates the text. + await new Promise(resolve => setTimeout(resolve, this.latencyMs)); + const response = answerFor(prompt); + const elapsedMs = performance.now() - start; + return { + response, + modelVersion: this.modelVersion, + latencyMs: elapsedMs, + promptTokens: estimateTokens(prompt), + completionTokens: estimateTokens(response), + get totalTokens() { + return this.promptTokens + this.completionTokens; + }, + }; + } +} diff --git a/content/develop/use-cases/semantic-cache/nodejs/package.json b/content/develop/use-cases/semantic-cache/nodejs/package.json new file mode 100644 index 0000000000..dd060bc708 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/nodejs/package.json @@ -0,0 +1,18 @@ +{ + "name": "redis-semantic-cache-demo-nodejs", + "version": "1.0.0", + "private": true, + "type": "module", + "description": "Redis semantic cache demo with node-redis and @xenova/transformers.", + "main": "demoServer.js", + "scripts": { + "start": "node demoServer.js" + }, + "dependencies": { + "@xenova/transformers": "^2.17.2", + "redis": "^5.12.1" + }, + "engines": { + "node": ">=18" + } +} diff --git a/content/develop/use-cases/semantic-cache/nodejs/seedCache.js b/content/develop/use-cases/semantic-cache/nodejs/seedCache.js new file mode 100644 index 0000000000..ae3ad709b2 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/nodejs/seedCache.js @@ -0,0 +1,85 @@ +// Pre-seed the semantic cache with a handful of FAQ answers. +// +// In a real deployment the cache fills up organically as users ask +// questions: a first-time question is a miss, the LLM answers, and +// the response is written back. To make the demo immediately useful +// — so the first query you type lands on a hit instead of a cold +// miss — we seed a small set of canonical prompts and their answers +// at startup. +// +// The seed list mirrors the keyword table in `mockLlm.js` but stores +// the *canonical phrasing* of each question. Paraphrases of any of +// these prompts ("How do I return an item?", "Can I get a refund?") +// embed close to the canonical entry and the cache lookup serves the +// stored response without ever calling the model. + +export const SEED_ENTRIES = [ + { + prompt: 'What is your return policy?', + response: + 'You can return any unworn item within 30 days of delivery for ' + + 'a full refund. Start a return from your order page; we email ' + + 'a prepaid label and refund the original payment method within ' + + 'five business days of receiving the item.', + }, + { + prompt: 'How long does shipping take?', + response: + 'Standard shipping is free on orders over $50 and arrives in ' + + 'three to five business days. Expedited two-day shipping is ' + + '$9.99 and is available at checkout for in-stock items.', + }, + { + prompt: 'How do I find my size?', + response: + 'We follow standard US sizing. For most styles we recommend ' + + 'ordering your usual size; the product page includes a sizing ' + + 'chart and customer fit notes for items that run small or ' + + 'large.', + }, + { + prompt: 'Is there a warranty on your products?', + response: + 'All gear is covered by a one-year manufacturer warranty ' + + 'against defects in materials or workmanship. Email support ' + + 'with your order number and a photo of the issue and we will ' + + 'replace the item or issue a refund.', + }, + { + prompt: 'How can I contact customer support?', + response: + 'You can reach our support team by email at help@example.com ' + + 'or by live chat from the help centre, 9am to 9pm Eastern, ' + + 'seven days a week. Most tickets get a first reply within two ' + + 'hours.', + }, + { + prompt: 'Where is my order?', + response: + 'Your tracking number is on the order confirmation email and ' + + 'on the order detail page once the package has been picked up ' + + 'by the carrier — typically within 24 hours of order ' + + 'placement.', + }, +]; + +export async function seed(cache, embedder, { + tenant = 'acme', + locale = 'en', + modelVersion = 'gpt-4.5-2026', +} = {}) { + const prompts = SEED_ENTRIES.map(e => e.prompt); + const vectors = await embedder.encodeMany(prompts); + for (let i = 0; i < SEED_ENTRIES.length; i++) { + const entry = SEED_ENTRIES[i]; + await cache.put({ + prompt: entry.prompt, + response: entry.response, + embedding: vectors[i], + tenant, + locale, + modelVersion, + }); + } + return SEED_ENTRIES.length; +} diff --git a/content/develop/use-cases/semantic-cache/php/.gitignore b/content/develop/use-cases/semantic-cache/php/.gitignore new file mode 100644 index 0000000000..ef21c82985 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/.gitignore @@ -0,0 +1,6 @@ +vendor/ +.transformers-cache/ +.models/ +.idea/ +.vscode/ +*.log diff --git a/content/develop/use-cases/semantic-cache/php/_index.md b/content/develop/use-cases/semantic-cache/php/_index.md new file mode 100644 index 0000000000..1bf6b4ea98 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/_index.md @@ -0,0 +1,298 @@ +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: Build a Redis-backed semantic cache for LLM responses in PHP with Predis and transformers-php +linkTitle: Predis example (PHP) +title: Redis semantic cache with Predis +weight: 4 +--- + +This guide shows you how to build a small Redis-backed semantic cache for LLM responses in PHP with [Predis]({{< relref "/develop/clients/php" >}}) and [TransformersPHP](https://transformers.codewithkyrian.com/) running the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) encoder locally on ONNX Runtime. It includes a local web server built with PHP's built-in development HTTP server so you can send paraphrased prompts at a mock LLM, watch the cache decide hit or miss, sweep the cosine-distance threshold, and see the cumulative latency and token savings build up. + +## Overview + +Each cache entry is stored as a single Redis [Hash]({{< relref "/develop/data-types/hashes" >}}) at `cache:`. The hash holds the original prompt, the LLM's response, the raw `float32` bytes of a 384-dimensional embedding of the prompt, and metadata fields — tenant, locale, model version, safety flag — plus a `created_ts` and a `hit_count`. A single [Redis Search]({{< relref "/develop/ai/search-and-query" >}}) index covers the embedding field and every metadata field, so one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) call with a `KNN` clause does the vector lookup *and* the TAG pre-filter in the same round trip — no cross-store joins. + +The lookup is thresholded: [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) always returns the nearest entry that satisfies the filters, but the application only serves it as a hit when the reported cosine distance is at or below `distanceThreshold`. Anything further away is treated as a miss; the caller runs the LLM and writes the new prompt, response, and embedding back to the same key pattern with a TTL. + +The embedder is [TransformersPHP](https://transformers.codewithkyrian.com/) running the [`Xenova/all-MiniLM-L6-v2`](https://huggingface.co/Xenova/all-MiniLM-L6-v2) ONNX export — the same 384-dimensional encoder the [Node.js example]({{< relref "/develop/use-cases/semantic-cache/nodejs" >}}) uses. The library is the established choice for vector embeddings in PHP (see [Index and query vectors]({{< relref "/develop/clients/php/vecsearch" >}}) for the precedent). Cosine distances differ from the Python and Jedis ports by only a few thousandths because of small numerical differences between ONNX Runtime and PyTorch, so a cache populated by one demo can be queried by another against the same Redis instance with very nearly the same hit/miss behaviour. + +That gives you: + +* A single round trip for lookup — vector KNN + metadata pre-filter in one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}). +* Tens of milliseconds on a hit vs. a multi-second LLM call on a miss; the embedding step is the bottleneck either way, and that's a model-side cost, not a Redis one. +* Tenant, locale, and model-version isolation enforced inside the query, not in application code — a write under one tenant cannot be served to another. +* Bounded memory: every entry has an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL, and a database-level [eviction policy]({{< relref "/develop/reference/eviction" >}}) (LRU / LFU) caps the cache size under pressure. + +## How it works + +A query goes through three stages: **embed**, **lookup**, and (on a miss) **call the LLM and write back**. + +### Hit path (the goal) + +1. The application calls `$embedder->encodeOne($prompt)` to turn the incoming text into a 384-element `float` array. +2. `$cache->lookup($queryVec, tenant: ..., locale: ..., modelVersion: ...)` runs [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) with a TAG pre-filter and a `KNN 1` clause. Redis returns the closest cached prompt that satisfies the filters along with its cosine distance. +3. If the distance is at or below the threshold, the cache returns a `CacheHit` containing the cached response. The helper also issues an [`HINCRBY`]({{< relref "/commands/hincrby" >}}) on `hit_count` and an [`EXPIRE`]({{< relref "/commands/expire" >}}) refresh inside a [`MULTI/EXEC`]({{< relref "/commands/multi" >}}), so a frequently used answer keeps its TTL and the demo UI can see which entries are load-bearing. +4. The LLM is not called at all. The application returns the cached response to the user. + +### Miss path + +When the distance is above the threshold — or there is no candidate in scope at all — the helper returns a `CacheMiss` instead, carrying the distance of the nearest candidate (if any) for logging. The application then: + +1. Calls the LLM with the prompt. +2. Calls `$cache->put($prompt, $response, $embedding, tenant: ..., locale: ..., modelVersion: ...)`. The same embedding the lookup used is reused — no re-encode. The helper writes the Hash with [`HSET`]({{< relref "/commands/hset" >}}) and an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL inside a single [`MULTI/EXEC`]({{< relref "/commands/multi" >}}) so the entry never lands without a TTL on a partial failure. +3. Returns the LLM's response to the user. The next semantically similar prompt under the same metadata scope will be a hit. + +## The cache helper + +The `RedisSemanticCache` class wraps the Redis Search index and the lookup / write flow +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/php/src/RedisSemanticCache.php)): + +```php +use Predis\Client; +use Redis\SemanticCache\{RedisSemanticCache, LocalEmbedder, CacheHit}; + +$client = new Client(['host' => 'localhost', 'port' => 6379]); +$embedder = LocalEmbedder::create(); // sentence-transformers/all-MiniLM-L6-v2 + +$cache = new RedisSemanticCache( + client: $client, + indexName: 'semcache:idx', + keyPrefix: 'cache:', + distanceThreshold: 0.5, // cosine distance, lower = stricter + defaultTtlSeconds: 3600, // one hour +); + +// One-time index setup (idempotent). +$cache->createIndex(); + +// 1) Embed the prompt. +$prompt = 'How do I return an item?'; +$queryVec = $embedder->encodeOne($prompt); + +// 2) Look up under a metadata scope. The TAG filter and the KNN +// travel together in one FT.SEARCH. +$result = $cache->lookup( + queryVec: $queryVec, + tenant: 'acme', + locale: 'en', + modelVersion: 'gpt-4.5-2026', +); + +if ($result instanceof CacheHit) { + $response = $result->response; + printf("hit (%.3f): %s\n", $result->distance, $response); +} else { + // 3a) Miss — call the LLM. (Use your real client here.) + $response = call_llm($prompt); + + // 3b) Cache the new entry. Reuses the same embedding bytes the + // lookup used, so we don't pay the encoder twice. + $cache->put( + prompt: $prompt, + response: $response, + embedding: $queryVec, + tenant: 'acme', + locale: 'en', + modelVersion: 'gpt-4.5-2026', + ); +} +``` + +### Data model + +Each cache entry is one Redis Hash. The vector field is raw little-endian `float32` bytes — no JSON wrapping — because the Redis Search vector encoding expects exactly that. The helper packs the embedding with PHP's [`pack('g*', ...)`](https://www.php.net/manual/en/function.pack.php) (the `g` format is a little-endian single-precision IEEE-754 float), matching the encoding the Python, Node.js, Go, and Jedis ports write. + +```text +cache:7c3f8a1b9e02 + prompt=How do I return an item? + response=You can return any unworn item within 30 days... + tenant=acme + locale=en + model_version=gpt-4.5-2026 + safety=ok + created_ts=1715990400.123 + hit_count=4 + embedding=<384 × float32 little-endian bytes> +``` + +The Redis Search index schema treats every field as queryable in its natural type: + +```text +FT.CREATE semcache:idx + ON HASH PREFIX 1 cache: + SCHEMA + prompt TEXT + response TEXT + tenant TAG + locale TAG + model_version TAG + safety TAG + created_ts NUMERIC SORTABLE + hit_count NUMERIC SORTABLE + embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE +``` + +### The query + +The lookup is a hybrid query: a TAG pre-filter expression in parentheses, then `=>[KNN 1 @embedding $vec]`. With `DIALECT 2`, Redis applies the filter first and KNN-ranks only the matching documents. In Predis: + +```php +use Predis\Command\Argument\Search\SearchArguments; + +$arguments = (new SearchArguments()) + ->addReturn(7, 'prompt', 'response', 'tenant', 'locale', + 'model_version', 'hit_count', 'distance') + ->sortBy('distance', 'asc') + ->limit(0, 1) + ->dialect('2') + ->params(['vec', pack('g*', ...$queryVec)]); + +$raw = $client->ftsearch( + 'semcache:idx', + '(@tenant:{acme} @locale:{en} @model_version:{gpt\-4\.5\-2026} @safety:{ok})' + . '=>[KNN 1 @embedding $vec AS distance]', + $arguments, +); +``` + +`distance` is the cosine *distance* (0 means identical, 2 means opposite). The result is sorted ascending, so the top row is the closest candidate. The application inspects `distance` against the threshold and decides hit or miss in user code — Redis returns the row either way, and treating it as a hit or a miss is a policy decision the cache helper owns, not a server-side filter. + +Predis 3.x defaults to query dialect 2; the cache helper sets it explicitly so the code reads correctly against earlier versions too. See [Index and query vectors]({{< relref "/develop/clients/php/vecsearch" >}}) for more on Predis's vector-search helpers. + +## The mock LLM + +To make the latency and token savings visible without requiring an API key, `MockLLM.php` provides a deterministic stand-in +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/php/src/MockLLM.php)): + +```php +use Redis\SemanticCache\MockLLM; + +$llm = new MockLLM(latencyMs: 1500.0); +$response = $llm->complete('What is your return policy?'); +// $response['response'] — the templated answer text +// $response['latency_ms'] — wall-clock time the call took +// $response['total_tokens'] — estimated prompt + completion tokens +``` + +The mock sleeps for the configured latency, then keyword-matches against a small FAQ table to produce an answer. The deliberate slowness is what makes a hit visibly cheaper than a miss in the demo. In production code, you would replace `MockLLM` with your real client of choice — an HTTP call to OpenAI, Anthropic, a self-hosted vLLM endpoint, anything — without changing the cache helper. + +## Pre-seeding the cache + +In a real deployment the cache fills up organically: a first-time question is a miss, the LLM answers, and the response is written back. For the demo, `SeedCache.php` pre-loads a small set of canonical FAQ prompts so the very first query lands on a hit +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/php/src/SeedCache.php)): + +```php +use Redis\SemanticCache\SeedCache; + +$cache->createIndex(); +SeedCache::seed($cache, $embedder, tenant: 'acme', locale: 'en'); +``` + +The seed list stores the canonical phrasing of each question ("What is your return policy?"). Paraphrases of any of these prompts ("How do I return an item?", "Can I get a refund?") embed close to the canonical entry, so the cache lookup serves the stored response without ever calling the model. + +The seed helper embeds the prompts one at a time rather than as a single batched `encodeMany` call. TransformersPHP's attention-mask handling produces slightly different mean-pooled vectors for variable-length inputs inside a batch versus single-input calls, and that 0.01-cosine-distance drift would otherwise make a self-lookup of a seeded prompt look like a near-match instead of a clean zero-distance hit. + +## The interactive demo + +`public/index.php` is a front controller for PHP's built-in HTTP server — no Slim, no Symfony, no embedded framework. The HTML page lets you: + +* Type a prompt and toggle metadata: tenant, locale, model version. Each combination is a separate cache namespace inside the same index. +* Slide the cosine-distance threshold and see hits flip to misses (and back) on the same prompt, with the actual distance reported on each query. +* Submit with **Ask** to run the full hit-or-miss path (calls the LLM on a miss, writes the answer back). Submit with **Lookup only (no LLM)** to sweep the threshold against a fixed prompt without polluting the cache. +* Watch the cumulative panel build up: total queries, cache hits, cache misses, hit ratio, tokens not spent, LLM milliseconds not waited. +* Inspect every cached entry, including remaining TTL and total hit count, and drop individual entries to simulate eviction. + +The front controller rebuilds a `LocalEmbedder`, a `RedisSemanticCache`, and a `MockLLM` on every request because PHP's built-in server is single-process and does not share user-land objects between requests. The first request is therefore slow (the embedder reloads the tokenizer and ONNX session); subsequent requests reuse the cached model files on disk and are fast. The HTML page is shared with the Python, Node.js, Go, and Jedis demos; the same `index.html` works against any of the language ports without modification. Endpoints: + +| Endpoint | What it does | +|-----------------|-------------------------------------------------------------------------------| +| `GET /state` | Index info and the full list of cached entries. | +| `POST /query` | Embed the prompt, run `FT.SEARCH`, on miss call the LLM and write back. | +| `POST /reset` | Drop every cached entry and re-seed from the FAQ list. | +| `POST /drop` | Delete a single cached entry by id. | + +## Configuration + +PHP's CLI flag parsing is awkward, so the demo reads configuration from environment variables rather than `--`-style flags. All variables have defaults; override only what you need. + +| Variable | Default | Purpose | +|-----------------------------|----------------|----------------------------------------------------| +| `SEMCACHE_PORT` | `8093` | TCP port for the dev server | +| `SEMCACHE_REDIS_HOST` | `localhost` | Redis host | +| `SEMCACHE_REDIS_PORT` | `6379` | Redis port | +| `SEMCACHE_INDEX_NAME` | `semcache:idx` | Redis Search index name | +| `SEMCACHE_KEY_PREFIX` | `cache:` | Prefix for cache entry hashes | +| `SEMCACHE_TTL_SECONDS` | `3600` | TTL on each cache entry | +| `SEMCACHE_THRESHOLD` | `0.5` | Default cosine-distance threshold | +| `SEMCACHE_LLM_LATENCY_MS` | `1500` | Mock LLM sleep, milliseconds | +| `SEMCACHE_RESEED` | `true` | Re-seed FAQ entries on the first request | + +## Run the demo locally + +1. Clone the [`redis/docs`](https://github.com/redis/docs) repository and change into the example + directory: + + ```bash + git clone https://github.com/redis/docs.git + cd docs/content/develop/use-cases/semantic-cache/php + ``` + +2. Make sure a Redis instance with the Redis Search module is running locally on + port 6379. [Redis Stack]({{< relref "/operate/oss_and_stack/install/install-stack" >}}) or + [Redis 8 with Search]({{< relref "/develop/ai/search-and-query" >}}) both work. + +3. Install the PHP dependencies with [Composer](https://getcomposer.org/). This step also + downloads the prebuilt TransformersPHP native libraries (ONNX Runtime, OpenBLAS, Rindow's matlib FFI shim) for your platform — about 90 MB on macOS arm64: + + ```bash + composer install + ``` + + The example requires PHP 8.2 or later and uses [Predis](https://github.com/predis/predis) for Redis access, with no PHP extensions required beyond the standard `ffi` shipped with most builds. + +4. Start the demo. The included `run.sh` sets the PHP `ffi.enable=true` directive that + TransformersPHP needs at runtime, caps `post_max_size` at 1 MiB to match the demo's + body-size budget, and silences PHP 8.4 deprecation notices that `codewithkyrian/transformers` + 0.5.x emits on the latest PHP — the underlying inference is unaffected. The first run + downloads the `Xenova/all-MiniLM-L6-v2` ONNX weights (~30 MB) into the local Hugging + Face cache; every subsequent run is offline: + + ```bash + ./run.sh + ``` + + To pick a different port or threshold, set the corresponding environment variable + before invoking the script: + + ```bash + SEMCACHE_PORT=8093 SEMCACHE_THRESHOLD=0.4 ./run.sh + ``` + +5. Open and try some queries: + + * **"What is your return policy?"** — exact match against the seed, distance ≈ 0, + hit at any threshold. + * **"How fast is delivery?"** — paraphrase of the shipping seed; distance + around 0.30, hit at the default threshold of 0.5. + * **"How do I return an item?"** — slightly looser paraphrase of the returns + seed; distance around 0.49, still a hit at the default threshold. Slide + the threshold down to 0.4 to see this one flip to a miss. + * **"What payment methods do you accept?"** — unrelated to anything in the + seed, but the embedding model still finds shallow surface-form similarity + with the canonical "What ___ do you ___?" phrasing of the seeds, so the + distance lands around 0.66. At the default threshold of 0.5 you will see + a miss, the mock LLM kicks in for ~1.5 s, the new answer is cached, and + a follow-up of the same question is now an immediate hit. At threshold + 0.7 the same query is a borderline hit — that's the cosine-distance + cutoff working exactly as advertised. + * Switch the **Tenant** dropdown to `globex` or `initech` and re-ask any + seeded question — the result flips to a miss because the cache entries + live under `acme`. That's the metadata pre-filter at work inside `FT.SEARCH`. + +The server is read/write against your local Redis. The default index name is `semcache:idx` and entry keys live under `cache:`. Set `SEMCACHE_RESEED=false` to keep an existing cache across restarts, `SEMCACHE_THRESHOLD` to change the default cosine-distance cutoff, `SEMCACHE_LLM_LATENCY_MS` to make the mock LLM faster or slower for the demo, or `SEMCACHE_PORT` to listen on a different port. diff --git a/content/develop/use-cases/semantic-cache/php/composer.json b/content/develop/use-cases/semantic-cache/php/composer.json new file mode 100644 index 0000000000..7ad2af1cf3 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/composer.json @@ -0,0 +1,28 @@ +{ + "name": "redis/semantic-cache-php", + "description": "Redis semantic-cache demo for the Redis docs, in PHP with Predis and transformers-php.", + "type": "project", + "license": "MIT", + "require": { + "php": "^8.2", + "codewithkyrian/transformers": "^0.5", + "predis/predis": "^3.0", + "rindow/rindow-matlib-ffi": "~1.0.0" + }, + "autoload": { + "psr-4": { + "Redis\\SemanticCache\\": "src/" + } + }, + "config": { + "sort-packages": true, + "allow-plugins": { + "codewithkyrian/transformers-libsloader": true + }, + "platform": { + "php": "8.2.0" + } + }, + "minimum-stability": "stable", + "prefer-stable": true +} diff --git a/content/develop/use-cases/semantic-cache/php/composer.lock b/content/develop/use-cases/semantic-cache/php/composer.lock new file mode 100644 index 0000000000..b39ab87062 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/composer.lock @@ -0,0 +1,1499 @@ +{ + "_readme": [ + "This file locks the dependencies of your project to a known state", + "Read more about it at https://getcomposer.org/doc/01-basic-usage.md#installing-dependencies", + "This file is @generated automatically" + ], + "content-hash": "4d6de94fd5caea61e194ea4e8b8a4db0", + "packages": [ + { + "name": "codewithkyrian/jinja-php", + "version": "1.0.0", + "source": { + "type": "git", + "url": "https://github.com/CodeWithKyrian/jinja-php.git", + "reference": "3a246c831af5c3c3c532399aa0c1e5209441675f" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/CodeWithKyrian/jinja-php/zipball/3a246c831af5c3c3c532399aa0c1e5209441675f", + "reference": "3a246c831af5c3c3c532399aa0c1e5209441675f", + "shasum": "" + }, + "require": { + "php": "^8.1" + }, + "require-dev": { + "pestphp/pest": "^2.34", + "symfony/var-dumper": "^6.3|^7.0" + }, + "type": "library", + "autoload": { + "files": [ + "src/Core/Utils.php" + ], + "psr-4": { + "Codewithkyrian\\Jinja\\": "src/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Kyrian Obikwelu", + "email": "koshnawaza@gmail.com" + } + ], + "description": "A minimalistic PHP implementation of the Jinja templating engine, specifically designed for parsing and rendering ML chat templates.", + "support": { + "issues": "https://github.com/CodeWithKyrian/jinja-php/issues", + "source": "https://github.com/CodeWithKyrian/jinja-php/tree/1.0.0" + }, + "time": "2024-03-19T17:43:20+00:00" + }, + { + "name": "codewithkyrian/transformers", + "version": "0.5.3", + "source": { + "type": "git", + "url": "https://github.com/CodeWithKyrian/transformers-php.git", + "reference": "474406c25d33e36fcc4f6225719a46afa82acbf8" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/CodeWithKyrian/transformers-php/zipball/474406c25d33e36fcc4f6225719a46afa82acbf8", + "reference": "474406c25d33e36fcc4f6225719a46afa82acbf8", + "shasum": "" + }, + "require": { + "codewithkyrian/jinja-php": "^1.0", + "codewithkyrian/transformers-libsloader": "^2.0", + "ext-ffi": "*", + "imagine/imagine": "^1.3", + "php": "^8.1", + "rindow/rindow-math-matrix": "^2.0", + "rindow/rindow-matlib-ffi": "^1.0", + "rindow/rindow-openblas-ffi": "^1.0", + "rokka/imagine-vips": "^0.31.0", + "symfony/console": "^6.4|^7.0" + }, + "require-dev": { + "pestphp/pest": "^2.31", + "symfony/var-dumper": "^7.0" + }, + "suggest": { + "ext-gd": "Required to use the GD Driver for image processing", + "ext-imagick": "Required to use the Imagick Driver for image processing", + "rokka/imagine-vips": "Required to use the VIPS Driver for image processing" + }, + "bin": [ + "bin/transformers" + ], + "type": "library", + "autoload": { + "files": [ + "src/Pipelines/Pipeline.php", + "src/Utils/Helpers.php" + ], + "psr-4": { + "Codewithkyrian\\Transformers\\": "src/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "Apache-2.0" + ], + "authors": [ + { + "name": "Kyrian Obikwelu", + "email": "koshnawaza@gmail.com" + } + ], + "description": "State-of-the-art Machine Learning for PHP. Run Transformers in PHP", + "keywords": [ + "CodeWithKyrian", + "ai", + "machine learning", + "natural language processing", + "nlp", + "php", + "transformers", + "transformers-php" + ], + "support": { + "issues": "https://github.com/CodeWithKyrian/transformers-php/issues", + "source": "https://github.com/CodeWithKyrian/transformers-php/tree/0.5.3" + }, + "time": "2024-09-27T19:39:27+00:00" + }, + { + "name": "codewithkyrian/transformers-libsloader", + "version": "2.0.0", + "source": { + "type": "git", + "url": "https://github.com/CodeWithKyrian/transformers-libsloader.git", + "reference": "7052adad23e969701a961437b77422f820df05ba" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/CodeWithKyrian/transformers-libsloader/zipball/7052adad23e969701a961437b77422f820df05ba", + "reference": "7052adad23e969701a961437b77422f820df05ba", + "shasum": "" + }, + "require": { + "composer-plugin-api": "^1.1 || ^2.0", + "php": "^8.1" + }, + "require-dev": { + "composer/composer": "~1.0 || ~2.0", + "symfony/var-dumper": "^6.0|^7.0" + }, + "type": "composer-plugin", + "extra": { + "class": "Codewithkyrian\\TransformersLibsLoader\\Plugin" + }, + "autoload": { + "psr-4": { + "Codewithkyrian\\TransformersLibsLoader\\": "src/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Kyrian Obikwelu", + "email": "koshnawaza@gmail.com" + } + ], + "description": "Composer plugin to download all shared libraries necessary for TransformersPHP", + "support": { + "issues": "https://github.com/CodeWithKyrian/transformers-libsloader/issues", + "source": "https://github.com/CodeWithKyrian/transformers-libsloader/tree/2.0.0" + }, + "time": "2024-08-18T16:40:39+00:00" + }, + { + "name": "imagine/imagine", + "version": "1.5.2", + "source": { + "type": "git", + "url": "https://github.com/php-imagine/Imagine.git", + "reference": "f9ed796eefb77c2f0f2167e1d4e36bc2b5ed6b0c" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/php-imagine/Imagine/zipball/f9ed796eefb77c2f0f2167e1d4e36bc2b5ed6b0c", + "reference": "f9ed796eefb77c2f0f2167e1d4e36bc2b5ed6b0c", + "shasum": "" + }, + "require": { + "php": ">=7.1" + }, + "require-dev": { + "phpunit/phpunit": "^4.8 || ^5.7 || ^6.5 || ^7.5 || ^8.4 || ^9.3" + }, + "suggest": { + "ext-exif": "to read EXIF metadata", + "ext-gd": "to use the GD implementation", + "ext-gmagick": "to use the Gmagick implementation", + "ext-imagick": "to use the Imagick implementation" + }, + "type": "library", + "extra": { + "branch-alias": { + "dev-develop": "1.x-dev" + } + }, + "autoload": { + "psr-4": { + "Imagine\\": "src/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Bulat Shakirzyanov", + "email": "mallluhuct@gmail.com", + "homepage": "http://avalanche123.com" + } + ], + "description": "Image processing for PHP", + "homepage": "http://imagine.readthedocs.org/", + "keywords": [ + "drawing", + "graphics", + "image manipulation", + "image processing" + ], + "support": { + "issues": "https://github.com/php-imagine/Imagine/issues", + "source": "https://github.com/php-imagine/Imagine/tree/1.5.2" + }, + "time": "2026-01-09T10:45:12+00:00" + }, + { + "name": "interop-phpobjects/polite-math", + "version": "1.0.7", + "source": { + "type": "git", + "url": "https://github.com/interop-phpobjects/polite-math.git", + "reference": "621246cdc108b1388307097e06361ca5b9259467" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/interop-phpobjects/polite-math/zipball/621246cdc108b1388307097e06361ca5b9259467", + "reference": "621246cdc108b1388307097e06361ca5b9259467", + "shasum": "" + }, + "require": { + "php": ">=7.2" + }, + "type": "library", + "autoload": { + "psr-4": { + "Interop\\Polite\\Math\\": "src/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "description": "Interoperability of interfaces around the Math", + "keywords": [ + "interop", + "interoperability", + "math" + ], + "support": { + "issues": "https://github.com/interop-phpobjects/polite-math/issues", + "source": "https://github.com/interop-phpobjects/polite-math/tree/1.0.7" + }, + "time": "2024-04-07T07:10:27+00:00" + }, + { + "name": "jcupitt/vips", + "version": "v2.6.1", + "source": { + "type": "git", + "url": "https://github.com/libvips/php-vips.git", + "reference": "6dbba1bb3a4ba1e39edb71016a2aa6da5bff32ee" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/libvips/php-vips/zipball/6dbba1bb3a4ba1e39edb71016a2aa6da5bff32ee", + "reference": "6dbba1bb3a4ba1e39edb71016a2aa6da5bff32ee", + "shasum": "" + }, + "require": { + "ext-ffi": "*", + "php": ">=7.4", + "psr/log": "^1.1.3|^2.0|^3.0" + }, + "require-dev": { + "php-parallel-lint/php-parallel-lint": "^1.3", + "phpdocumentor/shim": "^3.3", + "phpunit/phpunit": "^9.5", + "squizlabs/php_codesniffer": "^3.7" + }, + "type": "library", + "extra": { + "branch-alias": { + "dev-master": "2.0.x-dev" + } + }, + "autoload": { + "psr-4": { + "Jcupitt\\Vips\\": "src" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "John Cupitt", + "email": "jcupitt@gmail.com", + "homepage": "https://github.com/jcupitt", + "role": "Developer" + } + ], + "description": "A high-level interface to the libvips image processing library.", + "homepage": "https://github.com/libvips/php-vips", + "keywords": [ + "image", + "libvips", + "processing" + ], + "support": { + "issues": "https://github.com/libvips/php-vips/issues", + "source": "https://github.com/libvips/php-vips/tree/v2.6.1" + }, + "time": "2025-12-10T14:03:20+00:00" + }, + { + "name": "phenx/php-font-lib", + "version": "0.5.6", + "source": { + "type": "git", + "url": "https://github.com/dompdf/php-font-lib.git", + "reference": "a1681e9793040740a405ac5b189275059e2a9863" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/dompdf/php-font-lib/zipball/a1681e9793040740a405ac5b189275059e2a9863", + "reference": "a1681e9793040740a405ac5b189275059e2a9863", + "shasum": "" + }, + "require": { + "ext-mbstring": "*" + }, + "require-dev": { + "symfony/phpunit-bridge": "^3 || ^4 || ^5 || ^6" + }, + "type": "library", + "autoload": { + "psr-4": { + "FontLib\\": "src/FontLib" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "LGPL-2.1-or-later" + ], + "authors": [ + { + "name": "Fabien Ménager", + "email": "fabien.menager@gmail.com" + } + ], + "description": "A library to read, parse, export and make subsets of different types of font files.", + "homepage": "https://github.com/PhenX/php-font-lib", + "support": { + "issues": "https://github.com/dompdf/php-font-lib/issues", + "source": "https://github.com/dompdf/php-font-lib/tree/0.5.6" + }, + "time": "2024-01-29T14:45:26+00:00" + }, + { + "name": "predis/predis", + "version": "v3.4.2", + "source": { + "type": "git", + "url": "https://github.com/predis/predis.git", + "reference": "2033429520d8997a7815a2485f56abe6d2d0e075" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/predis/predis/zipball/2033429520d8997a7815a2485f56abe6d2d0e075", + "reference": "2033429520d8997a7815a2485f56abe6d2d0e075", + "shasum": "" + }, + "require": { + "php": "^7.2 || ^8.0", + "psr/http-message": "^1.0|^2.0" + }, + "require-dev": { + "friendsofphp/php-cs-fixer": "^3.3", + "phpstan/phpstan": "^1.9", + "phpunit/phpcov": "^6.0 || ^8.0", + "phpunit/phpunit": "^8.0 || ~9.4.4" + }, + "suggest": { + "ext-relay": "Faster connection with in-memory caching (>=0.6.2)" + }, + "type": "library", + "autoload": { + "psr-4": { + "Predis\\": "src/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Till Krüss", + "homepage": "https://till.im", + "role": "Maintainer" + } + ], + "description": "A flexible and feature-complete Redis/Valkey client for PHP.", + "homepage": "http://github.com/predis/predis", + "keywords": [ + "nosql", + "predis", + "redis" + ], + "support": { + "issues": "https://github.com/predis/predis/issues", + "source": "https://github.com/predis/predis/tree/v3.4.2" + }, + "funding": [ + { + "url": "https://github.com/sponsors/tillkruss", + "type": "github" + } + ], + "time": "2026-03-09T20:33:04+00:00" + }, + { + "name": "psr/container", + "version": "2.0.2", + "source": { + "type": "git", + "url": "https://github.com/php-fig/container.git", + "reference": "c71ecc56dfe541dbd90c5360474fbc405f8d5963" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/php-fig/container/zipball/c71ecc56dfe541dbd90c5360474fbc405f8d5963", + "reference": "c71ecc56dfe541dbd90c5360474fbc405f8d5963", + "shasum": "" + }, + "require": { + "php": ">=7.4.0" + }, + "type": "library", + "extra": { + "branch-alias": { + "dev-master": "2.0.x-dev" + } + }, + "autoload": { + "psr-4": { + "Psr\\Container\\": "src/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "PHP-FIG", + "homepage": "https://www.php-fig.org/" + } + ], + "description": "Common Container Interface (PHP FIG PSR-11)", + "homepage": "https://github.com/php-fig/container", + "keywords": [ + "PSR-11", + "container", + "container-interface", + "container-interop", + "psr" + ], + "support": { + "issues": "https://github.com/php-fig/container/issues", + "source": "https://github.com/php-fig/container/tree/2.0.2" + }, + "time": "2021-11-05T16:47:00+00:00" + }, + { + "name": "psr/http-message", + "version": "2.0", + "source": { + "type": "git", + "url": "https://github.com/php-fig/http-message.git", + "reference": "402d35bcb92c70c026d1a6a9883f06b2ead23d71" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/php-fig/http-message/zipball/402d35bcb92c70c026d1a6a9883f06b2ead23d71", + "reference": "402d35bcb92c70c026d1a6a9883f06b2ead23d71", + "shasum": "" + }, + "require": { + "php": "^7.2 || ^8.0" + }, + "type": "library", + "extra": { + "branch-alias": { + "dev-master": "2.0.x-dev" + } + }, + "autoload": { + "psr-4": { + "Psr\\Http\\Message\\": "src/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "PHP-FIG", + "homepage": "https://www.php-fig.org/" + } + ], + "description": "Common interface for HTTP messages", + "homepage": "https://github.com/php-fig/http-message", + "keywords": [ + "http", + "http-message", + "psr", + "psr-7", + "request", + "response" + ], + "support": { + "source": "https://github.com/php-fig/http-message/tree/2.0" + }, + "time": "2023-04-04T09:54:51+00:00" + }, + { + "name": "psr/log", + "version": "3.0.2", + "source": { + "type": "git", + "url": "https://github.com/php-fig/log.git", + "reference": "f16e1d5863e37f8d8c2a01719f5b34baa2b714d3" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/php-fig/log/zipball/f16e1d5863e37f8d8c2a01719f5b34baa2b714d3", + "reference": "f16e1d5863e37f8d8c2a01719f5b34baa2b714d3", + "shasum": "" + }, + "require": { + "php": ">=8.0.0" + }, + "type": "library", + "extra": { + "branch-alias": { + "dev-master": "3.x-dev" + } + }, + "autoload": { + "psr-4": { + "Psr\\Log\\": "src" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "PHP-FIG", + "homepage": "https://www.php-fig.org/" + } + ], + "description": "Common interface for logging libraries", + "homepage": "https://github.com/php-fig/log", + "keywords": [ + "log", + "psr", + "psr-3" + ], + "support": { + "source": "https://github.com/php-fig/log/tree/3.0.2" + }, + "time": "2024-09-11T13:17:53+00:00" + }, + { + "name": "rindow/rindow-math-matrix", + "version": "2.1.2", + "source": { + "type": "git", + "url": "https://github.com/rindow/rindow-math-matrix.git", + "reference": "6d6622b4495d6325e4065430d143d6f4c3b5f0c4" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/rindow/rindow-math-matrix/zipball/6d6622b4495d6325e4065430d143d6f4c3b5f0c4", + "reference": "6d6622b4495d6325e4065430d143d6f4c3b5f0c4", + "shasum": "" + }, + "require": { + "interop-phpobjects/polite-math": "^1.0.7", + "php": "^8.1" + }, + "suggest": { + "rindow/math-plot": "for OpenCL tunning", + "rindow/rindow-math-matrix-matlibffi": "^1.0.4" + }, + "bin": [ + "bin/rindow-math-matrix" + ], + "type": "library", + "autoload": { + "psr-4": { + "Rindow\\Math\\Matrix\\": "src/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "BSD-3-Clause" + ], + "description": "The fundamental package for scientific matrix operation", + "keywords": [ + "N-dimension", + "math", + "matrix", + "operation", + "rindow" + ], + "support": { + "issues": "https://github.com/rindow/rindow-math-matrix/issues", + "source": "https://github.com/rindow/rindow-math-matrix/tree/2.1.2" + }, + "time": "2025-04-13T13:08:33+00:00" + }, + { + "name": "rindow/rindow-matlib-ffi", + "version": "1.0.2", + "source": { + "type": "git", + "url": "https://github.com/rindow/rindow-matlib-ffi.git", + "reference": "2daaf402cd330d352cf82e75cbefac0c6608c08e" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/rindow/rindow-matlib-ffi/zipball/2daaf402cd330d352cf82e75cbefac0c6608c08e", + "reference": "2daaf402cd330d352cf82e75cbefac0c6608c08e", + "shasum": "" + }, + "require": { + "interop-phpobjects/polite-math": "~1.0.7", + "php": "^8.1" + }, + "require-dev": { + "ext-ffi": "*", + "rindow/rindow-math-buffer-ffi": "^1.0" + }, + "type": "library", + "autoload": { + "psr-4": { + "Rindow\\Matlib\\FFI\\": "src/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "BSD-3-Clause" + ], + "description": "The math matrix library for FFI on PHP", + "keywords": [ + "ffi", + "math", + "matrix", + "rindow" + ], + "support": { + "issues": "https://github.com/rindow/rindow-matlib-ffi/issues", + "source": "https://github.com/rindow/rindow-matlib-ffi/tree/1.0.2" + }, + "time": "2024-04-25T14:59:33+00:00" + }, + { + "name": "rindow/rindow-openblas-ffi", + "version": "1.0.6", + "source": { + "type": "git", + "url": "https://github.com/rindow/rindow-openblas-ffi.git", + "reference": "efcddb9b24ac9d2d2f3a7d1092fbd5f66dccbb5e" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/rindow/rindow-openblas-ffi/zipball/efcddb9b24ac9d2d2f3a7d1092fbd5f66dccbb5e", + "reference": "efcddb9b24ac9d2d2f3a7d1092fbd5f66dccbb5e", + "shasum": "" + }, + "require": { + "interop-phpobjects/polite-math": "^1.0.7", + "php": "^8.1" + }, + "require-dev": { + "ext-ffi": "*", + "rindow/rindow-math-buffer-ffi": "^1.0.5" + }, + "type": "library", + "autoload": { + "psr-4": { + "Rindow\\OpenBLAS\\FFI\\": "src/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "BSD-3-Clause" + ], + "description": "The Openblas library for FFI on PHP", + "keywords": [ + "ffi", + "math", + "openblas", + "rindow" + ], + "support": { + "issues": "https://github.com/rindow/rindow-openblas-ffi/issues", + "source": "https://github.com/rindow/rindow-openblas-ffi/tree/1.0.6" + }, + "time": "2025-04-13T12:42:45+00:00" + }, + { + "name": "rokka/imagine-vips", + "version": "0.31.0", + "source": { + "type": "git", + "url": "https://github.com/rokka-io/imagine-vips.git", + "reference": "6c86dc4a988fbd51081973abd29cbc38989e2e94" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/rokka-io/imagine-vips/zipball/6c86dc4a988fbd51081973abd29cbc38989e2e94", + "reference": "6c86dc4a988fbd51081973abd29cbc38989e2e94", + "shasum": "" + }, + "require": { + "imagine/imagine": "^1.0", + "jcupitt/vips": "^2.1.0 || ^1.0.3", + "phenx/php-font-lib": "^0.5.2", + "php": "^7.2 || ^8.0" + }, + "require-dev": { + "friendsofphp/php-cs-fixer": "^2.8", + "phpstan/phpstan": "^1.8", + "phpunit/phpunit": "^8 || ^9" + }, + "suggest": { + "ext-gd": "to use the GD implementation fallback for saving unsupported file formats", + "ext-imagick": "to use the Imagick implementation fallback for saving unsupported file formats" + }, + "type": "library", + "autoload": { + "psr-0": { + "Imagine": "lib/" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "rokka", + "email": "rokka@rokka.io", + "homepage": "https://rokka.io" + } + ], + "description": "libvips adapter for imagine", + "homepage": "https://github.com/rokka-io/imagine-vips", + "keywords": [ + "drawing", + "graphics", + "image manipulation", + "image processing", + "libvips", + "php-vips", + "vips" + ], + "support": { + "issues": "https://github.com/rokka-io/imagine-vips/issues", + "source": "https://github.com/rokka-io/imagine-vips/tree/0.31.0" + }, + "time": "2022-10-12T16:32:32+00:00" + }, + { + "name": "symfony/console", + "version": "v7.4.11", + "source": { + "type": "git", + "url": "https://github.com/symfony/console.git", + "reference": "ed0107e43ab452aa77ae99e005b95e56b556e075" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/symfony/console/zipball/ed0107e43ab452aa77ae99e005b95e56b556e075", + "reference": "ed0107e43ab452aa77ae99e005b95e56b556e075", + "shasum": "" + }, + "require": { + "php": ">=8.2", + "symfony/deprecation-contracts": "^2.5|^3", + "symfony/polyfill-mbstring": "~1.0", + "symfony/service-contracts": "^2.5|^3", + "symfony/string": "^7.2|^8.0" + }, + "conflict": { + "symfony/dependency-injection": "<6.4", + "symfony/dotenv": "<6.4", + "symfony/event-dispatcher": "<6.4", + "symfony/lock": "<6.4", + "symfony/process": "<6.4" + }, + "provide": { + "psr/log-implementation": "1.0|2.0|3.0" + }, + "require-dev": { + "psr/log": "^1|^2|^3", + "symfony/config": "^6.4|^7.0|^8.0", + "symfony/dependency-injection": "^6.4|^7.0|^8.0", + "symfony/event-dispatcher": "^6.4|^7.0|^8.0", + "symfony/http-foundation": "^6.4|^7.0|^8.0", + "symfony/http-kernel": "^6.4|^7.0|^8.0", + "symfony/lock": "^6.4|^7.0|^8.0", + "symfony/messenger": "^6.4|^7.0|^8.0", + "symfony/process": "^6.4|^7.0|^8.0", + "symfony/stopwatch": "^6.4|^7.0|^8.0", + "symfony/var-dumper": "^6.4|^7.0|^8.0" + }, + "type": "library", + "autoload": { + "psr-4": { + "Symfony\\Component\\Console\\": "" + }, + "exclude-from-classmap": [ + "/Tests/" + ] + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Fabien Potencier", + "email": "fabien@symfony.com" + }, + { + "name": "Symfony Community", + "homepage": "https://symfony.com/contributors" + } + ], + "description": "Eases the creation of beautiful and testable command line interfaces", + "homepage": "https://symfony.com", + "keywords": [ + "cli", + "command-line", + "console", + "terminal" + ], + "support": { + "source": "https://github.com/symfony/console/tree/v7.4.11" + }, + "funding": [ + { + "url": "https://symfony.com/sponsor", + "type": "custom" + }, + { + "url": "https://github.com/fabpot", + "type": "github" + }, + { + "url": "https://github.com/nicolas-grekas", + "type": "github" + }, + { + "url": "https://tidelift.com/funding/github/packagist/symfony/symfony", + "type": "tidelift" + } + ], + "time": "2026-05-13T12:04:42+00:00" + }, + { + "name": "symfony/deprecation-contracts", + "version": "v3.7.0", + "source": { + "type": "git", + "url": "https://github.com/symfony/deprecation-contracts.git", + "reference": "50f59d1f3ca46d41ac911f97a78626b6756af35b" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/symfony/deprecation-contracts/zipball/50f59d1f3ca46d41ac911f97a78626b6756af35b", + "reference": "50f59d1f3ca46d41ac911f97a78626b6756af35b", + "shasum": "" + }, + "require": { + "php": ">=8.1" + }, + "type": "library", + "extra": { + "thanks": { + "url": "https://github.com/symfony/contracts", + "name": "symfony/contracts" + }, + "branch-alias": { + "dev-main": "3.7-dev" + } + }, + "autoload": { + "files": [ + "function.php" + ] + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Nicolas Grekas", + "email": "p@tchwork.com" + }, + { + "name": "Symfony Community", + "homepage": "https://symfony.com/contributors" + } + ], + "description": "A generic function and convention to trigger deprecation notices", + "homepage": "https://symfony.com", + "support": { + "source": "https://github.com/symfony/deprecation-contracts/tree/v3.7.0" + }, + "funding": [ + { + "url": "https://symfony.com/sponsor", + "type": "custom" + }, + { + "url": "https://github.com/fabpot", + "type": "github" + }, + { + "url": "https://github.com/nicolas-grekas", + "type": "github" + }, + { + "url": "https://tidelift.com/funding/github/packagist/symfony/symfony", + "type": "tidelift" + } + ], + "time": "2026-04-13T15:52:40+00:00" + }, + { + "name": "symfony/polyfill-ctype", + "version": "v1.37.0", + "source": { + "type": "git", + "url": "https://github.com/symfony/polyfill-ctype.git", + "reference": "141046a8f9477948ff284fa65be2095baafb94f2" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/symfony/polyfill-ctype/zipball/141046a8f9477948ff284fa65be2095baafb94f2", + "reference": "141046a8f9477948ff284fa65be2095baafb94f2", + "shasum": "" + }, + "require": { + "php": ">=7.2" + }, + "provide": { + "ext-ctype": "*" + }, + "suggest": { + "ext-ctype": "For best performance" + }, + "type": "library", + "extra": { + "thanks": { + "url": "https://github.com/symfony/polyfill", + "name": "symfony/polyfill" + } + }, + "autoload": { + "files": [ + "bootstrap.php" + ], + "psr-4": { + "Symfony\\Polyfill\\Ctype\\": "" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Gert de Pagter", + "email": "BackEndTea@gmail.com" + }, + { + "name": "Symfony Community", + "homepage": "https://symfony.com/contributors" + } + ], + "description": "Symfony polyfill for ctype functions", + "homepage": "https://symfony.com", + "keywords": [ + "compatibility", + "ctype", + "polyfill", + "portable" + ], + "support": { + "source": "https://github.com/symfony/polyfill-ctype/tree/v1.37.0" + }, + "funding": [ + { + "url": "https://symfony.com/sponsor", + "type": "custom" + }, + { + "url": "https://github.com/fabpot", + "type": "github" + }, + { + "url": "https://github.com/nicolas-grekas", + "type": "github" + }, + { + "url": "https://tidelift.com/funding/github/packagist/symfony/symfony", + "type": "tidelift" + } + ], + "time": "2026-04-10T16:19:22+00:00" + }, + { + "name": "symfony/polyfill-intl-grapheme", + "version": "v1.37.0", + "source": { + "type": "git", + "url": "https://github.com/symfony/polyfill-intl-grapheme.git", + "reference": "4864388bfbd3001ce88e234fab652acd91fdc57e" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/symfony/polyfill-intl-grapheme/zipball/4864388bfbd3001ce88e234fab652acd91fdc57e", + "reference": "4864388bfbd3001ce88e234fab652acd91fdc57e", + "shasum": "" + }, + "require": { + "php": ">=7.2" + }, + "suggest": { + "ext-intl": "For best performance" + }, + "type": "library", + "extra": { + "thanks": { + "url": "https://github.com/symfony/polyfill", + "name": "symfony/polyfill" + } + }, + "autoload": { + "files": [ + "bootstrap.php" + ], + "psr-4": { + "Symfony\\Polyfill\\Intl\\Grapheme\\": "" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Nicolas Grekas", + "email": "p@tchwork.com" + }, + { + "name": "Symfony Community", + "homepage": "https://symfony.com/contributors" + } + ], + "description": "Symfony polyfill for intl's grapheme_* functions", + "homepage": "https://symfony.com", + "keywords": [ + "compatibility", + "grapheme", + "intl", + "polyfill", + "portable", + "shim" + ], + "support": { + "source": "https://github.com/symfony/polyfill-intl-grapheme/tree/v1.37.0" + }, + "funding": [ + { + "url": "https://symfony.com/sponsor", + "type": "custom" + }, + { + "url": "https://github.com/fabpot", + "type": "github" + }, + { + "url": "https://github.com/nicolas-grekas", + "type": "github" + }, + { + "url": "https://tidelift.com/funding/github/packagist/symfony/symfony", + "type": "tidelift" + } + ], + "time": "2026-04-26T13:13:48+00:00" + }, + { + "name": "symfony/polyfill-intl-normalizer", + "version": "v1.37.0", + "source": { + "type": "git", + "url": "https://github.com/symfony/polyfill-intl-normalizer.git", + "reference": "3833d7255cc303546435cb650316bff708a1c75c" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/symfony/polyfill-intl-normalizer/zipball/3833d7255cc303546435cb650316bff708a1c75c", + "reference": "3833d7255cc303546435cb650316bff708a1c75c", + "shasum": "" + }, + "require": { + "php": ">=7.2" + }, + "suggest": { + "ext-intl": "For best performance" + }, + "type": "library", + "extra": { + "thanks": { + "url": "https://github.com/symfony/polyfill", + "name": "symfony/polyfill" + } + }, + "autoload": { + "files": [ + "bootstrap.php" + ], + "psr-4": { + "Symfony\\Polyfill\\Intl\\Normalizer\\": "" + }, + "classmap": [ + "Resources/stubs" + ] + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Nicolas Grekas", + "email": "p@tchwork.com" + }, + { + "name": "Symfony Community", + "homepage": "https://symfony.com/contributors" + } + ], + "description": "Symfony polyfill for intl's Normalizer class and related functions", + "homepage": "https://symfony.com", + "keywords": [ + "compatibility", + "intl", + "normalizer", + "polyfill", + "portable", + "shim" + ], + "support": { + "source": "https://github.com/symfony/polyfill-intl-normalizer/tree/v1.37.0" + }, + "funding": [ + { + "url": "https://symfony.com/sponsor", + "type": "custom" + }, + { + "url": "https://github.com/fabpot", + "type": "github" + }, + { + "url": "https://github.com/nicolas-grekas", + "type": "github" + }, + { + "url": "https://tidelift.com/funding/github/packagist/symfony/symfony", + "type": "tidelift" + } + ], + "time": "2024-09-09T11:45:10+00:00" + }, + { + "name": "symfony/polyfill-mbstring", + "version": "v1.37.0", + "source": { + "type": "git", + "url": "https://github.com/symfony/polyfill-mbstring.git", + "reference": "6a21eb99c6973357967f6ce3708cd55a6bec6315" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/symfony/polyfill-mbstring/zipball/6a21eb99c6973357967f6ce3708cd55a6bec6315", + "reference": "6a21eb99c6973357967f6ce3708cd55a6bec6315", + "shasum": "" + }, + "require": { + "ext-iconv": "*", + "php": ">=7.2" + }, + "provide": { + "ext-mbstring": "*" + }, + "suggest": { + "ext-mbstring": "For best performance" + }, + "type": "library", + "extra": { + "thanks": { + "url": "https://github.com/symfony/polyfill", + "name": "symfony/polyfill" + } + }, + "autoload": { + "files": [ + "bootstrap.php" + ], + "psr-4": { + "Symfony\\Polyfill\\Mbstring\\": "" + } + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Nicolas Grekas", + "email": "p@tchwork.com" + }, + { + "name": "Symfony Community", + "homepage": "https://symfony.com/contributors" + } + ], + "description": "Symfony polyfill for the Mbstring extension", + "homepage": "https://symfony.com", + "keywords": [ + "compatibility", + "mbstring", + "polyfill", + "portable", + "shim" + ], + "support": { + "source": "https://github.com/symfony/polyfill-mbstring/tree/v1.37.0" + }, + "funding": [ + { + "url": "https://symfony.com/sponsor", + "type": "custom" + }, + { + "url": "https://github.com/fabpot", + "type": "github" + }, + { + "url": "https://github.com/nicolas-grekas", + "type": "github" + }, + { + "url": "https://tidelift.com/funding/github/packagist/symfony/symfony", + "type": "tidelift" + } + ], + "time": "2026-04-10T17:25:58+00:00" + }, + { + "name": "symfony/service-contracts", + "version": "v3.7.0", + "source": { + "type": "git", + "url": "https://github.com/symfony/service-contracts.git", + "reference": "d25d82433a80eba6aa0e6c24b61d7370d99e444a" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/symfony/service-contracts/zipball/d25d82433a80eba6aa0e6c24b61d7370d99e444a", + "reference": "d25d82433a80eba6aa0e6c24b61d7370d99e444a", + "shasum": "" + }, + "require": { + "php": ">=8.1", + "psr/container": "^1.1|^2.0", + "symfony/deprecation-contracts": "^2.5|^3" + }, + "conflict": { + "ext-psr": "<1.1|>=2" + }, + "type": "library", + "extra": { + "thanks": { + "url": "https://github.com/symfony/contracts", + "name": "symfony/contracts" + }, + "branch-alias": { + "dev-main": "3.7-dev" + } + }, + "autoload": { + "psr-4": { + "Symfony\\Contracts\\Service\\": "" + }, + "exclude-from-classmap": [ + "/Test/" + ] + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Nicolas Grekas", + "email": "p@tchwork.com" + }, + { + "name": "Symfony Community", + "homepage": "https://symfony.com/contributors" + } + ], + "description": "Generic abstractions related to writing services", + "homepage": "https://symfony.com", + "keywords": [ + "abstractions", + "contracts", + "decoupling", + "interfaces", + "interoperability", + "standards" + ], + "support": { + "source": "https://github.com/symfony/service-contracts/tree/v3.7.0" + }, + "funding": [ + { + "url": "https://symfony.com/sponsor", + "type": "custom" + }, + { + "url": "https://github.com/fabpot", + "type": "github" + }, + { + "url": "https://github.com/nicolas-grekas", + "type": "github" + }, + { + "url": "https://tidelift.com/funding/github/packagist/symfony/symfony", + "type": "tidelift" + } + ], + "time": "2026-03-28T09:44:51+00:00" + }, + { + "name": "symfony/string", + "version": "v8.0.11", + "source": { + "type": "git", + "url": "https://github.com/symfony/string.git", + "reference": "39be2ad058a3c0bd558edca23e65f009865d75ff" + }, + "dist": { + "type": "zip", + "url": "https://api.github.com/repos/symfony/string/zipball/39be2ad058a3c0bd558edca23e65f009865d75ff", + "reference": "39be2ad058a3c0bd558edca23e65f009865d75ff", + "shasum": "" + }, + "require": { + "php": ">=8.4", + "symfony/polyfill-ctype": "^1.8", + "symfony/polyfill-intl-grapheme": "^1.33", + "symfony/polyfill-intl-normalizer": "^1.0", + "symfony/polyfill-mbstring": "^1.0" + }, + "conflict": { + "symfony/translation-contracts": "<2.5" + }, + "require-dev": { + "symfony/emoji": "^7.4|^8.0", + "symfony/http-client": "^7.4|^8.0", + "symfony/intl": "^7.4|^8.0", + "symfony/translation-contracts": "^2.5|^3.0", + "symfony/var-exporter": "^7.4|^8.0" + }, + "type": "library", + "autoload": { + "files": [ + "Resources/functions.php" + ], + "psr-4": { + "Symfony\\Component\\String\\": "" + }, + "exclude-from-classmap": [ + "/Tests/" + ] + }, + "notification-url": "https://packagist.org/downloads/", + "license": [ + "MIT" + ], + "authors": [ + { + "name": "Nicolas Grekas", + "email": "p@tchwork.com" + }, + { + "name": "Symfony Community", + "homepage": "https://symfony.com/contributors" + } + ], + "description": "Provides an object-oriented API to strings and deals with bytes, UTF-8 code points and grapheme clusters in a unified way", + "homepage": "https://symfony.com", + "keywords": [ + "grapheme", + "i18n", + "string", + "unicode", + "utf-8", + "utf8" + ], + "support": { + "source": "https://github.com/symfony/string/tree/v8.0.11" + }, + "funding": [ + { + "url": "https://symfony.com/sponsor", + "type": "custom" + }, + { + "url": "https://github.com/fabpot", + "type": "github" + }, + { + "url": "https://github.com/nicolas-grekas", + "type": "github" + }, + { + "url": "https://tidelift.com/funding/github/packagist/symfony/symfony", + "type": "tidelift" + } + ], + "time": "2026-05-13T12:07:53+00:00" + } + ], + "packages-dev": [], + "aliases": [], + "minimum-stability": "stable", + "stability-flags": {}, + "prefer-stable": true, + "prefer-lowest": false, + "platform": { + "php": "^8.2" + }, + "platform-dev": {}, + "plugin-api-version": "2.6.0" +} diff --git a/content/develop/use-cases/semantic-cache/php/index.html b/content/develop/use-cases/semantic-cache/php/index.html new file mode 100644 index 0000000000..e897cfdee7 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/index.html @@ -0,0 +1,513 @@ + + + + + + Redis Semantic Cache Demo + + + +
+
loading…
+

Redis Semantic Cache Demo

+

+ A small semantic cache sits in front of a mock LLM. Each cache + entry is a Hash at __KEY_PREFIX__<id> holding + the prompt, the response, the prompt's 384-dimensional embedding, + and metadata fields. A single FT.SEARCH on + __INDEX_NAME__ does the KNN against cached prompts + with a TAG pre-filter (tenant, locale, model version, safety) in + the same round trip. If the closest cached prompt is within the + cosine-distance threshold, the demo serves the cached response + and the LLM is not called at all. +

+ +
+ +
+

Ask the LLM

+

Type a question, optionally adjust the metadata filters and + the distance threshold, and submit. The server embeds the + prompt, runs FT.SEARCH with KNN over the cache, + and either serves the cached response (hit) or runs the mock + LLM and writes the new response back to the cache (miss).

+ + +
+
+ + +
+
+ + +
+
+ + +
+
+
+ + + 0.50 +
+

+ The cache serves a hit when the closest cached prompt's + cosine distance is at or below this threshold. Lower = + stricter (fewer hits, safer reuse); higher = looser (more + hits, more risk of serving a near-miss). +

+ + + + + +
+
+ +
+

Cumulative savings

+

Every hit avoids one LLM round trip. The numbers below add + up across the session — tokens that would have been spent and + wall-clock seconds that would have been waited if the cache + had not served the answer.

+
+
+
0
+
Total queries
+
+
+
0
+
Cache hits
+
+
+
0
+
Cache misses
+
+
+
0%
+
Hit ratio
+
+
+
0
+
Tokens saved
+
+
+
0 ms
+
LLM time saved
+
+
+
+ +
+

Index state

+
+ +
+ +
+

Cached entries

+

Every prompt/response pair currently in the cache. + hit_count is the running total of times the entry + has served a hit; ttl is the remaining lifetime + in seconds before EXPIRE drops the key. Click + Drop to simulate eviction.

+ + + + + + + + + + + + +
IDPromptMetadataHitsTTL
+
+ +
+ +
+
+ + + + diff --git a/content/develop/use-cases/semantic-cache/php/public/index.php b/content/develop/use-cases/semantic-cache/php/public/index.php new file mode 100644 index 0000000000..936f36633f --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/public/index.php @@ -0,0 +1,427 @@ + getenv('SEMCACHE_REDIS_HOST') ?: 'localhost', + 'redis_port' => (int) (getenv('SEMCACHE_REDIS_PORT') ?: '6379'), + 'index_name' => getenv('SEMCACHE_INDEX_NAME') ?: 'semcache:idx', + 'key_prefix' => getenv('SEMCACHE_KEY_PREFIX') ?: 'cache:', + 'ttl_seconds' => (int) (getenv('SEMCACHE_TTL_SECONDS') ?: '3600'), + 'threshold' => $threshold, + 'llm_latency_ms' => (float) (getenv('SEMCACHE_LLM_LATENCY_MS') ?: '1500'), + 'reseed' => $reseed, + 'stack_label' => 'Predis + codewithkyrian/transformers-php + PHP built-in HTTP server', + ]; +} + +// ----- HTTP helpers -------------------------------------------------- + +function send_json(mixed $payload, int $status = 200): void +{ + http_response_code($status); + header('Content-Type: application/json'); + echo json_encode($payload, JSON_UNESCAPED_SLASHES | JSON_PARTIAL_OUTPUT_ON_ERROR); +} + +function send_html(string $html, int $status = 200): void +{ + http_response_code($status); + header('Content-Type: text/html; charset=utf-8'); + echo $html; +} + +/** + * Clamp the threshold to the meaningful cosine-distance range. + * + * PHP differs from JavaScript and Python here: `(float) "nan"` is + * `0.0` and `(float) "inf"` is `0.0` (PHP's cast bails out on the + * first non-numeric character). So unlike the Node.js port, we + * cannot rely on `is_finite()` alone — we'd happily accept "nan" as + * 0.0, which would silently turn every lookup into a miss. Detect + * the textual forms explicitly first, then parse, then range-check + * with `is_finite()` as a belt-and-braces guard. Out-of-band threshold + * values fall back to the documented 0.5 default, matching the + * behaviour the other-language ports advertise. + */ +function clamp_threshold(?string $raw): float +{ + if ($raw === null || $raw === '') { + return 0.5; + } + $trimmed = strtolower(trim($raw)); + if (in_array($trimmed, ['nan', 'inf', '-inf', '+inf', 'infinity', '-infinity'], true)) { + return 0.5; + } + if (!is_numeric($trimmed)) { + return 0.5; + } + $parsed = (float) $trimmed; + if (!is_finite($parsed)) { + return 0.5; + } + return max(0.0, min(2.0, $parsed)); +} + +/** + * Read POST body fields. PHP's `$_POST` is already populated for + * application/x-www-form-urlencoded bodies by the built-in server. + * We also enforce an explicit body cap; php.ini's `post_max_size` + * defaults to 8M and PHP silently truncates a request larger than + * that, so we set it explicitly in `run.sh` to a more conservative + * 1 MiB and then double-check here so the failure mode is a clean + * 413 JSON response. + */ +function read_post(): array +{ + return $_POST; +} + +function estimate_response_tokens(string $prompt, string $response): int +{ + return max(1, (int) floor((strlen($prompt) + strlen($response)) / 4)); +} + +// ----- Demo orchestration ------------------------------------------ + +function run_query( + RedisSemanticCache $cache, + LocalEmbedder $embedder, + MockLLM $llm, + string $prompt, + string $tenant, + string $locale, + string $modelVersion, + float $threshold, + bool $lookupOnly, +): array { + $t0 = hrtime(true); + $queryVec = $embedder->encodeOne($prompt); + $embedMs = (hrtime(true) - $t0) / 1e6; + + $t1 = hrtime(true); + $result = $cache->lookup( + queryVec: $queryVec, + tenant: $tenant, + locale: $locale, + modelVersion: $modelVersion, + distanceThreshold: $threshold, + ); + $lookupMs = (hrtime(true) - $t1) / 1e6; + + if ($result instanceof CacheHit) { + return [ + 'outcome' => 'hit', + 'response' => $result->response, + 'entry_id' => $result->id, + 'distance' => $result->distance, + 'ttl_seconds' => $result->ttlSeconds, + 'hit_count' => $result->hitCount, + 'threshold' => $threshold, + 'embed_ms' => $embedMs, + 'lookup_ms' => $lookupMs, + 'llm_ms' => null, + 'total_ms' => $embedMs + $lookupMs, + 'tokens_avoided' => estimate_response_tokens($result->prompt, $result->response), + 'ms_avoided' => $llm->latencyMs, + ]; + } + + if ($lookupOnly) { + return [ + 'outcome' => 'miss', + 'response' => '(LLM not called in lookup-only mode)', + 'nearest_distance' => $result->nearestDistance, + 'threshold' => $threshold, + 'wrote_entry_id' => null, + 'embed_ms' => $embedMs, + 'lookup_ms' => $lookupMs, + 'llm_ms' => null, + 'total_ms' => $embedMs + $lookupMs, + ]; + } + + $t2 = hrtime(true); + $llmResponse = $llm->complete($prompt); + $llmMs = (hrtime(true) - $t2) / 1e6; + + // Write the new entry back. The embedding is the same vector we + // already used for the lookup — no need to re-encode. + $entryId = $cache->put( + prompt: $prompt, + response: $llmResponse['response'], + embedding: $queryVec, + tenant: $tenant, + locale: $locale, + modelVersion: $modelVersion, + ); + + return [ + 'outcome' => 'miss', + 'response' => $llmResponse['response'], + 'nearest_distance' => $result->nearestDistance, + 'threshold' => $threshold, + 'wrote_entry_id' => $entryId, + 'embed_ms' => $embedMs, + 'lookup_ms' => $lookupMs, + 'llm_ms' => $llmMs, + 'total_ms' => $embedMs + $lookupMs + $llmMs, + ]; +} + +function build_state( + RedisSemanticCache $cache, + LocalEmbedder $embedder, + MockLLM $llm, + string $stackLabel, +): array { + $info = $cache->indexInfo(); + return [ + 'index' => array_merge($info, [ + 'index_name' => $cache->indexName, + 'model' => $embedder->modelName, + 'mock_llm_latency_ms' => $llm->latencyMs, + 'default_threshold' => $cache->distanceThreshold, + 'stack_label' => $stackLabel, + ]), + 'entries' => $cache->listEntries(200), + ]; +} + +function load_html_page(string $indexName, string $keyPrefix): string +{ + $path = __DIR__ . '/../index.html'; + $raw = file_get_contents($path); + if ($raw === false) { + throw new RuntimeException("could not read $path"); + } + return strtr($raw, [ + '__INDEX_NAME__' => $indexName, + '__KEY_PREFIX__' => $keyPrefix, + ]); +} + +// ----- Built-in server pre-flight ----------------------------------- +// +// The built-in server invokes this script for both static and dynamic +// paths. We don't host any static assets other than the HTML page, +// but we return false for any request the built-in server should +// handle on its own (e.g. a 404 it formats itself) — by leaving the +// dispatch through PHP for known paths only. + +// JSON exception wrapper. Without this, an uncaught exception escapes +// to the default error handler and the client's `await res.json()` +// explodes with an opaque parse error instead of surfacing what +// actually went wrong. +set_exception_handler(function (Throwable $exc): void { + if (!headers_sent()) { + http_response_code(500); + header('Content-Type: application/json'); + } + error_log('[demo] handler error: ' . $exc::class . ': ' . $exc->getMessage()); + echo json_encode([ + 'error' => $exc->getMessage(), + 'type' => $exc::class, + ]); +}); + +// ----- Router ------------------------------------------------------ + +$config = load_config(); +$method = $_SERVER['REQUEST_METHOD'] ?? 'GET'; +$path = parse_url($_SERVER['REQUEST_URI'] ?? '/', PHP_URL_PATH) ?: '/'; + +// Cap POST body size *defensively*. The built-in server has already +// applied php.ini's `post_max_size` by the time we get here, so a +// 50 MiB body would have been rejected at the SAPI layer with an +// empty `$_POST`. We still check `Content-Length` so the failure mode +// is a clean 413 JSON response rather than a confusing "request +// missing required field" downstream. +$maxBodyBytes = 1 * 1024 * 1024; +if (in_array($method, ['POST', 'PUT', 'PATCH'], true)) { + $contentLength = (int) ($_SERVER['CONTENT_LENGTH'] ?? 0); + if ($contentLength > $maxBodyBytes) { + send_json([ + 'error' => "request body exceeds {$maxBodyBytes} bytes", + 'type' => 'PayloadTooLarge', + ], 413); + return; + } +} + +// Connect to Redis up front — the lookup and the state endpoints both +// need it, and a connection error should surface as a clean 500 +// instead of a partial response. +$client = new Client([ + 'host' => $config['redis_host'], + 'port' => $config['redis_port'], +]); + +$cache = new RedisSemanticCache( + client: $client, + indexName: $config['index_name'], + keyPrefix: $config['key_prefix'], + distanceThreshold: $config['threshold'], + defaultTtlSeconds: $config['ttl_seconds'], +); +$cache->createIndex(); + +$embedder = LocalEmbedder::create(); +$llm = new MockLLM(latencyMs: $config['llm_latency_ms']); + +// Per-process reseed. PHP's built-in server spawns a fresh PHP +// process per request, so the `num_docs` check is what actually +// guards against re-seeding on every request: once the seed has +// run and the index has docs in it, this branch is a no-op. The +// previous version of this block gated the reseed on `path === '/'`, +// which left the cache empty if the first request was to `/state`, +// `/query`, or anything else. +if ($config['reseed'] && $cache->indexInfo()['num_docs'] === 0) { + $cache->clear(); + SeedCache::seed($cache, $embedder); +} + +if ($method === 'GET') { + if ($path === '/' || $path === '/index.html') { + send_html(load_html_page($config['index_name'], $config['key_prefix'])); + return; + } + if ($path === '/state') { + send_json(build_state($cache, $embedder, $llm, $config['stack_label'])); + return; + } + send_json(['error' => 'not found'], 404); + return; +} + +if ($method === 'POST') { + $params = read_post(); + + if ($path === '/query') { + $prompt = trim((string) ($params['prompt'] ?? '')); + if ($prompt === '') { + send_json(['error' => 'prompt is required'], 400); + return; + } + $payload = run_query( + cache: $cache, + embedder: $embedder, + llm: $llm, + prompt: $prompt, + tenant: (string) ($params['tenant'] ?? 'acme'), + locale: (string) ($params['locale'] ?? 'en'), + modelVersion: (string) ($params['model_version'] ?? $llm->modelVersion), + threshold: clamp_threshold($params['threshold'] ?? '0.5'), + lookupOnly: !empty($params['lookup_only']), + ); + send_json($payload); + return; + } + + if ($path === '/reset') { + $cache->clear(); + SeedCache::seed($cache, $embedder); + send_json(['ok' => true]); + return; + } + + if ($path === '/drop') { + $entryId = trim((string) ($params['entry_id'] ?? '')); + if ($entryId === '') { + send_json(['error' => 'entry_id is required'], 400); + return; + } + $deleted = $cache->deleteEntry($entryId); + send_json(['deleted' => $deleted, 'entry_id' => $entryId]); + return; + } + + send_json(['error' => 'not found'], 404); + return; +} + +send_json(['error' => 'method not allowed'], 405); diff --git a/content/develop/use-cases/semantic-cache/php/run.sh b/content/develop/use-cases/semantic-cache/php/run.sh new file mode 100755 index 0000000000..500df44f5e --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/run.sh @@ -0,0 +1,35 @@ +#!/usr/bin/env bash +# Convenience launcher for the Redis semantic-cache demo (PHP). +# +# Reads SEMCACHE_* environment variables for configuration (see +# `_index.md`) and starts PHP's built-in dev server with the +# `public/index.php` front controller. The `-d post_max_size=1M` +# override is deliberate: php.ini's default of 8M is generous for a +# demo whose largest legitimate body is a few hundred bytes of +# form-encoded query fields, and the smaller cap is paired with the +# defensive Content-Length check inside the front controller so a +# malformed POST is rejected at the SAPI layer with an empty $_POST +# rather than accumulated in memory. + +set -euo pipefail + +HOST="${SEMCACHE_HOST:-127.0.0.1}" +PORT="${SEMCACHE_PORT:-8093}" + +cd "$(dirname "$0")" + +if [[ ! -d vendor ]]; then + echo "vendor/ is missing — run 'composer install' first." >&2 + exit 1 +fi + +exec php \ + -d post_max_size=1M \ + -d upload_max_filesize=1M \ + -d memory_limit=256M \ + -d error_reporting='E_ALL & ~E_DEPRECATED & ~E_USER_DEPRECATED' \ + -d display_errors=0 \ + -d ffi.enable=true \ + -S "${HOST}:${PORT}" \ + -t public \ + public/index.php diff --git a/content/develop/use-cases/semantic-cache/php/src/CacheHit.php b/content/develop/use-cases/semantic-cache/php/src/CacheHit.php new file mode 100644 index 0000000000..32fa6f80b7 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/src/CacheHit.php @@ -0,0 +1,46 @@ + + */ + public function toArray(): array + { + return [ + 'id' => $this->id, + 'prompt' => $this->prompt, + 'response' => $this->response, + 'tenant' => $this->tenant, + 'locale' => $this->locale, + 'model_version' => $this->modelVersion, + 'distance' => round($this->distance, 4), + 'ttl_seconds' => $this->ttlSeconds, + 'hit_count' => $this->hitCount, + ]; + } +} diff --git a/content/develop/use-cases/semantic-cache/php/src/CacheMiss.php b/content/develop/use-cases/semantic-cache/php/src/CacheMiss.php new file mode 100644 index 0000000000..554a0b6d77 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/src/CacheMiss.php @@ -0,0 +1,35 @@ + + */ + public function toArray(): array + { + return [ + 'nearest_distance' => $this->nearestDistance === null + ? null + : round($this->nearestDistance, 4), + 'nearest_id' => $this->nearestId, + ]; + } +} diff --git a/content/develop/use-cases/semantic-cache/php/src/LocalEmbedder.php b/content/develop/use-cases/semantic-cache/php/src/LocalEmbedder.php new file mode 100644 index 0000000000..f920f2184e --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/src/LocalEmbedder.php @@ -0,0 +1,222 @@ +extractor = $extractor; + } + + /** + * Build the embedder. + * + * The pipeline load is synchronous in transformers-php but still + * does I/O (model download on first run, ONNX session + * initialisation on every run), so wrapping the constructor in a + * static factory keeps the dimension probe in one place and makes + * it explicit that creating an embedder is not free. + */ + public static function create(string $modelName = self::DEFAULT_MODEL): self + { + $extractor = pipeline('feature-extraction', $modelName); + + // Probe the output shape once so we have an authoritative + // dimension to compare against the index's expected vectorDim. + // RedisSemanticCache also checks length on every put / lookup, + // so a model swap that produces wrong-dim vectors fails at the + // call site with a clear error instead of as a server-side + // FT.SEARCH parse failure. + $probe = $extractor('dimension probe', pooling: 'mean', normalize: true); + $vector = self::firstRow($probe); + $dim = count($vector); + if ($dim < 1) { + throw new RuntimeException( + "embedder probe returned an empty vector for model {$modelName}" + ); + } + return new self($modelName, $extractor, $dim); + } + + /** + * Encode a single string. Returns a list of length `dim`. + * + * @return list + */ + public function encodeOne(string $text): array + { + $out = ($this->extractor)($text, pooling: 'mean', normalize: true); + $vec = self::firstRow($out); + return $this->finaliseVector($vec); + } + + /** + * Encode several strings in one pipeline call. Returns an array + * of list; callers that need raw bytes use `toBytes()`. + * + * @param list $texts + * @return list> + */ + public function encodeMany(array $texts): array + { + if ($texts === []) { + return []; + } + $out = ($this->extractor)($texts, pooling: 'mean', normalize: true); + + // The pipeline returns a 2-D nested array: rows × dims. If the + // pipeline ever returns a different number of vectors than + // inputs (a tokenizer truncation bug, a model swap quirk) the + // alignment with `texts` would be silently wrong and the seed + // step would write the wrong embedding next to each FAQ + // entry — catastrophic for the demo and impossible to debug. + if (!is_array($out) || !array_is_list($out)) { + throw new RuntimeException( + 'feature-extraction pipeline returned an unexpected shape' + ); + } + if (count($out) !== count($texts)) { + throw new RuntimeException(sprintf( + 'feature-extraction pipeline returned %d vectors for %d inputs', + count($out), + count($texts), + )); + } + + $result = []; + foreach ($out as $row) { + $result[] = $this->finaliseVector(self::asFloatList($row)); + } + return $result; + } + + /** + * Pack a vector into the little-endian float32 bytes Redis Search + * expects for a FLOAT32 vector field. + * + * Uses `pack('g*', ...)` exactly as the established PHP example + * does (`develop/clients/php/vecsearch.md`). `g` is the + * little-endian single-precision IEEE-754 float specifier, so the + * byte layout matches the encoding used by the redis-py, Node.js, + * Go, and Jedis ports. + * + * @param list $vector + */ + public static function toBytes(array $vector): string + { + return pack('g*', ...$vector); + } + + /** + * Drop the leading batch dimension. The feature-extraction + * pipeline returns a 2-D nested array (`[1][dim]` for a single + * string), mirroring the Python `sentence-transformers` shape. + * + * @param mixed $tensor + * @return list + */ + private static function firstRow(mixed $tensor): array + { + if (!is_array($tensor) || $tensor === []) { + throw new RuntimeException('embedder returned an empty tensor'); + } + $first = $tensor[0]; + return self::asFloatList($first); + } + + /** + * @param mixed $row + * @return list + */ + private static function asFloatList(mixed $row): array + { + if (!is_array($row)) { + throw new RuntimeException('embedder row is not an array'); + } + $out = []; + foreach ($row as $value) { + if (!is_int($value) && !is_float($value)) { + throw new RuntimeException( + 'embedder row contains a non-numeric value' + ); + } + $out[] = (float) $value; + } + return $out; + } + + /** + * Belt-and-braces L2 normalisation. The pipeline already + * normalises when called with `normalize: true`, but a numerical + * drift of 1e-6 in the squared magnitude is enough to introduce a + * detectable bias in cosine-distance comparisons across language + * implementations — and writing this here means the demo is + * resilient to a future transformers-php release that changes the + * default normalisation semantics. + * + * @param list $vector + * @return list + */ + private function finaliseVector(array $vector): array + { + if (count($vector) !== $this->dim) { + throw new InvalidArgumentException(sprintf( + 'embedder returned %d dims, expected %d', + count($vector), + $this->dim, + )); + } + $sumSq = 0.0; + foreach ($vector as $v) { + $sumSq += $v * $v; + } + if ($sumSq <= 0.0) { + return $vector; + } + $norm = sqrt($sumSq); + // Don't divide if the vector is already a unit vector to + // within float32 precision — avoids needless re-quantisation. + if (abs($norm - 1.0) <= 1e-6) { + return $vector; + } + $out = []; + foreach ($vector as $v) { + $out[] = $v / $norm; + } + return $out; + } +} diff --git a/content/develop/use-cases/semantic-cache/php/src/MockLLM.php b/content/develop/use-cases/semantic-cache/php/src/MockLLM.php new file mode 100644 index 0000000000..081bf00a6c --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/src/MockLLM.php @@ -0,0 +1,166 @@ +, answer:string}> + */ + private const KNOWLEDGE = [ + [ + 'keywords' => ['return', 'refund', 'exchange'], + 'answer' => + 'You can return any unworn item within 30 days of delivery for a ' + . 'full refund. Start a return from your order page; we email a ' + . 'prepaid label and refund the original payment method within ' + . 'five business days of receiving the item.', + ], + [ + 'keywords' => ['shipping', 'delivery', 'arrive', 'ship'], + 'answer' => + 'Standard shipping is free on orders over $50 and arrives in ' + . 'three to five business days. Expedited two-day shipping is ' + . '$9.99 and is available at checkout for in-stock items.', + ], + [ + 'keywords' => ['size', 'sizing', 'fit'], + 'answer' => + 'We follow standard US sizing. For most styles we recommend ' + . 'ordering your usual size; the product page includes a sizing ' + . 'chart and customer fit notes for items that run small or large.', + ], + [ + 'keywords' => ['warranty', 'guarantee', 'defect', 'broken'], + 'answer' => + 'All gear is covered by a one-year manufacturer warranty against ' + . 'defects in materials or workmanship. Email support with your ' + . 'order number and a photo of the issue and we will replace the ' + . 'item or issue a refund.', + ], + [ + 'keywords' => ['contact', 'support', 'help', 'agent'], + 'answer' => + 'You can reach our support team by email at help@example.com or ' + . 'by live chat from the help centre, 9am to 9pm Eastern, seven ' + . 'days a week. Most tickets get a first reply within two hours.', + ], + [ + 'keywords' => ['track', 'tracking', 'order', 'where'], + 'answer' => + 'Your tracking number is on the order confirmation email and on ' + . 'the order detail page once the package has been picked up by ' + . 'the carrier — typically within 24 hours of order placement.', + ], + [ + 'keywords' => ['cancel', 'modify', 'change'], + 'answer' => + 'Orders can be cancelled or modified for up to one hour after ' + . 'placement. After that the order has usually entered our ' + . 'warehouse system; the fastest path is to accept delivery and ' + . 'start a return for any unwanted items.', + ], + [ + 'keywords' => ['discount', 'coupon', 'promo', 'code'], + 'answer' => + 'Active promotional codes are listed on the homepage banner. ' + . 'Codes apply at checkout and cannot be combined; the system ' + . 'automatically uses the larger of the two when more than one ' + . 'would qualify.', + ], + ]; + + public int $callCount = 0; + + public function __construct( + public readonly string $modelVersion = 'gpt-4.5-2026', + public readonly float $latencyMs = 1500.0, + ) { + } + + /** + * Pretend to call a model. Sleeps, then returns a templated answer. + * + * @return array{response:string, model_version:string, latency_ms:float, prompt_tokens:int, completion_tokens:int, total_tokens:int} + */ + public function complete(string $prompt): array + { + $this->callCount++; + $start = hrtime(true); + // Sleep first so the latency is realistic regardless of which + // branch generates the text. `usleep` takes microseconds; the + // multiplier converts the millisecond input. + $sleepUs = (int) round($this->latencyMs * 1000); + if ($sleepUs > 0) { + usleep($sleepUs); + } + $response = self::answerFor($prompt); + $elapsedMs = (hrtime(true) - $start) / 1e6; + + $promptTokens = self::estimateTokens($prompt); + $completionTokens = self::estimateTokens($response); + return [ + 'response' => $response, + 'model_version' => $this->modelVersion, + 'latency_ms' => $elapsedMs, + 'prompt_tokens' => $promptTokens, + 'completion_tokens' => $completionTokens, + 'total_tokens' => $promptTokens + $completionTokens, + ]; + } + + private static function answerFor(string $prompt): string + { + $lower = strtolower($prompt); + foreach (self::KNOWLEDGE as $row) { + foreach ($row['keywords'] as $keyword) { + if (str_contains($lower, $keyword)) { + return $row['answer']; + } + } + } + // Generic fallback — keeps the demo working for queries that + // don't match any FAQ keyword. + return + 'Thanks for the question. Our team would normally answer this ' + . 'individually; in the meantime please check the help centre or ' + . 'contact support@example.com for a faster response.'; + } + + /** + * Rough English token estimate: ~4 characters per token. Real + * tokenizers (BPE, SentencePiece) vary slightly but this is close + * enough for "look how many tokens you saved" demo signage. + */ + public static function estimateTokens(string $text): int + { + if ($text === '') { + return 0; + } + return max(1, (int) floor(strlen($text) / 4)); + } +} diff --git a/content/develop/use-cases/semantic-cache/php/src/RedisSemanticCache.php b/content/develop/use-cases/semantic-cache/php/src/RedisSemanticCache.php new file mode 100644 index 0000000000..ca97871b47 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/src/RedisSemanticCache.php @@ -0,0 +1,503 @@ +`. The hash + * stores the user's prompt and the corresponding LLM response + * alongside the raw float32 bytes of the prompt's 384-dimensional + * embedding and a small set of metadata fields — tenant, locale, + * model version, and a safety flag. + * + * A single Redis Search index covers the embedding plus every + * metadata field, so one `FT.SEARCH` call does an + * approximate-nearest-neighbour lookup against the cached prompts + * with a TAG pre-filter applied in the same pass — no cross-store + * joins, no extra round trips, and tenant isolation is enforced + * *inside* the query rather than after the fact in application code. + * + * The lookup is thresholded: `FT.SEARCH` always returns the closest + * cached prompt, but the cache only serves it as a hit when the + * cosine distance is at or below `distanceThreshold`. Anything + * further away is treated as a miss; the caller is expected to run + * the underlying LLM and write the new prompt, response, and + * embedding back with `put`. + * + * Each cache entry is written with `EXPIRE`, so stale answers age out + * without manual cleanup; combine with an `allkeys-lfu` eviction + * policy on the database to cap memory under pressure too. + */ +final class RedisSemanticCache +{ + public const VECTOR_DIM_DEFAULT = 384; + + /** + * Characters Redis Search treats as syntax inside a TAG value; + * any of them in a user-supplied filter must be backslash-escaped + * or the surrounding `{...}` block won't parse correctly. + */ + private const TAG_SPECIAL = "\\,.<>{}[]\"':;!@#\$%^&*()-+=~| "; + + public function __construct( + public readonly Client $client, + public readonly string $indexName = 'semcache:idx', + public readonly string $keyPrefix = 'cache:', + public readonly int $vectorDim = self::VECTOR_DIM_DEFAULT, + public readonly float $distanceThreshold = 0.5, + public readonly int $defaultTtlSeconds = 3600, + ) { + } + + // -- Keys ------------------------------------------------------------ + + public function entryKey(string $entryId): string + { + return $this->keyPrefix . $entryId; + } + + // -- Index management ----------------------------------------------- + + /** + * Create the Redis Search index if it doesn't already exist. + * + * One index covers the embedding plus every metadata field, so a + * single `FT.SEARCH` can pre-filter by tenant / locale / model + * and then KNN-rank the matching documents in one pass. The + * `prompt` and `response` fields are stored as TEXT so admin + * tooling can grep the cache by content, but the cache lookup + * itself is vector-only. + */ + public function createIndex(): void + { + $schema = [ + new TextField('prompt'), + new TextField('response'), + new TagField('tenant'), + new TagField('locale'), + new TagField('model_version'), + new TagField('safety'), + new NumericField('created_ts', '', NumericField::SORTABLE), + new NumericField('hit_count', '', NumericField::SORTABLE), + new VectorField( + 'embedding', + 'HNSW', + ['TYPE', 'FLOAT32', 'DIM', $this->vectorDim, 'DISTANCE_METRIC', 'COSINE'], + ), + ]; + + try { + $this->client->ftcreate( + $this->indexName, + $schema, + (new CreateArguments()) + ->on('HASH') + ->prefix([$this->keyPrefix]), + ); + } catch (ServerException $exc) { + if (!str_contains((string) $exc->getMessage(), 'Index already exists')) { + throw $exc; + } + } + } + + public function dropIndex(bool $deleteDocuments = false): void + { + try { + $args = new DropArguments(); + if ($deleteDocuments) { + $args->dd(); + } + $this->client->ftdropindex($this->indexName, $args); + } catch (ServerException $exc) { + $message = strtolower((string) $exc->getMessage()); + if (!str_contains($message, 'no such index') + && !str_contains($message, 'unknown index name') + ) { + throw $exc; + } + } + } + + // -- Lookup --------------------------------------------------------- + + /** + * Find the nearest in-scope cached prompt and decide hit / miss. + * + * `FT.SEARCH` returns the single nearest entry that satisfies the + * TAG pre-filters. The lookup is a hit only if the reported + * cosine distance is at or below `distanceThreshold` (or the + * instance default). Anything further away is a miss with the + * candidate distance attached so the caller can log it. + * + * On a hit, the entry's `hit_count` is incremented inside a + * MULTI/EXEC alongside an `EXPIRE` refresh so a frequently-used + * answer keeps its TTL and the demo UI can see which entries are + * load-bearing. + * + * @param list $queryVec + */ + public function lookup( + array $queryVec, + ?string $tenant = null, + ?string $locale = null, + ?string $modelVersion = null, + ?string $safety = 'ok', + ?float $distanceThreshold = null, + ): CacheHit|CacheMiss { + // Match the shape check that `put` performs. A wrong-dim + // vector would otherwise hit Redis as a malformed FT.SEARCH + // parameter and surface as a server-side parse error instead + // of a clear caller-side InvalidArgumentException. + if (count($queryVec) !== $this->vectorDim) { + throw new InvalidArgumentException(sprintf( + 'queryVec length is %d; index expects %d', + count($queryVec), + $this->vectorDim, + )); + } + + $threshold = $distanceThreshold ?? $this->distanceThreshold; + + $filterClause = self::buildFilterClause( + tenant: $tenant, + locale: $locale, + modelVersion: $modelVersion, + safety: $safety, + ); + $queryStr = $filterClause . '=>[KNN 1 @embedding $vec AS distance]'; + $vecBytes = LocalEmbedder::toBytes($queryVec); + + $arguments = (new SearchArguments()) + ->addReturn(7, 'prompt', 'response', 'tenant', 'locale', 'model_version', 'hit_count', 'distance') + ->sortBy('distance', 'asc') + ->limit(0, 1) + ->dialect('2') + ->params(['vec', $vecBytes]); + + $raw = $this->client->ftsearch($this->indexName, $queryStr, $arguments); + $docs = self::parseSearchResponse($raw); + if ($docs === []) { + return new CacheMiss(null, null); + } + + $doc = $docs[0]; + $rawKey = $doc['__id'] ?? ''; + $entryId = str_starts_with($rawKey, $this->keyPrefix) + ? substr($rawKey, strlen($this->keyPrefix)) + : $rawKey; + $distance = (float) ($doc['distance'] ?? 0.0); + + if ($distance > $threshold) { + return new CacheMiss($distance, $entryId); + } + + // The hash may have expired between FT.SEARCH returning the + // row and us getting here — the search index lags expirations + // by its periodic scan. If we just blindly HINCRBY-ed, Redis + // would helpfully recreate the hash with only `hit_count` set + // and the search index would then log it as an indexing + // failure (no embedding, no metadata). EXISTS narrows that + // race to the pipeline round-trip; a strictly race-free + // version would wrap the bump in a Lua script that checks + // existence and acts in one server-side step. + $entryKey = $this->entryKey($entryId); + if ((int) $this->client->exists($entryKey) === 0) { + return new CacheMiss($distance, $entryId); + } + + // MULTI/EXEC the three writes so they apply as a unit on the + // server — a partial failure between HINCRBY and EXPIRE would + // otherwise leave the entry without a refreshed TTL. + $ttlSeconds = $this->defaultTtlSeconds; + $replies = $this->client->transaction(function ($tx) use ($entryKey, $ttlSeconds) { + $tx->hincrby($entryKey, 'hit_count', 1); + $tx->expire($entryKey, $ttlSeconds); + $tx->ttl($entryKey); + }); + $newHitCount = (int) ($replies[0] ?? 0); + $ttl = (int) ($replies[2] ?? $ttlSeconds); + + return new CacheHit( + id: $entryId, + prompt: (string) ($doc['prompt'] ?? ''), + response: (string) ($doc['response'] ?? ''), + tenant: (string) ($doc['tenant'] ?? ''), + locale: (string) ($doc['locale'] ?? ''), + modelVersion: (string) ($doc['model_version'] ?? ''), + distance: $distance, + ttlSeconds: $ttl > 0 ? $ttl : $this->defaultTtlSeconds, + hitCount: $newHitCount, + ); + } + + // -- Write ---------------------------------------------------------- + + /** + * Write a new cache entry and return its id. + * + * The embedding is stored as raw little-endian float32 bytes — + * the encoding Redis Search expects from a FLOAT32 vector field + * (`pack('g*', ...)`). `EXPIRE` on the key gives every entry a + * bounded lifetime; combine with an `allkeys-lfu` eviction policy + * on the database to cap memory under pressure too. + * + * @param list $embedding + */ + public function put( + string $prompt, + string $response, + array $embedding, + string $tenant = 'default', + string $locale = 'en', + string $modelVersion = 'gpt-4.5-2026', + string $safety = 'ok', + ?int $ttlSeconds = null, + ?string $entryId = null, + ): string { + if (count($embedding) !== $this->vectorDim) { + throw new InvalidArgumentException(sprintf( + 'embedding length is %d; index expects %d', + count($embedding), + $this->vectorDim, + )); + } + + $entryId ??= self::newEntryId(); + $key = $this->entryKey($entryId); + $ttl = $ttlSeconds ?? $this->defaultTtlSeconds; + + $mapping = [ + 'prompt' => $prompt, + 'response' => $response, + 'tenant' => $tenant, + 'locale' => $locale, + 'model_version' => $modelVersion, + 'safety' => $safety, + 'created_ts' => sprintf('%.6f', microtime(true)), + 'hit_count' => '0', + 'embedding' => LocalEmbedder::toBytes($embedding), + ]; + + // MULTI/EXEC so HSET and EXPIRE either both apply or neither + // does. Without the transaction wrapper a connection drop + // between the two writes could leave the entry without a TTL + // and the cache would then keep an answer past its intended + // lifetime (or forever, on a database with no eviction policy). + // Predis exposes both `hmset` and `hset`, but `HMSET` has been + // deprecated server-side since Redis 4 in favour of the + // variadic `HSET key field value [field value …]`. Flatten the + // associative mapping into the field/value sequence `hset` + // expects to keep the wire command on the supported path. + $hsetArgs = [$key]; + foreach ($mapping as $field => $value) { + $hsetArgs[] = $field; + $hsetArgs[] = $value; + } + $this->client->transaction(function ($tx) use ($hsetArgs, $key, $ttl) { + $tx->hset(...$hsetArgs); + $tx->expire($key, $ttl); + }); + return $entryId; + } + + // -- Filter clause -------------------------------------------------- + + public static function escapeTagValue(string $value): string + { + $out = ''; + $special = self::TAG_SPECIAL; + $len = strlen($value); + for ($i = 0; $i < $len; $i++) { + $ch = $value[$i]; + $out .= (str_contains($special, $ch) ? '\\' . $ch : $ch); + } + return $out; + } + + public static function buildFilterClause( + ?string $tenant, + ?string $locale, + ?string $modelVersion, + ?string $safety, + ): string { + $clauses = []; + if ($tenant !== null && $tenant !== '') { + $clauses[] = '@tenant:{' . self::escapeTagValue($tenant) . '}'; + } + if ($locale !== null && $locale !== '') { + $clauses[] = '@locale:{' . self::escapeTagValue($locale) . '}'; + } + if ($modelVersion !== null && $modelVersion !== '') { + $clauses[] = '@model_version:{' . self::escapeTagValue($modelVersion) . '}'; + } + if ($safety !== null && $safety !== '') { + $clauses[] = '@safety:{' . self::escapeTagValue($safety) . '}'; + } + if ($clauses === []) { + return '(*)'; + } + return '(' . implode(' ', $clauses) . ')'; + } + + // -- Inspection / admin --------------------------------------------- + + /** + * Subset of `FT.INFO` useful for the demo UI. + * + * @return array{num_docs:int,indexing_failures:int,vector_index_size_mb:float} + */ + public function indexInfo(): array + { + try { + $raw = $this->client->ftinfo($this->indexName); + } catch (ServerException) { + return [ + 'num_docs' => 0, + 'indexing_failures' => 0, + 'vector_index_size_mb' => 0.0, + ]; + } + + $info = self::flatPairsToMap($raw); + return [ + 'num_docs' => (int) ($info['num_docs'] ?? 0), + 'indexing_failures' => (int) ($info['hash_indexing_failures'] ?? 0), + 'vector_index_size_mb' => (float) ($info['vector_index_sz_mb'] ?? 0.0), + ]; + } + + /** + * Return every cached entry (no embedding) for the admin UI. + * + * @return list> + */ + public function listEntries(int $limit = 100): array + { + $arguments = (new SearchArguments()) + ->addReturn(8, 'prompt', 'response', 'tenant', 'locale', 'model_version', 'safety', 'created_ts', 'hit_count') + ->limit(0, $limit) + ->sortBy('created_ts', 'desc'); + + try { + $raw = $this->client->ftsearch($this->indexName, '*', $arguments); + } catch (ServerException) { + return []; + } + $docs = self::parseSearchResponse($raw); + + $out = []; + foreach ($docs as $doc) { + $rawKey = $doc['__id'] ?? ''; + $entryId = str_starts_with($rawKey, $this->keyPrefix) + ? substr($rawKey, strlen($this->keyPrefix)) + : $rawKey; + $ttl = (int) $this->client->ttl($this->entryKey($entryId)); + $out[] = [ + 'id' => $entryId, + 'prompt' => (string) ($doc['prompt'] ?? ''), + 'response' => (string) ($doc['response'] ?? ''), + 'tenant' => (string) ($doc['tenant'] ?? ''), + 'locale' => (string) ($doc['locale'] ?? ''), + 'model_version' => (string) ($doc['model_version'] ?? ''), + 'safety' => (string) ($doc['safety'] ?? ''), + 'hit_count' => (int) ($doc['hit_count'] ?? 0), + 'ttl_seconds' => $ttl > 0 ? $ttl : 0, + 'created_ts' => (float) ($doc['created_ts'] ?? 0.0), + ]; + } + return $out; + } + + public function deleteEntry(string $entryId): bool + { + return ((int) $this->client->del($this->entryKey($entryId))) > 0; + } + + /** + * Drop the index and every cached entry. Returns the number of + * entries that were removed. Used by the demo's "reset" button — + * in production the equivalent is just `FLUSHDB` on a dedicated + * cache database, or letting TTLs expire naturally. + */ + public function clear(): int + { + $before = $this->indexInfo()['num_docs']; + $this->dropIndex(deleteDocuments: true); + $this->createIndex(); + return $before; + } + + // -- Helpers -------------------------------------------------------- + + /** + * FT.SEARCH returns: [count, key1, [field, value, field, value, ...], key2, [...], ...] + * + * @param mixed $raw + * @return list> + */ + private static function parseSearchResponse(mixed $raw): array + { + if (!is_array($raw) || $raw === []) { + return []; + } + $docs = []; + $count = count($raw); + for ($i = 1; $i + 1 < $count; $i += 2) { + $key = $raw[$i]; + $fields = $raw[$i + 1]; + if (!is_array($fields)) { + continue; + } + $doc = ['__id' => (string) $key]; + $fieldCount = count($fields); + for ($j = 0; $j + 1 < $fieldCount; $j += 2) { + $doc[(string) $fields[$j]] = $fields[$j + 1]; + } + $docs[] = $doc; + } + return $docs; + } + + /** + * Convert a flat RESP key-value array (`[k1, v1, k2, v2, ...]`) + * into an associative map for the `FT.INFO` reply. + * + * @param mixed $raw + * @return array + */ + private static function flatPairsToMap(mixed $raw): array + { + if (!is_array($raw)) { + return []; + } + $map = []; + $count = count($raw); + for ($i = 0; $i + 1 < $count; $i += 2) { + $map[(string) $raw[$i]] = $raw[$i + 1]; + } + return $map; + } + + private static function newEntryId(): string + { + // 12 lowercase hex characters — collision space is 16^12, big + // enough for the demo but compact in the UI table. + $bytes = random_bytes(6); + return bin2hex($bytes); + } +} diff --git a/content/develop/use-cases/semantic-cache/php/src/SeedCache.php b/content/develop/use-cases/semantic-cache/php/src/SeedCache.php new file mode 100644 index 0000000000..0fccccb322 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/php/src/SeedCache.php @@ -0,0 +1,114 @@ + + */ + public const ENTRIES = [ + [ + 'prompt' => 'What is your return policy?', + 'response' => + 'You can return any unworn item within 30 days of delivery for ' + . 'a full refund. Start a return from your order page; we email ' + . 'a prepaid label and refund the original payment method within ' + . 'five business days of receiving the item.', + ], + [ + 'prompt' => 'How long does shipping take?', + 'response' => + 'Standard shipping is free on orders over $50 and arrives in ' + . 'three to five business days. Expedited two-day shipping is ' + . '$9.99 and is available at checkout for in-stock items.', + ], + [ + 'prompt' => 'How do I find my size?', + 'response' => + 'We follow standard US sizing. For most styles we recommend ' + . 'ordering your usual size; the product page includes a sizing ' + . 'chart and customer fit notes for items that run small or ' + . 'large.', + ], + [ + 'prompt' => 'Is there a warranty on your products?', + 'response' => + 'All gear is covered by a one-year manufacturer warranty ' + . 'against defects in materials or workmanship. Email support ' + . 'with your order number and a photo of the issue and we will ' + . 'replace the item or issue a refund.', + ], + [ + 'prompt' => 'How can I contact customer support?', + 'response' => + 'You can reach our support team by email at help@example.com ' + . 'or by live chat from the help centre, 9am to 9pm Eastern, ' + . 'seven days a week. Most tickets get a first reply within two ' + . 'hours.', + ], + [ + 'prompt' => 'Where is my order?', + 'response' => + 'Your tracking number is on the order confirmation email and ' + . 'on the order detail page once the package has been picked up ' + . 'by the carrier — typically within 24 hours of order ' + . 'placement.', + ], + ]; + + /** + * Embed and write the seed entries into the cache. Returns the + * number of entries seeded. + * + * The seeds are embedded one prompt at a time even though + * `encodeMany` would be one pipeline call cheaper. transformers-php + * pads variable-length inputs inside a batch and the mean-pooling + * step then attends to the padding-masked positions slightly + * differently than it does in single-input mode. The numerical + * drift is small (~0.01 cosine distance) but it would make + * `What is your return policy?` register as ~0.012 away from its + * own seed entry instead of effectively zero, which would muddy + * the "exact match → distance ≈ 0" claim the docs make. Encoding + * one-at-a-time keeps the seed vectors bitwise-identical to what + * the lookup path produces for the same prompt. + */ + public static function seed( + RedisSemanticCache $cache, + LocalEmbedder $embedder, + string $tenant = 'acme', + string $locale = 'en', + string $modelVersion = 'gpt-4.5-2026', + ): int { + foreach (self::ENTRIES as $entry) { + $vector = $embedder->encodeOne($entry['prompt']); + $cache->put( + prompt: $entry['prompt'], + response: $entry['response'], + embedding: $vector, + tenant: $tenant, + locale: $locale, + modelVersion: $modelVersion, + ); + } + return count(self::ENTRIES); + } +} diff --git a/content/develop/use-cases/semantic-cache/redis-py/_index.md b/content/develop/use-cases/semantic-cache/redis-py/_index.md new file mode 100644 index 0000000000..993e407d7d --- /dev/null +++ b/content/develop/use-cases/semantic-cache/redis-py/_index.md @@ -0,0 +1,262 @@ +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: Build a Redis-backed semantic cache for LLM responses in Python with redis-py and sentence-transformers +linkTitle: redis-py example (Python) +title: Redis semantic cache with redis-py +weight: 1 +--- + +This guide shows you how to build a small Redis-backed semantic cache for LLM responses in Python with [`redis-py`]({{< relref "/develop/clients/redis-py" >}}) and the [`sentence-transformers`](https://www.sbert.net/) library. It includes a local web server built with the Python standard library so you can send paraphrased prompts at a mock LLM, watch the cache decide hit or miss, sweep the cosine-distance threshold, and see the cumulative latency and token savings build up. + +## Overview + +Each cache entry is stored as a single Redis [Hash]({{< relref "/develop/data-types/hashes" >}}) at `cache:`. The hash holds the original prompt, the LLM's response, the raw `float32` bytes of a 384-dimensional embedding of the prompt, and metadata fields — tenant, locale, model version, safety flag — plus a `created_ts` and a `hit_count`. A single [Redis Search]({{< relref "/develop/ai/search-and-query" >}}) index covers the embedding field and every metadata field, so one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) call with a `KNN` clause does the vector lookup *and* the TAG pre-filter in the same round trip — no cross-store joins. + +The lookup is thresholded: [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) always returns the nearest entry that satisfies the filters, but the application only serves it as a hit when the reported cosine distance is at or below `distance_threshold`. Anything further away is treated as a miss; the caller runs the LLM and writes the new prompt, response, and embedding back to the same key pattern with a TTL. + +That gives you: + +* A single round trip for lookup — vector KNN + metadata pre-filter in one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}). +* Tens of milliseconds on a hit vs. a multi-second LLM call on a miss; the embedding step is the bottleneck either way, and that's a model-side cost, not a Redis one. +* Tenant, locale, and model-version isolation enforced inside the query, not in application code — a write under one tenant cannot be served to another. +* Bounded memory: every entry has an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL, and a database-level [eviction policy]({{< relref "/develop/reference/eviction" >}}) (LRU / LFU) caps the cache size under pressure. + +## How it works + +A query goes through three stages: **embed**, **lookup**, and (on a miss) **call the LLM and write back**. + +### Hit path (the goal) + +1. The application calls `embedder.encode_one(prompt)` to turn the incoming text into a 384-dimensional `float32` vector. +2. `cache.lookup(query_vec, tenant=..., locale=..., model_version=...)` runs [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) with a TAG pre-filter and a `KNN 1` clause. Redis returns the closest cached prompt that satisfies the filters along with its cosine distance. +3. If the distance is at or below the threshold, the cache returns a `CacheHit` containing the cached response. The helper also pipelines an [`HINCRBY`]({{< relref "/commands/hincrby" >}}) on `hit_count` and an [`EXPIRE`]({{< relref "/commands/expire" >}}) refresh, so a frequently used answer keeps its TTL and the demo UI can see which entries are load-bearing. +4. The LLM is not called at all. The application returns the cached response to the user. + +### Miss path + +When the distance is above the threshold — or there is no candidate in scope at all — the helper returns a `CacheMiss` instead, carrying the distance of the nearest candidate (if any) for logging. The application then: + +1. Calls the LLM with the prompt. +2. Calls `cache.put(prompt, response, embedding, tenant=..., locale=..., model_version=...)`. The same embedding the lookup used is reused — no re-encode. The helper writes the Hash with [`HSET`]({{< relref "/commands/hset" >}}) and an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL in a pipeline. +3. Returns the LLM's response to the user. The next semantically similar prompt under the same metadata scope will be a hit. + +## The cache helper + +The `RedisSemanticCache` class wraps the Redis Search index and the lookup / write flow +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/redis-py/cache.py)): + +```python +import redis +from cache import RedisSemanticCache, CacheHit, CacheMiss +from embeddings import LocalEmbedder + +# Use decode_responses=False because the embedding field is raw bytes; +# the helper decodes text fields explicitly where it needs them. +r = redis.Redis(host="localhost", port=6379, decode_responses=False) +cache = RedisSemanticCache( + redis_client=r, + index_name="semcache:idx", + distance_threshold=0.5, # cosine distance, lower = stricter + default_ttl_seconds=3600, # one hour +) +embedder = LocalEmbedder() # sentence-transformers/all-MiniLM-L6-v2 + +# One-time index setup (idempotent). +cache.create_index() + +# 1) Embed the prompt. +prompt = "How do I return an item?" +query_vec = embedder.encode_one(prompt) + +# 2) Look up under a metadata scope. The TAG filter and the KNN +# travel together in one FT.SEARCH. +result = cache.lookup( + query_vec, + tenant="acme", + locale="en", + model_version="gpt-4.5-2026", +) + +if isinstance(result, CacheHit): + response = result.response + print(f"hit ({result.distance:.3f}): {response}") +else: + # 3a) Miss — call the LLM. (Use your real client here.) + response = call_llm(prompt) + + # 3b) Cache the new entry. Reuses the same embedding bytes the + # lookup used, so we don't pay the encoder twice. + cache.put( + prompt=prompt, + response=response, + embedding=query_vec, + tenant="acme", + locale="en", + model_version="gpt-4.5-2026", + ) +``` + +### Data model + +Each cache entry is one Redis Hash. The vector field is raw little-endian `float32` bytes — no JSON wrapping — because the Redis Search vector encoding expects exactly that. + +```text +cache:7c3f8a1b9e02 + prompt=How do I return an item? + response=You can return any unworn item within 30 days... + tenant=acme + locale=en + model_version=gpt-4.5-2026 + safety=ok + created_ts=1715990400.123 + hit_count=4 + embedding=<384 × float32 little-endian bytes> +``` + +The Redis Search index schema treats every field as queryable in its natural type: + +```text +FT.CREATE semcache:idx + ON HASH PREFIX 1 cache: + SCHEMA + prompt TEXT + response TEXT + tenant TAG + locale TAG + model_version TAG + safety TAG + created_ts NUMERIC SORTABLE + hit_count NUMERIC SORTABLE + embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE +``` + +The `prompt` and `response` TEXT fields aren't used by the cache lookup itself — that's vector-only — but they make it possible to grep the cache by content from `redis-cli` for debugging or admin tooling. + +### The query + +The lookup is a hybrid query: a TAG pre-filter expression in parentheses, then `=>[KNN 1 @embedding $vec]`. With `DIALECT 2`, Redis applies the filter first and KNN-ranks only the matching documents. + +```text +FT.SEARCH semcache:idx + "(@tenant:{acme} @locale:{en} @model_version:{gpt\-4\.5\-2026} @safety:{ok}) + =>[KNN 1 @embedding $vec AS distance]" + PARAMS 2 vec <384-float32-bytes> + SORTBY distance + RETURN 7 prompt response tenant locale model_version hit_count distance + DIALECT 2 +``` + +`distance` is the cosine *distance* (0 means identical, 2 means opposite). The result is sorted ascending, so the top row is the closest candidate. The application inspects `distance` against the threshold and decides hit or miss in user code — Redis returns the row either way, and treating it as a hit or a miss is a policy decision the cache helper owns, not a server-side filter. + +## The mock LLM + +To make the latency and token savings visible without requiring an API key, `mock_llm.py` provides a deterministic stand-in +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/redis-py/mock_llm.py)): + +```python +from mock_llm import MockLLM + +llm = MockLLM(latency_ms=1500.0) # one and a half seconds per call +response = llm.complete("What is your return policy?") +# response.response — the templated answer text +# response.latency_ms — wall-clock time the call took +# response.total_tokens — estimated prompt + completion tokens +``` + +The mock sleeps for the configured latency, then keyword-matches against a small FAQ table to produce an answer. The deliberate slowness is what makes a hit visibly cheaper than a miss in the demo. In production code, you would replace `MockLLM` with your real client of choice — OpenAI, Anthropic, Bedrock, vLLM, Ollama, anything — without changing the cache helper. + +## Pre-seeding the cache + +In a real deployment the cache fills up organically: a first-time question is a miss, the LLM answers, and the response is written back. For the demo, `seed_cache.py` pre-loads a small set of canonical FAQ prompts so the very first query lands on a hit +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/redis-py/seed_cache.py)): + +```python +from seed_cache import seed +from cache import RedisSemanticCache +from embeddings import LocalEmbedder + +cache = RedisSemanticCache() +embedder = LocalEmbedder() +cache.create_index() +seed(cache, embedder, tenant="acme", locale="en") +``` + +The seed list stores the canonical phrasing of each question ("What is your return policy?"). Paraphrases of any of these prompts ("How do I return an item?", "Can I get a refund?") embed close to the canonical entry, so the cache lookup serves the stored response without ever calling the model. + +## The interactive demo + +`demo_server.py` runs a ThreadingHTTPServer. The HTML page lets you: + +* Type a prompt and toggle metadata: tenant, locale, model version. Each combination is a separate cache namespace inside the same index. +* Slide the cosine-distance threshold and see hits flip to misses (and back) on the same prompt, with the actual distance reported on each query. +* Submit with **Ask** to run the full hit-or-miss path (calls the LLM on a miss, writes the answer back). Submit with **Lookup only (no LLM)** to sweep the threshold against a fixed prompt without polluting the cache. +* Watch the cumulative panel build up: total queries, cache hits, cache misses, hit ratio, tokens not spent, LLM seconds not waited. +* Inspect every cached entry, including remaining TTL and total hit count, and drop individual entries to simulate eviction. + +The server holds one `LocalEmbedder`, one `RedisSemanticCache`, and one `MockLLM` for the lifetime of the process. Endpoints: + +| Endpoint | What it does | +|-----------------|-------------------------------------------------------------------------------| +| `GET /state` | Index info and the full list of cached entries. | +| `POST /query` | Embed the prompt, run `FT.SEARCH`, on miss call the LLM and write back. | +| `POST /reset` | Drop every cached entry and re-seed from the FAQ list. | +| `POST /drop` | Delete a single cached entry by id. | + +## Run the demo locally + +1. Clone the [`redis/docs`](https://github.com/redis/docs) repository and change into the example + directory: + + ```bash + git clone https://github.com/redis/docs.git + cd docs/content/develop/use-cases/semantic-cache/redis-py + ``` + +2. Install the dependencies: + + ```bash + pip install redis sentence-transformers numpy + ``` + +3. Make sure a Redis instance with the Redis Search module is running locally on + port 6379. [Redis Stack]({{< relref "/operate/oss_and_stack/install/install-stack" >}}) or + [Redis 8 with Search]({{< relref "/develop/ai/search-and-query" >}}) both work. + +4. Start the demo server. The first run downloads the `all-MiniLM-L6-v2` model + (~80 MB) into the local Hugging Face cache: + + ```bash + python demo_server.py + ``` + +5. Open and try some queries: + + * **"What is your return policy?"** — exact match against the seed, distance ≈ 0, + hit at any threshold. + * **"How fast is delivery?"** — paraphrase of the shipping seed; distance + around 0.30, hit at the default threshold of 0.5. + * **"How do I return an item?"** — slightly looser paraphrase of the returns + seed; distance around 0.49, still a hit at the default threshold. Slide + the threshold down to 0.4 to see this one flip to a miss. + * **"What payment methods do you accept?"** — unrelated to anything in the + seed; distance > 0.8, so you'll see a miss, the mock LLM kicks in for + ~1.5 s, the new answer is cached, and a follow-up of the same question + is now an immediate hit. + * Switch the **Tenant** dropdown to `globex` or `initech` and re-ask any + seeded question — the result flips to a miss because the cache entries + live under `acme`. That's the metadata pre-filter at work inside `FT.SEARCH`. + + `all-MiniLM-L6-v2` puts FAQ-style paraphrases in the 0.3–0.5 cosine-distance + range and unrelated queries above 0.8, which is what motivates the 0.5 + default. A stricter embedding model (or a domain-tuned one) would let you + drop the threshold further; a noisier one would push it up. The right + threshold is always a function of the model, the corpus, and how + conservative the application needs to be about reuse. + +The server is read/write against your local Redis. The default index name is `semcache:idx` and entry keys live under `cache:`. Pass `--no-reset` to keep an existing cache across restarts, `--threshold` to change the default cosine-distance cutoff, or `--llm-latency-ms` to make the mock LLM faster or slower for the demo. diff --git a/content/develop/use-cases/semantic-cache/redis-py/cache.py b/content/develop/use-cases/semantic-cache/redis-py/cache.py new file mode 100644 index 0000000000..8f848dce03 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/redis-py/cache.py @@ -0,0 +1,484 @@ +""" +Redis semantic-cache helper backed by Redis Search. + +Each cache entry lives as a Hash document at ``cache:``. The hash +stores the user's prompt and the corresponding LLM response alongside +the raw float32 bytes of the prompt's 384-dimensional embedding and a +small set of metadata fields — tenant, locale, model version, and a +safety flag. + +A single Redis Search index covers the embedding plus every metadata +field, so one ``FT.SEARCH`` call does an approximate-nearest-neighbour +lookup against the cached prompts with a TAG pre-filter applied in the +same pass — no cross-store joins, no extra round trips, and tenant +isolation is enforced *inside* the query rather than after the fact in +application code. + +The lookup is thresholded: ``FT.SEARCH`` always returns the closest +cached prompt, but the cache only serves it as a hit when the cosine +distance is at or below ``distance_threshold``. Anything further away +is treated as a miss; the caller is expected to run the underlying LLM +and write the new prompt, response, and embedding back with ``put``. + +Each cache entry is written with ``EXPIRE``, so stale answers age out +without manual cleanup; setting an eviction policy on the database +(``allkeys-lfu`` is the common choice) caps memory under pressure. +This helper expects a ``redis.Redis`` client constructed with +``decode_responses=False`` because the embedding field is binary; +mixing UTF-8 decoding into a binary path corrupts vectors. Text +fields are decoded explicitly where needed. +""" + +from __future__ import annotations + +import time +import uuid +from dataclasses import dataclass +from typing import Optional + +import numpy as np +import redis +from redis.commands.search.field import ( + NumericField, + TagField, + TextField, + VectorField, +) +from redis.commands.search.index_definition import IndexDefinition, IndexType +from redis.commands.search.query import Query + + +VECTOR_DIM_DEFAULT = 384 + + +@dataclass +class CacheHit: + """A cache lookup that returned a cached response. + + ``distance`` is the cosine distance ``FT.SEARCH`` reported for the + nearest cached prompt (0 = identical, 2 = opposite). It is always + at or below the threshold the lookup was run with. + """ + + id: str + prompt: str + response: str + tenant: str + locale: str + model_version: str + distance: float + ttl_seconds: int + hit_count: int + + def to_dict(self) -> dict: + return { + "id": self.id, + "prompt": self.prompt, + "response": self.response, + "tenant": self.tenant, + "locale": self.locale, + "model_version": self.model_version, + "distance": round(self.distance, 4), + "ttl_seconds": self.ttl_seconds, + "hit_count": self.hit_count, + } + + +@dataclass +class CacheMiss: + """A cache lookup that did not return a usable response. + + ``nearest_distance`` is the cosine distance to the closest cached + prompt that *did* match the metadata filters. It is ``None`` if the + cache had no entry in scope at all, which is what the demo UI + shows as "no candidate" vs. "candidate too far". + """ + + nearest_distance: Optional[float] + nearest_id: Optional[str] + + def to_dict(self) -> dict: + return { + "nearest_distance": + round(self.nearest_distance, 4) + if self.nearest_distance is not None else None, + "nearest_id": self.nearest_id, + } + + +class RedisSemanticCache: + """Index, look up, and write a semantic cache of LLM responses.""" + + def __init__( + self, + redis_client: Optional[redis.Redis] = None, + index_name: str = "semcache:idx", + key_prefix: str = "cache:", + vector_dim: int = VECTOR_DIM_DEFAULT, + distance_threshold: float = 0.5, + default_ttl_seconds: int = 3600, + ) -> None: + self.redis = redis_client or redis.Redis( + host="localhost", + port=6379, + decode_responses=False, + ) + self.index_name = index_name + self.key_prefix = key_prefix + self.vector_dim = vector_dim + self.distance_threshold = distance_threshold + self.default_ttl_seconds = default_ttl_seconds + + # ------------------------------------------------------------------ + # Keys + # ------------------------------------------------------------------ + + def entry_key(self, entry_id: str) -> str: + return f"{self.key_prefix}{entry_id}" + + # ------------------------------------------------------------------ + # Index management + # ------------------------------------------------------------------ + + def create_index(self) -> None: + """Create the Redis Search index if it doesn't already exist. + + One index covers the embedding plus every metadata field, so a + single ``FT.SEARCH`` can pre-filter by tenant / locale / model + and then KNN-rank the matching documents in one pass. The + ``prompt`` and ``response`` fields are stored as TEXT so the + admin tooling can grep the cache by content, but the cache + lookup itself is vector-only. + """ + schema = ( + TextField("prompt"), + TextField("response"), + TagField("tenant"), + TagField("locale"), + TagField("model_version"), + TagField("safety"), + NumericField("created_ts", sortable=True), + NumericField("hit_count", sortable=True), + VectorField( + "embedding", + "HNSW", + { + "TYPE": "FLOAT32", + "DIM": self.vector_dim, + "DISTANCE_METRIC": "COSINE", + }, + ), + ) + definition = IndexDefinition( + prefix=[self.key_prefix], index_type=IndexType.HASH, + ) + try: + self.redis.ft(self.index_name).create_index( + fields=schema, definition=definition, + ) + except redis.ResponseError as exc: + if "Index already exists" not in str(exc): + raise + + def drop_index(self, delete_documents: bool = False) -> None: + """Drop the search index. Optionally also delete cached entries.""" + try: + self.redis.ft(self.index_name).dropindex( + delete_documents=delete_documents, + ) + except redis.ResponseError as exc: + message = str(exc).lower() + if "no such index" not in message \ + and "unknown index name" not in message: + raise + + # ------------------------------------------------------------------ + # Lookup + # ------------------------------------------------------------------ + + def lookup( + self, + query_vec: np.ndarray, + tenant: str | None = None, + locale: str | None = None, + model_version: str | None = None, + safety: str | None = "ok", + distance_threshold: float | None = None, + ) -> CacheHit | CacheMiss: + """Find the nearest in-scope cached prompt and decide hit / miss. + + ``FT.SEARCH`` returns the single nearest entry that satisfies + the TAG pre-filters. The lookup is a hit only if the reported + cosine distance is at or below ``distance_threshold`` (or the + instance default). Anything further away is a miss with the + candidate distance attached so the caller can log it. + + On a hit, the entry's ``hit_count`` is incremented atomically + with ``HINCRBY`` so the demo UI can show which entries are + load-bearing. The TTL is refreshed on every hit so frequently + used answers don't age out under cold tail entries. + """ + threshold = ( + distance_threshold if distance_threshold is not None + else self.distance_threshold + ) + + # Match the shape check that ``put`` performs. A wrong-dim + # vector would otherwise hit Redis as a malformed FT.SEARCH + # parameter and surface as a server-side parse error instead + # of a clear caller-side ValueError. + if query_vec.shape != (self.vector_dim,): + raise ValueError( + f"query_vec has shape {query_vec.shape}; " + f"index expects ({self.vector_dim},)" + ) + + filter_clause = self._build_filter_clause( + tenant=tenant, + locale=locale, + model_version=model_version, + safety=safety, + ) + knn_query = f"{filter_clause}=>[KNN 1 @embedding $vec AS distance]" + q = ( + Query(knn_query) + .sort_by("distance") + .return_fields( + "prompt", "response", "tenant", "locale", + "model_version", "hit_count", "distance", + ) + .paging(0, 1) + .dialect(2) + ) + result = self.redis.ft(self.index_name).search( + q, query_params={"vec": query_vec.astype(np.float32).tobytes()}, + ) + + if not result.docs: + return CacheMiss(nearest_distance=None, nearest_id=None) + + doc = result.docs[0] + raw_key = _decode(getattr(doc, "id", "")) + entry_id = ( + raw_key[len(self.key_prefix):] if raw_key.startswith(self.key_prefix) + else raw_key + ) + distance = float(_decode(getattr(doc, "distance", "0")) or 0) + + if distance > threshold: + return CacheMiss(nearest_distance=distance, nearest_id=entry_id) + + # The hash may have expired between FT.SEARCH returning the + # row and us getting here — the search index lags expirations + # by its periodic scan. If we just blindly ``HINCRBY``-ed, + # Redis would helpfully recreate the hash with only + # ``hit_count`` set and the search index would then log it as + # an indexing failure (no embedding, no metadata). EXISTS + # narrows that race to the pipeline round-trip; a strictly + # race-free version would wrap the bump in a Lua script that + # checks existence and acts in one server-side step. + entry_key = self.entry_key(entry_id) + if not self.redis.exists(entry_key): + return CacheMiss(nearest_distance=distance, nearest_id=entry_id) + + # MULTI/EXEC the three writes so they apply as a unit on the + # server — a partial failure between HINCRBY and EXPIRE would + # otherwise leave the entry without a refreshed TTL. + pipe = self.redis.pipeline(transaction=True) + pipe.hincrby(entry_key, "hit_count", 1) + pipe.expire(entry_key, self.default_ttl_seconds) + pipe.ttl(entry_key) + new_hit_count, _expired, ttl = pipe.execute() + + return CacheHit( + id=entry_id, + prompt=_decode(getattr(doc, "prompt", "")), + response=_decode(getattr(doc, "response", "")), + tenant=_decode(getattr(doc, "tenant", "")), + locale=_decode(getattr(doc, "locale", "")), + model_version=_decode(getattr(doc, "model_version", "")), + distance=distance, + ttl_seconds=int(ttl) if ttl and ttl > 0 else self.default_ttl_seconds, + hit_count=int(new_hit_count), + ) + + # ------------------------------------------------------------------ + # Write + # ------------------------------------------------------------------ + + def put( + self, + prompt: str, + response: str, + embedding: np.ndarray, + tenant: str = "default", + locale: str = "en", + model_version: str = "gpt-4.5-2026", + safety: str = "ok", + ttl_seconds: int | None = None, + entry_id: str | None = None, + ) -> str: + """Write a new cache entry and return its id. + + The embedding is stored as raw little-endian float32 bytes — + the encoding Redis Search expects from a FLOAT32 vector field. + ``EXPIRE`` on the key gives every entry a bounded lifetime; + combine with an ``allkeys-lfu`` eviction policy on the database + to cap memory under pressure too. + """ + if embedding.shape != (self.vector_dim,): + raise ValueError( + f"embedding has shape {embedding.shape}; " + f"index expects ({self.vector_dim},)" + ) + if embedding.dtype != np.float32: + embedding = embedding.astype(np.float32, copy=False) + + entry_id = entry_id or uuid.uuid4().hex[:12] + key = self.entry_key(entry_id) + ttl = ttl_seconds if ttl_seconds is not None else self.default_ttl_seconds + + mapping = { + "prompt": prompt, + "response": response, + "tenant": tenant, + "locale": locale, + "model_version": model_version, + "safety": safety, + "created_ts": str(time.time()), + "hit_count": "0", + "embedding": embedding.tobytes(), + } + # MULTI/EXEC so HSET and EXPIRE either both apply or neither + # does. Without the transaction wrapper a connection drop + # between the two writes could leave the entry without a TTL + # and the cache would then keep an answer past its intended + # lifetime (or forever, on a database with no eviction policy). + pipe = self.redis.pipeline(transaction=True) + pipe.hset(key, mapping=mapping) + pipe.expire(key, ttl) + pipe.execute() + return entry_id + + # ------------------------------------------------------------------ + # Filter clause + # ------------------------------------------------------------------ + + # Characters Redis Search treats as syntax inside a TAG value; any + # of them in a user-supplied filter must be backslash-escaped or + # the surrounding ``{...}`` block won't parse correctly. + _TAG_SPECIAL = set("\\,.<>{}[]\"':;!@#$%^&*()-+=~| ") + + @classmethod + def _escape_tag_value(cls, value: str) -> str: + return "".join( + "\\" + ch if ch in cls._TAG_SPECIAL else ch for ch in value + ) + + @classmethod + def _build_filter_clause( + cls, + *, + tenant: str | None, + locale: str | None, + model_version: str | None, + safety: str | None, + ) -> str: + clauses: list[str] = [] + if tenant: + clauses.append(f"@tenant:{{{cls._escape_tag_value(tenant)}}}") + if locale: + clauses.append(f"@locale:{{{cls._escape_tag_value(locale)}}}") + if model_version: + clauses.append( + f"@model_version:{{{cls._escape_tag_value(model_version)}}}" + ) + if safety: + clauses.append(f"@safety:{{{cls._escape_tag_value(safety)}}}") + return "(" + " ".join(clauses) + ")" if clauses else "(*)" + + # ------------------------------------------------------------------ + # Inspection / admin + # ------------------------------------------------------------------ + + def index_info(self) -> dict: + """Subset of ``FT.INFO`` useful for the demo UI.""" + try: + info = self.redis.ft(self.index_name).info() + except redis.ResponseError: + return {"num_docs": 0, "indexing_failures": 0, + "vector_index_size_mb": 0.0} + return { + "num_docs": int(info.get("num_docs", 0)), + "indexing_failures": + int(info.get("hash_indexing_failures", 0)), + "vector_index_size_mb": _safe_mb(info), + } + + def list_entries(self, limit: int = 100) -> list[dict]: + """Return every cached entry (no embedding) for the admin UI.""" + q = ( + Query("*") + .return_fields( + "prompt", "response", "tenant", "locale", + "model_version", "safety", "created_ts", "hit_count", + ) + .paging(0, limit) + .sort_by("created_ts", asc=False) + ) + result = self.redis.ft(self.index_name).search(q) + out: list[dict] = [] + for doc in result.docs: + raw_key = _decode(getattr(doc, "id", "")) + entry_id = ( + raw_key[len(self.key_prefix):] + if raw_key.startswith(self.key_prefix) else raw_key + ) + ttl = self.redis.ttl(self.entry_key(entry_id)) + out.append({ + "id": entry_id, + "prompt": _decode(getattr(doc, "prompt", "")), + "response": _decode(getattr(doc, "response", "")), + "tenant": _decode(getattr(doc, "tenant", "")), + "locale": _decode(getattr(doc, "locale", "")), + "model_version": _decode(getattr(doc, "model_version", "")), + "safety": _decode(getattr(doc, "safety", "")), + "hit_count": + int(_decode(getattr(doc, "hit_count", "0")) or 0), + "ttl_seconds": int(ttl) if ttl and ttl > 0 else 0, + "created_ts": + float(_decode(getattr(doc, "created_ts", "0")) or 0), + }) + return out + + def delete_entry(self, entry_id: str) -> bool: + """Drop a single entry. Returns ``True`` if the key existed.""" + return bool(self.redis.delete(self.entry_key(entry_id))) + + def clear(self) -> int: + """Drop the index and every cached entry. + + Returns the number of entries that were removed. Used by the + demo's "reset" button — in production the equivalent is just + ``FLUSHDB`` on a dedicated cache database, or letting TTLs + expire naturally. + """ + before = self.index_info()["num_docs"] + self.drop_index(delete_documents=True) + self.create_index() + return before + + +def _decode(value) -> str: + if value is None: + return "" + if isinstance(value, bytes): + return value.decode("utf-8") + return value + + +def _safe_mb(info: dict) -> float: + try: + return float(info.get("vector_index_sz_mb", 0.0) or 0.0) + except (TypeError, ValueError): + return 0.0 diff --git a/content/develop/use-cases/semantic-cache/redis-py/demo_server.py b/content/develop/use-cases/semantic-cache/redis-py/demo_server.py new file mode 100644 index 0000000000..35a917cb15 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/redis-py/demo_server.py @@ -0,0 +1,976 @@ +#!/usr/bin/env python3 +""" +Redis semantic-cache demo server. + +Run this file and visit http://localhost:8085 to drive a small +semantic-cache demo backed by Redis Search. The UI lets you: + +* Type a natural-language prompt and watch the cache decide hit or + miss. On a hit Redis returns the cached response in tens of + milliseconds and the demo LLM is not called at all; on a miss the + demo LLM "thinks" for ~1.5 s before answering and the new prompt, + response, and embedding are written back to Redis for next time. +* Adjust the cosine-distance threshold to see how close a paraphrase + must be for the cache to serve it. +* Switch tenant, locale, or model version to see metadata isolation + in action — entries written under one tenant cannot be served to + another, because the TAG filter goes into the same ``FT.SEARCH`` + call as the KNN. +* Watch cumulative savings build up: hit count, token spend avoided, + and end-to-end latency saved against the LLM mock. +* Inspect every cached entry, including its remaining TTL and total + hit count, and drop individual entries to simulate eviction. + +The server holds a single ``LocalEmbedder``, a single +``RedisSemanticCache``, and a single ``MockLLM`` for the lifetime of +the process. The first run downloads the embedding model (~80 MB) +into the local Hugging Face cache; everything after is local. +""" + +from __future__ import annotations + +import argparse +import json +import sys +import time +from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer +from pathlib import Path +from threading import Lock +from urllib.parse import parse_qs, urlparse + +import numpy as np + +sys.path.insert(0, str(Path(__file__).resolve().parent)) + +try: + import redis + + from cache import CacheHit, CacheMiss, RedisSemanticCache + from embeddings import LocalEmbedder + from mock_llm import MockLLM + from seed_cache import SEED_ENTRIES, seed +except ImportError as exc: + print(f"Error: {exc}") + print( + "Make sure the required packages are installed:\n" + " pip install redis sentence-transformers numpy" + ) + sys.exit(1) + + +HTML_TEMPLATE = """ + + + + + Redis Semantic Cache Demo + + + +
+
loading…
+

Redis Semantic Cache Demo

+

+ A small semantic cache sits in front of a mock LLM. Each cache + entry is a Hash at __KEY_PREFIX__<id> holding + the prompt, the response, the prompt's 384-dimensional embedding, + and metadata fields. A single FT.SEARCH on + __INDEX_NAME__ does the KNN against cached prompts + with a TAG pre-filter (tenant, locale, model version, safety) in + the same round trip. If the closest cached prompt is within the + cosine-distance threshold, the demo serves the cached response + and the LLM is not called at all. +

+ +
+ +
+

Ask the LLM

+

Type a question, optionally adjust the metadata filters and + the distance threshold, and submit. The server embeds the + prompt, runs FT.SEARCH with KNN over the cache, + and either serves the cached response (hit) or runs the mock + LLM and writes the new response back to the cache (miss).

+ + +
+
+ + +
+
+ + +
+
+ + +
+
+
+ + + 0.50 +
+

+ The cache serves a hit when the closest cached prompt's + cosine distance is at or below this threshold. Lower = + stricter (fewer hits, safer reuse); higher = looser (more + hits, more risk of serving a near-miss). +

+ + + + + +
+
+ +
+

Cumulative savings

+

Every hit avoids one LLM round trip. The numbers below add + up across the session — tokens that would have been spent and + wall-clock seconds that would have been waited if the cache + had not served the answer.

+
+
+
0
+
Total queries
+
+
+
0
+
Cache hits
+
+
+
0
+
Cache misses
+
+
+
0%
+
Hit ratio
+
+
+
0
+
Tokens saved
+
+
+
0 ms
+
LLM time saved
+
+
+
+ +
+

Index state

+
+ +
+ +
+

Cached entries

+

Every prompt/response pair currently in the cache. + hit_count is the running total of times the entry + has served a hit; ttl is the remaining lifetime + in seconds before EXPIRE drops the key. Click + Drop to simulate eviction.

+ + + + + + + + + + + + +
IDPromptMetadataHitsTTL
+
+ +
+ +
+
+ + + + +""" + + +class SemanticCacheDemo: + """Demo state: cache management, mock LLM, cumulative stats.""" + + def __init__( + self, + cache: RedisSemanticCache, + embedder: LocalEmbedder, + llm: MockLLM, + default_tenant: str = "acme", + default_locale: str = "en", + ) -> None: + self.cache = cache + self.embedder = embedder + self.llm = llm + self.default_tenant = default_tenant + self.default_locale = default_locale + self._lock = Lock() + + def seed(self) -> int: + """Drop everything in scope and pre-populate with FAQ entries.""" + with self._lock: + self.cache.clear() + return seed( + self.cache, + self.embedder, + tenant=self.default_tenant, + locale=self.default_locale, + model_version=self.llm.model_version, + ) + + def run_query( + self, + prompt: str, + tenant: str, + locale: str, + model_version: str, + threshold: float, + lookup_only: bool, + ) -> dict: + """The hot path: embed, look up, optionally call the LLM, cache. + + Timings are taken with ``time.perf_counter`` around each + bounded step so the UI can display the embed / lookup / LLM + breakdown separately. The cache write on a miss is *not* + included in ``total_ms`` so the latency number reflects the + user-facing wait, not the background bookkeeping. + """ + t0 = time.perf_counter() + query_vec = self.embedder.encode_one(prompt) + embed_ms = (time.perf_counter() - t0) * 1000 + + t1 = time.perf_counter() + result = self.cache.lookup( + query_vec, + tenant=tenant, + locale=locale, + model_version=model_version, + distance_threshold=threshold, + ) + lookup_ms = (time.perf_counter() - t1) * 1000 + + if isinstance(result, CacheHit): + return { + "outcome": "hit", + "response": result.response, + "entry_id": result.id, + "distance": result.distance, + "ttl_seconds": result.ttl_seconds, + "hit_count": result.hit_count, + "threshold": threshold, + "embed_ms": embed_ms, + "lookup_ms": lookup_ms, + "llm_ms": None, + "total_ms": embed_ms + lookup_ms, + "tokens_avoided": _estimate_response_tokens( + result.prompt, result.response, + ), + "ms_avoided": self.llm.latency_ms, + } + + # Miss path. In "lookup only" mode the demo reports the miss + # without actually calling the LLM — useful for sweeping the + # threshold against a fixed prompt to see where the cutoff + # would fall without polluting the cache. + assert isinstance(result, CacheMiss) + if lookup_only: + return { + "outcome": "miss", + "response": "(LLM not called in lookup-only mode)", + "nearest_distance": result.nearest_distance, + "threshold": threshold, + "wrote_entry_id": None, + "embed_ms": embed_ms, + "lookup_ms": lookup_ms, + "llm_ms": None, + "total_ms": embed_ms + lookup_ms, + } + + t2 = time.perf_counter() + llm_response = self.llm.complete(prompt) + llm_ms = (time.perf_counter() - t2) * 1000 + + # Write the new entry back. The embedding is the same vector + # we already used for the lookup — no need to re-encode. + entry_id = self.cache.put( + prompt=prompt, + response=llm_response.response, + embedding=query_vec, + tenant=tenant, + locale=locale, + model_version=model_version, + ) + return { + "outcome": "miss", + "response": llm_response.response, + "nearest_distance": result.nearest_distance, + "threshold": threshold, + "wrote_entry_id": entry_id, + "embed_ms": embed_ms, + "lookup_ms": lookup_ms, + "llm_ms": llm_ms, + "total_ms": embed_ms + lookup_ms + llm_ms, + } + + +def _estimate_response_tokens(prompt: str, response: str) -> int: + """Approximate combined token cost of a prompt and its response.""" + return max(1, (len(prompt) + len(response)) // 4) + + +class SemanticCacheHandler(BaseHTTPRequestHandler): + """HTTP handler. Server-state lives on class attributes.""" + + cache: RedisSemanticCache | None = None + embedder: LocalEmbedder | None = None + demo: SemanticCacheDemo | None = None + llm: MockLLM | None = None + + # ------------------------------------------------------------------ + # GET + # ------------------------------------------------------------------ + + def do_GET(self) -> None: + try: + parsed = urlparse(self.path) + if parsed.path in {"/", "/index.html"}: + self._send_html(self._html_page()) + return + if parsed.path == "/state": + self._send_json(self._build_state(), 200) + return + self.send_error(404) + except Exception as exc: + self._send_error_json(exc) + + # ------------------------------------------------------------------ + # POST + # ------------------------------------------------------------------ + + def do_POST(self) -> None: + try: + parsed = urlparse(self.path) + if parsed.path == "/query": + self._handle_query() + return + if parsed.path == "/reset": + self.demo.seed() + self._send_json({"ok": True}, 200) + return + if parsed.path == "/drop": + self._handle_drop() + return + self.send_error(404) + except Exception as exc: + self._send_error_json(exc) + + def _send_error_json(self, exc: Exception) -> None: + """Return a JSON 500 so the client's ``await res.json()`` works. + + Without this wrapper, an exception in a handler escapes to + ``BaseHTTPRequestHandler`` which writes a plain-text 500 page; + the demo's ``fetch().then(r => r.json())`` then explodes with + an opaque JSON parse error instead of surfacing what went wrong. + """ + sys.stderr.write(f"[demo] handler error: {type(exc).__name__}: {exc}\n") + try: + self._send_json( + {"error": str(exc), "type": type(exc).__name__}, 500, + ) + except Exception: + # Headers may already be partially flushed; nothing useful + # left to do beyond letting the connection drop. + pass + + # ---- handlers --------------------------------------------------- + + def _handle_query(self) -> None: + params = self._read_form() + prompt = params.get("prompt", [""])[0].strip() + if not prompt: + self._send_json({"error": "prompt is required"}, 400) + return + try: + threshold = float(params.get("threshold", ["0.5"])[0]) + except ValueError: + threshold = 0.5 + # ``float()`` happily parses "nan" / "inf"; either would + # silently turn the lookup into a permanent hit (NaN comparisons + # are always False, so ``distance > nan`` cannot reject) or a + # permanent miss. Clamp to the meaningful cosine-distance range + # so a malformed POST can't override the threshold semantics. + import math + if not math.isfinite(threshold): + threshold = 0.5 + threshold = max(0.0, min(2.0, threshold)) + payload = self.demo.run_query( + prompt=prompt, + tenant=params.get("tenant", ["acme"])[0] or "acme", + locale=params.get("locale", ["en"])[0] or "en", + model_version= + params.get("model_version", [self.llm.model_version])[0] + or self.llm.model_version, + threshold=threshold, + lookup_only=bool(params.get("lookup_only", [""])[0]), + ) + self._send_json(payload, 200) + + def _handle_drop(self) -> None: + params = self._read_form() + entry_id = params.get("entry_id", [""])[0].strip() + if not entry_id: + self._send_json({"error": "entry_id is required"}, 400) + return + deleted = self.cache.delete_entry(entry_id) + self._send_json({"deleted": deleted, "entry_id": entry_id}, 200) + + # ---- state assembly --------------------------------------------- + + def _build_state(self) -> dict: + info = self.cache.index_info() + info["index_name"] = self.cache.index_name + info["model"] = self.embedder.model_name + info["mock_llm_latency_ms"] = self.llm.latency_ms + # ``default_threshold`` is what the ``--threshold`` flag + # actually configures; the UI slider initialises to this on + # first load so the flag visibly changes the demo's behaviour. + # ``stack_label`` lets the same HTML render a per-language + # badge (redis-py, node-redis, etc.) without forking the file + # per language. + info["default_threshold"] = self.cache.distance_threshold + info["stack_label"] = ( + "redis-py + sentence-transformers + " + "Python standard library HTTP server" + ) + return { + "index": info, + "entries": self.cache.list_entries(limit=200), + } + + # ---- HTTP plumbing ---------------------------------------------- + + def _read_form(self) -> dict[str, list[str]]: + length = int(self.headers.get("Content-Length", "0")) + raw = self.rfile.read(length).decode("utf-8") if length else "" + return parse_qs(raw) + + def _send_html(self, html: str, status: int = 200) -> None: + self.send_response(status) + self.send_header("Content-Type", "text/html; charset=utf-8") + self.end_headers() + self.wfile.write(html.encode("utf-8")) + + def _send_json(self, payload: dict, status: int) -> None: + self.send_response(status) + self.send_header("Content-Type", "application/json") + self.end_headers() + self.wfile.write(json.dumps(payload, default=_json_default).encode("utf-8")) + + def log_message(self, format: str, *args) -> None: # noqa: A002 + sys.stderr.write(f"[demo] {format % args}\n") + + def _html_page(self) -> str: + return ( + HTML_TEMPLATE + .replace("__INDEX_NAME__", self.cache.index_name) + .replace("__KEY_PREFIX__", self.cache.key_prefix) + ) + + +def _json_default(value): + if isinstance(value, np.floating): + return float(value) + if isinstance(value, np.integer): + return int(value) + if isinstance(value, np.ndarray): + return value.tolist() + raise TypeError(f"unserializable: {type(value).__name__}") + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Run the Redis semantic-cache demo server.", + ) + parser.add_argument("--host", default="127.0.0.1", help="HTTP bind host") + parser.add_argument("--port", type=int, default=8085, help="HTTP bind port") + parser.add_argument("--redis-host", default="localhost", help="Redis host") + parser.add_argument("--redis-port", type=int, default=6379, help="Redis port") + parser.add_argument( + "--index-name", default="semcache:idx", + help="Redis Search index name", + ) + parser.add_argument( + "--key-prefix", default="cache:", + help="Hash key prefix for cached entries", + ) + parser.add_argument( + "--ttl-seconds", type=int, default=3600, + help="TTL applied to every cache entry on write", + ) + parser.add_argument( + "--threshold", type=float, default=0.5, + help="Default cosine-distance threshold for cache hits", + ) + parser.add_argument( + "--llm-latency-ms", type=float, default=1500.0, + help="Simulated LLM round-trip latency in milliseconds", + ) + parser.add_argument( + "--no-reset", dest="reset_on_start", action="store_false", + help=( + "Keep any existing cached entries instead of dropping" + " and re-seeding on startup." + ), + ) + return parser.parse_args() + + +def main() -> None: + args = parse_args() + + redis_client = redis.Redis( + host=args.redis_host, + port=args.redis_port, + decode_responses=False, + ) + try: + redis_client.ping() + except redis.ConnectionError as exc: + print(f"Error: cannot reach Redis at {args.redis_host}:{args.redis_port}") + print(f" ({exc})") + sys.exit(1) + + cache = RedisSemanticCache( + redis_client=redis_client, + index_name=args.index_name, + key_prefix=args.key_prefix, + distance_threshold=args.threshold, + default_ttl_seconds=args.ttl_seconds, + ) + cache.create_index() + + print("Loading embedding model (first run downloads ~80 MB)...") + embedder = LocalEmbedder() + llm = MockLLM(latency_ms=args.llm_latency_ms) + + demo = SemanticCacheDemo( + cache=cache, embedder=embedder, llm=llm, + ) + + if args.reset_on_start: + print( + f"Dropping any existing cache under '{args.key_prefix}*' and" + " re-seeding from the FAQ list (pass --no-reset to keep)." + ) + seeded = demo.seed() + print(f"Seeded {seeded} entries.") + + SemanticCacheHandler.cache = cache + SemanticCacheHandler.embedder = embedder + SemanticCacheHandler.demo = demo + SemanticCacheHandler.llm = llm + + print( + f"Redis semantic cache demo listening on " + f"http://{args.host}:{args.port}" + ) + print( + f"Using Redis at {args.redis_host}:{args.redis_port}" + f" with index '{args.index_name}'" + ) + + server = ThreadingHTTPServer((args.host, args.port), SemanticCacheHandler) + try: + server.serve_forever() + except KeyboardInterrupt: + pass + + +if __name__ == "__main__": + main() diff --git a/content/develop/use-cases/semantic-cache/redis-py/embeddings.py b/content/develop/use-cases/semantic-cache/redis-py/embeddings.py new file mode 100644 index 0000000000..15ad84fc85 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/redis-py/embeddings.py @@ -0,0 +1,55 @@ +""" +Local text-embedding helper backed by sentence-transformers. + +This is a thin wrapper around the ``sentence-transformers`` model +``all-MiniLM-L6-v2``: a 384-dimensional encoder that runs on CPU, +needs no API key, and has a small footprint (~80 MB). On the first +call the model is downloaded into the local Hugging Face cache; every +later call runs locally. + +Vectors are L2-normalised on output so a Redis Search index declared +with ``DISTANCE_METRIC COSINE`` returns scores that are directly +comparable across entries. +""" + +from __future__ import annotations + +from typing import Iterable + +import numpy as np + + +_DEFAULT_MODEL = "sentence-transformers/all-MiniLM-L6-v2" + + +class LocalEmbedder: + """Encode short strings into normalised float32 vectors. + + A single instance loads the model once and reuses it for every + call. The demo server keeps one ``LocalEmbedder`` around for the + lifetime of the process so it can embed the incoming prompt and, + on a miss, write the same bytes into the cache without + re-encoding. + """ + + def __init__(self, model_name: str = _DEFAULT_MODEL) -> None: + from sentence_transformers import SentenceTransformer + + self.model_name = model_name + self.model = SentenceTransformer(model_name) + self.dim = int(self.model.get_sentence_embedding_dimension()) + + def encode_one(self, text: str) -> np.ndarray: + """Encode a single string. Returns a 1-D ``float32`` array.""" + return self.encode_many([text])[0] + + def encode_many(self, texts: Iterable[str]) -> np.ndarray: + """Encode a batch. Returns an ``(N, dim) float32`` array.""" + batch = list(texts) + vectors = self.model.encode( + batch, + batch_size=32, + normalize_embeddings=True, + convert_to_numpy=True, + ) + return vectors.astype(np.float32, copy=False) diff --git a/content/develop/use-cases/semantic-cache/redis-py/mock_llm.py b/content/develop/use-cases/semantic-cache/redis-py/mock_llm.py new file mode 100644 index 0000000000..ef33962454 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/redis-py/mock_llm.py @@ -0,0 +1,168 @@ +""" +Deterministic mock LLM for the semantic-cache demo. + +The point of a semantic cache is to *skip* an LLM call when a prior +answer is reusable. To make that visible in a docs demo we need an +LLM stand-in that: + +* takes long enough that the saved time on a cache hit is obvious + (real-world model calls are 500 ms to several seconds); +* responds deterministically so a given prompt always produces the + same answer, which keeps the demo reproducible; +* exposes an estimated token count so the demo can show the saving in + "tokens not spent" terms alongside latency; +* needs no API keys, no network, no extra dependencies. + +It is keyword-matched against a small lookup table of FAQ-style +answers for a fictional online retailer. Anything that doesn't match +falls back to a generic templated reply. The `latency_ms` parameter +is the simulated round trip; the default (1500 ms) is in the +neighbourhood of a real GPT-class model on a moderately-sized prompt. +""" + +from __future__ import annotations + +import time +from dataclasses import dataclass + + +@dataclass +class LLMResponse: + response: str + model_version: str + latency_ms: float + prompt_tokens: int + completion_tokens: int + + @property + def total_tokens(self) -> int: + return self.prompt_tokens + self.completion_tokens + + def to_dict(self) -> dict: + return { + "response": self.response, + "model_version": self.model_version, + "latency_ms": round(self.latency_ms, 2), + "prompt_tokens": self.prompt_tokens, + "completion_tokens": self.completion_tokens, + "total_tokens": self.total_tokens, + } + + +# A small FAQ table for a fictional online retailer. Each row is +# (keyword set, response). The keyword set is matched against the +# *prompt*, so a paraphrase like "How do I return an item?" and +# "What is your return policy?" both land on the same row — but the +# match is by surface form, not embedding, so the cache lookup is +# what makes paraphrase reuse work. The mock LLM itself only matches +# crude keyword overlap. +_KNOWLEDGE = [ + ( + {"return", "refund", "exchange"}, + "You can return any unworn item within 30 days of delivery for a " + "full refund. Start a return from your order page; we email a " + "prepaid label and refund the original payment method within " + "five business days of receiving the item.", + ), + ( + {"shipping", "delivery", "arrive", "ship"}, + "Standard shipping is free on orders over $50 and arrives in " + "three to five business days. Expedited two-day shipping is " + "$9.99 and is available at checkout for in-stock items.", + ), + ( + {"size", "sizing", "fit"}, + "We follow standard US sizing. For most styles we recommend " + "ordering your usual size; the product page includes a sizing " + "chart and customer fit notes for items that run small or large.", + ), + ( + {"warranty", "guarantee", "defect", "broken"}, + "All gear is covered by a one-year manufacturer warranty against " + "defects in materials or workmanship. Email support with your " + "order number and a photo of the issue and we will replace the " + "item or issue a refund.", + ), + ( + {"contact", "support", "help", "agent"}, + "You can reach our support team by email at help@example.com or " + "by live chat from the help centre, 9am to 9pm Eastern, seven " + "days a week. Most tickets get a first reply within two hours.", + ), + ( + {"track", "tracking", "order", "where"}, + "Your tracking number is on the order confirmation email and on " + "the order detail page once the package has been picked up by " + "the carrier — typically within 24 hours of order placement.", + ), + ( + {"cancel", "modify", "change"}, + "Orders can be cancelled or modified for up to one hour after " + "placement. After that the order has usually entered our " + "warehouse system; the fastest path is to accept delivery and " + "start a return for any unwanted items.", + ), + ( + {"discount", "coupon", "promo", "code"}, + "Active promotional codes are listed on the homepage banner. " + "Codes apply at checkout and cannot be combined; the system " + "automatically uses the larger of the two when more than one " + "would qualify.", + ), +] + + +class MockLLM: + """A deterministic, slow, no-network stand-in for a real model.""" + + def __init__( + self, + model_version: str = "gpt-4.5-2026", + latency_ms: float = 1500.0, + ) -> None: + self.model_version = model_version + self.latency_ms = latency_ms + self.call_count = 0 + + def complete(self, prompt: str) -> LLMResponse: + """Pretend to call a model. Sleeps, then returns a templated answer.""" + self.call_count += 1 + start = time.perf_counter() + # Sleep first so the latency is realistic regardless of which + # branch generates the text. + time.sleep(self.latency_ms / 1000.0) + response = self._answer_for(prompt) + elapsed_ms = (time.perf_counter() - start) * 1000 + + return LLMResponse( + response=response, + model_version=self.model_version, + latency_ms=elapsed_ms, + prompt_tokens=_estimate_tokens(prompt), + completion_tokens=_estimate_tokens(response), + ) + + @staticmethod + def _answer_for(prompt: str) -> str: + lower = prompt.lower() + for keywords, answer in _KNOWLEDGE: + if any(kw in lower for kw in keywords): + return answer + # Generic fallback — keeps the demo working for queries that + # don't match any FAQ keyword. + return ( + "Thanks for the question. Our team would normally answer this " + "individually; in the meantime please check the help centre " + "or contact support@example.com for a faster response." + ) + + +def _estimate_tokens(text: str) -> int: + """Rough English token estimate: ~4 characters per token. + + Real tokenizers (BPE, SentencePiece) vary slightly but this is + close enough for "look how many tokens you saved" demo signage. + """ + if not text: + return 0 + return max(1, len(text) // 4) diff --git a/content/develop/use-cases/semantic-cache/redis-py/seed_cache.py b/content/develop/use-cases/semantic-cache/redis-py/seed_cache.py new file mode 100644 index 0000000000..48eecb1bf5 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/redis-py/seed_cache.py @@ -0,0 +1,95 @@ +""" +Pre-seed the semantic cache with a handful of FAQ answers. + +In a real deployment the cache fills up organically as users ask +questions: a first-time question is a miss, the LLM answers, and the +response is written back. To make the demo immediately useful — so +the first query you type lands on a hit instead of a cold miss — we +seed a small set of canonical prompts and their answers at startup. + +The seed list mirrors the keyword table in ``mock_llm.py`` but stores +the *canonical phrasing* of each question. Paraphrases of any of +these prompts ("How do I return an item?", "Can I get a refund?") +embed close to the canonical entry and the cache lookup serves the +stored response without ever calling the model. +""" + +from __future__ import annotations + +import numpy as np + +from cache import RedisSemanticCache +from embeddings import LocalEmbedder + + +SEED_ENTRIES: list[dict] = [ + { + "prompt": "What is your return policy?", + "response": + "You can return any unworn item within 30 days of delivery for " + "a full refund. Start a return from your order page; we email " + "a prepaid label and refund the original payment method within " + "five business days of receiving the item.", + }, + { + "prompt": "How long does shipping take?", + "response": + "Standard shipping is free on orders over $50 and arrives in " + "three to five business days. Expedited two-day shipping is " + "$9.99 and is available at checkout for in-stock items.", + }, + { + "prompt": "How do I find my size?", + "response": + "We follow standard US sizing. For most styles we recommend " + "ordering your usual size; the product page includes a sizing " + "chart and customer fit notes for items that run small or " + "large.", + }, + { + "prompt": "Is there a warranty on your products?", + "response": + "All gear is covered by a one-year manufacturer warranty " + "against defects in materials or workmanship. Email support " + "with your order number and a photo of the issue and we will " + "replace the item or issue a refund.", + }, + { + "prompt": "How can I contact customer support?", + "response": + "You can reach our support team by email at help@example.com " + "or by live chat from the help centre, 9am to 9pm Eastern, " + "seven days a week. Most tickets get a first reply within two " + "hours.", + }, + { + "prompt": "Where is my order?", + "response": + "Your tracking number is on the order confirmation email and " + "on the order detail page once the package has been picked up " + "by the carrier — typically within 24 hours of order " + "placement.", + }, +] + + +def seed( + cache: RedisSemanticCache, + embedder: LocalEmbedder, + tenant: str = "acme", + locale: str = "en", + model_version: str = "gpt-4.5-2026", +) -> int: + """Embed and write the seed entries into the cache.""" + prompts = [entry["prompt"] for entry in SEED_ENTRIES] + vectors = embedder.encode_many(prompts) + for entry, vec in zip(SEED_ENTRIES, vectors): + cache.put( + prompt=entry["prompt"], + response=entry["response"], + embedding=np.asarray(vec, dtype=np.float32), + tenant=tenant, + locale=locale, + model_version=model_version, + ) + return len(SEED_ENTRIES) diff --git a/content/develop/use-cases/semantic-cache/ruby/.gitignore b/content/develop/use-cases/semantic-cache/ruby/.gitignore new file mode 100644 index 0000000000..6944a2d8c5 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/ruby/.gitignore @@ -0,0 +1,17 @@ +# Bundler installs the gems and their native extensions (onnxruntime +# ships its own shared library) under vendor/ when run with +# `bundle config set --local path vendor/bundle`. +vendor/ + +# Local bundler config; `bundle config set --local path` writes to +# .bundle/config and a developer's personal path choice should not be +# committed alongside the example. +.bundle/ + +# informers downloads the ONNX-exported all-MiniLM-L6-v2 weights into +# the local Hugging Face cache directory on first run. Use the env var +# HF_HOME to override it; the default is ~/.cache/huggingface so it +# typically does not land here, but this entry catches the case where +# a user pins the cache to the project directory. +.cache/ +huggingface/ diff --git a/content/develop/use-cases/semantic-cache/ruby/Gemfile b/content/develop/use-cases/semantic-cache/ruby/Gemfile new file mode 100644 index 0000000000..847e34c01c --- /dev/null +++ b/content/develop/use-cases/semantic-cache/ruby/Gemfile @@ -0,0 +1,16 @@ +# Redis semantic-cache demo (Ruby). +# +# Pinned to Ruby 3.2+ baseline. The three runtime gems pull in: +# * `redis-client` (the lower-level transport under `redis`) +# * `onnxruntime` (the ONNX backend `informers` runs the encoder on) +# * `tokenizers` (the Hugging Face fast tokenizer used by `informers`) +# `webrick` was extracted from the stdlib in Ruby 3.0; declaring it here +# means `bundle install` resolves it on every supported Ruby version. + +source 'https://rubygems.org' + +ruby '>= 3.2' + +gem 'redis', '~> 5.4' +gem 'informers', '~> 1.3' +gem 'webrick', '~> 1.8' diff --git a/content/develop/use-cases/semantic-cache/ruby/Gemfile.lock b/content/develop/use-cases/semantic-cache/ruby/Gemfile.lock new file mode 100644 index 0000000000..aa7b78ea12 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/ruby/Gemfile.lock @@ -0,0 +1,76 @@ +GEM + remote: https://rubygems.org/ + specs: + connection_pool (3.0.2) + ffi (1.17.4-aarch64-linux-gnu) + ffi (1.17.4-aarch64-linux-musl) + ffi (1.17.4-arm64-darwin) + ffi (1.17.4-x86_64-darwin) + ffi (1.17.4-x86_64-linux-gnu) + ffi (1.17.4-x86_64-linux-musl) + informers (1.3.0) + onnxruntime (>= 0.9) + tokenizers (>= 0.6) + onnxruntime (0.11.3-aarch64-linux) + ffi + onnxruntime (0.11.3-arm64-darwin) + ffi + onnxruntime (0.11.3-x86_64-darwin) + ffi + onnxruntime (0.11.3-x86_64-linux) + ffi + redis (5.4.1) + redis-client (>= 0.22.0) + redis-client (0.29.0) + connection_pool + tokenizers (0.7.0-aarch64-linux) + tokenizers (0.7.0-aarch64-linux-musl) + tokenizers (0.7.0-arm64-darwin) + tokenizers (0.7.0-x86_64-darwin) + tokenizers (0.7.0-x86_64-linux) + tokenizers (0.7.0-x86_64-linux-musl) + webrick (1.9.2) + +PLATFORMS + aarch64-linux + aarch64-linux-gnu + aarch64-linux-musl + arm64-darwin + x86_64-darwin + x86_64-linux + x86_64-linux-gnu + x86_64-linux-musl + +DEPENDENCIES + informers (~> 1.3) + redis (~> 5.4) + webrick (~> 1.8) + +CHECKSUMS + connection_pool (3.0.2) sha256=33fff5ba71a12d2aa26cb72b1db8bba2a1a01823559fb01d29eb74c286e62e0a + ffi (1.17.4-aarch64-linux-gnu) sha256=b208f06f91ffd8f5e1193da3cae3d2ccfc27fc36fba577baf698d26d91c080df + ffi (1.17.4-aarch64-linux-musl) sha256=9286b7a615f2676245283aef0a0a3b475ae3aae2bb5448baace630bb77b91f39 + ffi (1.17.4-arm64-darwin) sha256=19071aaf1419251b0a46852abf960e77330a3b334d13a4ab51d58b31a937001b + ffi (1.17.4-x86_64-darwin) sha256=aa70390523cf3235096cf64962b709b4cfbd5c082a2cb2ae714eb0fe2ccda496 + ffi (1.17.4-x86_64-linux-gnu) sha256=9d3db14c2eae074b382fa9c083fe95aec6e0a1451da249eab096c34002bc752d + ffi (1.17.4-x86_64-linux-musl) sha256=3fdf9888483de005f8ef8d1cf2d3b20d86626af206cbf780f6a6a12439a9c49e + informers (1.3.0) sha256=7d8ea9f6c32ecd4519ccbf8b2d89e9ecfa7c7916700020926c21926d6506de83 + onnxruntime (0.11.3-aarch64-linux) sha256=d1b8477fcdbf7c4a8c05337bf8f8d782b81ffadaa11712b2b97a14a45288eb75 + onnxruntime (0.11.3-arm64-darwin) sha256=3b58c620809bc551c8d8b9c54683ac8e85936177b426e329dd5368411fd099ab + onnxruntime (0.11.3-x86_64-darwin) sha256=d585c8419f1d17f6098e569464947ec8926221cf0cb456716893bca858eda0a1 + onnxruntime (0.11.3-x86_64-linux) sha256=c46c30610d18c7051ad71f54a95a1518845dc0192384c2dfb54cc273d3c8e816 + redis (5.4.1) sha256=b5e675b57ad22b15c9bcc765d5ac26f60b675408af916d31527af9bd5a81faae + redis-client (0.29.0) sha256=0c65bf1f8f6dca22063ddb085c0bb2054feef6f03a84869f4161b18a9a15bea3 + tokenizers (0.7.0-aarch64-linux) sha256=c7cc43f4144d02e8db3a7d743725e38144f357d3f225373ef3a732869b057b2e + tokenizers (0.7.0-aarch64-linux-musl) sha256=f8b38535be37a044d7b119a65b2966b09c5493fdfd6eaf850b0f3c4f37fd81fa + tokenizers (0.7.0-arm64-darwin) sha256=a54883b2fe7ca83a9275284a30edeb12376a1fe1af6b9be6bf9693ef202eac82 + tokenizers (0.7.0-x86_64-darwin) sha256=e67366c1d35e34b6fa406b1fbb528411fe229c5f12cdf05fc7f68269dcc1b8ea + tokenizers (0.7.0-x86_64-linux) sha256=a795e5026ccf6f6340195789c48ff8c9c26462c75feb3b2971e26aef3118d842 + tokenizers (0.7.0-x86_64-linux-musl) sha256=c2409219bb97772ae3117671d62d5c021a5db9cf7212b3cf09a15a7cc8828989 + webrick (1.9.2) sha256=beb4a15fc474defed24a3bda4ffd88a490d517c9e4e6118c3edce59e45864131 + +RUBY VERSION + ruby 4.0.4 + +BUNDLED WITH + 4.0.11 diff --git a/content/develop/use-cases/semantic-cache/ruby/_index.md b/content/develop/use-cases/semantic-cache/ruby/_index.md new file mode 100644 index 0000000000..5ed1b8385c --- /dev/null +++ b/content/develop/use-cases/semantic-cache/ruby/_index.md @@ -0,0 +1,259 @@ +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: Build a Redis-backed semantic cache for LLM responses in Ruby with redis-rb and informers +linkTitle: redis-rb example (Ruby) +title: Redis semantic cache with redis-rb +weight: 4 +--- + +This guide shows you how to build a small Redis-backed semantic cache for LLM responses in Ruby with [`redis-rb`]({{< relref "/develop/clients/ruby" >}}) and the [`informers`](https://github.com/ankane/informers) gem, a Ruby port of Hugging Face transformers that runs the ONNX-exported [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) encoder locally on CPU. It includes a local web server built with the standard-library [`WEBrick`](https://github.com/ruby/webrick) HTTP server so you can send paraphrased prompts at a mock LLM, watch the cache decide hit or miss, sweep the cosine-distance threshold, and see the cumulative latency and token savings build up. + +## Overview + +Each cache entry is stored as a single Redis [Hash]({{< relref "/develop/data-types/hashes" >}}) at `cache:`. The hash holds the original prompt, the LLM's response, the raw `float32` bytes of a 384-dimensional embedding of the prompt, and metadata fields — tenant, locale, model version, safety flag — plus a `created_ts` and a `hit_count`. A single [Redis Search]({{< relref "/develop/ai/search-and-query" >}}) index covers the embedding field and every metadata field, so one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) call with a `KNN` clause does the vector lookup *and* the TAG pre-filter in the same round trip — no cross-store joins. + +The lookup is thresholded: [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) always returns the nearest entry that satisfies the filters, but the application only serves it as a hit when the reported cosine distance is at or below `distance_threshold`. Anything further away is treated as a miss; the caller runs the LLM and writes the new prompt, response, and embedding back to the same key pattern with a TTL. + +The embedder is [`informers`](https://github.com/ankane/informers) running the ONNX-exported [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model — the same 384-dimensional encoder the [Python example]({{< relref "/develop/use-cases/semantic-cache/redis-py" >}}), the [Node.js example]({{< relref "/develop/use-cases/semantic-cache/nodejs" >}}), the [Go example]({{< relref "/develop/use-cases/semantic-cache/go" >}}), and the [Jedis example]({{< relref "/develop/use-cases/semantic-cache/java-jedis" >}}) use. Embeddings produced by the Ruby ONNX path are semantically equivalent to the PyTorch ones — paraphrase distances differ by ~0.01, the same drift the Node.js Xenova ONNX path sees — so a cache populated by one demo can be queried by another against the same Redis instance. + +That gives you: + +* A single round trip for lookup — vector KNN + metadata pre-filter in one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}). +* Tens of milliseconds on a hit vs. a multi-second LLM call on a miss; the embedding step is the bottleneck either way, and that's a model-side cost, not a Redis one. +* Tenant, locale, and model-version isolation enforced inside the query, not in application code — a write under one tenant cannot be served to another. +* Bounded memory: every entry has an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL, and a database-level [eviction policy]({{< relref "/develop/reference/eviction" >}}) (LRU / LFU) caps the cache size under pressure. + +## How it works + +A query goes through three stages: **embed**, **lookup**, and (on a miss) **call the LLM and write back**. + +### Hit path (the goal) + +1. The application calls `embedder.encode_one(prompt)` to turn the incoming text into a 384-element `Array`. +2. `cache.lookup(query_vec, tenant:, locale:, model_version:)` runs [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) with a TAG pre-filter and a `KNN 1` clause. Redis returns the closest cached prompt that satisfies the filters along with its cosine distance. +3. If the distance is at or below the threshold, the cache returns a `CacheHit` containing the cached response. The helper also runs an [`HINCRBY`]({{< relref "/commands/hincrby" >}}) on `hit_count` and an [`EXPIRE`]({{< relref "/commands/expire" >}}) refresh inside a [`MULTI/EXEC`]({{< relref "/commands/multi" >}}), so a frequently used answer keeps its TTL and the demo UI can see which entries are load-bearing. +4. The LLM is not called at all. The application returns the cached response to the user. + +### Miss path + +When the distance is above the threshold — or there is no candidate in scope at all — the helper returns a `CacheMiss` instead, carrying the distance of the nearest candidate (if any) for logging. The application then: + +1. Calls the LLM with the prompt. +2. Calls `cache.put(prompt:, response:, embedding:, tenant:, locale:, model_version:)`. The same embedding the lookup used is reused — no re-encode. The helper writes the Hash with [`HSET`]({{< relref "/commands/hset" >}}) and an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL inside a single [`MULTI/EXEC`]({{< relref "/commands/multi" >}}) so the entry never lands without a TTL on a partial failure. +3. Returns the LLM's response to the user. The next semantically similar prompt under the same metadata scope will be a hit. + +## The cache helper + +The `RedisSemanticCache` class wraps the Redis Search index and the lookup / write flow +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/ruby/lib/cache.rb)): + +```ruby +require 'redis' +require_relative 'lib/cache' +require_relative 'lib/embeddings' + +client = Redis.new(host: 'localhost', port: 6379) + +cache = SemCache::RedisSemanticCache.new( + redis_client: client, + index_name: 'semcache:idx', + distance_threshold: 0.5, # cosine distance, lower = stricter + default_ttl_seconds: 3600 # one hour +) +embedder = SemCache::LocalEmbedder.new # sentence-transformers/all-MiniLM-L6-v2 + +# One-time index setup (idempotent). +cache.create_index + +# 1) Embed the prompt. +prompt = 'How do I return an item?' +query_vec = embedder.encode_one(prompt) + +# 2) Look up under a metadata scope. The TAG filter and the KNN +# travel together in one FT.SEARCH. +result = cache.lookup( + query_vec, + tenant: 'acme', + locale: 'en', + model_version: 'gpt-4.5-2026' +) + +if result.is_a?(SemCache::CacheHit) + puts "hit (#{format('%.3f', result.distance)}): #{result.response}" +else + # 3a) Miss — call the LLM. (Use your real client here.) + response = call_llm(prompt) + + # 3b) Cache the new entry. Reuses the same embedding bytes the + # lookup used, so we don't pay the encoder twice. + cache.put( + prompt: prompt, + response: response, + embedding: query_vec, + tenant: 'acme', + locale: 'en', + model_version: 'gpt-4.5-2026' + ) +end +``` + +### Data model + +Each cache entry is one Redis Hash. The vector field is raw little-endian `float32` bytes — no JSON wrapping — because the Redis Search vector encoding expects exactly that. The helper packs the `Array` with Ruby's [`Array#pack`](https://docs.ruby-lang.org/en/master/packed_data_rdoc.html) directive `'e*'`, which is little-endian single-precision float; the resulting `String` is ASCII-8BIT (binary) so `redis-rb` ships the exact bytes without any UTF-8 transcoding. + +```text +cache:7c3f8a1b9e02 + prompt=How do I return an item? + response=You can return any unworn item within 30 days... + tenant=acme + locale=en + model_version=gpt-4.5-2026 + safety=ok + created_ts=1715990400.123 + hit_count=4 + embedding=<384 × float32 little-endian bytes> +``` + +The Redis Search index schema treats every field as queryable in its natural type: + +```text +FT.CREATE semcache:idx + ON HASH PREFIX 1 cache: + SCHEMA + prompt TEXT + response TEXT + tenant TAG + locale TAG + model_version TAG + safety TAG + created_ts NUMERIC SORTABLE + hit_count NUMERIC SORTABLE + embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE +``` + +### The query + +The lookup is a hybrid query: a TAG pre-filter expression in parentheses, then `=>[KNN 1 @embedding $vec]`. With `DIALECT 2`, Redis applies the filter first and KNN-ranks only the matching documents. `redis-rb` doesn't ship dedicated `FT.*` bindings — it exposes a generic `#call` method that lets you send any command, so the helper builds the argument list directly: + +```ruby +args = [ + 'FT.SEARCH', 'semcache:idx', + '(@tenant:{acme} @locale:{en} @model_version:{gpt\-4\.5\-2026} @safety:{ok})' \ + '=>[KNN 1 @embedding $vec AS distance]', + 'PARAMS', '2', 'vec', query_vec.pack('e*'), + 'SORTBY', 'distance', 'ASC', + 'RETURN', '7', + 'prompt', 'response', 'tenant', 'locale', + 'model_version', 'hit_count', 'distance', + 'LIMIT', '0', '1', + 'DIALECT', '2' +] +reply = client.call(*args) +``` + +`distance` is the cosine *distance* (0 means identical, 2 means opposite). The result is sorted ascending, so the top row is the closest candidate. The application inspects `distance` against the threshold and decides hit or miss in user code — Redis returns the row either way, and treating it as a hit or a miss is a policy decision the cache helper owns, not a server-side filter. + +`FT.SEARCH` returns a flat array — `[total, key1, [field1, value1, ...], key2, [...], ...]` — that the helper parses into per-document hashes; binary fields like `embedding` and the encoded vector come back as ASCII-8BIT Strings so a future tooling change that wants the raw bytes doesn't have to re-encode. + +## The mock LLM + +To make the latency and token savings visible without requiring an API key, `mock_llm.rb` provides a deterministic stand-in +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/ruby/lib/mock_llm.rb)): + +```ruby +require_relative 'lib/mock_llm' + +llm = SemCache::MockLLM.new(latency_ms: 1500.0) +response = llm.complete('What is your return policy?') +# response.response — the templated answer text +# response.latency_ms — wall-clock time the call took +# response.total_tokens — estimated prompt + completion tokens +``` + +The mock sleeps for the configured latency, then keyword-matches against a small FAQ table to produce an answer. The deliberate slowness is what makes a hit visibly cheaper than a miss in the demo. In production code, you would replace `MockLLM` with your real client of choice — `openai`, `anthropic`, an internal vLLM endpoint, anything — without changing the cache helper. + +## Pre-seeding the cache + +In a real deployment the cache fills up organically: a first-time question is a miss, the LLM answers, and the response is written back. For the demo, `seed_cache.rb` pre-loads a small set of canonical FAQ prompts so the very first query lands on a hit +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/ruby/lib/seed_cache.rb)): + +```ruby +cache.create_index +SemCache::SeedCache.seed(cache, embedder, tenant: 'acme', locale: 'en') +``` + +The seed list stores the canonical phrasing of each question ("What is your return policy?"). Paraphrases of any of these prompts ("How do I return an item?", "Can I get a refund?") embed close to the canonical entry, so the cache lookup serves the stored response without ever calling the model. The seed phase batches the encoder calls into a single `encode_many` so the model dispatch is paid once across the whole seed list. + +## The interactive demo + +`demo_server.rb` runs a [`WEBrick`](https://github.com/ruby/webrick) HTTP server. The HTML page lets you: + +* Type a prompt and toggle metadata: tenant, locale, model version. Each combination is a separate cache namespace inside the same index. +* Slide the cosine-distance threshold and see hits flip to misses (and back) on the same prompt, with the actual distance reported on each query. +* Submit with **Ask** to run the full hit-or-miss path (calls the LLM on a miss, writes the answer back). Submit with **Lookup only (no LLM)** to sweep the threshold against a fixed prompt without polluting the cache. +* Watch the cumulative panel build up: total queries, cache hits, cache misses, hit ratio, tokens not spent, LLM milliseconds not waited. +* Inspect every cached entry, including remaining TTL and total hit count, and drop individual entries to simulate eviction. + +The server holds one `LocalEmbedder`, one `RedisSemanticCache`, and one `MockLLM` for the lifetime of the process. The HTML page is shared with the Python, Node.js, Go, and Jedis demos and is loaded from `index.html` next to `demo_server.rb`. Endpoints: + +| Endpoint | What it does | +|-----------------|-------------------------------------------------------------------------------| +| `GET /state` | Index info and the full list of cached entries. | +| `POST /query` | Embed the prompt, run `FT.SEARCH`, on miss call the LLM and write back. | +| `POST /reset` | Drop every cached entry and re-seed from the FAQ list. | +| `POST /drop` | Delete a single cached entry by id. | + +## Run the demo locally + +1. Clone the [`redis/docs`](https://github.com/redis/docs) repository and change into the example + directory: + + ```bash + git clone https://github.com/redis/docs.git + cd docs/content/develop/use-cases/semantic-cache/ruby + ``` + +2. Make sure you have Ruby 3.2 or newer and [Bundler](https://bundler.io/), then install the gems: + + ```bash + bundle install + ``` + + `informers` pulls in [`onnxruntime`](https://rubygems.org/gems/onnxruntime) (which ships the ONNX Runtime shared library as a native extension) and [`tokenizers`](https://rubygems.org/gems/tokenizers) (a Hugging Face fast-tokenizer Rust binding). Both come as pre-built binary gems for `arm64-darwin`, `x86_64-darwin`, `aarch64-linux`, and `x86_64-linux`, so there is no system ONNX install step on those platforms. + +3. Make sure a Redis instance with the Redis Search module is running locally on + port 6379. [Redis Stack]({{< relref "/operate/oss_and_stack/install/install-stack" >}}) or + [Redis 8 with Search]({{< relref "/develop/ai/search-and-query" >}}) both work. + +4. Start the demo server. The first run downloads the ONNX-exported + `sentence-transformers/all-MiniLM-L6-v2` model into the local Hugging Face + cache (~22 MB): + + ```bash + bundle exec ruby demo_server.rb + ``` + +5. Open and try some queries: + + * **"What is your return policy?"** — exact match against the seed, distance ≈ 0, + hit at any threshold. + * **"How fast is delivery?"** — paraphrase of the shipping seed; distance + around 0.30, hit at the default threshold of 0.5. + * **"How do I return an item?"** — slightly looser paraphrase of the returns + seed; distance around 0.49, still a hit at the default threshold. Slide + the threshold down to 0.4 to see this one flip to a miss. + * **"What payment methods do you accept?"** — unrelated to anything in the + seed; distance > 0.6, so at the default threshold you'll see a miss, the + mock LLM kicks in for ~1.5 s, the new answer is cached, and a follow-up + of the same question is now an immediate hit. + * Switch the **Tenant** dropdown to `globex` or `initech` and re-ask any + seeded question — the result flips to a miss because the cache entries + live under `acme`. That's the metadata pre-filter at work inside `FT.SEARCH`. + +The server is read/write against your local Redis. The default index name is `semcache:idx` and entry keys live under `cache:`. Flags mirror the Python, Node.js, Go, and Jedis demos: `--no-reset` to keep an existing cache across restarts, `--threshold` to change the default cosine-distance cutoff, `--llm-latency-ms` to make the mock LLM faster or slower for the demo, or `--port` to listen on a different port. diff --git a/content/develop/use-cases/semantic-cache/ruby/demo_server.rb b/content/develop/use-cases/semantic-cache/ruby/demo_server.rb new file mode 100644 index 0000000000..f75b88d7fc --- /dev/null +++ b/content/develop/use-cases/semantic-cache/ruby/demo_server.rb @@ -0,0 +1,468 @@ +#!/usr/bin/env ruby +# frozen_string_literal: true + +# Redis semantic-cache demo server (Ruby). +# +# Run this file and visit http://localhost:8094 to drive a small +# semantic-cache demo backed by Redis Search. The UI lets you: +# +# * Type a natural-language prompt and watch the cache decide hit or +# miss. On a hit Redis returns the cached response in tens of +# milliseconds and the demo LLM is not called at all; on a miss +# the demo LLM "thinks" for ~1.5 s before answering and the new +# prompt, response, and embedding are written back to Redis for +# next time. +# * Adjust the cosine-distance threshold to see how close a +# paraphrase must be for the cache to serve it. +# * Switch tenant, locale, or model version to see metadata isolation +# in action — entries written under one tenant cannot be served to +# another, because the TAG filter goes into the same `FT.SEARCH` +# call as the KNN. +# * Inspect every cached entry with TTL and hit count, and drop +# individual entries to simulate eviction. +# +# The server holds a single `LocalEmbedder`, a single +# `RedisSemanticCache`, and a single `MockLLM` for the lifetime of the +# process. The first run downloads the embedding model into the local +# Hugging Face cache; everything after is local. + +require 'json' +require 'optparse' +require 'webrick' +require 'cgi' + +require 'redis' + +$LOAD_PATH.unshift(File.expand_path('lib', __dir__)) +require 'cache' +require 'embeddings' +require 'mock_llm' +require 'seed_cache' + +module SemCache + # SemanticCacheDemo owns the cache, embedder, and LLM for the + # lifetime of the process. The handlers thread requests through + # `run_query` and the seed / reset endpoints reuse `seed` so there + # is only one description of the cache lifecycle. + class SemanticCacheDemo + attr_reader :cache, :embedder, :llm, + :default_tenant, :default_locale + + def initialize(cache:, embedder:, llm:, + default_tenant: 'acme', default_locale: 'en') + @cache = cache + @embedder = embedder + @llm = llm + @default_tenant = default_tenant + @default_locale = default_locale + end + + # Drop everything in scope and pre-populate with FAQ entries. + def seed + @cache.clear + SeedCache.seed( + @cache, @embedder, + tenant: @default_tenant, + locale: @default_locale, + model_version: @llm.model_version + ) + end + + # The hot path: embed, look up, optionally call the LLM, cache. + # + # Timings are taken with Process::CLOCK_MONOTONIC around each + # bounded step so the UI can display the embed / lookup / LLM + # breakdown separately. The cache write on a miss is *not* + # included in `total_ms` so the latency number reflects the + # user-facing wait, not the background bookkeeping. + def run_query(prompt:, tenant:, locale:, model_version:, + threshold:, lookup_only:) + t0 = monotonic_ms + query_vec = @embedder.encode_one(prompt) + embed_ms = monotonic_ms - t0 + + t1 = monotonic_ms + result = @cache.lookup( + query_vec, + tenant: tenant, locale: locale, model_version: model_version, + distance_threshold: threshold + ) + lookup_ms = monotonic_ms - t1 + + if result.is_a?(CacheHit) + return { + outcome: 'hit', + response: result.response, + entry_id: result.id, + distance: result.distance, + ttl_seconds: result.ttl_seconds, + hit_count: result.hit_count, + threshold: threshold, + embed_ms: embed_ms, + lookup_ms: lookup_ms, + llm_ms: nil, + total_ms: embed_ms + lookup_ms, + tokens_avoided: estimate_response_tokens(result.prompt, result.response), + ms_avoided: @llm.latency_ms + } + end + + # Miss path. In "lookup only" mode the demo reports the miss + # without actually calling the LLM — useful for sweeping the + # threshold against a fixed prompt to see where the cutoff would + # fall without polluting the cache. + if lookup_only + return { + outcome: 'miss', + response: '(LLM not called in lookup-only mode)', + nearest_distance: result.nearest_distance, + threshold: threshold, + wrote_entry_id: nil, + embed_ms: embed_ms, + lookup_ms: lookup_ms, + llm_ms: nil, + total_ms: embed_ms + lookup_ms + } + end + + t2 = monotonic_ms + llm_response = @llm.complete(prompt) + llm_ms = monotonic_ms - t2 + + # Write the new entry back. The embedding is the same vector we + # already used for the lookup — no need to re-encode. + entry_id = @cache.put( + prompt: prompt, + response: llm_response.response, + embedding: query_vec, + tenant: tenant, locale: locale, model_version: model_version + ) + + { + outcome: 'miss', + response: llm_response.response, + nearest_distance: result.nearest_distance, + threshold: threshold, + wrote_entry_id: entry_id, + embed_ms: embed_ms, + lookup_ms: lookup_ms, + llm_ms: llm_ms, + total_ms: embed_ms + lookup_ms + llm_ms + } + end + + private + + def monotonic_ms + Process.clock_gettime(Process::CLOCK_MONOTONIC) * 1000.0 + end + + def estimate_response_tokens(prompt, response) + [((prompt.to_s.length + response.to_s.length) / 4), 1].max + end + end + + # ---------------------------------------------------------------- + # HTTP plumbing + # ---------------------------------------------------------------- + + # Cap POST bodies so a runaway client (or, more realistically, a + # `curl --data-binary @big-file` by mistake) cannot accumulate + # unbounded memory before the handler runs. WEBrick has no + # built-in cap, so each POST handler calls `body_too_large?` + # before touching `req.body` and returns 413 if the request's + # `Content-Length` exceeds the limit. The demo's largest + # legitimate body is a few hundred bytes of form-encoded query + # fields; 1 MiB matches the Node / Go / Java caps. + MAX_BODY_BYTES = 1 * 1024 * 1024 + + # Check whether the request's Content-Length exceeds MAX_BODY_BYTES. + # WEBrick fully buffers `req.body` before the handler runs, so + # checking here only avoids unbounded *handler* work — the wire + # bytes already arrived. For a docs demo bound to loopback that + # is acceptable; a hardened deployment would put a reverse proxy + # in front of the server with its own request-size limit. + def self.body_too_large?(req) + length = req['Content-Length'].to_i + length > MAX_BODY_BYTES + end + + # Sanitise the threshold parameter from the form body. + # `Float()` happily handles "nan" → NaN and "inf" → +Inf. Either + # would silently turn the lookup into a permanent hit (NaN + # comparisons are always false, so `distance > NaN` cannot reject) + # or a permanent miss. Clamp to the meaningful cosine-distance + # range so a malformed POST cannot override the threshold semantics. + def self.clamp_threshold(raw) + parsed = Float(raw, exception: false) + return 0.5 if parsed.nil? || !parsed.finite? + [[parsed, 0.0].max, 2.0].min + end + + # Build the response shape /state serves. The Python / Node / Go / + # Jedis siblings serve the same shape so the shared HTML works + # without modification. `default_threshold` is what the + # `--threshold` flag actually configures; the UI slider initialises + # to this on first load so the flag visibly changes the demo's + # behaviour. `stack_label` lets the same HTML render a per-language + # badge (redis-py, node-redis, redis-rb, …) without forking the + # file per language. + def self.build_state(cache, embedder, llm, stack_label) + info = cache.index_info + { + index: { + num_docs: info[:num_docs], + index_name: cache.index_name, + indexing_failures: info[:indexing_failures], + vector_index_size_mb: info[:vector_index_size_mb], + model: embedder.model_name, + mock_llm_latency_ms: llm.latency_ms, + default_threshold: cache.distance_threshold, + stack_label: stack_label + }, + entries: cache.list_entries(limit: 200) + } + end + + # Parse a URL-encoded form body into a plain Hash. + # `URI.decode_www_form` returns an Array of pairs; we keep only the + # last value for a repeated key, which matches the Python / Node + # demos' behaviour. + def self.parse_form(body) + pairs = URI.decode_www_form(body.to_s) + pairs.to_h + rescue ArgumentError + {} + end + + # Wrap every handler so an uncaught exception lands as a JSON 500 + # rather than letting WEBrick render a plain-text stack trace. The + # demo's JS client always calls `await res.json()`, so a non-JSON + # body would surface as an opaque parse error. + def self.with_json_errors(response) + yield + rescue StandardError => e + warn("[demo] handler error: #{e.class}: #{e.message}") + warn(e.backtrace.first(8).join("\n")) + response.status = 500 + response['Content-Type'] = 'application/json' + response.body = JSON.generate(error: e.message, type: e.class.name) + end + + def self.send_json(response, payload, status: 200) + response.status = status + response['Content-Type'] = 'application/json' + response.body = JSON.generate(payload) + end + + def self.send_html(response, html, status: 200) + response.status = status + response['Content-Type'] = 'text/html; charset=utf-8' + response.body = html + end + + # ---------------------------------------------------------------- + # Handlers + # ---------------------------------------------------------------- + + def self.install_handlers(server, deps) + cache = deps.fetch(:cache) + embedder = deps.fetch(:embedder) + llm = deps.fetch(:llm) + demo = deps.fetch(:demo) + html_page = deps.fetch(:html_page) + stack_label = deps.fetch(:stack_label) + + server.mount_proc '/' do |req, res| + with_json_errors(res) do + if req.path != '/' && req.path != '/index.html' + send_json(res, { error: 'not found' }, status: 404) + next + end + if req.request_method != 'GET' + send_json(res, { error: 'method not allowed' }, status: 405) + next + end + send_html(res, html_page) + end + end + + server.mount_proc '/state' do |req, res| + with_json_errors(res) do + if req.request_method != 'GET' + send_json(res, { error: 'method not allowed' }, status: 405) + next + end + send_json(res, build_state(cache, embedder, llm, stack_label)) + end + end + + server.mount_proc '/query' do |req, res| + with_json_errors(res) do + if req.request_method != 'POST' + send_json(res, { error: 'method not allowed' }, status: 405) + next + end + if body_too_large?(req) + send_json(res, { error: "request body exceeds #{MAX_BODY_BYTES} bytes" }, status: 413) + next + end + params = parse_form(req.body) + prompt = (params['prompt'] || '').strip + if prompt.empty? + send_json(res, { error: 'prompt is required' }, status: 400) + next + end + payload = demo.run_query( + prompt: prompt, + tenant: empty_or(params['tenant'], 'acme'), + locale: empty_or(params['locale'], 'en'), + model_version: empty_or(params['model_version'], llm.model_version), + threshold: clamp_threshold(params['threshold'] || '0.5'), + lookup_only: !(params['lookup_only'].nil? || params['lookup_only'].empty?) + ) + send_json(res, payload) + end + end + + server.mount_proc '/reset' do |req, res| + with_json_errors(res) do + if req.request_method != 'POST' + send_json(res, { error: 'method not allowed' }, status: 405) + next + end + demo.seed + send_json(res, { ok: true }) + end + end + + server.mount_proc '/drop' do |req, res| + with_json_errors(res) do + if req.request_method != 'POST' + send_json(res, { error: 'method not allowed' }, status: 405) + next + end + if body_too_large?(req) + send_json(res, { error: "request body exceeds #{MAX_BODY_BYTES} bytes" }, status: 413) + next + end + params = parse_form(req.body) + entry_id = (params['entry_id'] || '').strip + if entry_id.empty? + send_json(res, { error: 'entry_id is required' }, status: 400) + next + end + deleted = cache.delete_entry(entry_id) + send_json(res, { deleted: deleted, entry_id: entry_id }) + end + end + end + + def self.empty_or(value, default) + value.nil? || value.empty? ? default : value + end + + # ---------------------------------------------------------------- + # Main + # ---------------------------------------------------------------- + + def self.parse_flags(argv) + options = { + host: '127.0.0.1', + port: 8094, + redis_host: 'localhost', + redis_port: 6379, + index_name: 'semcache:idx', + key_prefix: 'cache:', + ttl_seconds: 3600, + threshold: 0.5, + llm_latency_ms: 1500.0, + no_reset: false + } + OptionParser.new do |opts| + opts.banner = 'Usage: ruby demo_server.rb [options]' + opts.on('--host HOST', 'Interface to bind to') { |v| options[:host] = v } + opts.on('--port PORT', Integer, 'HTTP port for the UI') { |v| options[:port] = v } + opts.on('--redis-host HOST', 'Redis host') { |v| options[:redis_host] = v } + opts.on('--redis-port PORT', Integer, 'Redis port') { |v| options[:redis_port] = v } + opts.on('--index-name NAME', 'Redis Search index name') { |v| options[:index_name] = v } + opts.on('--key-prefix PREFIX', 'Key prefix for cache entries') { |v| options[:key_prefix] = v } + opts.on('--ttl-seconds N', Integer, 'TTL on every entry') { |v| options[:ttl_seconds] = v } + opts.on('--threshold F', Float, 'Default cosine-distance threshold') { |v| options[:threshold] = v } + opts.on('--llm-latency-ms F', Float, 'Simulated mock LLM latency in ms') { |v| options[:llm_latency_ms] = v } + opts.on('--no-reset', 'Skip the cache reset + seed on startup') { options[:no_reset] = true } + end.parse!(argv) + options + end + + def self.run!(argv = ARGV) + args = parse_flags(argv) + + client = Redis.new(host: args[:redis_host], port: args[:redis_port]) + begin + client.ping + rescue StandardError => e + warn("Error: cannot reach Redis at #{args[:redis_host]}:#{args[:redis_port]}") + warn(" (#{e.message})") + exit 1 + end + + cache = RedisSemanticCache.new( + redis_client: client, + index_name: args[:index_name], + key_prefix: args[:key_prefix], + distance_threshold: args[:threshold], + default_ttl_seconds: args[:ttl_seconds] + ) + cache.create_index + + puts 'Loading embedding model (first run downloads the ONNX weights)...' + embedder = LocalEmbedder.new + llm = MockLLM.new(latency_ms: args[:llm_latency_ms]) + + demo = SemanticCacheDemo.new(cache: cache, embedder: embedder, llm: llm) + unless args[:no_reset] + puts "Dropping any existing cache under '#{args[:key_prefix]}*' and " \ + 're-seeding from the FAQ list (pass --no-reset to keep).' + seeded = demo.seed + puts "Seeded #{seeded} entries." + end + + # Load the HTML once and replace the template tokens with the + # configured index name and key prefix so the docs panel shows + # the actual values in use rather than the default copies. + raw_html = File.read(File.expand_path('index.html', __dir__)) + html_page = raw_html + .gsub('__INDEX_NAME__', args[:index_name]) + .gsub('__KEY_PREFIX__', args[:key_prefix]) + + stack_label = 'redis-rb + informers + WEBrick' + + # WEBrick: turn down access logging so the console isn't a flood + # of GET / lines while the demo is running. (WEBrick has no + # built-in `MaxRequestBodySize` knob; each POST handler enforces + # the 1 MiB cap explicitly via `body_too_large?`.) + server = WEBrick::HTTPServer.new( + BindAddress: args[:host], + Port: args[:port], + Logger: WEBrick::Log.new($stderr, WEBrick::Log::WARN), + AccessLog: [] + ) + + install_handlers(server, { + cache: cache, embedder: embedder, llm: llm, + demo: demo, html_page: html_page, stack_label: stack_label + }) + + trap('INT') { server.shutdown } + trap('TERM') { server.shutdown } + + puts "Redis semantic cache demo listening on http://#{args[:host]}:#{args[:port]}" + puts "Using Redis at #{args[:redis_host]}:#{args[:redis_port]} with index '#{args[:index_name]}'" + server.start + ensure + client&.close + end +end + +SemCache.run! if $PROGRAM_NAME == __FILE__ diff --git a/content/develop/use-cases/semantic-cache/ruby/index.html b/content/develop/use-cases/semantic-cache/ruby/index.html new file mode 100644 index 0000000000..e897cfdee7 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/ruby/index.html @@ -0,0 +1,513 @@ + + + + + + Redis Semantic Cache Demo + + + +
+
loading…
+

Redis Semantic Cache Demo

+

+ A small semantic cache sits in front of a mock LLM. Each cache + entry is a Hash at __KEY_PREFIX__<id> holding + the prompt, the response, the prompt's 384-dimensional embedding, + and metadata fields. A single FT.SEARCH on + __INDEX_NAME__ does the KNN against cached prompts + with a TAG pre-filter (tenant, locale, model version, safety) in + the same round trip. If the closest cached prompt is within the + cosine-distance threshold, the demo serves the cached response + and the LLM is not called at all. +

+ +
+ +
+

Ask the LLM

+

Type a question, optionally adjust the metadata filters and + the distance threshold, and submit. The server embeds the + prompt, runs FT.SEARCH with KNN over the cache, + and either serves the cached response (hit) or runs the mock + LLM and writes the new response back to the cache (miss).

+ + +
+
+ + +
+
+ + +
+
+ + +
+
+
+ + + 0.50 +
+

+ The cache serves a hit when the closest cached prompt's + cosine distance is at or below this threshold. Lower = + stricter (fewer hits, safer reuse); higher = looser (more + hits, more risk of serving a near-miss). +

+ + + + + +
+
+ +
+

Cumulative savings

+

Every hit avoids one LLM round trip. The numbers below add + up across the session — tokens that would have been spent and + wall-clock seconds that would have been waited if the cache + had not served the answer.

+
+
+
0
+
Total queries
+
+
+
0
+
Cache hits
+
+
+
0
+
Cache misses
+
+
+
0%
+
Hit ratio
+
+
+
0
+
Tokens saved
+
+
+
0 ms
+
LLM time saved
+
+
+
+ +
+

Index state

+
+ +
+ +
+

Cached entries

+

Every prompt/response pair currently in the cache. + hit_count is the running total of times the entry + has served a hit; ttl is the remaining lifetime + in seconds before EXPIRE drops the key. Click + Drop to simulate eviction.

+ + + + + + + + + + + + +
IDPromptMetadataHitsTTL
+
+ +
+ +
+
+ + + + diff --git a/content/develop/use-cases/semantic-cache/ruby/lib/cache.rb b/content/develop/use-cases/semantic-cache/ruby/lib/cache.rb new file mode 100644 index 0000000000..b4e9535e7c --- /dev/null +++ b/content/develop/use-cases/semantic-cache/ruby/lib/cache.rb @@ -0,0 +1,435 @@ +# Redis semantic-cache helper backed by Redis Search (Ruby). +# +# Each cache entry lives as a Hash document at `cache:`. The hash +# stores the user's prompt and the corresponding LLM response alongside +# the raw float32 bytes of the prompt's 384-dimensional embedding and a +# small set of metadata fields — tenant, locale, model version, and a +# safety flag. +# +# A single Redis Search index covers the embedding plus every metadata +# field, so one `FT.SEARCH` call does an approximate-nearest-neighbour +# lookup against the cached prompts with a TAG pre-filter applied in +# the same pass — no cross-store joins, no extra round trips, and +# tenant isolation is enforced *inside* the query rather than after +# the fact in application code. +# +# The lookup is thresholded: `FT.SEARCH` always returns the closest +# cached prompt, but the cache only serves it as a hit when the cosine +# distance is at or below `distance_threshold`. Anything further away +# is treated as a miss; the caller is expected to run the underlying +# LLM and write the new prompt, response, and embedding back with +# `put`. +# +# Each cache entry is written with `EXPIRE`, so stale answers age out +# without manual cleanup; setting an eviction policy on the database +# (`allkeys-lfu` is the common choice) caps memory under pressure. +# This helper assumes the `redis-rb` client default — String I/O — and +# packs the binary embedding bytes into a String with ASCII-8BIT +# encoding so the protocol writer transmits the exact bytes without +# any UTF-8 transcoding. + +require 'redis' +require 'securerandom' +require 'set' + +module SemCache + VECTOR_DIM_DEFAULT = 384 + + # A cache lookup that returned a cached response. `distance` is the + # cosine distance `FT.SEARCH` reported for the nearest cached + # prompt (0 = identical, 2 = opposite). It is always at or below + # the threshold the lookup was run with. + CacheHit = Struct.new( + :id, :prompt, :response, :tenant, :locale, :model_version, + :distance, :ttl_seconds, :hit_count, keyword_init: true + ) do + def to_h + { + id: id, + prompt: prompt, + response: response, + tenant: tenant, + locale: locale, + model_version: model_version, + distance: distance.round(4), + ttl_seconds: ttl_seconds, + hit_count: hit_count + } + end + + def hit? + true + end + end + + # A cache lookup that did not return a usable response. + # `nearest_distance` is the cosine distance to the closest cached + # prompt that *did* match the metadata filters. It is `nil` if the + # cache had no entry in scope at all, which is what the demo UI + # shows as "no candidate" vs. "candidate too far". + CacheMiss = Struct.new( + :nearest_distance, :nearest_id, keyword_init: true + ) do + def to_h + { + nearest_distance: nearest_distance ? nearest_distance.round(4) : nil, + nearest_id: nearest_id + } + end + + def hit? + false + end + end + + class RedisSemanticCache + # Characters Redis Search treats as syntax inside a TAG value; any + # of them in a user-supplied filter must be backslash-escaped or + # the surrounding `{...}` block won't parse correctly. + TAG_SPECIAL = Set.new("\\,.<>{}[]\"':;!@#$%^&*()-+=~| ".chars).freeze + + attr_reader :redis, :index_name, :key_prefix, :vector_dim, + :distance_threshold, :default_ttl_seconds + + def initialize(redis_client: nil, + index_name: 'semcache:idx', + key_prefix: 'cache:', + vector_dim: VECTOR_DIM_DEFAULT, + distance_threshold: 0.5, + default_ttl_seconds: 3600) + @redis = redis_client || Redis.new(host: 'localhost', port: 6379) + @index_name = index_name + @key_prefix = key_prefix + @vector_dim = vector_dim + @distance_threshold = distance_threshold + @default_ttl_seconds = default_ttl_seconds + end + + # ---------------------------------------------------------------- + # Keys + # ---------------------------------------------------------------- + + def entry_key(entry_id) + "#{@key_prefix}#{entry_id}" + end + + # ---------------------------------------------------------------- + # Index management + # ---------------------------------------------------------------- + + # Create the Redis Search index if it doesn't already exist. One + # index covers the embedding plus every metadata field, so a + # single `FT.SEARCH` can pre-filter by tenant / locale / model and + # then KNN-rank the matching documents in one pass. The `prompt` + # and `response` fields are stored as TEXT so the admin tooling + # can grep the cache by content, but the cache lookup itself is + # vector-only. + def create_index + args = [ + 'FT.CREATE', @index_name, + 'ON', 'HASH', + 'PREFIX', '1', @key_prefix, + 'SCHEMA', + 'prompt', 'TEXT', + 'response', 'TEXT', + 'tenant', 'TAG', + 'locale', 'TAG', + 'model_version', 'TAG', + 'safety', 'TAG', + 'created_ts', 'NUMERIC', 'SORTABLE', + 'hit_count', 'NUMERIC', 'SORTABLE', + 'embedding', 'VECTOR', 'HNSW', '6', + 'TYPE', 'FLOAT32', + 'DIM', @vector_dim.to_s, + 'DISTANCE_METRIC', 'COSINE' + ] + @redis.call(*args) + rescue Redis::CommandError => e + raise unless e.message.include?('Index already exists') + end + + # Drop the search index. Optionally also delete cached entries. + def drop_index(delete_documents: false) + args = ['FT.DROPINDEX', @index_name] + args << 'DD' if delete_documents + @redis.call(*args) + rescue Redis::CommandError => e + message = e.message.downcase + raise unless message.include?('no such index') || message.include?('unknown index name') + end + + # ---------------------------------------------------------------- + # Lookup + # ---------------------------------------------------------------- + + # Find the nearest in-scope cached prompt and decide hit / miss. + # + # `FT.SEARCH` returns the single nearest entry that satisfies the + # TAG pre-filters. The lookup is a hit only if the reported cosine + # distance is at or below `distance_threshold` (or the instance + # default). Anything further away is a miss with the candidate + # distance attached so the caller can log it. + # + # On a hit, the entry's `hit_count` is incremented atomically with + # `HINCRBY` so the demo UI can show which entries are + # load-bearing. The TTL is refreshed on every hit so frequently + # used answers don't age out under cold tail entries. + def lookup(query_vec, tenant: nil, locale: nil, model_version: nil, + safety: 'ok', distance_threshold: nil) + validate_dim!(query_vec, 'query_vec') + + threshold = distance_threshold || @distance_threshold + + filter_clause = self.class.build_filter_clause( + tenant: tenant, locale: locale, + model_version: model_version, safety: safety + ) + query_str = "#{filter_clause}=>[KNN 1 @embedding $vec AS distance]" + vec_bytes = LocalEmbedder.to_bytes(query_vec) + + args = [ + 'FT.SEARCH', @index_name, query_str, + 'PARAMS', '2', 'vec', vec_bytes, + 'SORTBY', 'distance', 'ASC', + 'RETURN', '7', + 'prompt', 'response', 'tenant', 'locale', + 'model_version', 'hit_count', 'distance', + 'LIMIT', '0', '1', + 'DIALECT', '2' + ] + result = @redis.call(*args) + docs = parse_search_result(result) + + return CacheMiss.new(nearest_distance: nil, nearest_id: nil) if docs.empty? + + doc = docs.first + raw_key = doc[:_key] + entry_id = raw_key.start_with?(@key_prefix) ? raw_key[@key_prefix.length..] : raw_key + distance = doc[:distance].to_f + + if distance > threshold + return CacheMiss.new(nearest_distance: distance, nearest_id: entry_id) + end + + # The hash may have expired between FT.SEARCH returning the row + # and us getting here — the search index lags expirations by its + # periodic scan. If we just blindly HINCRBY-ed, Redis would + # helpfully recreate the hash with only `hit_count` set and the + # search index would then log it as an indexing failure (no + # embedding, no metadata). EXISTS narrows that race to the + # pipeline round-trip; a strictly race-free version would wrap + # the bump in a Lua script that checks existence and acts in + # one server-side step. + ek = entry_key(entry_id) + unless @redis.exists?(ek) + return CacheMiss.new(nearest_distance: distance, nearest_id: entry_id) + end + + # MULTI/EXEC the three writes so they apply as a unit on the + # server — a partial failure between HINCRBY and EXPIRE would + # otherwise leave the entry without a refreshed TTL. + replies = @redis.multi do |m| + m.hincrby(ek, 'hit_count', 1) + m.expire(ek, @default_ttl_seconds) + m.ttl(ek) + end + new_hit_count, _expired, ttl = replies + + CacheHit.new( + id: entry_id, + prompt: doc[:prompt] || '', + response: doc[:response] || '', + tenant: doc[:tenant] || '', + locale: doc[:locale] || '', + model_version: doc[:model_version] || '', + distance: distance, + ttl_seconds: ttl && ttl.positive? ? ttl.to_i : @default_ttl_seconds, + hit_count: new_hit_count.to_i + ) + end + + # ---------------------------------------------------------------- + # Write + # ---------------------------------------------------------------- + + # Write a new cache entry and return its id. + # + # The embedding is stored as raw little-endian float32 bytes — the + # encoding Redis Search expects from a FLOAT32 vector field. + # `EXPIRE` on the key gives every entry a bounded lifetime; + # combine with an `allkeys-lfu` eviction policy on the database + # to cap memory under pressure too. + def put(prompt:, response:, embedding:, + tenant: 'default', locale: 'en', + model_version: 'gpt-4.5-2026', safety: 'ok', + ttl_seconds: nil, entry_id: nil) + validate_dim!(embedding, 'embedding') + + id = entry_id || SecureRandom.hex(6) # 12 hex chars, matches sibling demos + key = entry_key(id) + ttl = ttl_seconds || @default_ttl_seconds + vec_bytes = LocalEmbedder.to_bytes(embedding) + + # MULTI/EXEC so HSET and EXPIRE either both apply or neither + # does. Without the transaction wrapper a connection drop + # between the two writes could leave the entry without a TTL + # and the cache would then keep an answer past its intended + # lifetime (or forever, on a database with no eviction policy). + # + # `redis-rb`'s HSET takes a flat list of field/value pairs or a + # Hash. We pass the flat list so the binary `vec_bytes` is sent + # as one argument without any String-to-Hash key coercion + # touching it. + @redis.multi do |m| + m.hset(key, + 'prompt', prompt, + 'response', response, + 'tenant', tenant, + 'locale', locale, + 'model_version', model_version, + 'safety', safety, + 'created_ts', Time.now.to_f.to_s, + 'hit_count', '0', + 'embedding', vec_bytes) + m.expire(key, ttl) + end + id + end + + # ---------------------------------------------------------------- + # Filter clause + # ---------------------------------------------------------------- + + def self.escape_tag_value(value) + value.each_char.map { |c| TAG_SPECIAL.include?(c) ? "\\#{c}" : c }.join + end + + def self.build_filter_clause(tenant:, locale:, model_version:, safety:) + clauses = [] + clauses << "@tenant:{#{escape_tag_value(tenant)}}" if tenant && !tenant.empty? + clauses << "@locale:{#{escape_tag_value(locale)}}" if locale && !locale.empty? + clauses << "@model_version:{#{escape_tag_value(model_version)}}" if model_version && !model_version.empty? + clauses << "@safety:{#{escape_tag_value(safety)}}" if safety && !safety.empty? + clauses.empty? ? '(*)' : "(#{clauses.join(' ')})" + end + + # ---------------------------------------------------------------- + # Inspection / admin + # ---------------------------------------------------------------- + + # Subset of `FT.INFO` useful for the demo UI. + def index_info + raw = @redis.call('FT.INFO', @index_name) + info = ft_info_to_hash(raw) + { + num_docs: (info['num_docs'] || 0).to_i, + indexing_failures: (info['hash_indexing_failures'] || 0).to_i, + vector_index_size_mb: (info['vector_index_sz_mb'] || 0).to_f + } + rescue Redis::CommandError + { num_docs: 0, indexing_failures: 0, vector_index_size_mb: 0.0 } + end + + # Return every cached entry (no embedding) for the admin UI. + def list_entries(limit: 100) + args = [ + 'FT.SEARCH', @index_name, '*', + 'RETURN', '8', + 'prompt', 'response', 'tenant', 'locale', + 'model_version', 'safety', 'created_ts', 'hit_count', + 'LIMIT', '0', limit.to_s, + 'SORTBY', 'created_ts', 'DESC', + 'DIALECT', '2' + ] + result = @redis.call(*args) + parse_search_result(result).map do |doc| + raw_key = doc[:_key] + entry_id = raw_key.start_with?(@key_prefix) ? raw_key[@key_prefix.length..] : raw_key + ttl = @redis.ttl(entry_key(entry_id)) + { + id: entry_id, + prompt: doc[:prompt] || '', + response: doc[:response] || '', + tenant: doc[:tenant] || '', + locale: doc[:locale] || '', + model_version: doc[:model_version] || '', + safety: doc[:safety] || '', + hit_count: (doc[:hit_count] || '0').to_i, + ttl_seconds: ttl && ttl.positive? ? ttl.to_i : 0, + created_ts: (doc[:created_ts] || '0').to_f + } + end + end + + # Drop a single entry. Returns true if the key existed. + def delete_entry(entry_id) + @redis.del(entry_key(entry_id)).positive? + end + + # Drop the index and every cached entry. Returns the number of + # entries that were removed. Used by the demo's "reset" button — + # in production the equivalent is just `FLUSHDB` on a dedicated + # cache database, or letting TTLs expire naturally. + def clear + before = index_info[:num_docs] + drop_index(delete_documents: true) + create_index + before + end + + # ---------------------------------------------------------------- + # Internals + # ---------------------------------------------------------------- + + private + + def validate_dim!(vector, label) + unless vector.respond_to?(:length) && vector.length == @vector_dim + actual = vector.respond_to?(:length) ? vector.length : 'unknown' + raise ArgumentError, + "#{label} has length #{actual}; index expects #{@vector_dim}" + end + end + + # Parse the raw `FT.SEARCH` reply (RESP2 layout). The shape is: + # [ total, key1, [field1, value1, field2, value2, ...], key2, [...], ... ] + # where each key is followed by a flat field/value array. + def parse_search_result(reply) + return [] unless reply.is_a?(Array) && reply.length >= 1 + _total = reply[0] + docs = [] + i = 1 + while i < reply.length + key = reply[i] + fields = reply[i + 1] + i += 2 + next if fields.nil? + doc = { _key: key } + # `fields` is a flat [k, v, k, v, ...] array; convert pairs to + # symbol-keyed entries on the doc hash for easy lookup. + j = 0 + while j < fields.length + field_name = fields[j].to_s + field_value = fields[j + 1] + doc[field_name.to_sym] = field_value + j += 2 + end + docs << doc + end + docs + end + + # `FT.INFO` returns a flat alternating key/value array. Lift it to + # a string-keyed hash, ignoring nested arrays we don't need. + def ft_info_to_hash(reply) + return {} unless reply.is_a?(Array) + out = {} + i = 0 + while i < reply.length + out[reply[i].to_s] = reply[i + 1] + i += 2 + end + out + end + end +end diff --git a/content/develop/use-cases/semantic-cache/ruby/lib/embeddings.rb b/content/develop/use-cases/semantic-cache/ruby/lib/embeddings.rb new file mode 100644 index 0000000000..0177a0df73 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/ruby/lib/embeddings.rb @@ -0,0 +1,90 @@ +# Local text-embedding helper backed by the `informers` gem. +# +# `informers` is a Ruby port of Hugging Face transformers that runs +# the ONNX-exported `sentence-transformers/all-MiniLM-L6-v2` encoder +# through the `onnxruntime` gem — same 384-d model the Python, Node.js, +# Go, and Jedis siblings use. Vectors are L2-normalised so a Redis +# Search index declared with `DISTANCE_METRIC COSINE` returns scores +# that are directly comparable across entries. +# +# Embeddings are numerically very close to the PyTorch reference +# (matches the Node.js Xenova ONNX path to ~0.01 in cosine distance); +# the model is downloaded into the local Hugging Face cache on the +# first call and every later call runs offline. + +require 'informers' + +module SemCache + # `informers` exposes a synchronous API, so the constructor does the + # model load directly. We probe the output shape once and record the + # dimension on the instance so callers can compare against the + # cache's expected vector dimension before doing any inserts; + # RedisSemanticCache also checks length on every put / lookup, so a + # model swap that produces wrong-dim vectors fails at the call site + # with a clear error. + class LocalEmbedder + DEFAULT_MODEL = 'sentence-transformers/all-MiniLM-L6-v2' + + attr_reader :model_name, :dim + + def initialize(model_name: DEFAULT_MODEL) + @model_name = model_name + # `Informers.pipeline("embedding", ...)` returns a configured + # EmbeddingPipeline. The `call(text, pooling:, normalize:)` API + # mirrors @xenova/transformers' feature-extraction pipeline so + # the Node.js sibling's code looks structurally identical. + @model = Informers.pipeline('embedding', model_name) + probe = encode_one('dimension probe') + @dim = probe.length + end + + # Encode a single string. Returns a 384-element Array of Float + # (Ruby doubles; the values themselves are float32 round-trips + # from the ONNX session so the precision is the model's). + # + # We pass `normalize: true` to informers, which L2-normalises in + # the ONNX graph itself — the result is already a unit vector, + # so a second pass through `l2_normalize` would be redundant. + # `validate_dim!` enforces the shape contract. + def encode_one(text) + vec = @model.(text, pooling: 'mean', normalize: true) + # The pipeline returns a flat Array when the input is a single + # string and an Array when the input is an Array; we + # special-case below in encode_many. Defensive flatten in case + # a future release unifies the shapes. + vec = vec.first if vec.first.is_a?(Array) + validate_dim!(vec) + vec + end + + # Encode several strings in one pipeline call. Returns an + # Array of float values, one row per input string. Raises + # if the model produces a different number of rows than inputs — + # that would silently misalign the seed phase otherwise. + def encode_many(texts) + rows = @model.(texts, pooling: 'mean', normalize: true) + if rows.length != texts.length + raise "informers returned #{rows.length} vectors for #{texts.length} inputs" + end + rows.each { |row| validate_dim!(row) } + rows + end + + # Pack a Ruby Array of Float into the bytes Redis Search expects: + # raw little-endian float32, no header, exactly `dim * 4` bytes. + # Ruby's `Array#pack` directive `'e'` is little-endian single + # precision float; `'e*'` packs every element. This is the + # encoding RediSearch reads for a `VECTOR ... TYPE FLOAT32` field. + def self.to_bytes(vector) + vector.pack('e*') + end + + private + + def validate_dim!(vec) + return if @dim.nil? + return if vec.length == @dim + raise "encoder produced #{vec.length}-d vector; expected #{@dim}-d" + end + end +end diff --git a/content/develop/use-cases/semantic-cache/ruby/lib/mock_llm.rb b/content/develop/use-cases/semantic-cache/ruby/lib/mock_llm.rb new file mode 100644 index 0000000000..3552a73819 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/ruby/lib/mock_llm.rb @@ -0,0 +1,147 @@ +# Deterministic mock LLM for the semantic-cache demo. +# +# The point of a semantic cache is to *skip* an LLM call when a prior +# answer is reusable. To make that visible in a docs demo we need an +# LLM stand-in that: +# +# * takes long enough that the saved time on a cache hit is obvious +# (real-world model calls are 500 ms to several seconds); +# * responds deterministically so a given prompt always produces the +# same answer, which keeps the demo reproducible; +# * exposes an estimated token count so the demo can show the saving +# in "tokens not spent" terms alongside latency; +# * needs no API keys, no network, no extra dependencies. +# +# It is keyword-matched against a small lookup table of FAQ-style +# answers for a fictional online retailer. Anything that doesn't match +# falls back to a generic templated reply. The `latency_ms` parameter +# is the simulated round trip; the default (1500 ms) is in the +# neighbourhood of a real GPT-class model on a moderately-sized prompt. + +module SemCache + class MockLLM + KNOWLEDGE = [ + { + keywords: %w[return refund exchange], + answer: + 'You can return any unworn item within 30 days of delivery for a ' \ + 'full refund. Start a return from your order page; we email a ' \ + 'prepaid label and refund the original payment method within ' \ + 'five business days of receiving the item.' + }, + { + keywords: %w[shipping delivery arrive ship], + answer: + 'Standard shipping is free on orders over $50 and arrives in ' \ + 'three to five business days. Expedited two-day shipping is ' \ + '$9.99 and is available at checkout for in-stock items.' + }, + { + keywords: %w[size sizing fit], + answer: + 'We follow standard US sizing. For most styles we recommend ' \ + 'ordering your usual size; the product page includes a sizing ' \ + 'chart and customer fit notes for items that run small or large.' + }, + { + keywords: %w[warranty guarantee defect broken], + answer: + 'All gear is covered by a one-year manufacturer warranty against ' \ + 'defects in materials or workmanship. Email support with your ' \ + 'order number and a photo of the issue and we will replace the ' \ + 'item or issue a refund.' + }, + { + keywords: %w[contact support help agent], + answer: + 'You can reach our support team by email at help@example.com or ' \ + 'by live chat from the help centre, 9am to 9pm Eastern, seven ' \ + 'days a week. Most tickets get a first reply within two hours.' + }, + { + keywords: %w[track tracking order where], + answer: + 'Your tracking number is on the order confirmation email and on ' \ + 'the order detail page once the package has been picked up by ' \ + 'the carrier — typically within 24 hours of order placement.' + }, + { + keywords: %w[cancel modify change], + answer: + 'Orders can be cancelled or modified for up to one hour after ' \ + 'placement. After that the order has usually entered our ' \ + 'warehouse system; the fastest path is to accept delivery and ' \ + 'start a return for any unwanted items.' + }, + { + keywords: %w[discount coupon promo code], + answer: + 'Active promotional codes are listed on the homepage banner. ' \ + 'Codes apply at checkout and cannot be combined; the system ' \ + 'automatically uses the larger of the two when more than one ' \ + 'would qualify.' + } + ].freeze + + Response = Struct.new( + :response, :model_version, :latency_ms, + :prompt_tokens, :completion_tokens, keyword_init: true + ) do + def total_tokens + prompt_tokens.to_i + completion_tokens.to_i + end + end + + attr_reader :model_version, :latency_ms, :call_count + + def initialize(model_version: 'gpt-4.5-2026', latency_ms: 1500.0) + @model_version = model_version + @latency_ms = latency_ms.to_f + @call_count = 0 + end + + # Pretend to call a model. Sleeps for the configured latency, then + # returns a templated answer. The sleep happens first so the + # latency budget is realistic regardless of which branch produces + # the text. + def complete(prompt) + @call_count += 1 + started = monotonic_ms + sleep(@latency_ms / 1000.0) if @latency_ms.positive? + response_text = answer_for(prompt) + Response.new( + response: response_text, + model_version: @model_version, + latency_ms: monotonic_ms - started, + prompt_tokens: estimate_tokens(prompt), + completion_tokens: estimate_tokens(response_text) + ) + end + + private + + def monotonic_ms + Process.clock_gettime(Process::CLOCK_MONOTONIC) * 1000.0 + end + + # Rough English token estimate: ~4 characters per token. Real + # tokenizers (BPE, SentencePiece) vary slightly but this is close + # enough for "look how many tokens you saved" demo signage. + def estimate_tokens(text) + return 0 if text.nil? || text.empty? + [(text.length / 4), 1].max + end + + def answer_for(prompt) + lower = prompt.to_s.downcase + row = KNOWLEDGE.find { |r| r[:keywords].any? { |k| lower.include?(k) } } + return row[:answer] if row + + # Generic fallback — keeps the demo working for queries that + # don't match any FAQ keyword. + 'Thanks for the question. Our team would normally answer this ' \ + 'individually; in the meantime please check the help centre or ' \ + 'contact support@example.com for a faster response.' + end + end +end diff --git a/content/develop/use-cases/semantic-cache/ruby/lib/seed_cache.rb b/content/develop/use-cases/semantic-cache/ruby/lib/seed_cache.rb new file mode 100644 index 0000000000..ca82c25955 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/ruby/lib/seed_cache.rb @@ -0,0 +1,89 @@ +# Pre-seed the semantic cache with a handful of FAQ answers. +# +# In a real deployment the cache fills up organically as users ask +# questions: a first-time question is a miss, the LLM answers, and the +# response is written back. To make the demo immediately useful — so +# the first query you type lands on a hit instead of a cold miss — we +# seed a small set of canonical prompts and their answers at startup. +# +# The seed list mirrors the keyword table in `mock_llm.rb` but stores +# the *canonical phrasing* of each question. Paraphrases of any of +# these prompts ("How do I return an item?", "Can I get a refund?") +# embed close to the canonical entry and the cache lookup serves the +# stored response without ever calling the model. + +module SemCache + module SeedCache + SEED_ENTRIES = [ + { + prompt: 'What is your return policy?', + response: + 'You can return any unworn item within 30 days of delivery for ' \ + 'a full refund. Start a return from your order page; we email ' \ + 'a prepaid label and refund the original payment method within ' \ + 'five business days of receiving the item.' + }, + { + prompt: 'How long does shipping take?', + response: + 'Standard shipping is free on orders over $50 and arrives in ' \ + 'three to five business days. Expedited two-day shipping is ' \ + '$9.99 and is available at checkout for in-stock items.' + }, + { + prompt: 'How do I find my size?', + response: + 'We follow standard US sizing. For most styles we recommend ' \ + 'ordering your usual size; the product page includes a sizing ' \ + 'chart and customer fit notes for items that run small or ' \ + 'large.' + }, + { + prompt: 'Is there a warranty on your products?', + response: + 'All gear is covered by a one-year manufacturer warranty ' \ + 'against defects in materials or workmanship. Email support ' \ + 'with your order number and a photo of the issue and we will ' \ + 'replace the item or issue a refund.' + }, + { + prompt: 'How can I contact customer support?', + response: + 'You can reach our support team by email at help@example.com ' \ + 'or by live chat from the help centre, 9am to 9pm Eastern, ' \ + 'seven days a week. Most tickets get a first reply within two ' \ + 'hours.' + }, + { + prompt: 'Where is my order?', + response: + 'Your tracking number is on the order confirmation email and ' \ + 'on the order detail page once the package has been picked up ' \ + 'by the carrier — typically within 24 hours of order ' \ + 'placement.' + } + ].freeze + + # Returns the number of entries written. Batched-encode the + # canonical prompts in a single informers call so we pay the model + # load once and amortise the tokenisation overhead across the row + # set — useful on Ruby because the per-call dispatch cost is + # higher than the equivalent in Python or Node. + def self.seed(cache, embedder, tenant: 'acme', locale: 'en', + model_version: 'gpt-4.5-2026') + prompts = SEED_ENTRIES.map { |e| e[:prompt] } + vectors = embedder.encode_many(prompts) + SEED_ENTRIES.each_with_index do |entry, i| + cache.put( + prompt: entry[:prompt], + response: entry[:response], + embedding: vectors[i], + tenant: tenant, + locale: locale, + model_version: model_version + ) + end + SEED_ENTRIES.length + end + end +end diff --git a/content/develop/use-cases/semantic-cache/rust/.gitignore b/content/develop/use-cases/semantic-cache/rust/.gitignore new file mode 100644 index 0000000000..3bbf50d79c --- /dev/null +++ b/content/develop/use-cases/semantic-cache/rust/.gitignore @@ -0,0 +1,9 @@ +# Cargo build artefacts. +target/ + +# Candle / hf-hub download the all-MiniLM-L6-v2 weights into a local +# cache directory on first run. The weights are ~87 MB and we don't +# want them in the repo. The exact location depends on whether the +# caller exported HF_HOME; the demo defaults to ./models so both the +# Go example and this one share the same gitignore shape. +models/ diff --git a/content/develop/use-cases/semantic-cache/rust/Cargo.lock b/content/develop/use-cases/semantic-cache/rust/Cargo.lock new file mode 100644 index 0000000000..194c45868f --- /dev/null +++ b/content/develop/use-cases/semantic-cache/rust/Cargo.lock @@ -0,0 +1,2879 @@ +# This file is automatically @generated by Cargo. +# It is not intended for manual editing. +version = 4 + +[[package]] +name = "adler2" +version = "2.0.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa" + +[[package]] +name = "aho-corasick" +version = "1.1.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ddd31a130427c27518df266943a5308ed92d4b226cc639f5a8f1002816174301" +dependencies = [ + "memchr", +] + +[[package]] +name = "anyhow" +version = "1.0.102" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7f202df86484c868dbad7eaa557ef785d5c66295e41b460ef922eca0723b842c" + +[[package]] +name = "arbitrary" +version = "1.4.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c3d036a3c4ab069c7b410a2ce876bd74808d2d0888a82667669f8e783a898bf1" +dependencies = [ + "derive_arbitrary", +] + +[[package]] +name = "arc-swap" +version = "1.9.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6a3a1fd6f75306b68087b831f025c712524bcb19aad54e557b1129cfa0a2b207" +dependencies = [ + "rustversion", +] + +[[package]] +name = "ascii" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d92bec98840b8f03a5ff5413de5293bfcd8bf96467cf5452609f939ec6f5de16" + +[[package]] +name = "autocfg" +version = "1.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8" + +[[package]] +name = "base64" +version = "0.13.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9e1b586273c5702936fe7b7d6896644d8be71e6314cfe09d3167c95f712589e8" + +[[package]] +name = "base64" +version = "0.22.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "72b3254f16251a8381aa12e40e3c4d2f0199f8c6508fbecb9d91f575e0fbb8c6" + +[[package]] +name = "bit-set" +version = "0.5.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0700ddab506f33b20a03b13996eccd309a48e5ff77d0d95926aa0210fb4e95f1" +dependencies = [ + "bit-vec", +] + +[[package]] +name = "bit-vec" +version = "0.6.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "349f9b6a179ed607305526ca489b34ad0a41aed5f7980fa90eb03160b69598fb" + +[[package]] +name = "bitflags" +version = "1.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bef38d45163c2f1dde094a7dfd33ccf595c92905c8f8f4fdc18d06fb1037718a" + +[[package]] +name = "bitflags" +version = "2.11.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c4512299f36f043ab09a583e57bceb5a5aab7a73db1805848e8fef3c9e8c78b3" + +[[package]] +name = "bumpalo" +version = "3.20.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5d20789868f4b01b2f2caec9f5c4e0213b41e3e5702a50157d699ae31ced2fcb" + +[[package]] +name = "bytemuck" +version = "1.25.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c8efb64bd706a16a1bdde310ae86b351e4d21550d98d056f22f8a7f7a2183fec" +dependencies = [ + "bytemuck_derive", +] + +[[package]] +name = "bytemuck_derive" +version = "1.10.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f9abbd1bc6865053c427f7198e6af43bfdedc55ab791faed4fbd361d789575ff" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "byteorder" +version = "1.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b" + +[[package]] +name = "bytes" +version = "1.11.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1e748733b7cbc798e1434b6ac524f0c1ff2ab456fe201501e6497c8417a4fc33" + +[[package]] +name = "candle-core" +version = "0.8.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "06ccf5ee3532e66868516d9b315f73aec9f34ea1a37ae98514534d458915dbf1" +dependencies = [ + "byteorder", + "gemm 0.17.1", + "half", + "memmap2", + "num-traits", + "num_cpus", + "rand 0.9.4", + "rand_distr", + "rayon", + "safetensors", + "thiserror", + "ug", + "yoke 0.7.5", + "zip", +] + +[[package]] +name = "candle-nn" +version = "0.8.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "be1160c3b63f47d40d91110a3e1e1e566ae38edddbbf492a60b40ffc3bc1ff38" +dependencies = [ + "candle-core", + "half", + "num-traits", + "rayon", + "safetensors", + "serde", + "thiserror", +] + +[[package]] +name = "candle-transformers" +version = "0.8.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "94a0900d49f8605e0e7e6693a1f560e6271279de98e5fa369e7abf3aac245020" +dependencies = [ + "byteorder", + "candle-core", + "candle-nn", + "fancy-regex", + "num-traits", + "rand 0.9.4", + "rayon", + "serde", + "serde_json", + "serde_plain", + "tracing", +] + +[[package]] +name = "cc" +version = "1.2.62" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a1dce859f0832a7d088c4f1119888ab94ef4b5d6795d1ce05afb7fe159d79f98" +dependencies = [ + "find-msvc-tools", + "shlex", +] + +[[package]] +name = "cfg-if" +version = "1.0.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801" + +[[package]] +name = "chunked_transfer" +version = "1.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6e4de3bc4ea267985becf712dc6d9eed8b04c953b3fcfb339ebc87acd9804901" + +[[package]] +name = "combine" +version = "4.6.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ba5a308b75df32fe02788e748662718f03fde005016435c444eea572398219fd" +dependencies = [ + "bytes", + "memchr", +] + +[[package]] +name = "console" +version = "0.15.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "054ccb5b10f9f2cbf51eb355ca1d05c2d279ce1804688d0db74b4733a5aeafd8" +dependencies = [ + "encode_unicode", + "libc", + "once_cell", + "unicode-width", + "windows-sys 0.59.0", +] + +[[package]] +name = "core-foundation" +version = "0.10.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b2a6cd9ae233e7f62ba4e9353e81a88df7fc8a5987b8d445b4d90c879bd156f6" +dependencies = [ + "core-foundation-sys", + "libc", +] + +[[package]] +name = "core-foundation-sys" +version = "0.8.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "773648b94d0e5d620f64f280777445740e61fe701025087ec8b57f45c791888b" + +[[package]] +name = "crc32fast" +version = "1.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9481c1c90cbf2ac953f07c8d4a58aa3945c425b7185c9154d67a65e4230da511" +dependencies = [ + "cfg-if", +] + +[[package]] +name = "crossbeam-deque" +version = "0.8.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9dd111b7b7f7d55b72c0a6ae361660ee5853c9af73f70c3c2ef6858b950e2e51" +dependencies = [ + "crossbeam-epoch", + "crossbeam-utils", +] + +[[package]] +name = "crossbeam-epoch" +version = "0.9.18" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5b82ac4a3c2ca9c3460964f020e1402edd5753411d7737aa39c3714ad1b5420e" +dependencies = [ + "crossbeam-utils", +] + +[[package]] +name = "crossbeam-utils" +version = "0.8.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d0a5c400df2834b80a4c3327b3aad3a4c4cd4de0629063962b03235697506a28" + +[[package]] +name = "crunchy" +version = "0.2.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "460fbee9c2c2f33933d720630a6a0bac33ba7053db5344fac858d4b8952d77d5" + +[[package]] +name = "darling" +version = "0.20.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fc7f46116c46ff9ab3eb1597a45688b6715c6e628b5c133e288e709a29bcb4ee" +dependencies = [ + "darling_core", + "darling_macro", +] + +[[package]] +name = "darling_core" +version = "0.20.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0d00b9596d185e565c2207a0b01f8bd1a135483d02d9b7b0a54b11da8d53412e" +dependencies = [ + "fnv", + "ident_case", + "proc-macro2", + "quote", + "strsim", + "syn", +] + +[[package]] +name = "darling_macro" +version = "0.20.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fc34b93ccb385b40dc71c6fceac4b2ad23662c7eeb248cf10d529b7e055b6ead" +dependencies = [ + "darling_core", + "quote", + "syn", +] + +[[package]] +name = "derive_arbitrary" +version = "1.4.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1e567bd82dcff979e4b03460c307b3cdc9e96fde3d73bed1496d2bc75d9dd62a" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "derive_builder" +version = "0.20.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "507dfb09ea8b7fa618fcf76e953f4f5e192547945816d5358edffe39f6f94947" +dependencies = [ + "derive_builder_macro", +] + +[[package]] +name = "derive_builder_core" +version = "0.20.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2d5bcf7b024d6835cfb3d473887cd966994907effbe9227e8c8219824d06c4e8" +dependencies = [ + "darling", + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "derive_builder_macro" +version = "0.20.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ab63b0e2bf4d5928aff72e83a7dace85d7bba5fe12dcc3c5a572d78caffd3f3c" +dependencies = [ + "derive_builder_core", + "syn", +] + +[[package]] +name = "dirs" +version = "5.0.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "44c45a9d03d6676652bcb5e724c7e988de1acad23a711b5217ab9cbecbec2225" +dependencies = [ + "dirs-sys", +] + +[[package]] +name = "dirs-sys" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "520f05a5cbd335fae5a99ff7a6ab8627577660ee5cfd6a94a6a929b52ff0321c" +dependencies = [ + "libc", + "option-ext", + "redox_users", + "windows-sys 0.48.0", +] + +[[package]] +name = "displaydoc" +version = "0.2.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "97369cbbc041bc366949bc74d34658d6cda5621039731c6310521892a3a20ae0" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "dyn-stack" +version = "0.10.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "56e53799688f5632f364f8fb387488dd05db9fe45db7011be066fc20e7027f8b" +dependencies = [ + "bytemuck", + "reborrow", +] + +[[package]] +name = "dyn-stack" +version = "0.13.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1c4713e43e2886ba72b8271aa66c93d722116acf7a75555cce11dcde84388fe8" +dependencies = [ + "bytemuck", + "dyn-stack-macros", +] + +[[package]] +name = "dyn-stack-macros" +version = "0.1.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e1d926b4d407d372f141f93bb444696142c29d32962ccbd3531117cf3aa0bfa9" + +[[package]] +name = "either" +version = "1.16.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "91622ff5e7162018101f2fea40d6ebf4a78bbe5a49736a2020649edf9693679e" + +[[package]] +name = "encode_unicode" +version = "1.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "34aa73646ffb006b8f5147f3dc182bd4bcb190227ce861fc4a4844bf8e3cb2c0" + +[[package]] +name = "enum-as-inner" +version = "0.6.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a1e6a265c649f3f5979b601d26f1d05ada116434c87741c9493cb56218f76cbc" +dependencies = [ + "heck", + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "equivalent" +version = "1.0.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "877a4ace8713b0bcf2a4e7eec82529c029f1d0619886d18145fea96c3ffe5c0f" + +[[package]] +name = "errno" +version = "0.3.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb" +dependencies = [ + "libc", + "windows-sys 0.61.2", +] + +[[package]] +name = "esaxx-rs" +version = "0.1.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d817e038c30374a4bcb22f94d0a8a0e216958d4c3dcde369b1439fec4bdda6e6" + +[[package]] +name = "fancy-regex" +version = "0.13.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "531e46835a22af56d1e3b66f04844bed63158bc094a628bec1d321d9b4c44bf2" +dependencies = [ + "bit-set", + "regex-automata", + "regex-syntax", +] + +[[package]] +name = "fastrand" +version = "2.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9f1f227452a390804cdb637b74a86990f2a7d7ba4b7d5693aac9b4dd6defd8d6" + +[[package]] +name = "find-msvc-tools" +version = "0.1.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5baebc0774151f905a1a2cc41989300b1e6fbb29aff0ceffa1064fdd3088d582" + +[[package]] +name = "flate2" +version = "1.1.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "843fba2746e448b37e26a819579957415c8cef339bf08564fe8b7ddbd959573c" +dependencies = [ + "crc32fast", + "miniz_oxide", +] + +[[package]] +name = "fnv" +version = "1.0.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1" + +[[package]] +name = "foldhash" +version = "0.1.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d9c4f5dac5e15c24eb999c26181a6ca40b39fe946cbe4c263c7209467bc83af2" + +[[package]] +name = "foreign-types" +version = "0.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f6f339eb8adc052cd2ca78910fda869aefa38d22d5cb648e6485e4d3fc06f3b1" +dependencies = [ + "foreign-types-shared", +] + +[[package]] +name = "foreign-types-shared" +version = "0.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "00b0228411908ca8685dba7fc2cdd70ec9990a6e753e89b6ac91a84c40fbaf4b" + +[[package]] +name = "form_urlencoded" +version = "1.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cb4cb245038516f5f85277875cdaa4f7d2c9a0fa0468de06ed190163b1581fcf" +dependencies = [ + "percent-encoding", +] + +[[package]] +name = "futures-core" +version = "0.3.32" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7e3450815272ef58cec6d564423f6e755e25379b217b0bc688e295ba24df6b1d" + +[[package]] +name = "futures-task" +version = "0.3.32" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "037711b3d59c33004d3856fbdc83b99d4ff37a24768fa1be9ce3538a1cde4393" + +[[package]] +name = "futures-util" +version = "0.3.32" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "389ca41296e6190b48053de0321d02a77f32f8a5d2461dd38762c0593805c6d6" +dependencies = [ + "futures-core", + "futures-task", + "pin-project-lite", + "slab", +] + +[[package]] +name = "gemm" +version = "0.17.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6ab24cc62135b40090e31a76a9b2766a501979f3070fa27f689c27ec04377d32" +dependencies = [ + "dyn-stack 0.10.0", + "gemm-c32 0.17.1", + "gemm-c64 0.17.1", + "gemm-common 0.17.1", + "gemm-f16 0.17.1", + "gemm-f32 0.17.1", + "gemm-f64 0.17.1", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 10.7.0", + "seq-macro", +] + +[[package]] +name = "gemm" +version = "0.18.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ab96b703d31950f1aeddded248bc95543c9efc7ac9c4a21fda8703a83ee35451" +dependencies = [ + "dyn-stack 0.13.2", + "gemm-c32 0.18.2", + "gemm-c64 0.18.2", + "gemm-common 0.18.2", + "gemm-f16 0.18.2", + "gemm-f32 0.18.2", + "gemm-f64 0.18.2", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 11.6.0", + "seq-macro", +] + +[[package]] +name = "gemm-c32" +version = "0.17.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b9c030d0b983d1e34a546b86e08f600c11696fde16199f971cd46c12e67512c0" +dependencies = [ + "dyn-stack 0.10.0", + "gemm-common 0.17.1", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 10.7.0", + "seq-macro", +] + +[[package]] +name = "gemm-c32" +version = "0.18.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f6db9fd9f40421d00eea9dd0770045a5603b8d684654816637732463f4073847" +dependencies = [ + "dyn-stack 0.13.2", + "gemm-common 0.18.2", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 11.6.0", + "seq-macro", +] + +[[package]] +name = "gemm-c64" +version = "0.17.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fbb5f2e79fefb9693d18e1066a557b4546cd334b226beadc68b11a8f9431852a" +dependencies = [ + "dyn-stack 0.10.0", + "gemm-common 0.17.1", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 10.7.0", + "seq-macro", +] + +[[package]] +name = "gemm-c64" +version = "0.18.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dfcad8a3d35a43758330b635d02edad980c1e143dc2f21e6fd25f9e4eada8edf" +dependencies = [ + "dyn-stack 0.13.2", + "gemm-common 0.18.2", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 11.6.0", + "seq-macro", +] + +[[package]] +name = "gemm-common" +version = "0.17.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a2e7ea062c987abcd8db95db917b4ffb4ecdfd0668471d8dc54734fdff2354e8" +dependencies = [ + "bytemuck", + "dyn-stack 0.10.0", + "half", + "num-complex", + "num-traits", + "once_cell", + "paste", + "pulp 0.18.22", + "raw-cpuid 10.7.0", + "rayon", + "seq-macro", + "sysctl 0.5.5", +] + +[[package]] +name = "gemm-common" +version = "0.18.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a352d4a69cbe938b9e2a9cb7a3a63b7e72f9349174a2752a558a8a563510d0f3" +dependencies = [ + "bytemuck", + "dyn-stack 0.13.2", + "half", + "libm", + "num-complex", + "num-traits", + "once_cell", + "paste", + "pulp 0.21.5", + "raw-cpuid 11.6.0", + "rayon", + "seq-macro", + "sysctl 0.6.0", +] + +[[package]] +name = "gemm-f16" +version = "0.17.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7ca4c06b9b11952071d317604acb332e924e817bd891bec8dfb494168c7cedd4" +dependencies = [ + "dyn-stack 0.10.0", + "gemm-common 0.17.1", + "gemm-f32 0.17.1", + "half", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 10.7.0", + "rayon", + "seq-macro", +] + +[[package]] +name = "gemm-f16" +version = "0.18.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cff95ae3259432f3c3410eaa919033cd03791d81cebd18018393dc147952e109" +dependencies = [ + "dyn-stack 0.13.2", + "gemm-common 0.18.2", + "gemm-f32 0.18.2", + "half", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 11.6.0", + "rayon", + "seq-macro", +] + +[[package]] +name = "gemm-f32" +version = "0.17.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e9a69f51aaefbd9cf12d18faf273d3e982d9d711f60775645ed5c8047b4ae113" +dependencies = [ + "dyn-stack 0.10.0", + "gemm-common 0.17.1", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 10.7.0", + "seq-macro", +] + +[[package]] +name = "gemm-f32" +version = "0.18.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bc8d3d4385393304f407392f754cd2dc4b315d05063f62cf09f47b58de276864" +dependencies = [ + "dyn-stack 0.13.2", + "gemm-common 0.18.2", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 11.6.0", + "seq-macro", +] + +[[package]] +name = "gemm-f64" +version = "0.17.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "aa397a48544fadf0b81ec8741e5c0fba0043008113f71f2034def1935645d2b0" +dependencies = [ + "dyn-stack 0.10.0", + "gemm-common 0.17.1", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 10.7.0", + "seq-macro", +] + +[[package]] +name = "gemm-f64" +version = "0.18.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "35b2a4f76ce4b8b16eadc11ccf2e083252d8237c1b589558a49b0183545015bd" +dependencies = [ + "dyn-stack 0.13.2", + "gemm-common 0.18.2", + "num-complex", + "num-traits", + "paste", + "raw-cpuid 11.6.0", + "seq-macro", +] + +[[package]] +name = "getrandom" +version = "0.2.17" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ff2abc00be7fca6ebc474524697ae276ad847ad0a6b3faa4bcb027e9a4614ad0" +dependencies = [ + "cfg-if", + "libc", + "wasi", +] + +[[package]] +name = "getrandom" +version = "0.3.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "899def5c37c4fd7b2664648c28120ecec138e4d395b459e5ca34f9cce2dd77fd" +dependencies = [ + "cfg-if", + "libc", + "r-efi 5.3.0", + "wasip2", +] + +[[package]] +name = "getrandom" +version = "0.4.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0de51e6874e94e7bf76d726fc5d13ba782deca734ff60d5bb2fb2607c7406555" +dependencies = [ + "cfg-if", + "libc", + "r-efi 6.0.0", + "wasip2", + "wasip3", +] + +[[package]] +name = "half" +version = "2.7.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6ea2d84b969582b4b1864a92dc5d27cd2b77b622a8d79306834f1be5ba20d84b" +dependencies = [ + "bytemuck", + "cfg-if", + "crunchy", + "num-traits", + "rand 0.9.4", + "rand_distr", + "zerocopy", +] + +[[package]] +name = "hashbrown" +version = "0.15.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9229cfe53dfd69f0609a49f65461bd93001ea1ef889cd5529dd176593f5338a1" +dependencies = [ + "foldhash", +] + +[[package]] +name = "hashbrown" +version = "0.17.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ed5909b6e89a2db4456e54cd5f673791d7eca6732202bbf2a9cc504fe2f9b84a" + +[[package]] +name = "heck" +version = "0.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea" + +[[package]] +name = "hermit-abi" +version = "0.5.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fc0fef456e4baa96da950455cd02c081ca953b141298e41db3fc7e36b1da849c" + +[[package]] +name = "hf-hub" +version = "0.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2b780635574b3d92f036890d8373433d6f9fc7abb320ee42a5c25897fc8ed732" +dependencies = [ + "dirs", + "indicatif", + "log", + "native-tls", + "rand 0.8.6", + "serde", + "serde_json", + "thiserror", + "ureq", +] + +[[package]] +name = "httpdate" +version = "1.0.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "df3b46402a9d5adb4c86a0cf463f42e19994e3ee891101b1841f30a545cb49a9" + +[[package]] +name = "icu_collections" +version = "2.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2984d1cd16c883d7935b9e07e44071dca8d917fd52ecc02c04d5fa0b5a3f191c" +dependencies = [ + "displaydoc", + "potential_utf", + "utf8_iter", + "yoke 0.8.2", + "zerofrom", + "zerovec", +] + +[[package]] +name = "icu_locale_core" +version = "2.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "92219b62b3e2b4d88ac5119f8904c10f8f61bf7e95b640d25ba3075e6cac2c29" +dependencies = [ + "displaydoc", + "litemap", + "tinystr", + "writeable", + "zerovec", +] + +[[package]] +name = "icu_normalizer" +version = "2.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c56e5ee99d6e3d33bd91c5d85458b6005a22140021cc324cea84dd0e72cff3b4" +dependencies = [ + "icu_collections", + "icu_normalizer_data", + "icu_properties", + "icu_provider", + "smallvec", + "zerovec", +] + +[[package]] +name = "icu_normalizer_data" +version = "2.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "da3be0ae77ea334f4da67c12f149704f19f81d1adf7c51cf482943e84a2bad38" + +[[package]] +name = "icu_properties" +version = "2.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bee3b67d0ea5c2cca5003417989af8996f8604e34fb9ddf96208a033901e70de" +dependencies = [ + "icu_collections", + "icu_locale_core", + "icu_properties_data", + "icu_provider", + "zerotrie", + "zerovec", +] + +[[package]] +name = "icu_properties_data" +version = "2.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8e2bbb201e0c04f7b4b3e14382af113e17ba4f63e2c9d2ee626b720cbce54a14" + +[[package]] +name = "icu_provider" +version = "2.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "139c4cf31c8b5f33d7e199446eff9c1e02decfc2f0eec2c8d71f65befa45b421" +dependencies = [ + "displaydoc", + "icu_locale_core", + "writeable", + "yoke 0.8.2", + "zerofrom", + "zerotrie", + "zerovec", +] + +[[package]] +name = "id-arena" +version = "2.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3d3067d79b975e8844ca9eb072e16b31c3c1c36928edf9c6789548c524d0d954" + +[[package]] +name = "ident_case" +version = "1.0.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b9e0384b61958566e926dc50660321d12159025e767c18e043daf26b70104c39" + +[[package]] +name = "idna" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3b0875f23caa03898994f6ddc501886a45c7d3d62d04d2d90788d47be1b1e4de" +dependencies = [ + "idna_adapter", + "smallvec", + "utf8_iter", +] + +[[package]] +name = "idna_adapter" +version = "1.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cb68373c0d6620ef8105e855e7745e18b0d00d3bdb07fb532e434244cdb9a714" +dependencies = [ + "icu_normalizer", + "icu_properties", +] + +[[package]] +name = "indexmap" +version = "2.14.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d466e9454f08e4a911e14806c24e16fba1b4c121d1ea474396f396069cf949d9" +dependencies = [ + "equivalent", + "hashbrown 0.17.1", + "serde", + "serde_core", +] + +[[package]] +name = "indicatif" +version = "0.17.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "183b3088984b400f4cfac3620d5e076c84da5364016b4f49473de574b2586235" +dependencies = [ + "console", + "number_prefix", + "portable-atomic", + "unicode-width", + "web-time", +] + +[[package]] +name = "itertools" +version = "0.11.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b1c173a5686ce8bfa551b3563d0c2170bf24ca44da99c7ca4bfdab5418c3fe57" +dependencies = [ + "either", +] + +[[package]] +name = "itertools" +version = "0.12.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ba291022dbbd398a455acf126c1e341954079855bc60dfdda641363bd6922569" +dependencies = [ + "either", +] + +[[package]] +name = "itertools" +version = "0.13.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "413ee7dfc52ee1a4949ceeb7dbc8a33f2d6c088194d9f922fb8318faf1f01186" +dependencies = [ + "either", +] + +[[package]] +name = "itoa" +version = "1.0.18" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8f42a60cbdf9a97f5d2305f08a87dc4e09308d1276d28c869c684d7777685682" + +[[package]] +name = "js-sys" +version = "0.3.98" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "67df7112613f8bfd9150013a0314e196f4800d3201ae742489d999db2f979f08" +dependencies = [ + "cfg-if", + "futures-util", + "once_cell", + "wasm-bindgen", +] + +[[package]] +name = "lazy_static" +version = "1.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bbd2bcb4c963f2ddae06a2efc7e9f3591312473c50c6685e1f298068316e66fe" + +[[package]] +name = "leb128fmt" +version = "0.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "09edd9e8b54e49e587e4f6295a7d29c3ea94d469cb40ab8ca70b288248a81db2" + +[[package]] +name = "libc" +version = "0.2.186" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "68ab91017fe16c622486840e4c83c9a37afeff978bd239b5293d61ece587de66" + +[[package]] +name = "libloading" +version = "0.8.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d7c4b02199fee7c5d21a5ae7d8cfa79a6ef5bb2fc834d6e9058e89c825efdc55" +dependencies = [ + "cfg-if", + "windows-link", +] + +[[package]] +name = "libm" +version = "0.2.16" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b6d2cec3eae94f9f509c767b45932f1ada8350c4bdb85af2fcab4a3c14807981" + +[[package]] +name = "libredox" +version = "0.1.16" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e02f3bb43d335493c96bf3fd3a321600bf6bd07ed34bc64118e9293bdffea46c" +dependencies = [ + "libc", +] + +[[package]] +name = "linux-raw-sys" +version = "0.12.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "32a66949e030da00e8c7d4434b251670a91556f4144941d37452769c25d58a53" + +[[package]] +name = "litemap" +version = "0.8.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "92daf443525c4cce67b150400bc2316076100ce0b3686209eb8cf3c31612e6f0" + +[[package]] +name = "log" +version = "0.4.29" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897" + +[[package]] +name = "macro_rules_attribute" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "65049d7923698040cd0b1ddcced9b0eb14dd22c5f86ae59c3740eab64a676520" +dependencies = [ + "macro_rules_attribute-proc_macro", + "paste", +] + +[[package]] +name = "macro_rules_attribute-proc_macro" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "670fdfda89751bc4a84ac13eaa63e205cf0fd22b4c9a5fbfa085b63c1f1d3a30" + +[[package]] +name = "memchr" +version = "2.8.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79" + +[[package]] +name = "memmap2" +version = "0.9.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "714098028fe011992e1c3962653c96b2d578c4b4bce9036e15ff220319b1e0e3" +dependencies = [ + "libc", + "stable_deref_trait", +] + +[[package]] +name = "minimal-lexical" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "68354c5c6bd36d73ff3feceb05efa59b6acb7626617f4962be322a825e61f79a" + +[[package]] +name = "miniz_oxide" +version = "0.8.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1fa76a2c86f704bdb222d66965fb3d63269ce38518b83cb0575fca855ebb6316" +dependencies = [ + "adler2", + "simd-adler32", +] + +[[package]] +name = "monostate" +version = "0.1.18" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3341a273f6c9d5bef1908f17b7267bbab0e95c9bf69a0d4dcf8e9e1b2c76ef67" +dependencies = [ + "monostate-impl", + "serde", + "serde_core", +] + +[[package]] +name = "monostate-impl" +version = "0.1.18" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e4db6d5580af57bf992f59068d4ea26fd518574ff48d7639b255a36f9de6e7e9" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "native-tls" +version = "0.2.18" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "465500e14ea162429d264d44189adc38b199b62b1c21eea9f69e4b73cb03bbf2" +dependencies = [ + "libc", + "log", + "openssl", + "openssl-probe", + "openssl-sys", + "schannel", + "security-framework", + "security-framework-sys", + "tempfile", +] + +[[package]] +name = "nom" +version = "7.1.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d273983c5a657a70a3e8f2a01329822f3b8c8172b73826411a55751e404a0a4a" +dependencies = [ + "memchr", + "minimal-lexical", +] + +[[package]] +name = "num" +version = "0.4.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "35bd024e8b2ff75562e5f34e7f4905839deb4b22955ef5e73d2fea1b9813cb23" +dependencies = [ + "num-bigint", + "num-complex", + "num-integer", + "num-iter", + "num-rational", + "num-traits", +] + +[[package]] +name = "num-bigint" +version = "0.4.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a5e44f723f1133c9deac646763579fdb3ac745e418f2a7af9cd0c431da1f20b9" +dependencies = [ + "num-integer", + "num-traits", +] + +[[package]] +name = "num-complex" +version = "0.4.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "73f88a1307638156682bada9d7604135552957b7818057dcef22705b4d509495" +dependencies = [ + "bytemuck", + "num-traits", +] + +[[package]] +name = "num-integer" +version = "0.1.46" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7969661fd2958a5cb096e56c8e1ad0444ac2bbcd0061bd28660485a44879858f" +dependencies = [ + "num-traits", +] + +[[package]] +name = "num-iter" +version = "0.1.45" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1429034a0490724d0075ebb2bc9e875d6503c3cf69e235a8941aa757d83ef5bf" +dependencies = [ + "autocfg", + "num-integer", + "num-traits", +] + +[[package]] +name = "num-rational" +version = "0.4.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f83d14da390562dca69fc84082e73e548e1ad308d24accdedd2720017cb37824" +dependencies = [ + "num-bigint", + "num-integer", + "num-traits", +] + +[[package]] +name = "num-traits" +version = "0.2.19" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "071dfc062690e90b734c0b2273ce72ad0ffa95f0c74596bc250dcfd960262841" +dependencies = [ + "autocfg", + "libm", +] + +[[package]] +name = "num_cpus" +version = "1.17.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "91df4bbde75afed763b708b7eee1e8e7651e02d97f6d5dd763e89367e957b23b" +dependencies = [ + "hermit-abi", + "libc", +] + +[[package]] +name = "num_enum" +version = "0.7.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5d0bca838442ec211fa11de3a8b0e0e8f3a4522575b5c4c06ed722e005036f26" +dependencies = [ + "num_enum_derive", + "rustversion", +] + +[[package]] +name = "num_enum_derive" +version = "0.7.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "680998035259dcfcafe653688bf2aa6d3e2dc05e98be6ab46afb089dc84f1df8" +dependencies = [ + "proc-macro-crate", + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "number_prefix" +version = "0.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "830b246a0e5f20af87141b25c173cd1b609bd7779a4617d6ec582abaf90870f3" + +[[package]] +name = "once_cell" +version = "1.21.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50" + +[[package]] +name = "onig" +version = "6.5.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0cc3cbf698f9438986c11a880c90a6d04b9de27575afd28bbf45b154b6c709e2" +dependencies = [ + "bitflags 2.11.1", + "libc", + "once_cell", + "onig_sys", +] + +[[package]] +name = "onig_sys" +version = "69.9.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1e68317604e77e53b85896388e1a803c1d21b74c899ec9e5e1112db90735edd7" +dependencies = [ + "cc", + "pkg-config", +] + +[[package]] +name = "openssl" +version = "0.10.80" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a45fa2aa886c42762255da344f0a0d313e254066c46aad76f300c3d3da62d967" +dependencies = [ + "bitflags 2.11.1", + "cfg-if", + "foreign-types", + "libc", + "openssl-macros", + "openssl-sys", +] + +[[package]] +name = "openssl-macros" +version = "0.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a948666b637a0f465e8564c73e89d4dde00d72d4d473cc972f390fc3dcee7d9c" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "openssl-probe" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7c87def4c32ab89d880effc9e097653c8da5d6ef28e6b539d313baaacfbafcbe" + +[[package]] +name = "openssl-sys" +version = "0.9.116" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f28a22dc7140cda5f096e5e7724a6962ca81a7f8bfd2979f9b18c11af56318c4" +dependencies = [ + "cc", + "libc", + "pkg-config", + "vcpkg", +] + +[[package]] +name = "option-ext" +version = "0.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "04744f49eae99ab78e0d5c0b603ab218f515ea8cfe5a456d7629ad883a3b6e7d" + +[[package]] +name = "paste" +version = "1.0.15" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "57c0d7b74b563b49d38dae00a0c37d4d6de9b432382b2892f0574ddcae73fd0a" + +[[package]] +name = "percent-encoding" +version = "2.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9b4f627cb1b25917193a259e49bdad08f671f8d9708acfd5fe0a8c1455d87220" + +[[package]] +name = "pin-project-lite" +version = "0.2.17" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a89322df9ebe1c1578d689c92318e070967d1042b512afbe49518723f4e6d5cd" + +[[package]] +name = "pkg-config" +version = "0.3.33" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "19f132c84eca552bf34cab8ec81f1c1dcc229b811638f9d283dceabe58c5569e" + +[[package]] +name = "portable-atomic" +version = "1.13.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c33a9471896f1c69cecef8d20cbe2f7accd12527ce60845ff44c153bb2a21b49" + +[[package]] +name = "potential_utf" +version = "0.1.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0103b1cef7ec0cf76490e969665504990193874ea05c85ff9bab8b911d0a0564" +dependencies = [ + "zerovec", +] + +[[package]] +name = "ppv-lite86" +version = "0.2.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "85eae3c4ed2f50dcfe72643da4befc30deadb458a9b590d720cde2f2b1e97da9" +dependencies = [ + "zerocopy", +] + +[[package]] +name = "prettyplease" +version = "0.2.37" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "479ca8adacdd7ce8f1fb39ce9ecccbfe93a3f1344b3d0d97f20bc0196208f62b" +dependencies = [ + "proc-macro2", + "syn", +] + +[[package]] +name = "proc-macro-crate" +version = "3.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e67ba7e9b2b56446f1d419b1d807906278ffa1a658a8a5d8a39dcb1f5a78614f" +dependencies = [ + "toml_edit", +] + +[[package]] +name = "proc-macro2" +version = "1.0.106" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934" +dependencies = [ + "unicode-ident", +] + +[[package]] +name = "pulp" +version = "0.18.22" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a0a01a0dc67cf4558d279f0c25b0962bd08fc6dec0137699eae304103e882fe6" +dependencies = [ + "bytemuck", + "libm", + "num-complex", + "reborrow", +] + +[[package]] +name = "pulp" +version = "0.21.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "96b86df24f0a7ddd5e4b95c94fc9ed8a98f1ca94d3b01bdce2824097e7835907" +dependencies = [ + "bytemuck", + "cfg-if", + "libm", + "num-complex", + "reborrow", + "version_check", +] + +[[package]] +name = "quote" +version = "1.0.45" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "41f2619966050689382d2b44f664f4bc593e129785a36d6ee376ddf37259b924" +dependencies = [ + "proc-macro2", +] + +[[package]] +name = "r-efi" +version = "5.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "69cdb34c158ceb288df11e18b4bd39de994f6657d83847bdffdbd7f346754b0f" + +[[package]] +name = "r-efi" +version = "6.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f8dcc9c7d52a811697d2151c701e0d08956f92b0e24136cf4cf27b57a6a0d9bf" + +[[package]] +name = "rand" +version = "0.8.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5ca0ecfa931c29007047d1bc58e623ab12e5590e8c7cc53200d5202b69266d8a" +dependencies = [ + "libc", + "rand_chacha 0.3.1", + "rand_core 0.6.4", +] + +[[package]] +name = "rand" +version = "0.9.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "44c5af06bb1b7d3216d91932aed5265164bf384dc89cd6ba05cf59a35f5f76ea" +dependencies = [ + "rand_chacha 0.9.0", + "rand_core 0.9.5", +] + +[[package]] +name = "rand_chacha" +version = "0.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88" +dependencies = [ + "ppv-lite86", + "rand_core 0.6.4", +] + +[[package]] +name = "rand_chacha" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d3022b5f1df60f26e1ffddd6c66e8aa15de382ae63b3a0c1bfc0e4d3e3f325cb" +dependencies = [ + "ppv-lite86", + "rand_core 0.9.5", +] + +[[package]] +name = "rand_core" +version = "0.6.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ec0be4795e2f6a28069bec0b5ff3e2ac9bafc99e6a9a7dc3547996c5c816922c" +dependencies = [ + "getrandom 0.2.17", +] + +[[package]] +name = "rand_core" +version = "0.9.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "76afc826de14238e6e8c374ddcc1fa19e374fd8dd986b0d2af0d02377261d83c" +dependencies = [ + "getrandom 0.3.4", +] + +[[package]] +name = "rand_distr" +version = "0.5.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6a8615d50dcf34fa31f7ab52692afec947c4dd0ab803cc87cb3b0b4570ff7463" +dependencies = [ + "num-traits", + "rand 0.9.4", +] + +[[package]] +name = "raw-cpuid" +version = "10.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6c297679cb867470fa8c9f67dbba74a78d78e3e98d7cf2b08d6d71540f797332" +dependencies = [ + "bitflags 1.3.2", +] + +[[package]] +name = "raw-cpuid" +version = "11.6.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "498cd0dc59d73224351ee52a95fee0f1a617a2eae0e7d9d720cc622c73a54186" +dependencies = [ + "bitflags 2.11.1", +] + +[[package]] +name = "rayon" +version = "1.12.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fb39b166781f92d482534ef4b4b1b2568f42613b53e5b6c160e24cfbfa30926d" +dependencies = [ + "either", + "rayon-core", +] + +[[package]] +name = "rayon-cond" +version = "0.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "059f538b55efd2309c9794130bc149c6a553db90e9d99c2030785c82f0bd7df9" +dependencies = [ + "either", + "itertools 0.11.0", + "rayon", +] + +[[package]] +name = "rayon-core" +version = "1.13.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "22e18b0f0062d30d4230b2e85ff77fdfe4326feb054b9783a3460d8435c8ab91" +dependencies = [ + "crossbeam-deque", + "crossbeam-utils", +] + +[[package]] +name = "reborrow" +version = "0.5.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "03251193000f4bd3b042892be858ee50e8b3719f2b08e5833ac4353724632430" + +[[package]] +name = "redis" +version = "0.27.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "09d8f99a4090c89cc489a94833c901ead69bfbf3877b4867d5482e321ee875bc" +dependencies = [ + "arc-swap", + "combine", + "itertools 0.13.0", + "itoa", + "num-bigint", + "percent-encoding", + "ryu", + "url", +] + +[[package]] +name = "redox_users" +version = "0.4.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ba009ff324d1fc1b900bd1fdb31564febe58a8ccc8a6fdbb93b543d33b13ca43" +dependencies = [ + "getrandom 0.2.17", + "libredox", + "thiserror", +] + +[[package]] +name = "regex" +version = "1.12.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e10754a14b9137dd7b1e3e5b0493cc9171fdd105e0ab477f51b72e7f3ac0e276" +dependencies = [ + "aho-corasick", + "memchr", + "regex-automata", + "regex-syntax", +] + +[[package]] +name = "regex-automata" +version = "0.4.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6e1dd4122fc1595e8162618945476892eefca7b88c52820e74af6262213cae8f" +dependencies = [ + "aho-corasick", + "memchr", + "regex-syntax", +] + +[[package]] +name = "regex-syntax" +version = "0.8.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dc897dd8d9e8bd1ed8cdad82b5966c3e0ecae09fb1907d58efaa013543185d0a" + +[[package]] +name = "ring" +version = "0.17.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a4689e6c2294d81e88dc6261c768b63bc4fcdb852be6d1352498b114f61383b7" +dependencies = [ + "cc", + "cfg-if", + "getrandom 0.2.17", + "libc", + "untrusted", + "windows-sys 0.52.0", +] + +[[package]] +name = "rustix" +version = "1.1.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b6fe4565b9518b83ef4f91bb47ce29620ca828bd32cb7e408f0062e9930ba190" +dependencies = [ + "bitflags 2.11.1", + "errno", + "libc", + "linux-raw-sys", + "windows-sys 0.61.2", +] + +[[package]] +name = "rustls" +version = "0.23.40" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ef86cd5876211988985292b91c96a8f2d298df24e75989a43a3c73f2d4d8168b" +dependencies = [ + "log", + "once_cell", + "ring", + "rustls-pki-types", + "rustls-webpki", + "subtle", + "zeroize", +] + +[[package]] +name = "rustls-pki-types" +version = "1.14.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "30a7197ae7eb376e574fe940d068c30fe0462554a3ddbe4eca7838e049c937a9" +dependencies = [ + "zeroize", +] + +[[package]] +name = "rustls-webpki" +version = "0.103.13" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "61c429a8649f110dddef65e2a5ad240f747e85f7758a6bccc7e5777bd33f756e" +dependencies = [ + "ring", + "rustls-pki-types", + "untrusted", +] + +[[package]] +name = "rustversion" +version = "1.0.22" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b39cdef0fa800fc44525c84ccb54a029961a8215f9619753635a9c0d2538d46d" + +[[package]] +name = "ryu" +version = "1.0.23" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9774ba4a74de5f7b1c1451ed6cd5285a32eddb5cccb8cc655a4e50009e06477f" + +[[package]] +name = "safetensors" +version = "0.4.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "44560c11236a6130a46ce36c836a62936dc81ebf8c36a37947423571be0e55b6" +dependencies = [ + "serde", + "serde_json", +] + +[[package]] +name = "same-file" +version = "1.0.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "93fc1dc3aaa9bfed95e02e6eadabb4baf7e3078b0bd1b4d7b6b0b68378900502" +dependencies = [ + "winapi-util", +] + +[[package]] +name = "schannel" +version = "0.1.29" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "91c1b7e4904c873ef0710c1f407dde2e6287de2bebc1bbbf7d430bb7cbffd939" +dependencies = [ + "windows-sys 0.61.2", +] + +[[package]] +name = "security-framework" +version = "3.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b7f4bc775c73d9a02cde8bf7b2ec4c9d12743edf609006c7facc23998404cd1d" +dependencies = [ + "bitflags 2.11.1", + "core-foundation", + "core-foundation-sys", + "libc", + "security-framework-sys", +] + +[[package]] +name = "security-framework-sys" +version = "2.17.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6ce2691df843ecc5d231c0b14ece2acc3efb62c0a398c7e1d875f3983ce020e3" +dependencies = [ + "core-foundation-sys", + "libc", +] + +[[package]] +name = "semcache-demo" +version = "0.1.0" +dependencies = [ + "byteorder", + "candle-core", + "candle-nn", + "candle-transformers", + "getrandom 0.2.17", + "hf-hub", + "redis", + "serde", + "serde_json", + "tiny_http", + "tokenizers", + "url", +] + +[[package]] +name = "semver" +version = "1.0.28" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8a7852d02fc848982e0c167ef163aaff9cd91dc640ba85e263cb1ce46fae51cd" + +[[package]] +name = "seq-macro" +version = "0.3.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1bc711410fbe7399f390ca1c3b60ad0f53f80e95c5eb935e52268a0e2cd49acc" + +[[package]] +name = "serde" +version = "1.0.228" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9a8e94ea7f378bd32cbbd37198a4a91436180c5bb472411e48b5ec2e2124ae9e" +dependencies = [ + "serde_core", + "serde_derive", +] + +[[package]] +name = "serde_core" +version = "1.0.228" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "41d385c7d4ca58e59fc732af25c3983b67ac852c1a25000afe1175de458b67ad" +dependencies = [ + "serde_derive", +] + +[[package]] +name = "serde_derive" +version = "1.0.228" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d540f220d3187173da220f885ab66608367b6574e925011a9353e4badda91d79" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "serde_json" +version = "1.0.149" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "83fc039473c5595ace860d8c4fafa220ff474b3fc6bfdb4293327f1a37e94d86" +dependencies = [ + "itoa", + "memchr", + "serde", + "serde_core", + "zmij", +] + +[[package]] +name = "serde_plain" +version = "1.0.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9ce1fc6db65a611022b23a0dec6975d63fb80a302cb3388835ff02c097258d50" +dependencies = [ + "serde", +] + +[[package]] +name = "shlex" +version = "1.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64" + +[[package]] +name = "simd-adler32" +version = "0.3.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "703d5c7ef118737c72f1af64ad2f6f8c5e1921f818cdcb97b8fe6fc69bf66214" + +[[package]] +name = "slab" +version = "0.4.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0c790de23124f9ab44544d7ac05d60440adc586479ce501c1d6d7da3cd8c9cf5" + +[[package]] +name = "smallvec" +version = "1.15.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "67b1b7a3b5fe4f1376887184045fcf45c69e92af734b7aaddc05fb777b6fbd03" + +[[package]] +name = "spm_precompiled" +version = "0.1.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5851699c4033c63636f7ea4cf7b7c1f1bf06d0cc03cfb42e711de5a5c46cf326" +dependencies = [ + "base64 0.13.1", + "nom", + "serde", + "unicode-segmentation", +] + +[[package]] +name = "stable_deref_trait" +version = "1.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6ce2be8dc25455e1f91df71bfa12ad37d7af1092ae736f3a6cd0e37bc7810596" + +[[package]] +name = "strsim" +version = "0.11.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7da8b5736845d9f2fcb837ea5d9e2628564b3b043a70948a3f0b778838c5fb4f" + +[[package]] +name = "subtle" +version = "2.6.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "13c2bddecc57b384dee18652358fb23172facb8a2c51ccc10d74c157bdea3292" + +[[package]] +name = "syn" +version = "2.0.117" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e665b8803e7b1d2a727f4023456bbbbe74da67099c585258af0ad9c5013b9b99" +dependencies = [ + "proc-macro2", + "quote", + "unicode-ident", +] + +[[package]] +name = "synstructure" +version = "0.13.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "728a70f3dbaf5bab7f0c4b1ac8d7ae5ea60a4b5549c8a5914361c99147a709d2" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "sysctl" +version = "0.5.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ec7dddc5f0fee506baf8b9fdb989e242f17e4b11c61dfbb0635b705217199eea" +dependencies = [ + "bitflags 2.11.1", + "byteorder", + "enum-as-inner", + "libc", + "thiserror", + "walkdir", +] + +[[package]] +name = "sysctl" +version = "0.6.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "01198a2debb237c62b6826ec7081082d951f46dbb64b0e8c7649a452230d1dfc" +dependencies = [ + "bitflags 2.11.1", + "byteorder", + "enum-as-inner", + "libc", + "thiserror", + "walkdir", +] + +[[package]] +name = "tempfile" +version = "3.27.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "32497e9a4c7b38532efcdebeef879707aa9f794296a4f0244f6f69e9bc8574bd" +dependencies = [ + "fastrand", + "getrandom 0.4.2", + "once_cell", + "rustix", + "windows-sys 0.61.2", +] + +[[package]] +name = "thiserror" +version = "1.0.69" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b6aaf5339b578ea85b50e080feb250a3e8ae8cfcdff9a461c9ec2904bc923f52" +dependencies = [ + "thiserror-impl", +] + +[[package]] +name = "thiserror-impl" +version = "1.0.69" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4fee6c4efc90059e10f81e6d42c60a18f76588c3d74cb83a0b242a2b6c7504c1" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "tiny_http" +version = "0.12.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "389915df6413a2e74fb181895f933386023c71110878cd0825588928e64cdc82" +dependencies = [ + "ascii", + "chunked_transfer", + "httpdate", + "log", +] + +[[package]] +name = "tinystr" +version = "0.8.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c8323304221c2a851516f22236c5722a72eaa19749016521d6dff0824447d96d" +dependencies = [ + "displaydoc", + "zerovec", +] + +[[package]] +name = "tokenizers" +version = "0.20.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3b08cc37428a476fc9e20ac850132a513a2e1ce32b6a31addf2b74fa7033b905" +dependencies = [ + "aho-corasick", + "derive_builder", + "esaxx-rs", + "getrandom 0.2.17", + "itertools 0.12.1", + "lazy_static", + "log", + "macro_rules_attribute", + "monostate", + "onig", + "paste", + "rand 0.8.6", + "rayon", + "rayon-cond", + "regex", + "regex-syntax", + "serde", + "serde_json", + "spm_precompiled", + "thiserror", + "unicode-normalization-alignments", + "unicode-segmentation", + "unicode_categories", +] + +[[package]] +name = "toml_datetime" +version = "1.1.1+spec-1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3165f65f62e28e0115a00b2ebdd37eb6f3b641855f9d636d3cd4103767159ad7" +dependencies = [ + "serde_core", +] + +[[package]] +name = "toml_edit" +version = "0.25.11+spec-1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0b59c4d22ed448339746c59b905d24568fcbb3ab65a500494f7b8c3e97739f2b" +dependencies = [ + "indexmap", + "toml_datetime", + "toml_parser", + "winnow", +] + +[[package]] +name = "toml_parser" +version = "1.1.2+spec-1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a2abe9b86193656635d2411dc43050282ca48aa31c2451210f4202550afb7526" +dependencies = [ + "winnow", +] + +[[package]] +name = "tracing" +version = "0.1.44" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "63e71662fa4b2a2c3a26f570f037eb95bb1f85397f3cd8076caed2f026a6d100" +dependencies = [ + "pin-project-lite", + "tracing-attributes", + "tracing-core", +] + +[[package]] +name = "tracing-attributes" +version = "0.1.31" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7490cfa5ec963746568740651ac6781f701c9c5ea257c58e057f3ba8cf69e8da" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "tracing-core" +version = "0.1.36" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "db97caf9d906fbde555dd62fa95ddba9eecfd14cb388e4f491a66d74cd5fb79a" +dependencies = [ + "once_cell", +] + +[[package]] +name = "ug" +version = "0.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "03719c61a91b51541f076dfdba45caacf750b230cefaa4b32d6f5411c3f7f437" +dependencies = [ + "gemm 0.18.2", + "half", + "libloading", + "memmap2", + "num", + "num-traits", + "num_cpus", + "rayon", + "safetensors", + "serde", + "thiserror", + "tracing", + "yoke 0.7.5", +] + +[[package]] +name = "unicode-ident" +version = "1.0.24" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75" + +[[package]] +name = "unicode-normalization-alignments" +version = "0.1.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "43f613e4fa046e69818dd287fdc4bc78175ff20331479dab6e1b0f98d57062de" +dependencies = [ + "smallvec", +] + +[[package]] +name = "unicode-segmentation" +version = "1.13.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9629274872b2bfaf8d66f5f15725007f635594914870f65218920345aa11aa8c" + +[[package]] +name = "unicode-width" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b4ac048d71ede7ee76d585517add45da530660ef4390e49b098733c6e897f254" + +[[package]] +name = "unicode-xid" +version = "0.2.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853" + +[[package]] +name = "unicode_categories" +version = "0.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "39ec24b3121d976906ece63c9daad25b85969647682eee313cb5779fdd69e14e" + +[[package]] +name = "untrusted" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8ecb6da28b8a351d773b68d5825ac39017e680750f980f3a1a85cd8dd28a47c1" + +[[package]] +name = "ureq" +version = "2.12.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "02d1a66277ed75f640d608235660df48c8e3c19f3b4edb6a263315626cc3c01d" +dependencies = [ + "base64 0.22.1", + "flate2", + "log", + "native-tls", + "once_cell", + "rustls", + "rustls-pki-types", + "serde", + "serde_json", + "url", + "webpki-roots 0.26.11", +] + +[[package]] +name = "url" +version = "2.5.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ff67a8a4397373c3ef660812acab3268222035010ab8680ec4215f38ba3d0eed" +dependencies = [ + "form_urlencoded", + "idna", + "percent-encoding", + "serde", +] + +[[package]] +name = "utf8_iter" +version = "1.0.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b6c140620e7ffbb22c2dee59cafe6084a59b5ffc27a8859a5f0d494b5d52b6be" + +[[package]] +name = "vcpkg" +version = "0.2.15" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "accd4ea62f7bb7a82fe23066fb0957d48ef677f6eeb8215f372f52e48bb32426" + +[[package]] +name = "version_check" +version = "0.9.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a" + +[[package]] +name = "walkdir" +version = "2.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "29790946404f91d9c5d06f9874efddea1dc06c5efe94541a7d6863108e3a5e4b" +dependencies = [ + "same-file", + "winapi-util", +] + +[[package]] +name = "wasi" +version = "0.11.1+wasi-snapshot-preview1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ccf3ec651a847eb01de73ccad15eb7d99f80485de043efb2f370cd654f4ea44b" + +[[package]] +name = "wasip2" +version = "1.0.3+wasi-0.2.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "20064672db26d7cdc89c7798c48a0fdfac8213434a1186e5ef29fd560ae223d6" +dependencies = [ + "wit-bindgen 0.57.1", +] + +[[package]] +name = "wasip3" +version = "0.4.0+wasi-0.3.0-rc-2026-01-06" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5428f8bf88ea5ddc08faddef2ac4a67e390b88186c703ce6dbd955e1c145aca5" +dependencies = [ + "wit-bindgen 0.51.0", +] + +[[package]] +name = "wasm-bindgen" +version = "0.2.121" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "49ace1d07c165b0864824eee619580c4689389afa9dc9ed3a4c75040d82e6790" +dependencies = [ + "cfg-if", + "once_cell", + "rustversion", + "wasm-bindgen-macro", + "wasm-bindgen-shared", +] + +[[package]] +name = "wasm-bindgen-macro" +version = "0.2.121" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8e68e6f4afd367a562002c05637acb8578ff2dea1943df76afb9e83d177c8578" +dependencies = [ + "quote", + "wasm-bindgen-macro-support", +] + +[[package]] +name = "wasm-bindgen-macro-support" +version = "0.2.121" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d95a9ec35c64b2a7cb35d3fead40c4238d0940c86d107136999567a4703259f2" +dependencies = [ + "bumpalo", + "proc-macro2", + "quote", + "syn", + "wasm-bindgen-shared", +] + +[[package]] +name = "wasm-bindgen-shared" +version = "0.2.121" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c4e0100b01e9f0d03189a92b96772a1fb998639d981193d7dbab487302513441" +dependencies = [ + "unicode-ident", +] + +[[package]] +name = "wasm-encoder" +version = "0.244.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "990065f2fe63003fe337b932cfb5e3b80e0b4d0f5ff650e6985b1048f62c8319" +dependencies = [ + "leb128fmt", + "wasmparser", +] + +[[package]] +name = "wasm-metadata" +version = "0.244.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bb0e353e6a2fbdc176932bbaab493762eb1255a7900fe0fea1a2f96c296cc909" +dependencies = [ + "anyhow", + "indexmap", + "wasm-encoder", + "wasmparser", +] + +[[package]] +name = "wasmparser" +version = "0.244.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "47b807c72e1bac69382b3a6fb3dbe8ea4c0ed87ff5629b8685ae6b9a611028fe" +dependencies = [ + "bitflags 2.11.1", + "hashbrown 0.15.5", + "indexmap", + "semver", +] + +[[package]] +name = "web-time" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5a6580f308b1fad9207618087a65c04e7a10bc77e02c8e84e9b00dd4b12fa0bb" +dependencies = [ + "js-sys", + "wasm-bindgen", +] + +[[package]] +name = "webpki-roots" +version = "0.26.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "521bc38abb08001b01866da9f51eb7c5d647a19260e00054a8c7fd5f9e57f7a9" +dependencies = [ + "webpki-roots 1.0.7", +] + +[[package]] +name = "webpki-roots" +version = "1.0.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "52f5ee44c96cf55f1b349600768e3ece3a8f26010c05265ab73f945bb1a2eb9d" +dependencies = [ + "rustls-pki-types", +] + +[[package]] +name = "winapi-util" +version = "0.1.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c2a7b1c03c876122aa43f3020e6c3c3ee5c05081c9a00739faf7503aeba10d22" +dependencies = [ + "windows-sys 0.61.2", +] + +[[package]] +name = "windows-link" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f0805222e57f7521d6a62e36fa9163bc891acd422f971defe97d64e70d0a4fe5" + +[[package]] +name = "windows-sys" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "677d2418bec65e3338edb076e806bc1ec15693c5d0104683f2efe857f61056a9" +dependencies = [ + "windows-targets 0.48.5", +] + +[[package]] +name = "windows-sys" +version = "0.52.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "282be5f36a8ce781fad8c8ae18fa3f9beff57ec1b52cb3de0789201425d9a33d" +dependencies = [ + "windows-targets 0.52.6", +] + +[[package]] +name = "windows-sys" +version = "0.59.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1e38bc4d79ed67fd075bcc251a1c39b32a1776bbe92e5bef1f0bf1f8c531853b" +dependencies = [ + "windows-targets 0.52.6", +] + +[[package]] +name = "windows-sys" +version = "0.61.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ae137229bcbd6cdf0f7b80a31df61766145077ddf49416a728b02cb3921ff3fc" +dependencies = [ + "windows-link", +] + +[[package]] +name = "windows-targets" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9a2fa6e2155d7247be68c096456083145c183cbbbc2764150dda45a87197940c" +dependencies = [ + "windows_aarch64_gnullvm 0.48.5", + "windows_aarch64_msvc 0.48.5", + "windows_i686_gnu 0.48.5", + "windows_i686_msvc 0.48.5", + "windows_x86_64_gnu 0.48.5", + "windows_x86_64_gnullvm 0.48.5", + "windows_x86_64_msvc 0.48.5", +] + +[[package]] +name = "windows-targets" +version = "0.52.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9b724f72796e036ab90c1021d4780d4d3d648aca59e491e6b98e725b84e99973" +dependencies = [ + "windows_aarch64_gnullvm 0.52.6", + "windows_aarch64_msvc 0.52.6", + "windows_i686_gnu 0.52.6", + "windows_i686_gnullvm", + "windows_i686_msvc 0.52.6", + "windows_x86_64_gnu 0.52.6", + "windows_x86_64_gnullvm 0.52.6", + "windows_x86_64_msvc 0.52.6", +] + +[[package]] +name = "windows_aarch64_gnullvm" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2b38e32f0abccf9987a4e3079dfb67dcd799fb61361e53e2882c3cbaf0d905d8" + +[[package]] +name = "windows_aarch64_gnullvm" +version = "0.52.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "32a4622180e7a0ec044bb555404c800bc9fd9ec262ec147edd5989ccd0c02cd3" + +[[package]] +name = "windows_aarch64_msvc" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dc35310971f3b2dbbf3f0690a219f40e2d9afcf64f9ab7cc1be722937c26b4bc" + +[[package]] +name = "windows_aarch64_msvc" +version = "0.52.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "09ec2a7bb152e2252b53fa7803150007879548bc709c039df7627cabbd05d469" + +[[package]] +name = "windows_i686_gnu" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a75915e7def60c94dcef72200b9a8e58e5091744960da64ec734a6c6e9b3743e" + +[[package]] +name = "windows_i686_gnu" +version = "0.52.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8e9b5ad5ab802e97eb8e295ac6720e509ee4c243f69d781394014ebfe8bbfa0b" + +[[package]] +name = "windows_i686_gnullvm" +version = "0.52.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0eee52d38c090b3caa76c563b86c3a4bd71ef1a819287c19d586d7334ae8ed66" + +[[package]] +name = "windows_i686_msvc" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8f55c233f70c4b27f66c523580f78f1004e8b5a8b659e05a4eb49d4166cca406" + +[[package]] +name = "windows_i686_msvc" +version = "0.52.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "240948bc05c5e7c6dabba28bf89d89ffce3e303022809e73deaefe4f6ec56c66" + +[[package]] +name = "windows_x86_64_gnu" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "53d40abd2583d23e4718fddf1ebec84dbff8381c07cae67ff7768bbf19c6718e" + +[[package]] +name = "windows_x86_64_gnu" +version = "0.52.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "147a5c80aabfbf0c7d901cb5895d1de30ef2907eb21fbbab29ca94c5b08b1a78" + +[[package]] +name = "windows_x86_64_gnullvm" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0b7b52767868a23d5bab768e390dc5f5c55825b6d30b86c844ff2dc7414044cc" + +[[package]] +name = "windows_x86_64_gnullvm" +version = "0.52.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "24d5b23dc417412679681396f2b49f3de8c1473deb516bd34410872eff51ed0d" + +[[package]] +name = "windows_x86_64_msvc" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ed94fce61571a4006852b7389a063ab983c02eb1bb37b47f8272ce92d06d9538" + +[[package]] +name = "windows_x86_64_msvc" +version = "0.52.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "589f6da84c646204747d1270a2a5661ea66ed1cced2631d546fdfb155959f9ec" + +[[package]] +name = "winnow" +version = "1.0.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0592e1c9d151f854e6fd382574c3a0855250e1d9b2f99d9281c6e6391af352f1" +dependencies = [ + "memchr", +] + +[[package]] +name = "wit-bindgen" +version = "0.51.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d7249219f66ced02969388cf2bb044a09756a083d0fab1e566056b04d9fbcaa5" +dependencies = [ + "wit-bindgen-rust-macro", +] + +[[package]] +name = "wit-bindgen" +version = "0.57.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1ebf944e87a7c253233ad6766e082e3cd714b5d03812acc24c318f549614536e" + +[[package]] +name = "wit-bindgen-core" +version = "0.51.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ea61de684c3ea68cb082b7a88508a8b27fcc8b797d738bfc99a82facf1d752dc" +dependencies = [ + "anyhow", + "heck", + "wit-parser", +] + +[[package]] +name = "wit-bindgen-rust" +version = "0.51.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b7c566e0f4b284dd6561c786d9cb0142da491f46a9fbed79ea69cdad5db17f21" +dependencies = [ + "anyhow", + "heck", + "indexmap", + "prettyplease", + "syn", + "wasm-metadata", + "wit-bindgen-core", + "wit-component", +] + +[[package]] +name = "wit-bindgen-rust-macro" +version = "0.51.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0c0f9bfd77e6a48eccf51359e3ae77140a7f50b1e2ebfe62422d8afdaffab17a" +dependencies = [ + "anyhow", + "prettyplease", + "proc-macro2", + "quote", + "syn", + "wit-bindgen-core", + "wit-bindgen-rust", +] + +[[package]] +name = "wit-component" +version = "0.244.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9d66ea20e9553b30172b5e831994e35fbde2d165325bec84fc43dbf6f4eb9cb2" +dependencies = [ + "anyhow", + "bitflags 2.11.1", + "indexmap", + "log", + "serde", + "serde_derive", + "serde_json", + "wasm-encoder", + "wasm-metadata", + "wasmparser", + "wit-parser", +] + +[[package]] +name = "wit-parser" +version = "0.244.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ecc8ac4bc1dc3381b7f59c34f00b67e18f910c2c0f50015669dde7def656a736" +dependencies = [ + "anyhow", + "id-arena", + "indexmap", + "log", + "semver", + "serde", + "serde_derive", + "serde_json", + "unicode-xid", + "wasmparser", +] + +[[package]] +name = "writeable" +version = "0.6.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1ffae5123b2d3fc086436f8834ae3ab053a283cfac8fe0a0b8eaae044768a4c4" + +[[package]] +name = "yoke" +version = "0.7.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "120e6aef9aa629e3d4f52dc8cc43a015c7724194c97dfaf45180d2daf2b77f40" +dependencies = [ + "serde", + "stable_deref_trait", + "yoke-derive 0.7.5", + "zerofrom", +] + +[[package]] +name = "yoke" +version = "0.8.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "abe8c5fda708d9ca3df187cae8bfb9ceda00dd96231bed36e445a1a48e66f9ca" +dependencies = [ + "stable_deref_trait", + "yoke-derive 0.8.2", + "zerofrom", +] + +[[package]] +name = "yoke-derive" +version = "0.7.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2380878cad4ac9aac1e2435f3eb4020e8374b5f13c296cb75b4620ff8e229154" +dependencies = [ + "proc-macro2", + "quote", + "syn", + "synstructure", +] + +[[package]] +name = "yoke-derive" +version = "0.8.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "de844c262c8848816172cef550288e7dc6c7b7814b4ee56b3e1553f275f1858e" +dependencies = [ + "proc-macro2", + "quote", + "syn", + "synstructure", +] + +[[package]] +name = "zerocopy" +version = "0.8.48" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "eed437bf9d6692032087e337407a86f04cd8d6a16a37199ed57949d415bd68e9" +dependencies = [ + "zerocopy-derive", +] + +[[package]] +name = "zerocopy-derive" +version = "0.8.48" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "70e3cd084b1788766f53af483dd21f93881ff30d7320490ec3ef7526d203bad4" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "zerofrom" +version = "0.1.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0ec05a11813ea801ff6d75110ad09cd0824ddba17dfe17128ea0d5f68e6c5272" +dependencies = [ + "zerofrom-derive", +] + +[[package]] +name = "zerofrom-derive" +version = "0.1.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "11532158c46691caf0f2593ea8358fed6bbf68a0315e80aae9bd41fbade684a1" +dependencies = [ + "proc-macro2", + "quote", + "syn", + "synstructure", +] + +[[package]] +name = "zeroize" +version = "1.8.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b97154e67e32c85465826e8bcc1c59429aaaf107c1e4a9e53c8d8ccd5eff88d0" + +[[package]] +name = "zerotrie" +version = "0.2.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0f9152d31db0792fa83f70fb2f83148effb5c1f5b8c7686c3459e361d9bc20bf" +dependencies = [ + "displaydoc", + "yoke 0.8.2", + "zerofrom", +] + +[[package]] +name = "zerovec" +version = "0.11.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "90f911cbc359ab6af17377d242225f4d75119aec87ea711a880987b18cd7b239" +dependencies = [ + "yoke 0.8.2", + "zerofrom", + "zerovec-derive", +] + +[[package]] +name = "zerovec-derive" +version = "0.11.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "625dc425cab0dca6dc3c3319506e6593dcb08a9f387ea3b284dbd52a92c40555" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "zip" +version = "1.1.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9cc23c04387f4da0374be4533ad1208cbb091d5c11d070dfef13676ad6497164" +dependencies = [ + "arbitrary", + "crc32fast", + "crossbeam-utils", + "displaydoc", + "indexmap", + "num_enum", + "thiserror", +] + +[[package]] +name = "zmij" +version = "1.0.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b8848ee67ecc8aedbaf3e4122217aff892639231befc6a1b58d29fff4c2cabaa" diff --git a/content/develop/use-cases/semantic-cache/rust/Cargo.toml b/content/develop/use-cases/semantic-cache/rust/Cargo.toml new file mode 100644 index 0000000000..d084be3cd6 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/rust/Cargo.toml @@ -0,0 +1,50 @@ +[package] +name = "semcache-demo" +version = "0.1.0" +edition = "2021" +publish = false + +[[bin]] +name = "semcache-demo" +path = "src/main.rs" + +# Optimise the release build. The candle-based embedder is CPU-bound on +# token throughput; without -O the first encode takes long enough to be +# noticeable in the demo, and the docs example tells readers to use +# --release for that reason. +[profile.release] +opt-level = 3 +lto = "thin" +codegen-units = 1 + +[dependencies] +redis = { version = "0.27", default-features = false } +tiny_http = "0.12" +serde = { version = "1", features = ["derive"] } +serde_json = "1" +byteorder = "1" +url = "2" + +# Candle is the embedder. We need: +# - candle-core for the tensor type +# - candle-nn for module trait implementations BERT pulls in +# - candle-transformers for the BertModel + config types +# - tokenizers for the HuggingFace WordPiece tokenizer +# - hf-hub for fetching the model weights and tokenizer on first run +candle-core = "0.8" +candle-nn = "0.8" +candle-transformers = "0.8" +tokenizers = { version = "0.20", default-features = false, features = ["onig"] } +hf-hub = "0.3" + +# CLI flag parsing without the clap dependency — keeps the build leaner +# and the surface area tiny. The args set is small enough that an +# in-house parser is cheaper than another crate in the lockfile. +# (We just walk env::args in main.rs.) + +# UUID-style ids for cache entries. The Python and Node demos use 12 +# hex characters; we match the shape by hex-encoding 6 random bytes +# read from /dev/urandom via getrandom. +getrandom = "0.2" + +[dev-dependencies] diff --git a/content/develop/use-cases/semantic-cache/rust/_index.md b/content/develop/use-cases/semantic-cache/rust/_index.md new file mode 100644 index 0000000000..490ef7507c --- /dev/null +++ b/content/develop/use-cases/semantic-cache/rust/_index.md @@ -0,0 +1,278 @@ +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +description: Build a Redis-backed semantic cache for LLM responses in Rust with redis-rs and Candle +linkTitle: redis-rs example (Rust) +title: Redis semantic cache with redis-rs +weight: 5 +--- + +This guide shows you how to build a small Redis-backed semantic cache for LLM responses in Rust with [`redis-rs`]({{< relref "/develop/clients" >}}) and the [Candle](https://github.com/huggingface/candle) machine-learning framework. It includes a local web server built with the [`tiny_http`](https://crates.io/crates/tiny_http) crate so you can send paraphrased prompts at a mock LLM, watch the cache decide hit or miss, sweep the cosine-distance threshold, and see the cumulative latency and token savings build up. + +## Overview + +Each cache entry is stored as a single Redis [Hash]({{< relref "/develop/data-types/hashes" >}}) at `cache:`. The hash holds the original prompt, the LLM's response, the raw `float32` bytes of a 384-dimensional embedding of the prompt, and metadata fields — tenant, locale, model version, safety flag — plus a `created_ts` and a `hit_count`. A single [Redis Search]({{< relref "/develop/ai/search-and-query" >}}) index covers the embedding field and every metadata field, so one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) call with a `KNN` clause does the vector lookup *and* the TAG pre-filter in the same round trip — no cross-store joins. + +The lookup is thresholded: [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) always returns the nearest entry that satisfies the filters, but the application only serves it as a hit when the reported cosine distance is at or below `distance_threshold`. Anything further away is treated as a miss; the caller runs the LLM and writes the new prompt, response, and embedding back to the same key pattern with a TTL. + +The embedder is [Candle](https://github.com/huggingface/candle) running the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model, which is the same encoder the [Python example]({{< relref "/develop/use-cases/semantic-cache/redis-py" >}}), the [Node.js example]({{< relref "/develop/use-cases/semantic-cache/nodejs" >}}), the [Go example]({{< relref "/develop/use-cases/semantic-cache/go" >}}), and the [Java example]({{< relref "/develop/use-cases/semantic-cache/java-jedis" >}}) use. Embeddings produced by the five implementations are semantically equivalent — paraphrase distances differ only at the fourth decimal place — so a cache populated by one demo can be queried by another against the same Redis instance. + +That gives you: + +* A single round trip for lookup — vector KNN + metadata pre-filter in one [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}). +* Tens of milliseconds on a hit vs. a multi-second LLM call on a miss; the embedding step is the bottleneck either way, and that's a model-side cost, not a Redis one. +* Tenant, locale, and model-version isolation enforced inside the query, not in application code — a write under one tenant cannot be served to another. +* Bounded memory: every entry has an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL, and a database-level [eviction policy]({{< relref "/develop/reference/eviction" >}}) (LRU / LFU) caps the cache size under pressure. + +## How it works + +A query goes through three stages: **embed**, **lookup**, and (on a miss) **call the LLM and write back**. + +### Hit path (the goal) + +1. The application calls `embedder.encode_one(prompt)` to turn the incoming text into a 384-element `Vec`. +2. `cache.lookup(&query_vec, LookupParams { tenant, locale, model_version, .. })` runs [`FT.SEARCH`]({{< relref "/commands/ft.search" >}}) with a TAG pre-filter and a `KNN 1` clause. Redis returns the closest cached prompt that satisfies the filters along with its cosine distance. +3. If the distance is at or below the threshold, the cache returns a `LookupResult::Hit` carrying the cached response. The helper also runs an [`HINCRBY`]({{< relref "/commands/hincrby" >}}) on `hit_count` and an [`EXPIRE`]({{< relref "/commands/expire" >}}) refresh inside a [`MULTI/EXEC`]({{< relref "/commands/multi" >}}), so a frequently used answer keeps its TTL and the demo UI can see which entries are load-bearing. +4. The LLM is not called at all. The application returns the cached response to the user. + +### Miss path + +When the distance is above the threshold — or there is no candidate in scope at all — the helper returns a `LookupResult::Miss` instead, carrying the distance of the nearest candidate (if any) for logging. The application then: + +1. Calls the LLM with the prompt. +2. Calls `cache.put(PutParams { prompt, response, embedding: &query_vec, .. })`. The same embedding the lookup used is reused — no re-encode. The helper writes the Hash with [`HSET`]({{< relref "/commands/hset" >}}) and an [`EXPIRE`]({{< relref "/commands/expire" >}}) TTL inside a single [`MULTI/EXEC`]({{< relref "/commands/multi" >}}) so the entry never lands without a TTL on a partial failure. +3. Returns the LLM's response to the user. The next semantically similar prompt under the same metadata scope will be a hit. + +## The cache helper + +The `RedisSemanticCache` struct wraps the Redis Search index and the lookup / write flow +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/rust/src/cache.rs)): + +```rust +use redis::Client; +// `cache`, `embeddings`, and `seed_cache` are the demo's local modules. +use crate::cache::{LookupParams, LookupResult, PutParams, RedisSemanticCache}; +use crate::embeddings::LocalEmbedder; + +let client = Client::open("redis://localhost:6379/")?; + +let cache = RedisSemanticCache::new( + client, + "semcache:idx", + "cache:", + 384, // vector dimension + 0.5, // cosine distance threshold; lower = stricter + 3600, // default TTL in seconds (one hour) +)?; +let embedder = LocalEmbedder::new(None)?; // sentence-transformers/all-MiniLM-L6-v2 + +// One-time index setup (idempotent). +cache.create_index()?; + +// 1) Embed the prompt. +let prompt = "How do I return an item?"; +let query_vec = embedder.encode_one(prompt)?; + +// 2) Look up under a metadata scope. The TAG filter and the KNN +// travel together in one FT.SEARCH. +let result = cache.lookup( + &query_vec, + LookupParams { + tenant: Some("acme"), + locale: Some("en"), + model_version: Some("gpt-4.5-2026"), + safety: None, + distance_threshold: None, + }, +)?; + +let response = match result { + LookupResult::Hit(hit) => { + println!("hit ({:.3}): {}", hit.distance, hit.response); + hit.response + } + LookupResult::Miss(_) => { + // 3a) Miss — call the LLM. (Use your real client here.) + let llm_response = call_llm(prompt); + + // 3b) Cache the new entry. Reuses the same embedding bytes + // the lookup used, so we don't pay the encoder twice. + cache.put(PutParams { + prompt, + response: &llm_response, + embedding: &query_vec, + tenant: "acme", + locale: "en", + model_version: "gpt-4.5-2026", + safety: "ok", + ttl_seconds: None, + entry_id: None, + })?; + llm_response + } +}; +``` + +### Data model + +Each cache entry is one Redis Hash. The vector field is raw little-endian `float32` bytes — no JSON wrapping — because the Redis Search vector encoding expects exactly that. The helper uses `byteorder::LittleEndian::write_f32` rather than the host-default `f32::to_le_bytes` casts so the encoding contract is visible in the source and consistent with the [Go example]({{< relref "/develop/use-cases/semantic-cache/go" >}})'s `binary.LittleEndian.PutUint32`. + +```text +cache:7c3f8a1b9e02 + prompt=How do I return an item? + response=You can return any unworn item within 30 days... + tenant=acme + locale=en + model_version=gpt-4.5-2026 + safety=ok + created_ts=1715990400.123 + hit_count=4 + embedding=<384 × float32 little-endian bytes> +``` + +The Redis Search index schema treats every field as queryable in its natural type: + +```text +FT.CREATE semcache:idx + ON HASH PREFIX 1 cache: + SCHEMA + prompt TEXT + response TEXT + tenant TAG + locale TAG + model_version TAG + safety TAG + created_ts NUMERIC SORTABLE + hit_count NUMERIC SORTABLE + embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE +``` + +### The query + +The lookup is a hybrid query: a TAG pre-filter expression in parentheses, then `=>[KNN 1 @embedding $vec]`. With `DIALECT 2`, Redis applies the filter first and KNN-ranks only the matching documents. `redis-rs` doesn't ship a typed `FT.SEARCH` builder, so the helper composes the command directly: + +```rust +let value: redis::Value = redis::cmd("FT.SEARCH") + .arg("semcache:idx") + .arg("(@tenant:{acme} @locale:{en} @model_version:{gpt\\-4\\.5\\-2026} @safety:{ok})\ + =>[KNN 1 @embedding $vec AS distance]") + .arg("PARAMS").arg(2).arg("vec").arg(&vec_bytes[..]) + .arg("SORTBY").arg("distance").arg("ASC") + .arg("RETURN").arg(7) + .arg("prompt").arg("response").arg("tenant").arg("locale") + .arg("model_version").arg("hit_count").arg("distance") + .arg("LIMIT").arg(0).arg(1) + .arg("DIALECT").arg(2) + .query(&mut con)?; +``` + +`distance` is the cosine *distance* (0 means identical, 2 means opposite). The result is sorted ascending, so the top row is the closest candidate. The application inspects `distance` against the threshold and decides hit or miss in user code — Redis returns the row either way, and treating it as a hit or a miss is a policy decision the cache helper owns, not a server-side filter. + +## The mock LLM + +To make the latency and token savings visible without requiring an API key, `mock_llm.rs` provides a deterministic stand-in +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/rust/src/mock_llm.rs)): + +```rust +use crate::mock_llm::MockLlm; + +let llm = MockLlm::new(None, 1500.0); // 1.5 seconds per call +let resp = llm.complete("What is your return policy?"); +// resp.response — the templated answer text +// resp.latency_ms — wall-clock time the call took +// resp.total_tokens() — estimated prompt + completion tokens +``` + +The mock sleeps for the configured latency, then keyword-matches against a small FAQ table to produce an answer. The deliberate slowness is what makes a hit visibly cheaper than a miss in the demo. In production code, you would replace `MockLlm` with your real client of choice — an OpenAI Rust SDK, an Anthropic SDK, an internal vLLM endpoint, anything — without changing the cache helper. + +## Pre-seeding the cache + +In a real deployment the cache fills up organically: a first-time question is a miss, the LLM answers, and the response is written back. For the demo, `seed_cache.rs` pre-loads a small set of canonical FAQ prompts so the very first query lands on a hit +([source](https://github.com/redis/docs/blob/main/content/develop/use-cases/semantic-cache/rust/src/seed_cache.rs)): + +```rust +cache.create_index()?; +seed(&cache, &embedder, SeedOptions { + tenant: "acme", + locale: "en", + model_version: "gpt-4.5-2026", +})?; +``` + +The seed list stores the canonical phrasing of each question ("What is your return policy?"). Paraphrases of any of these prompts ("How do I return an item?", "Can I get a refund?") embed close to the canonical entry, so the cache lookup serves the stored response without ever calling the model. + +## The interactive demo + +`main.rs` runs a [`tiny_http`](https://crates.io/crates/tiny_http) server. The HTML page lets you: + +* Type a prompt and toggle metadata: tenant, locale, model version. Each combination is a separate cache namespace inside the same index. +* Slide the cosine-distance threshold and see hits flip to misses (and back) on the same prompt, with the actual distance reported on each query. +* Submit with **Ask** to run the full hit-or-miss path (calls the LLM on a miss, writes the answer back). Submit with **Lookup only (no LLM)** to sweep the threshold against a fixed prompt without polluting the cache. +* Watch the cumulative panel build up: total queries, cache hits, cache misses, hit ratio, tokens not spent, LLM milliseconds not waited. +* Inspect every cached entry, including remaining TTL and total hit count, and drop individual entries to simulate eviction. + +The server holds one `LocalEmbedder`, one `RedisSemanticCache`, and one `MockLlm` for the lifetime of the process. The HTML page is shared with the Python, Node.js, Go, and Java demos and is loaded from `index.html` next to `main.rs`. Endpoints: + +| Endpoint | What it does | +|-----------------|-------------------------------------------------------------------------------| +| `GET /state` | Index info and the full list of cached entries. | +| `POST /query` | Embed the prompt, run `FT.SEARCH`, on miss call the LLM and write back. | +| `POST /reset` | Drop every cached entry and re-seed from the FAQ list. | +| `POST /drop` | Delete a single cached entry by id. | + +## Run the demo locally + +1. Clone the [`redis/docs`](https://github.com/redis/docs) repository and change into the example + directory: + + ```bash + git clone https://github.com/redis/docs.git + cd docs/content/develop/use-cases/semantic-cache/rust + ``` + +2. Make sure a recent Rust toolchain is installed (Rust 1.75 or newer). + [`rustup`](https://rustup.rs/) is the usual route. The build pulls in + Candle and the Hugging Face tokenizer crates, so the first `cargo + build` takes a few minutes. + +3. Make sure a Redis instance with the Redis Search module is running locally on + port 6379. [Redis Stack]({{< relref "/operate/oss_and_stack/install/install-stack" >}}) or + [Redis 8 with Search]({{< relref "/develop/ai/search-and-query" >}}) both work. + +4. Build and run the demo server. The first run downloads the + `sentence-transformers/all-MiniLM-L6-v2` weights (~87 MB) into the local + Hugging Face cache: + + ```bash + cargo run --release + ``` + + Candle uses a pure-Rust tensor backend, so there's no native ONNX + Runtime library to install. If you'd prefer the higher-throughput ONNX + Runtime backend you can swap the `BertModel` forward call for an `ort` + `Session::run` — the cache helper, mock LLM, and HTTP server are + unchanged. + +5. Open and try some queries: + + * **"What is your return policy?"** — exact match against the seed, distance ≈ 0, + hit at any threshold. + * **"How fast is delivery?"** — paraphrase of the shipping seed; distance + around 0.30, hit at the default threshold of 0.5. + * **"How do I return an item?"** — slightly looser paraphrase of the returns + seed; distance around 0.49, still a hit at the default threshold. Slide + the threshold down to 0.4 to see this one flip to a miss. + * **"What payment methods do you accept?"** — unrelated to anything in the + seed; distance > 0.6, so you'll see a miss, the mock LLM kicks in for + ~1.5 s, the new answer is cached, and a follow-up of the same question + is now an immediate hit. + * Switch the **Tenant** dropdown to `globex` or `initech` and re-ask any + seeded question — the result flips to a miss because the cache entries + live under `acme`. That's the metadata pre-filter at work inside `FT.SEARCH`. + +The server is read/write against your local Redis. The default index name is `semcache:idx` and entry keys live under `cache:`. Flags mirror the Python, Node.js, Go, and Java demos: `--no-reset` to keep an existing cache across restarts, `--threshold` to change the default cosine-distance cutoff, or `--llm-latency-ms` to make the mock LLM faster or slower for the demo. diff --git a/content/develop/use-cases/semantic-cache/rust/index.html b/content/develop/use-cases/semantic-cache/rust/index.html new file mode 100644 index 0000000000..e897cfdee7 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/rust/index.html @@ -0,0 +1,513 @@ + + + + + + Redis Semantic Cache Demo + + + +
+
loading…
+

Redis Semantic Cache Demo

+

+ A small semantic cache sits in front of a mock LLM. Each cache + entry is a Hash at __KEY_PREFIX__<id> holding + the prompt, the response, the prompt's 384-dimensional embedding, + and metadata fields. A single FT.SEARCH on + __INDEX_NAME__ does the KNN against cached prompts + with a TAG pre-filter (tenant, locale, model version, safety) in + the same round trip. If the closest cached prompt is within the + cosine-distance threshold, the demo serves the cached response + and the LLM is not called at all. +

+ +
+ +
+

Ask the LLM

+

Type a question, optionally adjust the metadata filters and + the distance threshold, and submit. The server embeds the + prompt, runs FT.SEARCH with KNN over the cache, + and either serves the cached response (hit) or runs the mock + LLM and writes the new response back to the cache (miss).

+ + +
+
+ + +
+
+ + +
+
+ + +
+
+
+ + + 0.50 +
+

+ The cache serves a hit when the closest cached prompt's + cosine distance is at or below this threshold. Lower = + stricter (fewer hits, safer reuse); higher = looser (more + hits, more risk of serving a near-miss). +

+ + + + + +
+
+ +
+

Cumulative savings

+

Every hit avoids one LLM round trip. The numbers below add + up across the session — tokens that would have been spent and + wall-clock seconds that would have been waited if the cache + had not served the answer.

+
+
+
0
+
Total queries
+
+
+
0
+
Cache hits
+
+
+
0
+
Cache misses
+
+
+
0%
+
Hit ratio
+
+
+
0
+
Tokens saved
+
+
+
0 ms
+
LLM time saved
+
+
+
+ +
+

Index state

+
+ +
+ +
+

Cached entries

+

Every prompt/response pair currently in the cache. + hit_count is the running total of times the entry + has served a hit; ttl is the remaining lifetime + in seconds before EXPIRE drops the key. Click + Drop to simulate eviction.

+ + + + + + + + + + + + +
IDPromptMetadataHitsTTL
+
+ +
+ +
+
+ + + + diff --git a/content/develop/use-cases/semantic-cache/rust/src/cache.rs b/content/develop/use-cases/semantic-cache/rust/src/cache.rs new file mode 100644 index 0000000000..79af117090 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/rust/src/cache.rs @@ -0,0 +1,734 @@ +//! Redis semantic-cache helper backed by Redis Search. +//! +//! Each cache entry lives as a Hash document at `cache:`. The hash +//! stores the user's prompt and the corresponding LLM response +//! alongside the raw float32 bytes of the prompt's 384-dimensional +//! embedding and a small set of metadata fields — tenant, locale, +//! model version, and a safety flag. +//! +//! A single Redis Search index covers the embedding plus every +//! metadata field, so one `FT.SEARCH` call does an +//! approximate-nearest-neighbour lookup against the cached prompts +//! with a TAG pre-filter applied in the same pass — no cross-store +//! joins, no extra round trips, and tenant isolation is enforced +//! *inside* the query rather than after the fact in application code. +//! +//! The lookup is thresholded: `FT.SEARCH` always returns the closest +//! cached prompt, but the cache only serves it as a hit when the +//! cosine distance is at or below `distance_threshold`. Anything +//! further away is treated as a miss; the caller is expected to run +//! the underlying LLM and write the new prompt, response, and +//! embedding back with `put`. +//! +//! Each cache entry is written with `EXPIRE`, so stale answers age out +//! without manual cleanup; combine with an `allkeys-lfu` eviction +//! policy on the database to cap memory under pressure too. + +use std::sync::Mutex; +use std::time::{SystemTime, UNIX_EPOCH}; + +use redis::{Client, Commands, Connection, FromRedisValue, RedisError, Value}; + +use crate::embeddings::floats_to_bytes; + +pub const VECTOR_DIM_DEFAULT: usize = 384; + +#[derive(Debug)] +pub enum CacheError { + Redis(RedisError), + ShapeMismatch { expected: usize, got: usize }, + Parse(String), +} + +impl std::fmt::Display for CacheError { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + match self { + CacheError::Redis(e) => write!(f, "redis: {}", e), + CacheError::ShapeMismatch { expected, got } => write!( + f, + "embedding has dimension {}; index expects {}", + got, expected + ), + CacheError::Parse(msg) => write!(f, "parse: {}", msg), + } + } +} + +impl std::error::Error for CacheError {} + +impl From for CacheError { + fn from(e: RedisError) -> Self { + CacheError::Redis(e) + } +} + +// `tenant`, `locale`, and `model_version` aren't read by the demo +// HTTP layer (the UI only displays distance + response) but they are +// part of the public CacheHit surface for any caller that wants to +// audit which scope served a hit. `nearest_id` is similarly part of +// the diagnostic surface for "candidate too far" misses even though +// the UI ignores it. +#[derive(Debug, Clone)] +#[allow(dead_code)] +pub struct CacheHit { + pub id: String, + pub prompt: String, + pub response: String, + pub tenant: String, + pub locale: String, + pub model_version: String, + pub distance: f64, + pub ttl_seconds: i64, + pub hit_count: i64, +} + +#[derive(Debug, Clone, Default)] +#[allow(dead_code)] +pub struct CacheMiss { + /// `None` means "no candidate in scope at all"; `Some` means + /// "candidate too far". The demo UI uses that distinction to + /// display either "no candidate" or "candidate too far". + pub nearest_distance: Option, + pub nearest_id: Option, +} + +#[derive(Debug)] +pub enum LookupResult { + Hit(CacheHit), + Miss(CacheMiss), +} + +pub struct LookupParams<'a> { + pub tenant: Option<&'a str>, + pub locale: Option<&'a str>, + pub model_version: Option<&'a str>, + /// `Some("ok")` matches Python; pass `Some("-")` to skip the + /// safety filter; pass `None` to use the default ("ok"). + pub safety: Option<&'a str>, + pub distance_threshold: Option, +} + +pub struct PutParams<'a> { + pub prompt: &'a str, + pub response: &'a str, + pub embedding: &'a [f32], + pub tenant: &'a str, + pub locale: &'a str, + pub model_version: &'a str, + pub safety: &'a str, + pub ttl_seconds: Option, + pub entry_id: Option<&'a str>, +} + +#[derive(Debug, Clone, Default)] +pub struct IndexInfo { + pub num_docs: i64, + pub indexing_failures: i64, + pub vector_index_size_mb: f64, +} + +#[derive(Debug, Clone, serde::Serialize)] +pub struct Entry { + pub id: String, + pub prompt: String, + pub response: String, + pub tenant: String, + pub locale: String, + pub model_version: String, + pub safety: String, + pub hit_count: i64, + pub ttl_seconds: i64, + pub created_ts: f64, +} + +pub struct RedisSemanticCache { + client: Client, + /// The cache holds the only Redis connection. The demo is + /// single-threaded for /query (one tiny_http worker thread handles + /// each request) but we need synchronisation because the same + /// cache instance is shared across worker threads. + conn: Mutex, + pub index_name: String, + pub key_prefix: String, + pub vector_dim: usize, + pub distance_threshold: f64, + pub default_ttl_seconds: i64, +} + +impl RedisSemanticCache { + pub fn new( + client: Client, + index_name: impl Into, + key_prefix: impl Into, + vector_dim: usize, + distance_threshold: f64, + default_ttl_seconds: i64, + ) -> Result { + let conn = client.get_connection()?; + Ok(Self { + client, + conn: Mutex::new(conn), + index_name: index_name.into(), + key_prefix: key_prefix.into(), + vector_dim, + distance_threshold, + default_ttl_seconds, + }) + } + + pub fn entry_key(&self, entry_id: &str) -> String { + format!("{}{}", self.key_prefix, entry_id) + } + + /// Create the Redis Search index if it doesn't already exist. One + /// index covers the embedding plus every metadata field, so a + /// single FT.SEARCH can pre-filter by tenant / locale / model and + /// then KNN-rank the matching documents in one pass. + pub fn create_index(&self) -> Result<(), CacheError> { + let mut con = self.conn.lock().unwrap(); + let result: Result = redis::cmd("FT.CREATE") + .arg(&self.index_name) + .arg("ON") + .arg("HASH") + .arg("PREFIX") + .arg(1) + .arg(&self.key_prefix) + .arg("SCHEMA") + .arg("prompt").arg("TEXT") + .arg("response").arg("TEXT") + .arg("tenant").arg("TAG") + .arg("locale").arg("TAG") + .arg("model_version").arg("TAG") + .arg("safety").arg("TAG") + .arg("created_ts").arg("NUMERIC").arg("SORTABLE") + .arg("hit_count").arg("NUMERIC").arg("SORTABLE") + .arg("embedding") + .arg("VECTOR").arg("HNSW").arg(6) + .arg("TYPE").arg("FLOAT32") + .arg("DIM").arg(self.vector_dim as i64) + .arg("DISTANCE_METRIC").arg("COSINE") + .query(&mut *con); + + match result { + Ok(_) => Ok(()), + Err(e) => { + // Redis returns this either as "Index already exists" + // (older builds) or "Index: already exists" (newer + // RESP3 ServerError formatting in redis-rs 0.27). The + // distinguishing token is the bare word "exists" in + // an index-related error. + let msg = e.to_string().to_lowercase(); + if msg.contains("exists") && msg.contains("index") { + Ok(()) + } else { + Err(CacheError::Redis(e)) + } + } + } + } + + pub fn drop_index(&self, delete_documents: bool) -> Result<(), CacheError> { + let mut con = self.conn.lock().unwrap(); + let mut cmd = redis::cmd("FT.DROPINDEX"); + cmd.arg(&self.index_name); + if delete_documents { + cmd.arg("DD"); + } + match cmd.query::(&mut *con) { + Ok(_) => Ok(()), + Err(e) => { + let msg = e.to_string().to_lowercase(); + if msg.contains("no such index") || msg.contains("unknown index name") { + Ok(()) + } else { + Err(CacheError::Redis(e)) + } + } + } + } + + /// Run the thresholded FT.SEARCH and decide hit vs. miss. + /// + /// FT.SEARCH returns the single nearest entry that satisfies the + /// TAG pre-filters. The lookup is a hit only if the reported + /// cosine distance is at or below the threshold (instance default + /// or override). Anything further away is a miss with the + /// candidate distance attached so the caller can log it. + /// + /// On a hit, the entry's `hit_count` is incremented atomically + /// with HINCRBY so the demo UI can show which entries are + /// load-bearing. The TTL is refreshed on every hit so frequently + /// used answers don't age out under cold tail entries. + pub fn lookup( + &self, + query_vec: &[f32], + params: LookupParams<'_>, + ) -> Result { + // Match the shape check that `put` performs. A wrong-dim vector + // would otherwise hit Redis as a malformed FT.SEARCH parameter + // and surface as a server-side parse error instead of a clear + // caller-side error. + if query_vec.len() != self.vector_dim { + return Err(CacheError::ShapeMismatch { + expected: self.vector_dim, + got: query_vec.len(), + }); + } + + let threshold = params + .distance_threshold + .unwrap_or(self.distance_threshold); + let safety = match params.safety { + Some("-") => None, + Some(s) if !s.is_empty() => Some(s), + Some(_) | None => Some("ok"), + }; + + let filter_clause = + build_filter_clause(params.tenant, params.locale, params.model_version, safety); + let query_str = format!("{}=>[KNN 1 @embedding $vec AS distance]", filter_clause); + let vec_bytes = floats_to_bytes(query_vec); + + let value: Value = { + let mut con = self.conn.lock().unwrap(); + redis::cmd("FT.SEARCH") + .arg(&self.index_name) + .arg(&query_str) + .arg("PARAMS").arg(2).arg("vec").arg(&vec_bytes[..]) + .arg("SORTBY").arg("distance").arg("ASC") + .arg("RETURN").arg(7) + .arg("prompt").arg("response").arg("tenant").arg("locale") + .arg("model_version").arg("hit_count").arg("distance") + .arg("LIMIT").arg(0).arg(1) + .arg("DIALECT").arg(2) + .query(&mut *con)? + }; + + let docs = parse_ft_search(&value)?; + if docs.is_empty() { + return Ok(LookupResult::Miss(CacheMiss::default())); + } + + let doc = &docs[0]; + let raw_key = &doc.id; + let entry_id = raw_key + .strip_prefix(&self.key_prefix) + .unwrap_or(raw_key) + .to_string(); + let distance: f64 = doc + .field("distance") + .and_then(|s| s.parse::().ok()) + .unwrap_or(0.0); + + if distance > threshold { + return Ok(LookupResult::Miss(CacheMiss { + nearest_distance: Some(distance), + nearest_id: Some(entry_id), + })); + } + + // The hash may have expired between FT.SEARCH returning the + // row and us getting here — the search index lags expirations + // by its periodic scan. If we just blindly HINCRBY-ed, Redis + // would helpfully recreate the hash with only `hit_count` set + // and the search index would then log it as an indexing + // failure (no embedding, no metadata). EXISTS narrows that + // race to the pipeline round-trip; a strictly race-free + // version would wrap the bump in a Lua script that checks + // existence and acts in one server-side step. + let entry_key = self.entry_key(&entry_id); + let exists: i64 = { + let mut con = self.conn.lock().unwrap(); + con.exists(&entry_key)? + }; + if exists == 0 { + return Ok(LookupResult::Miss(CacheMiss { + nearest_distance: Some(distance), + nearest_id: Some(entry_id), + })); + } + + // MULTI/EXEC the three writes so they apply as a unit on the + // server — a partial failure between HINCRBY and EXPIRE would + // otherwise leave the entry without a refreshed TTL. + let (new_hit_count, _expired, ttl): (i64, i64, i64) = { + let mut con = self.conn.lock().unwrap(); + redis::pipe() + .atomic() + .cmd("HINCRBY").arg(&entry_key).arg("hit_count").arg(1) + .cmd("EXPIRE").arg(&entry_key).arg(self.default_ttl_seconds) + .cmd("TTL").arg(&entry_key) + .query(&mut *con)? + }; + let ttl_seconds = if ttl > 0 { + ttl + } else { + self.default_ttl_seconds + }; + + Ok(LookupResult::Hit(CacheHit { + id: entry_id, + prompt: doc.field("prompt").unwrap_or_default().to_string(), + response: doc.field("response").unwrap_or_default().to_string(), + tenant: doc.field("tenant").unwrap_or_default().to_string(), + locale: doc.field("locale").unwrap_or_default().to_string(), + model_version: doc.field("model_version").unwrap_or_default().to_string(), + distance, + ttl_seconds, + hit_count: new_hit_count, + })) + } + + /// Write a new cache entry and return its id. + /// + /// The embedding is stored as raw little-endian float32 bytes — + /// the encoding Redis Search expects from a FLOAT32 vector field. + /// EXPIRE on the key gives every entry a bounded lifetime. + pub fn put(&self, p: PutParams<'_>) -> Result { + if p.embedding.len() != self.vector_dim { + return Err(CacheError::ShapeMismatch { + expected: self.vector_dim, + got: p.embedding.len(), + }); + } + let entry_id = match p.entry_id { + Some(s) if !s.is_empty() => s.to_string(), + _ => new_entry_id(), + }; + let ttl = p.ttl_seconds.unwrap_or(self.default_ttl_seconds); + let key = self.entry_key(&entry_id); + + let created_ts = SystemTime::now() + .duration_since(UNIX_EPOCH) + .map(|d| d.as_secs_f64()) + .unwrap_or(0.0); + + let emb_bytes = floats_to_bytes(p.embedding); + + // MULTI/EXEC so HSET and EXPIRE either both apply or neither + // does. Without the transaction wrapper a connection drop + // between the two writes could leave the entry without a TTL + // and the cache would then keep an answer past its intended + // lifetime (or forever, on a database with no eviction + // policy). + let mut con = self.conn.lock().unwrap(); + redis::pipe() + .atomic() + .cmd("HSET") + .arg(&key) + .arg("prompt").arg(p.prompt) + .arg("response").arg(p.response) + .arg("tenant").arg(p.tenant) + .arg("locale").arg(p.locale) + .arg("model_version").arg(p.model_version) + .arg("safety").arg(p.safety) + .arg("created_ts").arg(format_f64(created_ts)) + .arg("hit_count").arg("0") + .arg("embedding").arg(&emb_bytes[..]) + .cmd("EXPIRE") + .arg(&key) + .arg(ttl) + .query::(&mut *con)?; + Ok(entry_id) + } + + /// Returns a small subset of FT.INFO. Failures (for example, an + /// index that hasn't been created yet) return zeroed counters + /// rather than surface as an error to the caller, since the demo + /// UI just renders "0 entries" in that case. + pub fn index_info(&self) -> IndexInfo { + let mut con = match self.conn.lock() { + Ok(c) => c, + Err(_) => return IndexInfo::default(), + }; + let value: Value = match redis::cmd("FT.INFO") + .arg(&self.index_name) + .query(&mut *con) + { + Ok(v) => v, + Err(_) => return IndexInfo::default(), + }; + parse_ft_info(&value) + } + + /// Returns every cached entry (no embedding) for the admin panel. + /// The result is sorted by created_ts descending so the most + /// recently written entry is at the top of the table. + pub fn list_entries(&self, limit: usize) -> Result, CacheError> { + let value: Value = { + let mut con = self.conn.lock().unwrap(); + redis::cmd("FT.SEARCH") + .arg(&self.index_name) + .arg("*") + .arg("RETURN").arg(8) + .arg("prompt").arg("response").arg("tenant").arg("locale") + .arg("model_version").arg("safety").arg("created_ts").arg("hit_count") + .arg("SORTBY").arg("created_ts").arg("DESC") + .arg("LIMIT").arg(0).arg(limit as i64) + .arg("DIALECT").arg(2) + .query(&mut *con)? + }; + let docs = parse_ft_search(&value)?; + let mut out = Vec::with_capacity(docs.len()); + for doc in docs { + let entry_id = doc + .id + .strip_prefix(&self.key_prefix) + .unwrap_or(&doc.id) + .to_string(); + let ttl: i64 = { + let mut con = self.conn.lock().unwrap(); + redis::cmd("TTL").arg(self.entry_key(&entry_id)).query(&mut *con).unwrap_or(0) + }; + let ttl_seconds = if ttl > 0 { ttl } else { 0 }; + let hit_count: i64 = doc + .field("hit_count") + .and_then(|s| s.parse::().ok()) + .unwrap_or(0); + let created_ts: f64 = doc + .field("created_ts") + .and_then(|s| s.parse::().ok()) + .unwrap_or(0.0); + out.push(Entry { + id: entry_id, + prompt: doc.field("prompt").unwrap_or_default().to_string(), + response: doc.field("response").unwrap_or_default().to_string(), + tenant: doc.field("tenant").unwrap_or_default().to_string(), + locale: doc.field("locale").unwrap_or_default().to_string(), + model_version: doc.field("model_version").unwrap_or_default().to_string(), + safety: doc.field("safety").unwrap_or_default().to_string(), + hit_count, + ttl_seconds, + created_ts, + }); + } + // Belt-and-braces sort in case Redis returns an unsorted top-N. + out.sort_by(|a, b| b.created_ts.partial_cmp(&a.created_ts).unwrap_or(std::cmp::Ordering::Equal)); + Ok(out) + } + + pub fn delete_entry(&self, entry_id: &str) -> Result { + let mut con = self.conn.lock().unwrap(); + let n: i64 = con.del(self.entry_key(entry_id))?; + Ok(n > 0) + } + + /// Drop the index and every cached entry, then recreate the + /// index. Returns the number of entries that were removed. + pub fn clear(&self) -> Result { + let before = self.index_info().num_docs; + self.drop_index(true)?; + self.create_index()?; + Ok(before) + } + + /// Open a fresh Redis connection. Used by the smoketest harness + /// only; the regular hot path uses the connection cached in + /// `self.conn`. + #[allow(dead_code)] + pub fn fresh_connection(&self) -> Result { + Ok(self.client.get_connection()?) + } +} + +// ---- FT.SEARCH / FT.INFO response parsing ---------------------------- + +#[derive(Debug)] +pub struct SearchDoc { + pub id: String, + pub fields: Vec<(String, String)>, +} + +impl SearchDoc { + fn field(&self, name: &str) -> Option<&str> { + self.fields + .iter() + .find(|(k, _)| k == name) + .map(|(_, v)| v.as_str()) + } +} + +fn parse_ft_search(value: &Value) -> Result, CacheError> { + // FT.SEARCH RESP shape (RESP2): [count, key1, [field, val, ...], key2, [field, val, ...], ...] + let items = match value { + Value::Array(items) => items, + _ => { + return Err(CacheError::Parse( + "FT.SEARCH did not return an array".into(), + )) + } + }; + if items.is_empty() { + return Ok(vec![]); + } + // First element is the total count; ignore it. + let mut out = Vec::new(); + let mut iter = items.iter().skip(1); + while let Some(key_value) = iter.next() { + let key = redis_value_to_string(key_value) + .ok_or_else(|| CacheError::Parse("key is not a string".into()))?; + // The next element is either an array of [field, value, field, value, ...] + // or absent (if the user passed NOCONTENT). + let fields_value = match iter.next() { + Some(v) => v, + None => { + out.push(SearchDoc { + id: key, + fields: vec![], + }); + continue; + } + }; + let field_items: Vec<&Value> = match fields_value { + Value::Array(v) => v.iter().collect(), + _ => { + out.push(SearchDoc { + id: key, + fields: vec![], + }); + continue; + } + }; + let mut fields = Vec::new(); + let mut f_iter = field_items.into_iter(); + while let Some(name_val) = f_iter.next() { + let name = match redis_value_to_string(name_val) { + Some(s) => s, + None => continue, + }; + let value = f_iter + .next() + .and_then(redis_value_to_string) + .unwrap_or_default(); + fields.push((name, value)); + } + out.push(SearchDoc { id: key, fields }); + } + Ok(out) +} + +fn parse_ft_info(value: &Value) -> IndexInfo { + // FT.INFO returns a flat [k1, v1, k2, v2, ...] array. + let items = match value { + Value::Array(items) => items, + _ => return IndexInfo::default(), + }; + let mut info = IndexInfo::default(); + let mut iter = items.iter(); + while let Some(k) = iter.next() { + let key = redis_value_to_string(k).unwrap_or_default(); + let v = match iter.next() { + Some(v) => v, + None => break, + }; + match key.as_str() { + "num_docs" => { + info.num_docs = match v { + Value::Int(n) => *n, + _ => redis_value_to_string(v) + .and_then(|s| s.parse::().ok()) + .unwrap_or(0), + }; + } + "hash_indexing_failures" => { + info.indexing_failures = match v { + Value::Int(n) => *n, + _ => redis_value_to_string(v) + .and_then(|s| s.parse::().ok()) + .unwrap_or(0), + }; + } + "vector_index_sz_mb" => { + info.vector_index_size_mb = match v { + Value::Double(d) => *d, + _ => redis_value_to_string(v) + .and_then(|s| s.parse::().ok()) + .unwrap_or(0.0), + }; + } + _ => {} + } + } + info +} + +fn redis_value_to_string(v: &Value) -> Option { + match v { + Value::BulkString(bytes) => Some(String::from_utf8_lossy(bytes).into_owned()), + Value::SimpleString(s) => Some(s.clone()), + Value::VerbatimString { format: _, text } => Some(text.clone()), + Value::Int(n) => Some(n.to_string()), + Value::Double(d) => Some(d.to_string()), + Value::Boolean(b) => Some(b.to_string()), + Value::Okay => Some("OK".to_string()), + Value::Nil => None, + _ => { + // Fall back to FromRedisValue's String impl for any + // variants we didn't list explicitly (Map, Set, BigNumber, + // etc — these shouldn't appear in FT.SEARCH replies but + // future protocol additions could). + String::from_redis_value(v).ok() + } + } +} + +// ---- Filter clause helpers ------------------------------------------ + +/// The characters Redis Search treats as syntax inside a TAG value; +/// any of them in a user-supplied filter must be backslash-escaped or +/// the surrounding `{...}` block won't parse correctly. +const TAG_SPECIAL: &str = "\\,.<>{}[]\"':;!@#$%^&*()-+=~| "; + +pub fn escape_tag_value(v: &str) -> String { + let mut out = String::with_capacity(v.len()); + for ch in v.chars() { + if TAG_SPECIAL.contains(ch) { + out.push('\\'); + } + out.push(ch); + } + out +} + +pub fn build_filter_clause( + tenant: Option<&str>, + locale: Option<&str>, + model_version: Option<&str>, + safety: Option<&str>, +) -> String { + let mut clauses = Vec::new(); + if let Some(t) = tenant.filter(|s| !s.is_empty()) { + clauses.push(format!("@tenant:{{{}}}", escape_tag_value(t))); + } + if let Some(l) = locale.filter(|s| !s.is_empty()) { + clauses.push(format!("@locale:{{{}}}", escape_tag_value(l))); + } + if let Some(m) = model_version.filter(|s| !s.is_empty()) { + clauses.push(format!("@model_version:{{{}}}", escape_tag_value(m))); + } + if let Some(s) = safety.filter(|s| !s.is_empty()) { + clauses.push(format!("@safety:{{{}}}", escape_tag_value(s))); + } + if clauses.is_empty() { + "(*)".to_string() + } else { + format!("({})", clauses.join(" ")) + } +} + +fn new_entry_id() -> String { + let mut buf = [0u8; 6]; + getrandom::getrandom(&mut buf).expect("getrandom never fails on supported platforms"); + let mut s = String::with_capacity(12); + for b in buf { + s.push_str(&format!("{:02x}", b)); + } + s +} + +fn format_f64(v: f64) -> String { + // Match the Python/Node/Go demos: no scientific notation, no + // trailing zeros. e.g. 1715990400.123 + let s = format!("{}", v); + s +} diff --git a/content/develop/use-cases/semantic-cache/rust/src/embeddings.rs b/content/develop/use-cases/semantic-cache/rust/src/embeddings.rs new file mode 100644 index 0000000000..466e733bbb --- /dev/null +++ b/content/develop/use-cases/semantic-cache/rust/src/embeddings.rs @@ -0,0 +1,304 @@ +//! Local text-embedding helper backed by Candle. +//! +//! This is a thin wrapper around the sentence-transformers model +//! `sentence-transformers/all-MiniLM-L6-v2`: a 384-dimensional BERT +//! encoder that runs in-process on CPU through Candle's pure-Rust +//! tensor backend, needs no API key, and produces vectors numerically +//! equivalent to the equivalent PyTorch model from +//! sentence-transformers. +//! +//! Two things matter for parity with the Python / Node / Go / Jedis +//! demos: +//! +//! 1. **Mean pooling with the attention mask.** sentence-transformers +//! computes the sentence vector as the attention-mask-weighted +//! average of the per-token last-hidden-state vectors, *not* the +//! `[CLS]` vector. Doing CLS-only here would produce numerically +//! different vectors and the published distance benchmarks (0.30 +//! for "How fast is delivery?", 0.49 for "How do I return an +//! item?") would drift. +//! 2. **Explicit L2 normalisation.** With normalised vectors, cosine +//! distance reduces to `1 - dot product`, which is what Redis +//! Search reports for our `COSINE` HNSW field. Without +//! normalisation, the distances would be in a different range and +//! the 0.5 default threshold would be meaningless. +//! +//! The model weights are fetched from the Hugging Face Hub on first +//! run via `hf-hub`. Subsequent runs read from the local cache. + +use std::path::PathBuf; + +use candle_core::{DType, Device, Tensor}; +use candle_nn::VarBuilder; +use candle_transformers::models::bert::{BertModel, Config, HiddenAct, DTYPE}; +use hf_hub::api::sync::Api; +use tokenizers::{PaddingParams, PaddingStrategy, Tokenizer, TruncationParams, TruncationStrategy}; + +pub const DEFAULT_EMBED_MODEL: &str = "sentence-transformers/all-MiniLM-L6-v2"; + +#[derive(Debug)] +pub enum EmbedError { + Hub(String), + Io(std::io::Error), + Candle(candle_core::Error), + Tokenizer(String), + BatchMismatch { expected: usize, got: usize }, + Empty, +} + +impl std::fmt::Display for EmbedError { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + match self { + EmbedError::Hub(msg) => write!(f, "hugging face hub: {}", msg), + EmbedError::Io(e) => write!(f, "io: {}", e), + EmbedError::Candle(e) => write!(f, "candle: {}", e), + EmbedError::Tokenizer(msg) => write!(f, "tokenizer: {}", msg), + EmbedError::BatchMismatch { expected, got } => write!( + f, + "pipeline returned {} vectors for {} inputs", + got, expected + ), + EmbedError::Empty => write!(f, "pipeline returned no embeddings"), + } + } +} + +impl std::error::Error for EmbedError {} + +impl From for EmbedError { + fn from(e: candle_core::Error) -> Self { + EmbedError::Candle(e) + } +} + +impl From for EmbedError { + fn from(e: std::io::Error) -> Self { + EmbedError::Io(e) + } +} + +/// Wraps a Candle BertModel + a HuggingFace tokenizer for a sentence +/// transformer. +pub struct LocalEmbedder { + pub model_name: String, + // `dim` is exposed so callers can sanity-check the model output + // against the Redis Search index dimension; the demo HTTP layer + // doesn't read it (it's hard-coded against VECTOR_DIM_DEFAULT). + #[allow(dead_code)] + pub dim: usize, + model: BertModel, + tokenizer: Tokenizer, + device: Device, +} + +impl LocalEmbedder { + /// Load the MiniLM model + tokenizer. Downloads them on first run + /// from the Hugging Face Hub into the local hf cache; later runs + /// load from disk only. + pub fn new(model_name: Option<&str>) -> Result { + let model_name = model_name.unwrap_or(DEFAULT_EMBED_MODEL).to_string(); + + let api = Api::new().map_err(|e| EmbedError::Hub(e.to_string()))?; + let repo = api.model(model_name.clone()); + + // The sentence-transformers MiniLM repo ships pytorch_model.bin + // + config.json + tokenizer.json. We deliberately use the + // PyTorch weights via candle's pickle reader rather than the + // safetensors mirror because (a) the canonical repo only + // publishes .bin, and (b) staying on the canonical repo means + // we hit the same weights as redis-py / nodejs / go / jedis + // demos. + let config_path: PathBuf = repo + .get("config.json") + .map_err(|e| EmbedError::Hub(e.to_string()))?; + let tokenizer_path: PathBuf = repo + .get("tokenizer.json") + .map_err(|e| EmbedError::Hub(e.to_string()))?; + let weights_path: PathBuf = repo + .get("pytorch_model.bin") + .map_err(|e| EmbedError::Hub(e.to_string()))?; + + let config_bytes = std::fs::read(&config_path)?; + let mut config: Config = + serde_json::from_slice(&config_bytes).map_err(|e| EmbedError::Hub(e.to_string()))?; + // sentence-transformers configs sometimes ship with hidden_act + // = "gelu" as a JSON string Candle parses as Gelu. MiniLM ships + // hidden_act="gelu", which already matches; force it just in + // case a downstream repo ships a non-standard value. + config.hidden_act = HiddenAct::Gelu; + + let mut tokenizer = Tokenizer::from_file(&tokenizer_path) + .map_err(|e| EmbedError::Tokenizer(e.to_string()))?; + // Match sentence-transformers' default: pad to the longest in + // the batch, truncate to the model's max length. Without the + // padding configuration set, single-example calls work fine + // but multi-example batches fail because Candle needs a + // rectangular tensor. + let pad_id = tokenizer + .get_padding() + .map(|p| p.pad_id) + .unwrap_or(0); + tokenizer.with_padding(Some(PaddingParams { + strategy: PaddingStrategy::BatchLongest, + direction: tokenizers::PaddingDirection::Right, + pad_to_multiple_of: None, + pad_id, + pad_type_id: 0, + pad_token: "[PAD]".to_string(), + })); + tokenizer + .with_truncation(Some(TruncationParams { + max_length: 512, + strategy: TruncationStrategy::LongestFirst, + stride: 0, + direction: tokenizers::TruncationDirection::Right, + })) + .map_err(|e| EmbedError::Tokenizer(e.to_string()))?; + + let device = Device::Cpu; + let vb = VarBuilder::from_pth(&weights_path, DTYPE, &device)?; + let model = BertModel::load(vb, &config)?; + + let dim = config.hidden_size; + + Ok(Self { + model_name, + dim, + model, + tokenizer, + device, + }) + } + + /// Returns a `dim`-element float32 vector for the input string, + /// L2-normalised. + pub fn encode_one(&self, text: &str) -> Result, EmbedError> { + let mut out = self.encode_many(&[text])?; + if out.is_empty() { + return Err(EmbedError::Empty); + } + Ok(out.remove(0)) + } + + /// Batch-encodes several strings in one forward pass so the model + /// pays the kernel-launch overhead once. Returns one vector per + /// input in the same order. Each vector is L2-normalised. + pub fn encode_many(&self, texts: &[&str]) -> Result>, EmbedError> { + if texts.is_empty() { + return Ok(Vec::new()); + } + + let encodings = self + .tokenizer + .encode_batch(texts.to_vec(), true) + .map_err(|e| EmbedError::Tokenizer(e.to_string()))?; + if encodings.len() != texts.len() { + return Err(EmbedError::BatchMismatch { + expected: texts.len(), + got: encodings.len(), + }); + } + + let batch_size = encodings.len(); + let seq_len = encodings.iter().map(|e| e.get_ids().len()).max().unwrap_or(0); + + let mut input_ids = Vec::with_capacity(batch_size * seq_len); + let mut attention_mask = Vec::with_capacity(batch_size * seq_len); + let mut token_type_ids = Vec::with_capacity(batch_size * seq_len); + for enc in &encodings { + input_ids.extend_from_slice(enc.get_ids()); + attention_mask.extend_from_slice(enc.get_attention_mask()); + token_type_ids.extend_from_slice(enc.get_type_ids()); + } + + let input_ids_t = Tensor::from_vec(input_ids, (batch_size, seq_len), &self.device)?; + let token_type_ids_t = + Tensor::from_vec(token_type_ids, (batch_size, seq_len), &self.device)?; + let attn_mask_u32 = attention_mask.clone(); + let attn_mask_t = Tensor::from_vec(attn_mask_u32, (batch_size, seq_len), &self.device)?; + + // Forward pass. Candle's BertModel takes input_ids, + // token_type_ids, and an optional attention_mask. We pass the + // mask so the encoder ignores padded positions. + let hidden = self.model.forward( + &input_ids_t, + &token_type_ids_t, + Some(&attn_mask_t), + )?; + + // Mean-pool with the attention mask. sentence-transformers + // computes the sentence vector as the mask-weighted average of + // per-token last-hidden-state vectors. Pseudocode: + // + // sum = (hidden * mask_expanded).sum(dim=1) + // counts = mask_expanded.sum(dim=1).clamp(min=1e-9) + // pooled = sum / counts + // + // The mask comes in as u32 / DTYPE-incompatible; convert to + // the model's DType and broadcast it across the hidden dim. + let mask_f = attn_mask_t.to_dtype(DTYPE)?; // (B, T) + let mask_expanded = mask_f.unsqueeze(2)?; // (B, T, 1) + let mask_expanded = mask_expanded.broadcast_as(hidden.shape())?; // (B, T, H) + + let masked = hidden.broadcast_mul(&mask_expanded)?; // (B, T, H) + let summed = masked.sum(1)?; // (B, H) + let counts = mask_f.sum(1)?; // (B,) + // Clamp the counts so an all-pad row (shouldn't happen, but be + // defensive) doesn't divide by zero. + let counts = counts.maximum(&Tensor::new(1e-9f32, &self.device)?.broadcast_as(counts.shape())?)?; + let counts = counts.unsqueeze(1)?; // (B, 1) + let pooled = summed.broadcast_div(&counts)?; // (B, H) + + // Extract as Vec> and L2-normalise each row in + // user-space so the demo's normalisation is explicit and + // visible in source. (Candle's tensor normalize helpers also + // exist but doing it by hand makes the docs example legible + // without a Candle deep-dive.) + let pooled_f32 = pooled.to_dtype(DType::F32)?; + let rows: Vec> = pooled_f32.to_vec2::()?; + if rows.len() != texts.len() { + return Err(EmbedError::BatchMismatch { + expected: texts.len(), + got: rows.len(), + }); + } + + let mut out = Vec::with_capacity(rows.len()); + for mut row in rows { + normalize_in_place(&mut row); + out.push(row); + } + Ok(out) + } +} + +/// L2-normalises a vector so it has unit length. A zero vector is left +/// untouched (its cosine distance to anything is undefined, but at +/// least Redis won't reject the bytes). +fn normalize_in_place(v: &mut [f32]) { + let mut sum_sq: f64 = 0.0; + for &x in v.iter() { + sum_sq += (x as f64) * (x as f64); + } + if sum_sq == 0.0 { + return; + } + let inv = (1.0 / sum_sq.sqrt()) as f32; + for x in v.iter_mut() { + *x *= inv; + } +} + +/// Packs a `&[f32]` into the raw little-endian byte sequence Redis +/// Search expects for a FLOAT32 vector field. We use +/// `byteorder::LittleEndian` (via `write_f32`) rather than relying on +/// `f32::to_le_bytes` so the encoding contract is visible in source +/// and consistent with the Go demo's `binary.LittleEndian.PutUint32`. +pub fn floats_to_bytes(fs: &[f32]) -> Vec { + use byteorder::{LittleEndian, WriteBytesExt}; + let mut buf = Vec::with_capacity(fs.len() * 4); + for &f in fs { + buf.write_f32::(f).expect("Vec write never fails"); + } + buf +} diff --git a/content/develop/use-cases/semantic-cache/rust/src/main.rs b/content/develop/use-cases/semantic-cache/rust/src/main.rs new file mode 100644 index 0000000000..0160479fe5 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/rust/src/main.rs @@ -0,0 +1,621 @@ +//! Redis semantic-cache demo server (Rust). +//! +//! Run this binary and visit http://localhost:8091 to drive a small +//! semantic-cache demo backed by Redis Search. The UI lets you: +//! +//! - Type a natural-language prompt and watch the cache decide hit +//! or miss. On a hit Redis returns the cached response in tens of +//! milliseconds and the demo LLM is not called at all; on a miss +//! the demo LLM "thinks" for ~1.5 s before answering and the new +//! prompt, response, and embedding are written back to Redis for +//! next time. +//! - Adjust the cosine-distance threshold to see how close a +//! paraphrase must be for the cache to serve it. +//! - Switch tenant, locale, or model version to see metadata +//! isolation in action — entries written under one tenant cannot +//! be served to another, because the TAG filter goes into the +//! same FT.SEARCH call as the KNN. +//! - Inspect every cached entry with TTL and hit count, and drop +//! individual entries to simulate eviction. +//! +//! The server holds a single `LocalEmbedder`, a single +//! `RedisSemanticCache`, and a single `MockLlm` for the lifetime of +//! the process. The first run downloads the embedding model into the +//! local Hugging Face cache; everything after is offline. + +mod cache; +mod embeddings; +mod mock_llm; +mod seed_cache; + +use std::collections::HashMap; +use std::io::Read; +use std::path::PathBuf; +use std::sync::Arc; +use std::time::Instant; + +use serde::Serialize; +use serde_json::{json, Value as JsonValue}; +use tiny_http::{Header, Method, Request, Response, Server, StatusCode}; + +use crate::cache::{ + Entry, IndexInfo, LookupParams, LookupResult, PutParams, RedisSemanticCache, VECTOR_DIM_DEFAULT, +}; +use crate::embeddings::LocalEmbedder; +use crate::mock_llm::MockLlm; +use crate::seed_cache::{seed, SeedOptions}; + +const STACK_LABEL: &str = "redis-rs + candle + tiny_http (Rust)"; +const MAX_BODY_BYTES: usize = 1 * 1024 * 1024; + +// ----- CLI flags ----------------------------------------------------- + +struct Flags { + host: String, + port: u16, + redis_host: String, + redis_port: u16, + index_name: String, + key_prefix: String, + ttl_seconds: i64, + threshold: f64, + llm_latency_ms: f64, + no_reset: bool, +} + +impl Default for Flags { + fn default() -> Self { + Self { + host: "127.0.0.1".into(), + port: 8091, + redis_host: "localhost".into(), + redis_port: 6379, + index_name: "semcache:idx".into(), + key_prefix: "cache:".into(), + ttl_seconds: 3600, + threshold: 0.5, + llm_latency_ms: 1500.0, + no_reset: false, + } + } +} + +fn parse_flags() -> Flags { + // Stdlib arg parsing keeps the dependency list short. The flag + // surface is small enough that a hand-rolled loop is cheaper than + // pulling clap in. Long flags only ("--host=x" or "--host x"); + // unknown flags abort. + let mut f = Flags::default(); + let args: Vec = std::env::args().skip(1).collect(); + let mut i = 0; + while i < args.len() { + let arg = &args[i]; + let (name, value, advance) = if let Some((n, v)) = arg.split_once('=') { + (n.to_string(), Some(v.to_string()), false) + } else { + (arg.clone(), args.get(i + 1).cloned(), true) + }; + match name.as_str() { + "--host" => { f.host = value.expect("--host requires a value"); if advance { i += 1; } } + "--port" => { f.port = value.expect("--port requires a value").parse().expect("--port must be an integer"); if advance { i += 1; } } + "--redis-host" => { f.redis_host = value.expect("--redis-host requires a value"); if advance { i += 1; } } + "--redis-port" => { f.redis_port = value.expect("--redis-port requires a value").parse().expect("--redis-port must be an integer"); if advance { i += 1; } } + "--index-name" => { f.index_name = value.expect("--index-name requires a value"); if advance { i += 1; } } + "--key-prefix" => { f.key_prefix = value.expect("--key-prefix requires a value"); if advance { i += 1; } } + "--ttl-seconds" => { f.ttl_seconds = value.expect("--ttl-seconds requires a value").parse().expect("--ttl-seconds must be an integer"); if advance { i += 1; } } + "--threshold" => { f.threshold = value.expect("--threshold requires a value").parse().expect("--threshold must be a float"); if advance { i += 1; } } + "--llm-latency-ms" => { f.llm_latency_ms = value.expect("--llm-latency-ms requires a value").parse().expect("--llm-latency-ms must be a float"); if advance { i += 1; } } + "--no-reset" => { f.no_reset = true; } + "--help" | "-h" => { print_help(); std::process::exit(0); } + other => { + eprintln!("unknown flag: {}", other); + print_help(); + std::process::exit(2); + } + } + i += 1; + } + f +} + +fn print_help() { + eprintln!( + "semcache-demo: Redis semantic cache demo (Rust)\n\ + \n\ + Flags:\n\ + --host interface to bind to (default 127.0.0.1)\n\ + --port HTTP port (default 8091)\n\ + --redis-host Redis host (default localhost)\n\ + --redis-port Redis port (default 6379)\n\ + --index-name Redis Search index name (default semcache:idx)\n\ + --key-prefix Cache key prefix (default cache:)\n\ + --ttl-seconds TTL applied to every cache entry (default 3600)\n\ + --threshold Default cosine-distance threshold (default 0.5)\n\ + --llm-latency-ms Simulated mock LLM latency (default 1500)\n\ + --no-reset Skip the cache reset + seed on startup\n" + ); +} + +// ----- Shared demo state --------------------------------------------- + +struct AppState { + cache: RedisSemanticCache, + embedder: LocalEmbedder, + llm: MockLlm, + html_page: String, + default_tenant: String, + default_locale: String, +} + +impl AppState { + fn seed_now(&self) -> Result> { + self.cache.clear()?; + seed( + &self.cache, + &self.embedder, + SeedOptions { + tenant: &self.default_tenant, + locale: &self.default_locale, + model_version: &self.llm.model_version, + }, + ) + } +} + +// ----- HTTP helpers -------------------------------------------------- + +fn json_response(payload: &JsonValue, status: u16) -> Response>> { + let body = serde_json::to_vec(payload).unwrap_or_else(|_| b"{}".to_vec()); + Response::from_data(body) + .with_status_code(StatusCode(status)) + .with_header( + "Content-Type: application/json; charset=utf-8" + .parse::
() + .unwrap(), + ) +} + +fn html_response(html: String, status: u16) -> Response>> { + Response::from_data(html.into_bytes()) + .with_status_code(StatusCode(status)) + .with_header( + "Content-Type: text/html; charset=utf-8" + .parse::
() + .unwrap(), + ) +} + +fn error_json(err_type: &str, message: &str) -> JsonValue { + json!({ "error": message, "type": err_type }) +} + +/// Read a request body up to MAX_BODY_BYTES. Returns an error string +/// if the body would exceed the cap; the handler surfaces that as a +/// 400 to the client. +fn read_body(req: &mut Request) -> Result { + let mut buf = Vec::new(); + { + let mut limited = req.as_reader().take((MAX_BODY_BYTES + 1) as u64); + limited + .read_to_end(&mut buf) + .map_err(|e| format!("reading body: {}", e))?; + } + if buf.len() > MAX_BODY_BYTES { + return Err(format!("request body exceeds {} bytes", MAX_BODY_BYTES)); + } + String::from_utf8(buf).map_err(|e| format!("body is not valid UTF-8: {}", e)) +} + +fn parse_form(body: &str) -> HashMap { + let mut out = HashMap::new(); + for (k, v) in url::form_urlencoded::parse(body.as_bytes()) { + out.insert(k.into_owned(), v.into_owned()); + } + out +} + +/// Clamp the threshold from a form-body string. `parse::` happily +/// handles "nan" → NaN and "inf" → +Inf; either would silently turn +/// the lookup into a permanent hit (NaN comparisons are always false, +/// so `distance > NaN` cannot reject) or a permanent miss. Use +/// `f64::is_finite` to reject those explicitly and fall back to 0.5; +/// finite values are clamped to the cosine-distance range [0, 2]. +fn clamp_threshold(raw: &str) -> f64 { + let parsed: f64 = raw.trim().parse().unwrap_or(f64::NAN); + if !parsed.is_finite() { + return 0.5; + } + if parsed < 0.0 { + 0.0 + } else if parsed > 2.0 { + 2.0 + } else { + parsed + } +} + +// ----- /state response shape ----------------------------------------- + +#[derive(Serialize)] +struct StateIndex { + num_docs: i64, + index_name: String, + indexing_failures: i64, + vector_index_size_mb: f64, + model: String, + mock_llm_latency_ms: f64, + default_threshold: f64, + stack_label: String, +} + +#[derive(Serialize)] +struct State { + index: StateIndex, + entries: Vec, +} + +fn build_state(app: &AppState) -> Result> { + let info: IndexInfo = app.cache.index_info(); + let entries = app.cache.list_entries(200)?; + Ok(State { + index: StateIndex { + num_docs: info.num_docs, + index_name: app.cache.index_name.clone(), + indexing_failures: info.indexing_failures, + vector_index_size_mb: info.vector_index_size_mb, + model: app.embedder.model_name.clone(), + mock_llm_latency_ms: app.llm.latency_ms, + default_threshold: app.cache.distance_threshold, + stack_label: STACK_LABEL.to_string(), + }, + entries, + }) +} + +// ----- Query hot path ------------------------------------------------- + +struct QueryParams { + prompt: String, + tenant: String, + locale: String, + model_version: String, + threshold: f64, + lookup_only: bool, +} + +fn run_query(app: &AppState, p: QueryParams) -> Result> { + let t0 = Instant::now(); + let query_vec = app.embedder.encode_one(&p.prompt)?; + let embed_ms = ms_since(t0); + + let t1 = Instant::now(); + let result = app.cache.lookup( + &query_vec, + LookupParams { + tenant: Some(&p.tenant), + locale: Some(&p.locale), + model_version: Some(&p.model_version), + safety: None, // default "ok" + distance_threshold: Some(p.threshold), + }, + )?; + let lookup_ms = ms_since(t1); + + match result { + LookupResult::Hit(hit) => Ok(json!({ + "outcome": "hit", + "response": hit.response, + "entry_id": hit.id, + "distance": hit.distance, + "ttl_seconds": hit.ttl_seconds, + "hit_count": hit.hit_count, + "threshold": p.threshold, + "embed_ms": embed_ms, + "lookup_ms": lookup_ms, + "llm_ms": JsonValue::Null, + "total_ms": embed_ms + lookup_ms, + "tokens_avoided": estimate_response_tokens(&hit.prompt, &hit.response), + "ms_avoided": app.llm.latency_ms, + })), + LookupResult::Miss(miss) => { + let nearest_distance = match miss.nearest_distance { + Some(d) => json!(d), + None => JsonValue::Null, + }; + if p.lookup_only { + return Ok(json!({ + "outcome": "miss", + "response": "(LLM not called in lookup-only mode)", + "nearest_distance": nearest_distance, + "threshold": p.threshold, + "wrote_entry_id": JsonValue::Null, + "embed_ms": embed_ms, + "lookup_ms": lookup_ms, + "llm_ms": JsonValue::Null, + "total_ms": embed_ms + lookup_ms, + })); + } + + let t2 = Instant::now(); + let llm_resp = app.llm.complete(&p.prompt); + let llm_ms = ms_since(t2); + + let entry_id = app.cache.put(PutParams { + prompt: &p.prompt, + response: &llm_resp.response, + embedding: &query_vec, + tenant: &p.tenant, + locale: &p.locale, + model_version: &p.model_version, + safety: "ok", + ttl_seconds: None, + entry_id: None, + })?; + + Ok(json!({ + "outcome": "miss", + "response": llm_resp.response, + "nearest_distance": nearest_distance, + "threshold": p.threshold, + "wrote_entry_id": entry_id, + "embed_ms": embed_ms, + "lookup_ms": lookup_ms, + "llm_ms": llm_ms, + "total_ms": embed_ms + lookup_ms + llm_ms, + })) + } + } +} + +fn ms_since(t: Instant) -> f64 { + (t.elapsed().as_micros() as f64) / 1000.0 +} + +fn estimate_response_tokens(prompt: &str, response: &str) -> i64 { + let n = ((prompt.len() + response.len()) / 4) as i64; + if n < 1 { + 1 + } else { + n + } +} + +// ----- Request routing ----------------------------------------------- + +/// Top-level request handler. Takes ownership of the request and +/// always calls `respond()` exactly once. Wraps the inner dispatch +/// in `catch_unwind` so a panic in any handler becomes a JSON 500 +/// rather than a dropped connection. +fn handle_request(app: &AppState, mut req: Request) { + // The dispatch must return a Response (or take ownership of the + // request to do its own respond, e.g. for streaming). We keep it + // simple by always returning a Response from `dispatch`. + let path = req.url().split('?').next().unwrap_or("/").to_string(); + let method = req.method().clone(); + + let response = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| { + dispatch(app, &method, &path, &mut req) + })); + let response = match response { + Ok(r) => r, + Err(panic) => { + eprintln!("[demo] panic in handler: {:?}", panic); + let msg = format!("internal server error: {:?}", panic); + json_response(&error_json("panic", &msg), 500) + } + }; + let _ = req.respond(response); +} + +fn dispatch( + app: &AppState, + method: &Method, + path: &str, + req: &mut Request, +) -> Response>> { + match (method, path) { + (Method::Get, "/") | (Method::Get, "/index.html") => html_response(app.html_page.clone(), 200), + (Method::Get, "/state") => match build_state(app) { + Ok(s) => match serde_json::to_value(&s) { + Ok(v) => json_response(&v, 200), + Err(e) => json_response(&error_json("state_encode", &e.to_string()), 500), + }, + Err(e) => json_response(&error_json("state", &e.to_string()), 500), + }, + (Method::Post, "/query") => handle_query(app, req), + (Method::Post, "/reset") => match app.seed_now() { + Ok(_) => json_response(&json!({"ok": true}), 200), + Err(e) => json_response(&error_json("reset", &e.to_string()), 500), + }, + (Method::Post, "/drop") => handle_drop(app, req), + _ => json_response(&error_json("not_found", "not found"), 404), + } +} + +fn handle_query(app: &AppState, req: &mut Request) -> Response>> { + let body = match read_body(req) { + Ok(b) => b, + Err(e) => return json_response(&error_json("bad_request", &e), 400), + }; + let form = parse_form(&body); + let prompt = form + .get("prompt") + .map(|s| s.trim().to_string()) + .unwrap_or_default(); + if prompt.is_empty() { + return json_response(&error_json("bad_request", "prompt is required"), 400); + } + let tenant = form + .get("tenant") + .map(|s| s.as_str()) + .filter(|s| !s.is_empty()) + .unwrap_or("acme") + .to_string(); + let locale = form + .get("locale") + .map(|s| s.as_str()) + .filter(|s| !s.is_empty()) + .unwrap_or("en") + .to_string(); + let default_model = app.llm.model_version.clone(); + let model_version = form + .get("model_version") + .map(|s| s.as_str()) + .filter(|s| !s.is_empty()) + .unwrap_or(&default_model) + .to_string(); + let threshold = clamp_threshold(form.get("threshold").map(|s| s.as_str()).unwrap_or("")); + let lookup_only = form + .get("lookup_only") + .map(|s| { + !s.is_empty() && s != "0" && s.to_lowercase() != "false" + }) + .unwrap_or(false); + + match run_query( + app, + QueryParams { + prompt, + tenant, + locale, + model_version, + threshold, + lookup_only, + }, + ) { + Ok(payload) => json_response(&payload, 200), + Err(e) => { + eprintln!("[demo] /query error: {}", e); + json_response(&error_json("query", &e.to_string()), 500) + } + } +} + +fn handle_drop(app: &AppState, req: &mut Request) -> Response>> { + let body = match read_body(req) { + Ok(b) => b, + Err(e) => return json_response(&error_json("bad_request", &e), 400), + }; + let form = parse_form(&body); + let entry_id = form + .get("entry_id") + .map(|s| s.trim().to_string()) + .unwrap_or_default(); + if entry_id.is_empty() { + return json_response(&error_json("bad_request", "entry_id is required"), 400); + } + match app.cache.delete_entry(&entry_id) { + Ok(deleted) => json_response( + &json!({"deleted": deleted, "entry_id": entry_id}), + 200, + ), + Err(e) => json_response(&error_json("drop", &e.to_string()), 500), + } +} + +// ----- Entry point --------------------------------------------------- + +fn main() -> Result<(), Box> { + let flags = parse_flags(); + + println!("Connecting to Redis at {}:{} ...", flags.redis_host, flags.redis_port); + let client = redis::Client::open(format!("redis://{}:{}/", flags.redis_host, flags.redis_port))?; + { + let mut con = client.get_connection()?; + let _: String = redis::cmd("PING").query(&mut con)?; + } + + let cache = RedisSemanticCache::new( + client, + flags.index_name.clone(), + flags.key_prefix.clone(), + VECTOR_DIM_DEFAULT, + flags.threshold, + flags.ttl_seconds, + )?; + cache.create_index()?; + + println!("Loading embedding model (first run downloads the MiniLM weights from Hugging Face)..."); + let embedder = LocalEmbedder::new(None)?; + + let llm = MockLlm::new(None, flags.llm_latency_ms); + + let app = Arc::new(AppState { + cache, + embedder, + llm, + html_page: load_html(&flags)?, + default_tenant: "acme".into(), + default_locale: "en".into(), + }); + + if !flags.no_reset { + println!( + "Dropping any existing cache under '{}*' and re-seeding from the FAQ list (pass --no-reset to keep).", + flags.key_prefix + ); + let seeded = app.seed_now()?; + println!("Seeded {} entries.", seeded); + } + + let addr = format!("{}:{}", flags.host, flags.port); + let server = Server::http(&addr).map_err(|e| format!("starting HTTP server: {}", e))?; + println!("Redis semantic cache demo listening on http://{}", addr); + println!( + "Using Redis at {}:{} with index '{}'", + flags.redis_host, flags.redis_port, flags.index_name + ); + + // tiny_http hands out requests round-robin to whichever thread is + // pulling from the iterator. A small worker pool means a /state + // poll doesn't have to wait for a /query that's still embedding. + // Four workers is plenty for one user clicking around. + let server = Arc::new(server); + let mut workers = Vec::new(); + for _ in 0..4 { + let app = Arc::clone(&app); + let server = Arc::clone(&server); + workers.push(std::thread::spawn(move || { + for req in server.incoming_requests() { + handle_request(&app, req); + } + })); + } + for w in workers { + let _ = w.join(); + } + Ok(()) +} + +fn load_html(flags: &Flags) -> Result { + // Resolve `index.html` next to the executable; fall back to the + // current working directory (matches the Go demo's + // `executableDir` lookup so `cargo run` works from the source + // directory). + let mut candidates: Vec = Vec::new(); + if let Ok(exe) = std::env::current_exe() { + if let Some(parent) = exe.parent() { + candidates.push(parent.join("index.html")); + if let Some(grand) = parent.parent() { + if let Some(great) = grand.parent() { + candidates.push(great.join("index.html")); + } + } + } + } + if let Ok(cwd) = std::env::current_dir() { + candidates.push(cwd.join("index.html")); + } + for path in &candidates { + if path.exists() { + let raw = std::fs::read_to_string(path)?; + return Ok(raw + .replace("__INDEX_NAME__", &flags.index_name) + .replace("__KEY_PREFIX__", &flags.key_prefix)); + } + } + Err(std::io::Error::new( + std::io::ErrorKind::NotFound, + format!("index.html not found in any of: {:?}", candidates), + )) +} diff --git a/content/develop/use-cases/semantic-cache/rust/src/mock_llm.rs b/content/develop/use-cases/semantic-cache/rust/src/mock_llm.rs new file mode 100644 index 0000000000..71860655f6 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/rust/src/mock_llm.rs @@ -0,0 +1,183 @@ +//! Deterministic mock LLM for the semantic-cache demo. +//! +//! The point of a semantic cache is to *skip* an LLM call when a prior +//! answer is reusable. To make that visible in a docs demo we need an +//! LLM stand-in that: +//! +//! - takes long enough that the saved time on a cache hit is obvious +//! (real-world model calls are 500 ms to several seconds); +//! - responds deterministically so a given prompt always produces the +//! same answer, which keeps the demo reproducible; +//! - exposes an estimated token count so the demo can show the +//! saving in "tokens not spent" terms alongside latency; +//! - needs no API keys, no network, no extra dependencies. +//! +//! It is keyword-matched against a small lookup table of FAQ-style +//! answers for a fictional online retailer. Anything that doesn't +//! match falls back to a generic templated reply. The `latency_ms` +//! field is the simulated round trip; the default (1500 ms) is in the +//! neighbourhood of a real GPT-class model on a moderately-sized +//! prompt. + +use std::sync::atomic::{AtomicI64, Ordering}; +use std::thread; +use std::time::{Duration, Instant}; + +struct KnowledgeEntry { + keywords: &'static [&'static str], + answer: &'static str, +} + +const KNOWLEDGE: &[KnowledgeEntry] = &[ + KnowledgeEntry { + keywords: &["return", "refund", "exchange"], + answer: "You can return any unworn item within 30 days of delivery for a \ + full refund. Start a return from your order page; we email a \ + prepaid label and refund the original payment method within \ + five business days of receiving the item.", + }, + KnowledgeEntry { + keywords: &["shipping", "delivery", "arrive", "ship"], + answer: "Standard shipping is free on orders over $50 and arrives in \ + three to five business days. Expedited two-day shipping is \ + $9.99 and is available at checkout for in-stock items.", + }, + KnowledgeEntry { + keywords: &["size", "sizing", "fit"], + answer: "We follow standard US sizing. For most styles we recommend \ + ordering your usual size; the product page includes a sizing \ + chart and customer fit notes for items that run small or large.", + }, + KnowledgeEntry { + keywords: &["warranty", "guarantee", "defect", "broken"], + answer: "All gear is covered by a one-year manufacturer warranty against \ + defects in materials or workmanship. Email support with your \ + order number and a photo of the issue and we will replace the \ + item or issue a refund.", + }, + KnowledgeEntry { + keywords: &["contact", "support", "help", "agent"], + answer: "You can reach our support team by email at help@example.com or \ + by live chat from the help centre, 9am to 9pm Eastern, seven \ + days a week. Most tickets get a first reply within two hours.", + }, + KnowledgeEntry { + keywords: &["track", "tracking", "order", "where"], + answer: "Your tracking number is on the order confirmation email and on \ + the order detail page once the package has been picked up by \ + the carrier — typically within 24 hours of order placement.", + }, + KnowledgeEntry { + keywords: &["cancel", "modify", "change"], + answer: "Orders can be cancelled or modified for up to one hour after \ + placement. After that the order has usually entered our \ + warehouse system; the fastest path is to accept delivery and \ + start a return for any unwanted items.", + }, + KnowledgeEntry { + keywords: &["discount", "coupon", "promo", "code"], + answer: "Active promotional codes are listed on the homepage banner. \ + Codes apply at checkout and cannot be combined; the system \ + automatically uses the larger of the two when more than one \ + would qualify.", + }, +]; + +const FALLBACK_ANSWER: &str = + "Thanks for the question. Our team would normally answer this \ + individually; in the meantime please check the help centre or \ + contact support@example.com for a faster response."; + +/// Rough English token estimate: ~4 characters per token. Real +/// tokenizers (BPE, SentencePiece) vary slightly but this is close +/// enough for "look how many tokens you saved" demo signage. +pub fn estimate_tokens(text: &str) -> i64 { + if text.is_empty() { + return 0; + } + let n = (text.len() / 4) as i64; + if n > 1 { + n + } else { + 1 + } +} + +fn answer_for(prompt: &str) -> &'static str { + let lower = prompt.to_lowercase(); + for row in KNOWLEDGE { + for kw in row.keywords { + if lower.contains(kw) { + return row.answer; + } + } + } + FALLBACK_ANSWER +} + +// The unused fields and methods on this type are intentional: this +// docs example mirrors the Python / Node / Go / Jedis demos and +// exposes the same public surface (`model_version`, `latency_ms`, +// `total_tokens()`, `call_count()`) so a reader copying the example +// into their own project doesn't have to add fields back to match +// the prose in `_index.md`. The demo HTTP layer only reads +// `response` and `latency_ms`, hence the dead-code warnings. +#[allow(dead_code)] +pub struct LlmResponse { + pub response: String, + pub model_version: String, + pub latency_ms: f64, + pub prompt_tokens: i64, + pub completion_tokens: i64, +} + +impl LlmResponse { + #[allow(dead_code)] + pub fn total_tokens(&self) -> i64 { + self.prompt_tokens + self.completion_tokens + } +} + +pub struct MockLlm { + pub model_version: String, + pub latency_ms: f64, + call_count: AtomicI64, +} + +impl MockLlm { + pub fn new(model_version: Option<&str>, latency_ms: f64) -> Self { + let model_version = model_version + .filter(|s| !s.is_empty()) + .unwrap_or("gpt-4.5-2026") + .to_string(); + let latency_ms = if latency_ms <= 0.0 { 1500.0 } else { latency_ms }; + Self { + model_version, + latency_ms, + call_count: AtomicI64::new(0), + } + } + + #[allow(dead_code)] + pub fn call_count(&self) -> i64 { + self.call_count.load(Ordering::Relaxed) + } + + /// Pretend to call an LLM. Sleeps first so the latency is + /// realistic regardless of which branch generates the text, then + /// keyword-matches a templated answer. + pub fn complete(&self, prompt: &str) -> LlmResponse { + self.call_count.fetch_add(1, Ordering::Relaxed); + let start = Instant::now(); + thread::sleep(Duration::from_micros((self.latency_ms * 1000.0) as u64)); + let resp = answer_for(prompt); + let elapsed_ms = (start.elapsed().as_micros() as f64) / 1000.0; + LlmResponse { + response: resp.to_string(), + model_version: self.model_version.clone(), + latency_ms: elapsed_ms, + prompt_tokens: estimate_tokens(prompt), + completion_tokens: estimate_tokens(resp), + } + } +} diff --git a/content/develop/use-cases/semantic-cache/rust/src/seed_cache.rs b/content/develop/use-cases/semantic-cache/rust/src/seed_cache.rs new file mode 100644 index 0000000000..3ad10a0225 --- /dev/null +++ b/content/develop/use-cases/semantic-cache/rust/src/seed_cache.rs @@ -0,0 +1,100 @@ +//! Pre-seed the semantic cache with a handful of FAQ answers. +//! +//! In a real deployment the cache fills up organically as users ask +//! questions: a first-time question is a miss, the LLM answers, and +//! the response is written back. To make the demo immediately useful +//! — so the first query you type lands on a hit instead of a cold +//! miss — we seed a small set of canonical prompts and their answers +//! at startup. +//! +//! The seed list mirrors the keyword table in `mock_llm.rs` but stores +//! the *canonical phrasing* of each question. Paraphrases of any of +//! these prompts ("How do I return an item?", "Can I get a refund?") +//! embed close to the canonical entry and the cache lookup serves the +//! stored response without ever calling the model. + +use crate::cache::{PutParams, RedisSemanticCache}; +use crate::embeddings::LocalEmbedder; + +/// One canonical prompt + response pair. +pub struct SeedEntry { + pub prompt: &'static str, + pub response: &'static str, +} + +pub const SEED_ENTRIES: &[SeedEntry] = &[ + SeedEntry { + prompt: "What is your return policy?", + response: "You can return any unworn item within 30 days of delivery for \ + a full refund. Start a return from your order page; we email \ + a prepaid label and refund the original payment method within \ + five business days of receiving the item.", + }, + SeedEntry { + prompt: "How long does shipping take?", + response: "Standard shipping is free on orders over $50 and arrives in \ + three to five business days. Expedited two-day shipping is \ + $9.99 and is available at checkout for in-stock items.", + }, + SeedEntry { + prompt: "How do I find my size?", + response: "We follow standard US sizing. For most styles we recommend \ + ordering your usual size; the product page includes a sizing \ + chart and customer fit notes for items that run small or \ + large.", + }, + SeedEntry { + prompt: "Is there a warranty on your products?", + response: "All gear is covered by a one-year manufacturer warranty \ + against defects in materials or workmanship. Email support \ + with your order number and a photo of the issue and we will \ + replace the item or issue a refund.", + }, + SeedEntry { + prompt: "How can I contact customer support?", + response: "You can reach our support team by email at help@example.com \ + or by live chat from the help centre, 9am to 9pm Eastern, \ + seven days a week. Most tickets get a first reply within two \ + hours.", + }, + SeedEntry { + prompt: "Where is my order?", + response: "Your tracking number is on the order confirmation email and \ + on the order detail page once the package has been picked up \ + by the carrier — typically within 24 hours of order \ + placement.", + }, +]; + +pub struct SeedOptions<'a> { + pub tenant: &'a str, + pub locale: &'a str, + pub model_version: &'a str, +} + +/// Write every entry in `SEED_ENTRIES` to the cache under the supplied +/// metadata scope. Embeddings are produced in one batched +/// `encode_many` call so the encoder only pays the setup cost once. +/// Returns the number of entries that were written. +pub fn seed( + cache: &RedisSemanticCache, + embedder: &LocalEmbedder, + opts: SeedOptions<'_>, +) -> Result> { + let prompts: Vec<&str> = SEED_ENTRIES.iter().map(|e| e.prompt).collect(); + let vectors = embedder.encode_many(&prompts)?; + for (entry, vec) in SEED_ENTRIES.iter().zip(vectors.into_iter()) { + cache.put(PutParams { + prompt: entry.prompt, + response: entry.response, + embedding: &vec, + tenant: opts.tenant, + locale: opts.locale, + model_version: opts.model_version, + safety: "ok", + ttl_seconds: None, + entry_id: None, + })?; + } + Ok(SEED_ENTRIES.len()) +}