diff --git a/blogs/cache-augmented-generation.mdx b/blogs/cache-augmented-generation.mdx deleted file mode 100644 index 46832a9..0000000 --- a/blogs/cache-augmented-generation.mdx +++ /dev/null @@ -1,124 +0,0 @@ ---- -title: "Cache-Augmented Generation – Teaching an AI to Remember for Lightning-Fast Answers" -description: "Morphik’s cache-augmented generation (CAG) gives large language models a memory upgrade, making them 10× faster than traditional RAG by storing long-term context in the transformer key-value cache." -author: "Morphik Team" -date: "Fri May 16 2025 17:00:00 GMT-0700 (Pacific Daylight Time)" -slug: "cache-augmented-generation" ---- - -# What if your LLM could remember an entire textbook _without re-reading it_ every time you asked a question? - -Imagine loading a 500-page technical manual into your AI **once** – and then getting instant, accurate answers from it at any time, without the costly re-reading, re-embedding, or re-searching. Sound like sci-fi? 🛸 **Morphik** is turning this idea into reality with a new approach called **Cache-Augmented Generation (CAG)**. It’s like giving your LLM a **memory upgrade** – a big, persistent brain cache so it can _remember_ all the details you fed it, instead of acting like an amnesiac goldfish that has to gulp down the textbook anew for every question. In this post, we’ll explore how CAG works, why it’s a game-changer compared to traditional RAG pipelines, and how you can try Morphik’s implementation yourself. Get ready for some fun analogies, a bit of technical deep-dive, and a demo code snippet. By the end, you might wonder why you ever let your poor AI reread _War and Peace_ a hundred times a day. 😉 - -## The Problem with Traditional RAG (Retrieval-Augmented Generation) - - - RAG's chunk-embed-retrieve loop adds latency, context bloat, and complexity – and it can still fail if the right chunk isn't retrieved. - - -Retrieval-Augmented Generation has been the go-to method for LLMs to access external knowledge. The recipe: **chunk** your documents into pieces, **embed** each chunk as a vector, store them in a database, and at query time **retrieve** the most relevant chunks to stuff into the LLM’s context along with the question. This approach works, but it comes with baggage: - -- **Repeated Reading = Latency:** Every single query incurs extra steps – embedding the user query, vector search, and then the LLM has to digest the retrieved text chunks. This real-time retrieval introduces significant latency. It’s like asking a question and making the AI run to the library to fetch reference pages _each time_ before answering. -- **Context Window Bloat:** Those retrieved chunks take up precious space in the prompt. If you pull in, say, four 500-token passages for each query, that’s 2 000 tokens of your context window gone. In RAG, it’s common to hit context size limits or pay huge token costs because you’re repeatedly shoving documents into the prompt. -- **Potential Errors in Retrieval:** The pipeline can fail if the vector search misses a relevant chunk or grabs an irrelevant one. If the answer was in a section that wasn’t retrieved, your LLM won’t magically know it. -- **Complexity & Maintenance:** A full RAG stack is a bit of an engineering circus 🎪. You have to maintain a vector database, an embedding model, possibly a retriever/reranker, handle chunking logic, etc. More moving parts means more chances for something to break. As one analysis put it, RAG’s integration of multiple components **increases system complexity** and requires careful tuning. - -Don’t get us wrong – RAG is powerful (especially for very large or frequently updated knowledge bases). But for many applications, all this overhead is starting to look… well, **naive**. - -## Cache-Augmented Generation: Giving Your LLM a Memory Upgrade - - - Preload the full document once, save the transformer’s key-value cache, and reuse it for all future queries. - - -**Cache-Augmented Generation (CAG)** is a radically different philosophy: _what if we preload all the knowledge into the model upfront, so it doesn’t have to retrieve stuff later?_ It leverages the transformer’s internal **key-value cache** (the same `past_key_values` that normally store attention states from previous tokens) to store an entire knowledge base inside the model’s context. Think of it as building a **cache layer for the brain** of your model. - -1. **One-Time Ingestion into the KV Cache:** You feed the entire document or set of documents to the LLM _once_ as a massive initial prompt. The model processes this just like it would with a long context, except now we **save the model’s internal state** (the key-value pairs from each transformer layer) after ingesting the docs. -2. **Blazing-Fast Queries from Memory:** When a user query comes in, we don’t stuff the documents into the prompt at all. We simply **reload the cached state** into the model’s transformer stack and feed in the new question tokens. -3. **Reuse, Updates, and Multi-turn:** The cached knowledge can be reused across any number of queries – your LLM now has the “textbook in mind” permanently for that session. You can append new docs with another one-time evaluation step. - -![CAG vs RAG](/assets/CAG/cag-diagram.png) - -_Figure 1 – Traditional RAG (top) vs Cache-Augmented Generation (bottom). In Morphik, the “Knowledge Cache” is the serialized transformer key-value state._ - -## Why Cache-Augmented Generation Is a Big Deal - -- **Speed, Speed, Speed:** Eliminating retrieval can reduce end-to-end latency by ~40 % in benchmarks on static datasets. -- **Reduced Token Costs:** Pay the document tokens once; each subsequent query might only cost ~50 tokens. -- **No Lost Context:** The model can synthesize info across the _entire_ document. -- **Simpler Infrastructure:** No vector DB or rerankers – just the LLM and a cache. -- **Multi-turn Consistency:** Because the cache persists, conversations naturally stay in sync. - - - If your knowledge base is huge (millions of docs) or changes every hour, stick with RAG or use a hybrid strategy. - - -## Cache-Augmented Generation in Action (Morphik Demo) - - - - ```bash - pip install morphik - ``` - - - ```python - from morphik import Morphik - - db = Morphik() - with open("ai\_textbook.txt", "r") as f: - content = f.read() - doc\_id = db.ingest\_text(content, filename="AI Textbook") - - ``` - - - ```python - db.create_cache( - name="textbook_cache", - model="llama2", - gguf_file="llama-2-7b-chat.Q4_K_M.gguf", - docs=[doc_id], - ) - cache = db.get_cache("textbook_cache") - ``` - - - ```python - questions = [ - "What is the chain rule in calculus?", - "Who proposed the concept of the Turing Test?" - ] - for q in questions: - result = cache.query(q) - print(f"Q: {q}\nA: {result.completion.strip()}\n") - ``` - - - -## RAG vs CAG Cheat Sheet - -| Aspect | Retrieval-Augmented (RAG) | Cache-Augmented (CAG) | -| ----------------------- | ----------------------------- | ------------------------------------- | -| **Architecture** | Vector DB \+ retriever \+ LLM | Just LLM \+ cache | -| **Per-Query Latency** | Embed \+ search \+ LLM | LLM only | -| **Token Cost / Query** | High (docs repeated) | Low (docs once) | -| **Best For** | Huge / dynamic KBs | Static or medium KBs that fit context | -| **Answer Quality Risk** | Missed retrievals | Full context inside model | -| **Infra Complexity** | Many moving parts | Minimal | -| **Data Updates** | Naturally incremental | Cache must be invalidated / rebuilt | - -## Try It Yourself - -Installing Morphik is a one-liner, and the SDK hides all the complexity: - -```bash -pip install morphik -``` - -- **GitHub:** https://github.com/morphik-org/morphik-core -- **Docs:** https://morphik.ai/docs - -> **Future roadmap:** persistent caches, multi-document graphs, smart eviction, and hybrid RAG-CAG for ever-fresher knowledge stores. - -Ready to give your LLM a memory boost? **Try Morphik today** and let your AI _learn once, answer forever\!_ 🚀 \ No newline at end of file diff --git a/concepts/metadata-filtering.mdx b/concepts/metadata-filtering.mdx index e6e8e55..930deec 100644 --- a/concepts/metadata-filtering.mdx +++ b/concepts/metadata-filtering.mdx @@ -3,7 +3,7 @@ title: "Metadata Filtering" description: "Canonical reference for Morphik’s metadata filter DSL and typed comparisons." --- -Morphik lets you filter documents and chunks directly in the database using a concise JSON filter syntax. The same structure powers the REST API, Python SDK (sync + async), folder helpers, `UserScope`, caches, and knowledge-graph builders, so you can define a filter once and reuse it everywhere. +Morphik lets you filter documents and chunks directly in the database using a concise JSON filter syntax. The same structure powers the REST API, Python SDK (sync + async), folder helpers, `UserScope`, and knowledge-graph builders, so you can define a filter once and reuse it everywhere. Prefer server-side filters over client-side post-processing. You’ll reduce bandwidth, improve performance, and keep behavior consistent between endpoints. @@ -14,7 +14,7 @@ Morphik lets you filter documents and chunks directly in the database using a co You can pass `filters` (or `document_filters`) to: - Retrieval endpoints: [`retrieve_chunks`](/python-sdk/retrieve_chunks), [`retrieve_docs`](/python-sdk/retrieve_docs), [`query`](/python-sdk/query), [`query_document`](/python-sdk/query_document) ingestion options. -- Listing/management: [`list_documents`](/python-sdk/list_documents), document/folder analytics, graph create/update, caches, chat history, and anywhere an SDK method exposes a `filters` argument. +- Listing/management: [`list_documents`](/python-sdk/list_documents), document/folder analytics, graph create/update, chat history, and anywhere an SDK method exposes a `filters` argument. ## Quick Start diff --git a/cookbooks/cache-augmented-generation.mdx b/cookbooks/cache-augmented-generation.mdx deleted file mode 100644 index 06f9b14..0000000 --- a/cookbooks/cache-augmented-generation.mdx +++ /dev/null @@ -1,9 +0,0 @@ ---- -title: 'Cache Augmented Generation: Speeding up RAG' -description: 'A collection of recipes to help you get started with Morphik' ---- - -## Introduction - -Morphik is a multi-modal RAG system that allows you to store, search, and retrieve data from various sources. It is built on top of the [LangChain](https://www.langchain.com/) framework and the [Chroma](https://www.chromadb.org/) vector store. - diff --git a/docs.json b/docs.json index 959b2a4..c3f0ed2 100644 --- a/docs.json +++ b/docs.json @@ -117,7 +117,6 @@ "blogs/memory-vibe-coding", "blogs/llm-science-battle", "blogs/gpt-vs-morphik-multimodal", - "blogs/cache-augmented-generation", "blogs/stop-parsing-docs" ] } @@ -213,13 +212,6 @@ "python-sdk/check_workflow_status" ] }, - { - "group": "Cache Management", - "pages": [ - "python-sdk/create_cache", - "python-sdk/get_cache" - ] - }, { "group": "Chat & Conversation Management", "pages": [ diff --git a/introduction.mdx b/introduction.mdx index 0f2c062..6b9dd3a 100644 --- a/introduction.mdx +++ b/introduction.mdx @@ -8,7 +8,6 @@ Morphik is a database that makes it easy to create fast, versatile, and producti Key features include: - **First class support for Unstructured Data**: Unlike traditional databases, Morphik allows users to directly ingest unstructured data of all forms - including (but not limited to) videos and PDFs. We've built research-driven custom algorithms to ensure state-of-the-art retrieval accuracy. -- **Persistent KV-caching**: For documents you reference often, you can process them once and *freeze* the LLM's internal state such that you can use it again later. This helps drastically reduce compute costs as well speed up model responses. - **Out of the box MCP support**: Morphik has [built in support](/using-morphik/mcp) for [Model Context Protocol](https://modelcontextprotocol.io/introduction) - so you can integrate your knowledge with any MCP client in a single click. - **User and Folder Scoping**: Organize and isolate your data with multi-user and folder-based access controls. Create logical boundaries for different projects or user groups while maintaining a unified database. - **Completely source-available**: You can check out Morphik core code [here!](https://github.com/morphik-org/morphik-core/). diff --git a/knowledge-base/how-do-i-set-up-rag.mdx b/knowledge-base/how-do-i-set-up-rag.mdx index 07b39ab..b86cbed 100644 --- a/knowledge-base/how-do-i-set-up-rag.mdx +++ b/knowledge-base/how-do-i-set-up-rag.mdx @@ -21,18 +21,15 @@ db = Morphik("morphik://owner_id:token@api.morphik.ai") # 1. Ingest a document doc = db.ingest_file("document.pdf") -# 2. Create a cache for quick retrieval -db.create_cache(name="docs-cache", model="openai_gpt4o", gguf_file="model.gguf", docs=[doc.id]) - -# 3 & 4. Retrieve and generate with context +# 2. Retrieve and generate with context response = db.query("What topics are covered?", k=4) print(response.text) ``` ### Related questions -- **Q:** What are the steps to set up a RAG pipeline with Morphik? - **A:** The key steps are: (1) Initialize the Morphik client, (2) Ingest your documents using `ingest_file()`, (3) Create a cache with `create_cache()`, and (4) Query using `query()` with your question and desired number of results. +- **Q:** What are the steps to set up a RAG pipeline with Morphik? + **A:** The key steps are: (1) Initialize the Morphik client, (2) Ingest your documents using `ingest_file()`, and (3) Query using `query()` with your question and desired number of results. - **Q:** How can I quickly build a retrieval-augmented generation workflow? **A:** Use Morphik's built-in RAG capabilities by following the code example above. The `query()` method handles both retrieval and generation in one step when you provide a question and set `k` for the number of relevant chunks to retrieve. diff --git a/python-sdk/create_cache.mdx b/python-sdk/create_cache.mdx deleted file mode 100644 index 124d192..0000000 --- a/python-sdk/create_cache.mdx +++ /dev/null @@ -1,100 +0,0 @@ ---- -title: "create_cache" -description: "Create a new cache with specified configuration" ---- - - - - ```python - def create_cache( - name: str, - model: str, - gguf_file: str, - filters: Optional[Dict[str, Any]] = None, - docs: Optional[List[str]] = None, - ) -> Dict[str, Any] - ``` - - - ```python - async def create_cache( - name: str, - model: str, - gguf_file: str, - filters: Optional[Dict[str, Any]] = None, - docs: Optional[List[str]] = None, - ) -> Dict[str, Any] - ``` - - - -## Parameters - -- `name` (str): Name of the cache to create -- `model` (str): Name of the model to use (e.g. "llama2") -- `gguf_file` (str): Name of the GGUF file to use for the model -- `filters` (Dict[str, Any], optional): Optional metadata filters to determine which documents to include. These filters will be applied in addition to any specific docs provided. -- `docs` (List[str], optional): Optional list of specific document IDs to include. These docs will be included in addition to any documents matching the filters. - -## Returns - -- `Dict[str, Any]`: Created cache configuration - -## Examples - - - - ```python - from morphik import Morphik - - db = Morphik() - - # This will include both: - # 1. Any documents with category="programming" - # 2. The specific documents "doc1" and "doc2" (regardless of their category) - cache = db.create_cache( - name="programming_cache", - model="llama2", - gguf_file="llama-2-7b-chat.Q4_K_M.gguf", - filters={"category": "programming"}, - docs=["doc1", "doc2"] - ) - ``` - - - ```python - from morphik import AsyncMorphik - - async with AsyncMorphik() as db: - # This will include both: - # 1. Any documents with category="programming" - # 2. The specific documents "doc1" and "doc2" (regardless of their category) - cache = await db.create_cache( - name="programming_cache", - model="llama2", - gguf_file="llama-2-7b-chat.Q4_K_M.gguf", - filters={"category": "programming"}, - docs=["doc1", "doc2"] - ) - ``` - - - -## Cache Usage - -After creating a cache, you can interact with it using the [`get_cache`](/python-sdk/get_cache) method, which returns a Cache object that provides methods for querying and updating the cache. - - - - ```python - cache = db.get_cache("programming_cache") - response = cache.query("Explain recursion in programming") - ``` - - - ```python - cache = await db.get_cache("programming_cache") - response = await cache.query("Explain recursion in programming") - ``` - - \ No newline at end of file diff --git a/python-sdk/get_cache.mdx b/python-sdk/get_cache.mdx deleted file mode 100644 index 82eb19d..0000000 --- a/python-sdk/get_cache.mdx +++ /dev/null @@ -1,103 +0,0 @@ ---- -title: "get_cache" -description: "Get a cache by name" ---- - - - - ```python - def get_cache(name: str) -> Cache - ``` - - - ```python - async def get_cache(name: str) -> AsyncCache - ``` - - - -## Parameters - -- `name` (str): Name of the cache to retrieve - -## Returns - - - - `Cache`: A cache object that is used to interact with the cache - - - `AsyncCache`: An async cache object that is used to interact with the cache - - - -## Examples - - - - ```python - from morphik import Morphik - - db = Morphik() - - cache = db.get_cache("programming_cache") - - # Update the cache with new documents - cache.update() - - # Add specific documents to the cache - cache.add_docs(["doc3", "doc4"]) - - # Query the cache - response = cache.query( - "What are the key concepts in functional programming?", - max_tokens=500, - temperature=0.7 - ) - print(response.completion) - ``` - - - ```python - from morphik import AsyncMorphik - - async with AsyncMorphik() as db: - cache = await db.get_cache("programming_cache") - - # Update the cache with new documents - await cache.update() - - # Add specific documents to the cache - await cache.add_docs(["doc3", "doc4"]) - - # Query the cache - response = await cache.query( - "What are the key concepts in functional programming?", - max_tokens=500, - temperature=0.7 - ) - print(response.completion) - ``` - - - -## Cache Methods - - - - The `Cache` object returned by the synchronous method has the following methods: - - - `update() -> bool`: Update the cache with any new documents matching the original filters - - `add_docs(docs: List[str]) -> bool`: Add specific documents to the cache - - `query(query: str, max_tokens: Optional[int] = None, temperature: Optional[float] = None) -> CompletionResponse`: Query the cache - - - The `AsyncCache` object returned by the asynchronous method has the following methods: - - - `update() -> bool`: Update the cache with any new documents matching the original filters - - `add_docs(docs: List[str]) -> bool`: Add specific documents to the cache - - `query(query: str, max_tokens: Optional[int] = None, temperature: Optional[float] = None) -> CompletionResponse`: Query the cache - - These methods have the same parameters and return values as their synchronous counterparts, but are called with `await`. - - \ No newline at end of file diff --git a/python-sdk/morphik.mdx b/python-sdk/morphik.mdx index 0992f50..8d5b643 100644 --- a/python-sdk/morphik.mdx +++ b/python-sdk/morphik.mdx @@ -138,10 +138,6 @@ Morphik provides the following methods. Each method page includes both synchrono - [list_graphs](/python-sdk/list_graphs) - [update_graph](/python-sdk/update_graph) -### Cache Operations -- [create_cache](/python-sdk/create_cache) -- [get_cache](/python-sdk/get_cache) - ### Client Management - [close](/python-sdk/close)