Skip to content

Commit 9b185ee

Browse files
patelchaitanyntkathole
authored andcommitted
Add DocEmbedder docs and examples
Signed-off-by: Chaitany patel <patelchaitany93@gmail.com>
1 parent 858c012 commit 9b185ee

6 files changed

Lines changed: 374 additions & 41 deletions

File tree

docs/getting-started/genai.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,53 @@ The transformation workflow typically involves:
5656
3. **Chunking**: Split documents into smaller, semantically meaningful chunks
5757
4. **Embedding Generation**: Convert text chunks into vector embeddings
5858
5. **Storage**: Store embeddings and metadata in Feast's feature store
59+
60+
### DocEmbedder: End-to-End Document Ingestion Pipeline
61+
62+
The `DocEmbedder` class provides an end-to-end pipeline for ingesting documents into Feast's online vector store. It handles chunking, embedding generation, and writing results -- all in a single step.
63+
64+
#### Key Components
65+
66+
* **`DocEmbedder`**: High-level orchestrator that runs the full pipeline: chunk → embed → schema transform → write to online store
67+
* **`BaseChunker` / `TextChunker`**: Pluggable chunking layer. `TextChunker` splits text by word count with configurable `chunk_size`, `chunk_overlap`, `min_chunk_size`, and `max_chunk_chars`
68+
* **`BaseEmbedder` / `MultiModalEmbedder`**: Pluggable embedding layer with modality routing. `MultiModalEmbedder` supports text (via sentence-transformers) and image (via CLIP) with lazy model loading
69+
* **`SchemaTransformFn`**: A user-defined function that transforms the chunked + embedded DataFrame into the format expected by the FeatureView schema
70+
71+
#### Quick Example
72+
73+
```python
74+
from feast import DocEmbedder
75+
import pandas as pd
76+
77+
# Prepare your documents
78+
df = pd.DataFrame({
79+
"id": ["doc1", "doc2"],
80+
"text": ["First document content...", "Second document content..."],
81+
})
82+
83+
# Create DocEmbedder -- automatically generates a FeatureView and applies the repo
84+
embedder = DocEmbedder(
85+
repo_path="feature_repo/",
86+
feature_view_name="text_feature_view",
87+
)
88+
89+
# Embed and ingest documents in one step
90+
result = embedder.embed_documents(
91+
documents=df,
92+
id_column="id",
93+
source_column="text",
94+
column_mapping=("text", "text_embedding"),
95+
)
96+
```
97+
98+
#### Features
99+
100+
* **Auto-generates FeatureView**: Creates a Python file with Entity and FeatureView definitions compatible with `feast apply`
101+
* **Auto-applies repo**: Registers the generated FeatureView in the registry automatically
102+
* **Custom schema transform**: Provide your own `SchemaTransformFn` to control how chunked + embedded data maps to your FeatureView schema
103+
* **Extensible**: Subclass `BaseChunker` or `BaseEmbedder` to plug in your own chunking or embedding strategies
104+
105+
For a complete walkthrough, see the [DocEmbedder tutorial notebook](../../examples/rag-retriever/rag_feast_docembedder.ipynb).
59106
### Feature Transformation for LLMs
60107

61108
Feast supports transformations that can be used to:
@@ -190,6 +237,7 @@ For more detailed information and examples:
190237

191238
* [Vector Database Reference](../reference/alpha-vector-database.md)
192239
* [RAG Tutorial with Docling](../tutorials/rag-with-docling.md)
240+
* [DocEmbedder Tutorial Notebook](../../examples/rag-retriever/rag_feast_docembedder.ipynb)
193241
* [RAG Fine Tuning with Feast and Milvus](../../examples/rag-retriever/README.md)
194242
* [Milvus Quickstart Example](https://github.com/feast-dev/feast/tree/master/examples/rag/milvus-quickstart.ipynb)
195243
* [Feast + Ray: Distributed Processing for RAG Applications](https://feast.dev/blog/feast-ray-distributed-processing/)

docs/tutorials/rag-with-docling.md

Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -409,6 +409,234 @@ response = client.chat.completions.create(
409409
print('\n'.join([c.message.content for c in response.choices]))
410410
```
411411

412+
## Alternative: Using DocEmbedder for Simplified Ingestion
413+
414+
Instead of manually chunking, embedding, and writing documents as shown above, you can use Feast's `DocEmbedder` class to handle the entire pipeline in a single step. `DocEmbedder` automates chunking, embedding generation, FeatureView creation, and writing to the online store.
415+
416+
### Install Dependencies
417+
418+
```bash
419+
pip install feast[milvus,rag]
420+
```
421+
422+
### Set Up and Ingest with DocEmbedder
423+
424+
```python
425+
from feast import DocEmbedder
426+
import pandas as pd
427+
428+
# Prepare your documents as a DataFrame
429+
df = pd.DataFrame({
430+
"id": ["doc1", "doc2", "doc3"],
431+
"text": [
432+
"Aaron is a prophet, high priest, and the brother of Moses...",
433+
"God at Sinai granted Aaron the priesthood for himself...",
434+
"His rod turned into a snake. Then he stretched out...",
435+
],
436+
})
437+
438+
# DocEmbedder handles everything: generates FeatureView, applies repo,
439+
# chunks text, generates embeddings, and writes to the online store
440+
embedder = DocEmbedder(
441+
repo_path="feature_repo/",
442+
feature_view_name="text_feature_view",
443+
)
444+
445+
result = embedder.embed_documents(
446+
documents=df,
447+
id_column="id",
448+
source_column="text",
449+
column_mapping=("text", "text_embedding"),
450+
)
451+
```
452+
453+
### Retrieve and Query
454+
455+
Once documents are ingested, you can retrieve them the same way as shown in Step 5 above:
456+
457+
```python
458+
from feast import FeatureStore
459+
460+
store = FeatureStore("feature_repo/")
461+
462+
query_embedding = embed_text("Who are the authors of the paper?")
463+
context_data = store.retrieve_online_documents_v2(
464+
features=[
465+
"text_feature_view:embedding",
466+
"text_feature_view:text",
467+
"text_feature_view:source_id",
468+
],
469+
query=query_embedding,
470+
top_k=3,
471+
distance_metric="COSINE",
472+
).to_df()
473+
```
474+
475+
### Customizing the Pipeline
476+
477+
`DocEmbedder` is extensible at every stage. Below are examples of how to create custom components and wire them together.
478+
479+
#### Custom Chunker
480+
481+
Subclass `BaseChunker` to implement your own chunking strategy. The `load_parse_and_chunk` method receives each document and must return a list of chunk dictionaries.
482+
483+
```python
484+
from feast.chunker import BaseChunker, ChunkingConfig
485+
from typing import Any, Optional
486+
487+
class SentenceChunker(BaseChunker):
488+
"""Chunks text by sentences instead of word count."""
489+
490+
def load_parse_and_chunk(
491+
self,
492+
source: Any,
493+
source_id: str,
494+
source_column: str,
495+
source_type: Optional[str] = None,
496+
) -> list[dict]:
497+
import re
498+
499+
text = str(source)
500+
# Split on sentence boundaries
501+
sentences = re.split(r'(?<=[.!?])\s+', text)
502+
503+
chunks = []
504+
current_chunk = []
505+
chunk_index = 0
506+
507+
for sentence in sentences:
508+
current_chunk.append(sentence)
509+
combined = " ".join(current_chunk)
510+
511+
if len(combined.split()) >= self.config.chunk_size:
512+
chunks.append({
513+
"chunk_id": f"{source_id}_{chunk_index}",
514+
"original_id": source_id,
515+
source_column: combined,
516+
"chunk_index": chunk_index,
517+
})
518+
# Keep overlap by retaining the last sentence
519+
current_chunk = [sentence]
520+
chunk_index += 1
521+
522+
# Don't forget the last chunk
523+
if current_chunk and len(" ".join(current_chunk).split()) >= self.config.min_chunk_size:
524+
chunks.append({
525+
"chunk_id": f"{source_id}_{chunk_index}",
526+
"original_id": source_id,
527+
source_column: " ".join(current_chunk),
528+
"chunk_index": chunk_index,
529+
})
530+
531+
return chunks
532+
```
533+
534+
Or simply configure the built-in `TextChunker`:
535+
536+
```python
537+
from feast import TextChunker, ChunkingConfig
538+
539+
chunker = TextChunker(config=ChunkingConfig(
540+
chunk_size=200,
541+
chunk_overlap=50,
542+
min_chunk_size=30,
543+
max_chunk_chars=1000,
544+
))
545+
```
546+
547+
#### Custom Embedder
548+
549+
Subclass `BaseEmbedder` to use a different embedding model. Register modality handlers in `_register_default_modalities` and implement the `embed` method.
550+
551+
```python
552+
from feast.embedder import BaseEmbedder, EmbeddingConfig
553+
from typing import Any, List, Optional
554+
import numpy as np
555+
556+
class OpenAIEmbedder(BaseEmbedder):
557+
"""Embedder that uses the OpenAI API for text embeddings."""
558+
559+
def __init__(self, model: str = "text-embedding-3-small", config: Optional[EmbeddingConfig] = None):
560+
self.model = model
561+
self._client = None
562+
super().__init__(config)
563+
564+
def _register_default_modalities(self) -> None:
565+
self.register_modality("text", self._embed_text)
566+
567+
@property
568+
def client(self):
569+
if self._client is None:
570+
from openai import OpenAI
571+
self._client = OpenAI()
572+
return self._client
573+
574+
def get_embedding_dim(self, modality: str) -> Optional[int]:
575+
# text-embedding-3-small produces 1536-dim vectors
576+
if modality == "text":
577+
return 1536
578+
return None
579+
580+
def embed(self, inputs: List[Any], modality: str) -> np.ndarray:
581+
if modality not in self._modality_handlers:
582+
raise ValueError(f"Unsupported modality: '{modality}'")
583+
return self._modality_handlers[modality](inputs)
584+
585+
def _embed_text(self, inputs: List[str]) -> np.ndarray:
586+
response = self.client.embeddings.create(input=inputs, model=self.model)
587+
return np.array([item.embedding for item in response.data])
588+
```
589+
590+
#### Custom Logical Layer Function
591+
592+
The schema transform function transforms the chunked + embedded DataFrame into the exact schema your FeatureView expects. It must accept a `pd.DataFrame` and return a `pd.DataFrame`.
593+
594+
```python
595+
import pandas as pd
596+
from datetime import datetime, timezone
597+
598+
def my_schema_transform_fn(df: pd.DataFrame) -> pd.DataFrame:
599+
"""Map chunked + embedded columns to the FeatureView schema."""
600+
return pd.DataFrame({
601+
"passage_id": df["chunk_id"],
602+
"text": df["text"],
603+
"embedding": df["text_embedding"],
604+
"event_timestamp": [datetime.now(timezone.utc)] * len(df),
605+
"source_id": df["original_id"],
606+
# Add any extra columns your FeatureView expects
607+
"chunk_index": df["chunk_index"],
608+
})
609+
```
610+
611+
#### Putting It All Together
612+
613+
Pass your custom components to `DocEmbedder`:
614+
615+
```python
616+
from feast import DocEmbedder
617+
618+
embedder = DocEmbedder(
619+
repo_path="feature_repo/",
620+
feature_view_name="text_feature_view",
621+
chunker=SentenceChunker(config=ChunkingConfig(chunk_size=150, min_chunk_size=20)),
622+
embedder=OpenAIEmbedder(model="text-embedding-3-small"),
623+
schema_transform_fn=my_schema_transform_fn,
624+
vector_length=1536, # Match the OpenAI embedding dimension
625+
)
626+
627+
# Embed and ingest
628+
result = embedder.embed_documents(
629+
documents=df,
630+
id_column="id",
631+
source_column="text",
632+
column_mapping=("text", "text_embedding"),
633+
)
634+
```
635+
636+
> **Note:** When using a custom `schema_transform_fn`, ensure the returned DataFrame columns match your FeatureView schema. When using a custom embedder with a different output dimension, set `vector_length` accordingly (or let it auto-detect via `get_embedding_dim`).
637+
638+
For a complete end-to-end example, see the [DocEmbedder notebook](https://github.com/feast-dev/feast/tree/master/examples/rag-retriever/rag_feast_docembedder.ipynb).
639+
412640
## Why Feast for RAG?
413641

414642
Feast makes it remarkably easy to set up and manage a RAG system by:

examples/rag-retriever/README.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,59 @@ Navigate to the examples/rag-retriever directory. Here you will find the followi
6262

6363
Open `rag_feast.ipynb` and follow the steps in the notebook to run the example.
6464

65+
## Using DocEmbedder for Simplified Ingestion
66+
67+
As an alternative to the manual data preparation steps in the notebook above, Feast provides the `DocEmbedder` class that automates the entire document-to-embeddings pipeline: chunking, embedding generation, FeatureView creation, and writing to the online store.
68+
69+
### Install Dependencies
70+
71+
```bash
72+
pip install feast[milvus,rag]
73+
```
74+
75+
### Quick Start
76+
77+
```python
78+
from feast import DocEmbedder
79+
from datasets import load_dataset
80+
81+
# Load your dataset
82+
dataset = load_dataset("facebook/wiki_dpr", "psgs_w100.nq.exact", split="train[:1%]",
83+
with_index=False, trust_remote_code=True)
84+
df = dataset.select(range(100)).to_pandas()
85+
86+
# DocEmbedder handles everything in one step
87+
embedder = DocEmbedder(
88+
repo_path="feature_repo_docembedder/",
89+
feature_view_name="text_feature_view",
90+
)
91+
92+
result = embedder.embed_documents(
93+
documents=df,
94+
id_column="id",
95+
source_column="text",
96+
column_mapping=("text", "text_embedding"),
97+
)
98+
```
99+
100+
### What DocEmbedder Does
101+
102+
1. **Generates a FeatureView**: Automatically creates a Python file with Entity and FeatureView definitions compatible with `feast apply`
103+
2. **Applies the repo**: Registers the FeatureView in the Feast registry and deploys infrastructure (e.g., Milvus collection)
104+
3. **Chunks documents**: Splits text into smaller passages using `TextChunker` (configurable chunk size, overlap, etc.)
105+
4. **Generates embeddings**: Produces vector embeddings using `MultiModalEmbedder` (defaults to `all-MiniLM-L6-v2`)
106+
5. **Writes to online store**: Stores the processed data in your configured online store (e.g., Milvus)
107+
108+
### Customization
109+
110+
* **Custom Chunker**: Subclass `BaseChunker` for your own chunking strategy
111+
* **Custom Embedder**: Subclass `BaseEmbedder` to use a different embedding model
112+
* **Logical Layer Function**: Provide a `SchemaTransformFn` to control how the output maps to your FeatureView schema
113+
114+
### Example Notebook
115+
116+
See **`rag_feast_docembedder.ipynb`** for a complete end-to-end example that uses DocEmbedder with the Wiki DPR dataset and then queries the results using `FeastRAGRetriever`.
117+
65118
## FeastRagRetriver Low Level Design
66119

67120
<img src="images/FeastRagRetriever.png" width="800" height="450" alt="Low level design for feast rag retriever">

sdk/python/feast/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
from .chunker import BaseChunker, ChunkingConfig, TextChunker
1818
from .data_source import KafkaSource, KinesisSource, PushSource, RequestSource
1919
from .dataframe import DataFrameEngine, FeastDataFrame
20-
from .doc_embedder import DocEmbedder, LogicalLayerFn
20+
from .doc_embedder import DocEmbedder, SchemaTransformFn
2121
from .embedder import BaseEmbedder, EmbeddingConfig, MultiModalEmbedder
2222
from .entity import Entity
2323
from .feature import Feature
@@ -66,7 +66,7 @@
6666
"Project",
6767
"FeastVectorStore",
6868
"DocEmbedder",
69-
"LogicalLayerFn",
69+
"SchemaTransformFn",
7070
"BaseChunker",
7171
"TextChunker",
7272
"ChunkingConfig",

0 commit comments

Comments
 (0)