You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `DocEmbedder` class provides an end-to-end pipeline for ingesting documents into Feast's online vector store. It handles chunking, embedding generation, and writing results -- all in a single step.
63
+
64
+
#### Key Components
65
+
66
+
***`DocEmbedder`**: High-level orchestrator that runs the full pipeline: chunk → embed → schema transform → write to online store
67
+
***`BaseChunker` / `TextChunker`**: Pluggable chunking layer. `TextChunker` splits text by word count with configurable `chunk_size`, `chunk_overlap`, `min_chunk_size`, and `max_chunk_chars`
68
+
***`BaseEmbedder` / `MultiModalEmbedder`**: Pluggable embedding layer with modality routing. `MultiModalEmbedder` supports text (via sentence-transformers) and image (via CLIP) with lazy model loading
69
+
***`SchemaTransformFn`**: A user-defined function that transforms the chunked + embedded DataFrame into the format expected by the FeatureView schema
print('\n'.join([c.message.content for c in response.choices]))
410
410
```
411
411
412
+
## Alternative: Using DocEmbedder for Simplified Ingestion
413
+
414
+
Instead of manually chunking, embedding, and writing documents as shown above, you can use Feast's `DocEmbedder` class to handle the entire pipeline in a single step. `DocEmbedder` automates chunking, embedding generation, FeatureView creation, and writing to the online store.
415
+
416
+
### Install Dependencies
417
+
418
+
```bash
419
+
pip install feast[milvus,rag]
420
+
```
421
+
422
+
### Set Up and Ingest with DocEmbedder
423
+
424
+
```python
425
+
from feast import DocEmbedder
426
+
import pandas as pd
427
+
428
+
# Prepare your documents as a DataFrame
429
+
df = pd.DataFrame({
430
+
"id": ["doc1", "doc2", "doc3"],
431
+
"text": [
432
+
"Aaron is a prophet, high priest, and the brother of Moses...",
433
+
"God at Sinai granted Aaron the priesthood for himself...",
434
+
"His rod turned into a snake. Then he stretched out...",
`DocEmbedder`is extensible at every stage. Below are examples of how to create custom components and wire them together.
478
+
479
+
#### Custom Chunker
480
+
481
+
Subclass `BaseChunker` to implement your own chunking strategy. The `load_parse_and_chunk` method receives each document and must return a list of chunk dictionaries.
482
+
483
+
```python
484
+
from feast.chunker import BaseChunker, ChunkingConfig
485
+
from typing import Any, Optional
486
+
487
+
class SentenceChunker(BaseChunker):
488
+
"""Chunks text by sentences instead of word count."""
489
+
490
+
def load_parse_and_chunk(
491
+
self,
492
+
source: Any,
493
+
source_id: str,
494
+
source_column: str,
495
+
source_type: Optional[str] = None,
496
+
) -> list[dict]:
497
+
import re
498
+
499
+
text = str(source)
500
+
# Split on sentence boundaries
501
+
sentences = re.split(r'(?<=[.!?])\s+', text)
502
+
503
+
chunks = []
504
+
current_chunk = []
505
+
chunk_index = 0
506
+
507
+
for sentence in sentences:
508
+
current_chunk.append(sentence)
509
+
combined = " ".join(current_chunk)
510
+
511
+
if len(combined.split()) >= self.config.chunk_size:
512
+
chunks.append({
513
+
"chunk_id": f"{source_id}_{chunk_index}",
514
+
"original_id": source_id,
515
+
source_column: combined,
516
+
"chunk_index": chunk_index,
517
+
})
518
+
# Keep overlap by retaining the last sentence
519
+
current_chunk = [sentence]
520
+
chunk_index += 1
521
+
522
+
# Don't forget the last chunk
523
+
if current_chunk and len(" ".join(current_chunk).split()) >= self.config.min_chunk_size:
524
+
chunks.append({
525
+
"chunk_id": f"{source_id}_{chunk_index}",
526
+
"original_id": source_id,
527
+
source_column: " ".join(current_chunk),
528
+
"chunk_index": chunk_index,
529
+
})
530
+
531
+
return chunks
532
+
```
533
+
534
+
Or simply configure the built-in `TextChunker`:
535
+
536
+
```python
537
+
from feast import TextChunker, ChunkingConfig
538
+
539
+
chunker = TextChunker(config=ChunkingConfig(
540
+
chunk_size=200,
541
+
chunk_overlap=50,
542
+
min_chunk_size=30,
543
+
max_chunk_chars=1000,
544
+
))
545
+
```
546
+
547
+
#### Custom Embedder
548
+
549
+
Subclass `BaseEmbedder` to use a different embedding model. Register modality handlers in `_register_default_modalities` and implement the `embed` method.
550
+
551
+
```python
552
+
from feast.embedder import BaseEmbedder, EmbeddingConfig
553
+
from typing import Any, List, Optional
554
+
import numpy as np
555
+
556
+
class OpenAIEmbedder(BaseEmbedder):
557
+
"""Embedder that uses the OpenAI API for text embeddings."""
return np.array([item.embedding for item in response.data])
588
+
```
589
+
590
+
#### Custom Logical Layer Function
591
+
592
+
The schema transform function transforms the chunked + embedded DataFrame into the exact schema your FeatureView expects. It must accept a `pd.DataFrame` and return a `pd.DataFrame`.
vector_length=1536, # Match the OpenAI embedding dimension
625
+
)
626
+
627
+
# Embed and ingest
628
+
result = embedder.embed_documents(
629
+
documents=df,
630
+
id_column="id",
631
+
source_column="text",
632
+
column_mapping=("text", "text_embedding"),
633
+
)
634
+
```
635
+
636
+
> **Note:** When using a custom `schema_transform_fn`, ensure the returned DataFrame columns match your FeatureView schema. When using a custom embedder with a different output dimension, set `vector_length` accordingly (or let it auto-detect via `get_embedding_dim`).
637
+
638
+
For a complete end-to-end example, see the [DocEmbedder notebook](https://github.com/feast-dev/feast/tree/master/examples/rag-retriever/rag_feast_docembedder.ipynb).
639
+
412
640
## Why Feast for RAG?
413
641
414
642
Feast makes it remarkably easy to set up and manage a RAG system by:
Copy file name to clipboardExpand all lines: examples/rag-retriever/README.md
+53Lines changed: 53 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,6 +62,59 @@ Navigate to the examples/rag-retriever directory. Here you will find the followi
62
62
63
63
Open `rag_feast.ipynb` and follow the steps in the notebook to run the example.
64
64
65
+
## Using DocEmbedder for Simplified Ingestion
66
+
67
+
As an alternative to the manual data preparation steps in the notebook above, Feast provides the `DocEmbedder` class that automates the entire document-to-embeddings pipeline: chunking, embedding generation, FeatureView creation, and writing to the online store.
1.**Generates a FeatureView**: Automatically creates a Python file with Entity and FeatureView definitions compatible with `feast apply`
103
+
2.**Applies the repo**: Registers the FeatureView in the Feast registry and deploys infrastructure (e.g., Milvus collection)
104
+
3.**Chunks documents**: Splits text into smaller passages using `TextChunker` (configurable chunk size, overlap, etc.)
105
+
4.**Generates embeddings**: Produces vector embeddings using `MultiModalEmbedder` (defaults to `all-MiniLM-L6-v2`)
106
+
5.**Writes to online store**: Stores the processed data in your configured online store (e.g., Milvus)
107
+
108
+
### Customization
109
+
110
+
***Custom Chunker**: Subclass `BaseChunker` for your own chunking strategy
111
+
***Custom Embedder**: Subclass `BaseEmbedder` to use a different embedding model
112
+
***Logical Layer Function**: Provide a `SchemaTransformFn` to control how the output maps to your FeatureView schema
113
+
114
+
### Example Notebook
115
+
116
+
See **`rag_feast_docembedder.ipynb`** for a complete end-to-end example that uses DocEmbedder with the Wiki DPR dataset and then queries the results using `FeastRAGRetriever`.
117
+
65
118
## FeastRagRetriver Low Level Design
66
119
67
120
<imgsrc="images/FeastRagRetriever.png"width="800"height="450"alt="Low level design for feast rag retriever">
0 commit comments