This guide explains how to deploy the NVIDIA RAG Blueprint for retrieval-only use cases without deploying the LLM generation components. This deployment mode is ideal when you only need document search and retrieval capabilities, saving GPU resources by not running the LLM NIM.
In retrieval-only mode, you deploy:
- Embedding NIM - For converting queries to vectors
- Reranking NIM - For reordering retrieved results by relevance
- Vector Database - For storing and searching document embeddings
- RAG Server - For handling
/searchAPI requests
You skip deploying:
- LLM NIM (
nim-llm-ms) - Not needed for retrieval-only workflows
This configuration allows you to use the /search API endpoint to retrieve relevant documents without generating LLM responses, significantly reducing GPU memory requirements.
Retrieval-only deployments are useful for:
- Search Applications: Building document search systems without answer generation
- Retrieval Pipelines: Integrating with your own LLM or downstream processing
- Resource-Constrained Environments: When GPU resources are limited
- Custom Generation: Using retrieved documents with an external LLM service
- Testing and Development: Validating retrieval quality before adding generation
:::{important} Before you deploy the RAG Blueprint, consider the following:
- For self-hosted NIMs, ensure that you have at least 50-80GB of available disk space for embedding and reranking model caches (significantly less than full deployment).
- First-time deployment takes 5-10 minutes for self-hosted NIMs, or 2-3 minutes for NVIDIA-hosted models.
- Model downloads do not show progress bars.
For monitoring deployment progress, refer to Deploy on Kubernetes with Helm. :::
-
Install Docker Engine and Docker Compose. Ensure Docker Compose version is 2.29.1 or later.
-
Authenticate Docker with NGC:
export NGC_API_KEY="nvapi-..." echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin
-
Install the NVIDIA Container Toolkit.
-
Clone the RAG Blueprint Git repository to get the necessary deployment files.
-
Create a directory to cache the models:
mkdir -p ~/.cache/model-cache export MODEL_DIRECTORY=~/.cache/model-cache
-
Export the required environment variables:
# For self-hosted NIMs source deploy/compose/.env # For NVIDIA-hosted NIMs source deploy/compose/nvdev.env
Choose one of the following options based on your deployment preference.
Instead of starting all NIMs, use the text-embed profile to start only the embedding and reranking services:
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml up -d nemotron-ranking-ms nemotron-embedding-ms:::{note}
The text-embed profile starts only nemotron-embedding-ms and nemotron-ranking-ms , which is sufficient for retrieval operations. The LLM NIM (nim-llm-ms) is not started, saving significant GPU memory.
:::
Wait for the services to become healthy:
watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'Expected output:
NAMES STATUS
nemotron-ranking-ms Up 5 minutes (healthy)
nemotron-embedding-ms Up 5 minutes (healthy)
For an even lighter deployment, use NVIDIA-hosted NIMs for embedding and reranking while running only the RAG server locally:
# Configure to use NVIDIA-hosted endpoints
export APP_EMBEDDINGS_SERVERURL=""
export APP_RANKING_SERVERURL="":::{note}
When APP_EMBEDDINGS_SERVERURL and APP_RANKING_SERVERURL are empty, the RAG server uses NVIDIA-hosted API endpoints (requires valid NGC_API_KEY).
:::
docker compose -f deploy/compose/vectordb.yaml up -ddocker compose -f deploy/compose/docker-compose-rag-server.yaml up -d rag-serverVerify the RAG server is running:
curl -X 'GET' 'http://localhost:8081/v1/health?check_dependencies=true' -H 'accept: application/json'If you need to ingest documents, start the ingestion server:
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d ingestor-server:::{tip} If you already have documents ingested from a previous deployment, you can skip this step and use the existing collections. :::
The /search endpoint retrieves relevant documents without LLM generation. This is the primary API for retrieval-only mode.
import requests
url = "http://localhost:8081/v1/search"
payload = {
"query": "What are the key features of the product?",
"collection_names": ["my_collection"],
"enable_reranker": True
}
response = requests.post(url, json=payload)
results = response.json()
# Process retrieved documents
for doc in results.get("citations", []):
print(f"Source: {doc['source']}")
print(f"Content: {doc['content'][:200]}...")
print(f"Score: {doc.get('score', 'N/A')}")
print("---")payload = {
"query": "What are the key features of the product?",
"collection_names": ["my_collection"],
"enable_reranker": True,
# Filter by custom metadata
"filter_expr": 'content_metadata["category"] == "electronics"'
}You can also use the provided CLI script for search operations:
# Basic search
python scripts/retriever_api_usage.py --mode search "Tell me about the product features"
# Search with specific collection
python scripts/retriever_api_usage.py \
--mode search \
--payload-json '{"collection_names":["my_collection"], "reranker_top_k": 5}' \
"What is the return policy?"
# Save results to file
python scripts/retriever_api_usage.py \
--mode search \
--output-json results.json \
"Technical specifications"For Kubernetes deployments, configure the Helm chart to disable the LLM NIM:
helm upgrade --install rag nvidia-blueprint-rag \
--namespace rag \
--set nimOperator.nim-llm.enabled=false \
--set nimOperator.nvidia-nim-llama-32-nv-embedqa-1b-v2.enabled=true \
--set nimOperator.nvidia-nim-llama-32-nv-rerankqa-1b-v2.enabled=true \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEYOr modify values.yaml:
# Disable LLM NIM for retrieval-only deployment
nimOperator:
nim-llm:
enabled: false
# Keep embedding and reranking NIMs enabled
nvidia-nim-llama-32-nv-embedqa-1b-v2:
enabled: true
nvidia-nim-llama-32-nv-rerankqa-1b-v2:
enabled: trueAfter retrieving documents, you can send them to your own LLM for generation:
import requests
# Step 1: Retrieve relevant documents
search_url = "http://localhost:8081/v1/search"
search_payload = {
"query": "What are the key features of the product?",
"reranker_top_k": 5,
"collection_names": ["my_collection"],
"enable_reranker": True
}
search_response = requests.post(search_url, json=search_payload)
citations = search_response.json().get("citations", [])
# Step 2: Format context from retrieved documents
context = "\n\n".join([
f"[Source: {doc['source']}]\n{doc['content']}"
for doc in citations
])
# Step 3: Send to your LLM
prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: Tell me more about the feature XYZ of the product?
Answer:"""
# Use your preferred LLM API (OpenAI, Claude, local model, etc.)
llm_response = your_llm_client.generate(prompt)| Deployment Mode | Required GPUs | Memory Usage |
|---|---|---|
| Full RAG (with LLM) | 2-4 GPUs | ~160GB+ |
| Retrieval-Only | 1 GPU | ~24GB |
| Cloud-Hosted NIMs | 0 GPUs | N/A |
:::{note} GPU requirements depend on the specific embedding and reranking models used. The values above are estimates for the default models. :::
This is expected behavior in retrieval-only mode. The /generate endpoint requires an LLM, which is not deployed. Use the /search endpoint instead.
Check the embedding NIM logs:
docker logs nemotron-embedding-msEnsure the model cache directory has proper permissions:
chmod -R 755 ~/.cache/model-cache-
Verify documents are ingested in the collection:
curl -X GET "http://localhost:8082/v1/documents?collection_name=my_collection" -
Check that the collection name in the search request matches the ingested collection.
-
Try increasing
vdb_top_kto retrieve more candidates.
To stop all retrieval-only services:
docker compose -f deploy/compose/docker-compose-rag-server.yaml down
docker compose -f deploy/compose/vectordb.yaml down
docker compose -f deploy/compose/nims.yaml down