| layout | default |
|---|---|
| title | Chapter 5: LLM Integration & Configuration |
| parent | RAGFlow Tutorial |
| nav_order | 5 |
Welcome to Chapter 5: LLM Integration & Configuration. In this part of RAGFlow Tutorial: Complete Guide to Open-Source RAG Engine, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Connect RAGFlow with various Large Language Models for intelligent question answering.
This chapter covers how to integrate different Large Language Models (LLMs) with RAGFlow to power your RAG applications. You'll learn to configure various LLM providers and optimize their performance for document-based question answering.
RAGFlow supports a wide range of LLM providers for different use cases and deployment scenarios:
- OpenAI - GPT-4, GPT-3.5-turbo
- Anthropic - Claude 3, Claude 2
- Google - Gemini 1.5, Gemini 1.0
- Azure OpenAI - Enterprise-grade deployments
- AWS Bedrock - Amazon's LLM service
- Ollama - Local model inference
- LM Studio - Local model management
- Hugging Face - Direct model integration
- vLLM - High-throughput inference
- LocalAI - Unified local AI API
- Together AI - Optimized inference
- Replicate - Model marketplace
- Fireworks AI - Fast inference
- DeepInfra - Cost-effective models
- Log into RAGFlow web interface
- Navigate to System Settings > Model Providers
- Click Add Provider to configure a new LLM
# Set environment variables for different providers
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"{
"provider": "OpenAI",
"model": "gpt-4o",
"api_key": "sk-...",
"temperature": 0.1,
"max_tokens": 2000,
"top_p": 0.9
}{
"provider": "Anthropic",
"model": "claude-3-5-sonnet-20241022",
"api_key": "sk-ant-...",
"temperature": 0.1,
"max_tokens": 4000,
"system_prompt": "You are a helpful assistant that answers questions based on provided context."
}# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a model
ollama pull llama3.1:8b
# Start Ollama service
ollama serveThen configure in RAGFlow:
{
"provider": "Ollama",
"model": "llama3.1:8b",
"base_url": "http://localhost:11434",
"temperature": 0.1,
"num_ctx": 4096
}{
"temperature": 0.1, // Lower = more deterministic
"top_p": 0.9, // Nucleus sampling
"top_k": 40, // Top-k sampling
"repetition_penalty": 1.1, // Reduce repetition
"max_tokens": 2000 // Response length limit
}{
"max_context_length": 8192, // Maximum context tokens
"overlap_size": 200, // Chunk overlap for retrieval
"compression_ratio": 0.7, // Context compression
"hierarchical_retrieval": true // Multi-level retrieval
}{
"models": [
{
"name": "gpt-4o",
"provider": "OpenAI",
"priority": 1,
"fallback": false
},
{
"name": "claude-3-5-sonnet",
"provider": "Anthropic",
"priority": 2,
"fallback": true
},
{
"name": "llama3.1:8b",
"provider": "Ollama",
"priority": 3,
"fallback": true
}
]
}{
"load_balancing": {
"enabled": true,
"strategy": "round_robin",
"health_check_interval": 30,
"timeout": 10
}
}{
"caching": {
"enabled": true,
"ttl": 3600, // Cache TTL in seconds
"max_cache_size": "1GB", // Maximum cache size
"compression": true // Enable response compression
}
}{
"batch_processing": {
"enabled": true,
"max_batch_size": 10,
"timeout": 30,
"concurrency_limit": 5
}
}{
"task": "document_qa",
"model": "gpt-4o",
"temperature": 0.1,
"max_tokens": 1000,
"system_prompt": "Answer questions based solely on the provided document context."
}{
"task": "creative_writing",
"model": "claude-3-5-sonnet",
"temperature": 0.8,
"max_tokens": 2000,
"system_prompt": "Generate creative content while staying relevant to the document context."
}{
"task": "code_generation",
"model": "gpt-4o",
"temperature": 0.2,
"max_tokens": 1500,
"system_prompt": "Generate code based on the documentation and requirements provided."
}{
"monitoring": {
"response_time_tracking": true,
"token_usage_monitoring": true,
"quality_scoring": true,
"error_rate_tracking": true
}
}{
"dashboard": {
"real_time_metrics": true,
"historical_trends": true,
"model_comparison": true,
"cost_analysis": true
}
}# Monitor rate limits
curl -X GET "http://localhost:80/api/rate-limits" \
-H "Authorization: Bearer YOUR_TOKEN"
# Implement exponential backoff
# RAGFlow handles this automatically with retry logic# Check model compatibility
curl -X POST "http://localhost:80/api/models/check-compatibility" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o", "task": "document_qa"}'{
"error_handling": {
"context_overflow_strategy": "truncate",
"chunk_reduction_ratio": 0.8,
"fallback_model": "gpt-3.5-turbo"
}
}# Use environment variables
export RAGFLOW_ENCRYPTION_KEY="your-encryption-key"
# Rotate keys regularly
curl -X POST "http://localhost:80/api/keys/rotate" \
-H "Authorization: Bearer ADMIN_TOKEN"{
"access_control": {
"user_roles": ["admin", "editor", "viewer"],
"model_permissions": {
"gpt-4": ["admin", "editor"],
"claude-3": ["admin", "editor", "viewer"],
"local-models": ["admin", "editor", "viewer"]
}
}
}# docker-compose.prod.yml
version: '3.8'
services:
ragflow:
image: infiniflow/ragflow:latest
environment:
- LLM_PROVIDER_BACKUP=true
- LOAD_BALANCER_ENABLED=true
- CACHE_LAYER=redis
depends_on:
- redis
- postgres
redis:
image: redis:7-alpine
postgres:
image: postgres:15{
"scaling": {
"auto_scaling": true,
"min_instances": 2,
"max_instances": 10,
"cpu_threshold": 70,
"memory_threshold": 80
}
}| Use Case | Recommended Models | Rationale |
|---|---|---|
| Document Q&A | GPT-4, Claude 3 | High accuracy, good context understanding |
| Creative Tasks | Claude 3, GPT-4 | Better at generating natural, creative responses |
| Code Generation | GPT-4, Claude 3 | Strong code understanding and generation |
| Cost-Effective | GPT-3.5, Claude Instant | Good balance of cost and performance |
| Local/Offline | Llama 3, Mistral | Privacy-focused, no API costs |
- Use Appropriate Model Sizes: Larger models for complex tasks, smaller for simple queries
- Implement Caching: Cache frequent queries and responses
- Monitor Usage: Track token consumption and costs
- Load Balancing: Distribute requests across multiple model instances
- Fallback Strategies: Have backup models for reliability
Now that you have configured LLMs for RAGFlow, you're ready to:
- Chapter 6: Chatbot Development - Build conversational interfaces
- Chapter 7: Advanced Features - Explore advanced RAGFlow capabilities
- Chapter 8: Production Deployment - Deploy at scale
Ready to build intelligent chatbots? Continue to Chapter 6: Chatbot Development! 🚀
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for model, temperature, provider so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 5: LLM Integration & Configuration as an operating subsystem inside RAGFlow Tutorial: Complete Guide to Open-Source RAG Engine, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around max_tokens, your, claude as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 5: LLM Integration & Configuration usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
model. - Input normalization: shape incoming data so
temperaturereceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
provider. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- GitHub Repository
Why it matters: authoritative reference on
GitHub Repository(github.com). - AI Codebase Knowledge Builder
Why it matters: authoritative reference on
AI Codebase Knowledge Builder(github.com).
Suggested trace strategy:
- search upstream code for
modelandtemperatureto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production