| layout | default |
|---|---|
| title | Chapter 6: Building RAG Systems |
| parent | Firecrawl Tutorial |
| nav_order | 6 |
Welcome to Chapter 6: Building RAG Systems. In this part of Firecrawl Tutorial: Building LLM-Ready Web Scraping and Data Extraction Systems, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Retrieval-Augmented Generation (RAG) is the most impactful application of web scraping for AI. Instead of relying solely on an LLM's training data, RAG retrieves relevant content from your scraped corpus and injects it into the prompt -- giving the model up-to-date, domain-specific knowledge without fine-tuning.
In this chapter you will build a complete RAG pipeline: scrape websites with Firecrawl, chunk content intelligently, generate embeddings, store them in a vector database, and build a question-answering system that grounds its answers in your scraped data.
| Skill | Description |
|---|---|
| Chunking strategies | Split documents into optimal segments for retrieval |
| Embedding generation | Convert text chunks into vector representations |
| Vector storage | Store and query embeddings in ChromaDB and Pinecone |
| Retrieval pipeline | Find the most relevant chunks for a given query |
| RAG prompting | Construct prompts that use retrieved context effectively |
| Evaluation | Measure retrieval quality, latency, and cost |
flowchart TD
subgraph Ingestion["Ingestion Pipeline"]
A[Firecrawl Scrape] --> B[Content Cleaning]
B --> C[Chunking]
C --> D[Embedding Generation]
D --> E[Vector Database]
end
subgraph Query["Query Pipeline"]
F[User Question] --> G[Embed Query]
G --> H[Vector Search]
H --> I[Retrieve Top-K Chunks]
I --> J[Build Prompt]
J --> K[LLM Generation]
K --> L[Answer with Citations]
end
E --> H
classDef ingest fill:#e1f5fe,stroke:#01579b
classDef query fill:#e8f5e8,stroke:#1b5e20
classDef store fill:#f3e5f5,stroke:#4a148c
class A,B,C,D ingest
class E store
class F,G,H,I,J,K,L query
Start by scraping a documentation site or knowledge base with Firecrawl, then clean it using the pipeline from Chapter 5.
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_KEY")
# Crawl a documentation site
crawl_result = app.crawl_url(
"https://docs.example.com",
params={
"limit": 50,
"maxDepth": 3,
"includePaths": ["/docs/*", "/guides/*"],
"excludePaths": ["/docs/api-ref/*"], # Skip auto-generated API refs
},
poll_interval=5,
)
# Collect cleaned documents
documents = []
for page in crawl_result:
if page.get("markdown") and len(page["markdown"]) > 200:
documents.append({
"content": page["markdown"],
"url": page["metadata"]["sourceURL"],
"title": page["metadata"].get("title", "Untitled"),
})
print(f"Collected {len(documents)} documents for RAG pipeline")import FirecrawlApp from "@mendable/firecrawl-js";
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const crawlResult = await app.crawlUrl("https://docs.example.com", {
limit: 50,
maxDepth: 3,
includePaths: ["/docs/*", "/guides/*"],
});
const documents = crawlResult
.filter((page) => page.markdown && page.markdown.length > 200)
.map((page) => ({
content: page.markdown!,
url: page.metadata?.sourceURL || "",
title: page.metadata?.title || "Untitled",
}));
console.log(`Collected ${documents.length} documents for RAG pipeline`);Chunking determines how well your RAG system retrieves relevant information. Chunks that are too large dilute relevance; chunks that are too small lose context.
| Strategy | Chunk Size | Overlap | Best For | Weakness |
|---|---|---|---|---|
| Fixed-size | 500-1000 chars | 100-200 chars | Simple implementation | May split mid-sentence |
| Recursive character | 500-1000 chars | 100-200 chars | General use | Slightly better boundaries |
| Markdown-aware | Varies by section | Section headers preserved | Documentation | Uneven chunk sizes |
| Semantic | Varies | Based on topic shifts | Diverse content | Requires additional model |
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_document(content: str, chunk_size: int = 800, chunk_overlap: int = 150):
"""Split a document into overlapping chunks."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " "],
length_function=len,
)
return splitter.split_text(content)
# Chunk all documents
all_chunks = []
for doc in documents:
chunks = chunk_document(doc["content"])
for i, chunk in enumerate(chunks):
all_chunks.append({
"text": chunk,
"metadata": {
"source_url": doc["url"],
"title": doc["title"],
"chunk_index": i,
"total_chunks": len(chunks),
}
})
print(f"Created {len(all_chunks)} chunks from {len(documents)} documents")
print(f"Average chunk size: {sum(len(c['text']) for c in all_chunks) // len(all_chunks)} chars")For documentation sites, split on markdown headers to preserve section boundaries.
from langchain.text_splitter import MarkdownHeaderTextSplitter
def chunk_markdown(content: str):
"""Split markdown by headers, preserving section structure."""
headers_to_split = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split)
splits = splitter.split_text(content)
# Further split large sections
char_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=150,
)
final_chunks = []
for split in splits:
if len(split.page_content) > 1000:
sub_chunks = char_splitter.split_text(split.page_content)
for sub in sub_chunks:
final_chunks.append({
"text": sub,
"headers": split.metadata,
})
else:
final_chunks.append({
"text": split.page_content,
"headers": split.metadata,
})
return final_chunksflowchart TD
A[Content Type] --> B{Document Length}
B -- "Short (< 2K chars)" --> C["Chunk Size: 400-600"]
B -- "Medium (2K-10K)" --> D["Chunk Size: 600-800"]
B -- "Long (> 10K)" --> E["Chunk Size: 800-1200"]
C --> F["Overlap: 50-100"]
D --> G["Overlap: 100-200"]
E --> H["Overlap: 150-250"]
F --> I[More precise retrieval]
G --> J[Balanced retrieval]
H --> K[More context per chunk]
classDef small fill:#e8f5e8,stroke:#1b5e20
classDef med fill:#fff3e0,stroke:#e65100
classDef large fill:#e1f5fe,stroke:#01579b
class C,F,I small
class D,G,J med
class E,H,K large
Convert text chunks into vector representations using an embedding model. The embeddings capture semantic meaning, allowing you to find relevant chunks even when the query uses different words than the source text.
from sentence_transformers import SentenceTransformer
# Load a lightweight embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Generate embeddings for all chunks
texts = [chunk["text"] for chunk in all_chunks]
embeddings = model.encode(texts, show_progress_bar=True, batch_size=32)
print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding dimension: {embeddings.shape[1]}")from openai import OpenAI
openai_client = OpenAI(api_key="YOUR_OPENAI_KEY")
def generate_openai_embeddings(texts: list, model: str = "text-embedding-3-small"):
"""Generate embeddings using OpenAI's API."""
# Process in batches of 100
all_embeddings = []
for i in range(0, len(texts), 100):
batch = texts[i:i+100]
response = openai_client.embeddings.create(input=batch, model=model)
batch_embeddings = [item.embedding for item in response.data]
all_embeddings.extend(batch_embeddings)
return all_embeddings
embeddings = generate_openai_embeddings(texts)| Model | Dimensions | Speed | Quality | Cost |
|---|---|---|---|---|
all-MiniLM-L6-v2 |
384 | Fast | Good | Free (local) |
all-mpnet-base-v2 |
768 | Medium | Better | Free (local) |
text-embedding-3-small |
1536 | Fast (API) | Very good | $0.02/1M tokens |
text-embedding-3-large |
3072 | Fast (API) | Best | $0.13/1M tokens |
import chromadb
# Create a persistent ChromaDB client
chroma_client = chromadb.PersistentClient(path="./chroma_db")
# Create or get a collection
collection = chroma_client.get_or_create_collection(
name="firecrawl-docs",
metadata={"hnsw:space": "cosine"} # Use cosine similarity
)
# Add chunks with embeddings and metadata
collection.add(
ids=[f"chunk-{i}" for i in range(len(all_chunks))],
documents=[chunk["text"] for chunk in all_chunks],
embeddings=embeddings.tolist(),
metadatas=[chunk["metadata"] for chunk in all_chunks],
)
print(f"Stored {collection.count()} chunks in ChromaDB")from pinecone import Pinecone, ServerlessSpec
# Initialize Pinecone
pc = Pinecone(api_key="YOUR_PINECONE_KEY")
# Create an index
index_name = "firecrawl-docs"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=384, # Match your embedding model
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index(index_name)
# Upsert vectors in batches
batch_size = 100
for i in range(0, len(all_chunks), batch_size):
batch = []
for j in range(i, min(i + batch_size, len(all_chunks))):
batch.append({
"id": f"chunk-{j}",
"values": embeddings[j].tolist(),
"metadata": {
**all_chunks[j]["metadata"],
"text": all_chunks[j]["text"][:1000], # Store text preview
}
})
index.upsert(vectors=batch)
print(f"Upserted {len(all_chunks)} vectors to Pinecone")def retrieve_context(query: str, n_results: int = 5) -> list:
"""Retrieve the most relevant chunks for a query."""
# Embed the query
query_embedding = model.encode([query]).tolist()
# Search the vector database
results = collection.query(
query_embeddings=query_embedding,
n_results=n_results,
include=["documents", "metadatas", "distances"],
)
# Format results with metadata
contexts = []
for i in range(len(results["documents"][0])):
contexts.append({
"text": results["documents"][0][i],
"source": results["metadatas"][0][i].get("source_url", ""),
"title": results["metadatas"][0][i].get("title", ""),
"relevance": 1 - results["distances"][0][i], # Convert distance to similarity
})
return contexts
# Test retrieval
contexts = retrieve_context("How does Firecrawl handle JavaScript rendering?")
for ctx in contexts:
print(f"[{ctx['relevance']:.2f}] {ctx['title']}")
print(f" {ctx['text'][:150]}...")
print(f" Source: {ctx['source']}")
print()from openai import OpenAI
openai_client = OpenAI(api_key="YOUR_OPENAI_KEY")
def build_rag_prompt(query: str, contexts: list) -> str:
"""Build a RAG prompt with retrieved context."""
context_text = ""
for i, ctx in enumerate(contexts, 1):
context_text += f"\n--- Source {i}: {ctx['title']} ---\n"
context_text += f"URL: {ctx['source']}\n"
context_text += ctx["text"] + "\n"
return f"""Answer the following question based on the provided context.
If the context does not contain enough information to answer fully,
say so and provide what you can.
Always cite the source URL when referencing specific information.
Context:
{context_text}
Question: {query}
Answer:"""
def ask_rag(query: str, n_results: int = 5) -> dict:
"""Complete RAG pipeline: retrieve context and generate answer."""
# Step 1: Retrieve relevant chunks
contexts = retrieve_context(query, n_results=n_results)
# Step 2: Build prompt
prompt = build_rag_prompt(query, contexts)
# Step 3: Generate answer
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful technical assistant. "
"Answer questions using the provided context. "
"Always cite sources."},
{"role": "user", "content": prompt},
],
temperature=0.2,
max_tokens=1000,
)
answer = response.choices[0].message.content
tokens_used = response.usage.total_tokens
return {
"query": query,
"answer": answer,
"sources": [ctx["source"] for ctx in contexts],
"tokens_used": tokens_used,
}
# Ask a question
result = ask_rag("How do I scrape a JavaScript-rendered website with Firecrawl?")
print("Answer:", result["answer"])
print(f"\nSources: {', '.join(result['sources'])}")
print(f"Tokens used: {result['tokens_used']}")import { OpenAI } from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function askRAG(query: string, nResults = 5): Promise<{
answer: string;
sources: string[];
}> {
// Retrieve relevant chunks (using your vector DB client)
const contexts = await retrieveContext(query, nResults);
// Build context string
const contextText = contexts
.map((ctx, i) => `--- Source ${i + 1}: ${ctx.title} ---\nURL: ${ctx.source}\n${ctx.text}`)
.join("\n\n");
// Generate answer
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: "Answer using the provided context. Cite sources." },
{ role: "user", content: `Context:\n${contextText}\n\nQuestion: ${query}` },
],
temperature: 0.2,
});
return {
answer: response.choices[0].message.content || "",
sources: contexts.map((ctx) => ctx.source),
};
}Web content changes frequently. Build an update pipeline that re-scrapes and re-indexes content on a schedule.
import hashlib
from datetime import datetime
class RAGUpdater:
"""Keep the RAG knowledge base up to date."""
def __init__(self, app, collection, embed_model):
self.app = app
self.collection = collection
self.embed_model = embed_model
self.content_hashes = {} # URL -> content hash
def update(self, urls: list):
"""Re-scrape URLs and update only changed content."""
updated = 0
unchanged = 0
for url in urls:
try:
result = self.app.scrape_url(url, params={
"formats": ["markdown"],
"onlyMainContent": True,
})
content = result.get("markdown", "")
content_hash = hashlib.sha256(content.encode()).hexdigest()
# Skip if content has not changed
if self.content_hashes.get(url) == content_hash:
unchanged += 1
continue
# Re-chunk and re-embed
chunks = chunk_document(content)
embeddings = self.embed_model.encode(chunks)
# Update vector database
ids = [f"{url}-{i}" for i in range(len(chunks))]
self.collection.upsert(
ids=ids,
documents=chunks,
embeddings=embeddings.tolist(),
metadatas=[{"source_url": url, "updated_at": datetime.utcnow().isoformat()}] * len(chunks),
)
self.content_hashes[url] = content_hash
updated += 1
except Exception as exc:
print(f"Error updating {url}: {exc}")
print(f"Updated: {updated}, Unchanged: {unchanged}")flowchart LR
A[RAG Pipeline] --> B[Retrieval Metrics]
A --> C[Generation Metrics]
A --> D[System Metrics]
B --> B1[Precision@K]
B --> B2[Recall@K]
B --> B3[MRR]
C --> C1[Faithfulness]
C --> C2[Relevance]
C --> C3[Completeness]
D --> D1[Latency]
D --> D2[Tokens/Query]
D --> D3[Cost/Query]
classDef metric fill:#e8f5e8,stroke:#1b5e20
class B1,B2,B3,C1,C2,C3,D1,D2,D3 metric
| Metric | What It Measures | Target |
|---|---|---|
| Retrieval precision@5 | % of retrieved chunks that are relevant | > 60% |
| Answer faithfulness | Does the answer stay grounded in context? | > 90% |
| End-to-end latency | Total time from query to answer | < 3 seconds |
| Tokens per query | Total tokens used (embedding + generation) | < 2000 |
| Cost per query | Dollar cost per question answered | < $0.01 |
def evaluate_retrieval(query: str, expected_urls: list, n_results: int = 5) -> dict:
"""Evaluate retrieval quality for a test query."""
contexts = retrieve_context(query, n_results=n_results)
retrieved_urls = [ctx["source"] for ctx in contexts]
# Precision: what fraction of retrieved results are relevant?
relevant_retrieved = len(set(retrieved_urls) & set(expected_urls))
precision = relevant_retrieved / n_results if n_results > 0 else 0
# Recall: what fraction of relevant results were retrieved?
recall = relevant_retrieved / len(expected_urls) if expected_urls else 0
return {
"query": query,
"precision": round(precision, 2),
"recall": round(recall, 2),
"retrieved": retrieved_urls,
"expected": expected_urls,
}| Problem | Possible Cause | Solution |
|---|---|---|
| Low retrieval relevance | Chunks too large or too small | Experiment with chunk sizes (400-1200 chars) |
| Hallucinated answers | Insufficient context | Increase n_results or improve chunking |
| Slow queries | Large embedding dimension | Use a smaller model or add an HNSW index |
| Outdated answers | Stale content in vector DB | Set up periodic re-scraping with RAGUpdater |
| High cost per query | Too many tokens | Reduce n_results or use a cheaper LLM |
| Duplicate results | Same content in multiple chunks | Improve deduplication (see Chapter 5) |
- Pre-compute embeddings offline in batches rather than one-by-one at query time.
- Use smaller embedding models (384 dimensions) unless you need maximum quality.
- Cache frequent queries -- if the same question is asked often, cache the answer.
- Persist vector stores to disk (ChromaDB) or use a managed service (Pinecone, Weaviate) to avoid re-indexing.
- Batch embedding generation -- encode 32-64 texts at a time for GPU efficiency.
- Filter sensitive content before embedding -- do not store PII in your vector database.
- Respect copyright for stored content -- ensure your use complies with the source site's terms.
- Secure API keys -- never expose OpenAI, Pinecone, or Firecrawl keys in client-side code.
- Audit retrieved context -- log what context was used for each answer to enable traceability.
You have built a complete RAG pipeline: scraping content with Firecrawl, chunking it intelligently, generating embeddings, storing them in a vector database, and generating grounded answers. This architecture gives your LLM access to up-to-date, domain-specific knowledge without the cost and complexity of fine-tuning.
- Chunking is the most impactful parameter in a RAG system -- experiment with sizes between 400-1200 characters with 10-20% overlap.
- Recursive character splitting works well for most content; use markdown-aware splitting for documentation sites.
- ChromaDB is ideal for local development; Pinecone or Weaviate for production deployments.
- Always cite sources in RAG answers to enable verification and build trust.
- Keep your knowledge base fresh with periodic re-scraping and content-hash-based update detection.
Your RAG pipeline works. Now you need to scale it. In Chapter 7: Scaling & Performance, you will learn how to distribute scraping jobs across workers, optimize concurrency, add caching layers, and handle thousands of pages without hitting rate limits.
Built with insights from the Firecrawl project.
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for query, content, documents so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 6: Building RAG Systems as an operating subsystem inside Firecrawl Tutorial: Building LLM-Ready Web Scraping and Data Extraction Systems, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around print, chunks, embeddings as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 6: Building RAG Systems usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
query. - Input normalization: shape incoming data so
contentreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
documents. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- View Repo
Why it matters: authoritative reference on
View Repo(github.com).
Suggested trace strategy:
- search upstream code for
queryandcontentto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production