Skip to content

Commit d8876de

Browse files
authored
feat: add rag example (#12)
1 parent 934d87c commit d8876de

13 files changed

Lines changed: 2614 additions & 0 deletions

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Here, we try to highlight amazing work by our contributors and partners.
1313
| [DSPy.rb Langfuse Integration](https://github.com/vicentereig/dspy.rb) | A Ruby framework for LLM programming with built-in Langfuse tracing via OpenTelemetry. | [@vicentereig](https://github.com/vicentereig) |
1414
| [Tracing Pipecat Applications](./applications/langchat) | A Pipecat application sending traces to Langfuse. | [@aabedraba](https://github.com/aabedraba) |
1515
| [Tracing MCP Servers](./applications/mcp-tracing) | An example on using the OpenAI agents SDK together with an MCP server. | [@aabedraba](https://github.com/aabedraba) |
16+
| [RAG Observability and Evals](./applications/rag) | A RAG application that uses Langfuse for tracing and evals. | [@aabedraba](https://github.com/aabedraba) |
1617

1718
## Deployment Examples
1819

applications/rag/.env.example

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
OPENAI_API_KEY=sk-proj-123
2+
LANGFUSE_PUBLIC_KEY=pk-lf-123
3+
LANGFUSE_SECRET_KEY=sk-lf-123
4+
LANGFUSE_HOST=https://cloud.langfuse.com

applications/rag/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.env
2+
.venv
3+
__pycache__/

applications/rag/.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.11

applications/rag/README.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# RAG Obvservability and Evals with Langfuse
2+
3+
A RAG (Retrieval-Augmented Generation) chatbot that answers questions about Langfuse using OpenAI and LangChain, with observability and evals.
4+
5+
Follow the Langfuse blogpost on [RAG Observability and Evals](https://langfuse.com/blog/2025-10-28-rag-observability-and-evals) for more details.
6+
7+
## Features
8+
9+
- **Observability**: Full tracing of the RAG pipeline with Langfuse
10+
- **Component Evaluation**: Example evaluation for the chunk retrieval with Langfuse Experiments
11+
- **Evaluation**: Evaluation of the entire RAG pipeline with Langfuse Experiments
12+
13+
## Requirements
14+
15+
• Python ≥3.11
16+
• OpenAI API key
17+
• Langfuse credentials
18+
19+
## Setup
20+
21+
Create .env with:
22+
23+
```
24+
OPENAI_API_KEY=your_key
25+
LANGFUSE_PUBLIC_KEY=your_key
26+
LANGFUSE_SECRET_KEY=your_key
27+
LANGFUSE_HOST=https://cloud.langfuse.com
28+
```
29+
30+
And install the dependencies:
31+
32+
```sh
33+
uv sync
34+
```
35+
36+
## Usage
37+
38+
### For Observability
39+
40+
Run the bot:
41+
42+
```
43+
uv run rag_bot/main.py
44+
```
45+
46+
You should see traces in Langfuse like this:
47+
48+
![Langfuse Traces](./assets/rag-traces.png)
49+
50+
### Evaluations
51+
52+
Create a dataset in Langfuse with the following name: `rag_bot_evals`
53+
54+
Each item in the dataset should have the following fields:
55+
56+
- input: `{ "question": "What is Langfuse?" }`
57+
- expected_output: `{ "answer": "Langfuse is a platform for building and evaluating LLMs." }`
58+
59+
#### a. Answer Evaluation
60+
61+
```
62+
uv run rag_bot/answer_evaluation.py
63+
```
64+
65+
#### b. Component Evaluation
66+
67+
An example is provided to evaluate the right chunk size and overlap:
68+
69+
```
70+
uv run rag_bot/chunk_evaluation.py
71+
```
72+
73+
You should see the evaluation results in Langfuse like this:
74+
75+
![Langfuse Evaluation](./assets/rag-evaluation.png)
277 KB
Loading
264 KB
Loading

applications/rag/pyproject.toml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
[project]
2+
name = "rag"
3+
version = "0.1.0"
4+
description = "Add your description here"
5+
readme = "README.md"
6+
requires-python = ">=3.11"
7+
dependencies = [
8+
"langchain-community>=0.4",
9+
"langchain-openai>=1.0.1",
10+
"langchain-core>=0.4",
11+
"langchain-text-splitters>=0.4",
12+
"langfuse>=3.8.0",
13+
"python-dotenv>=1.1.1",
14+
"beautifulsoup4>=4.14.2",
15+
"langchain>=1.0.2",
16+
]
17+
18+
[tool.ruff]
19+
line-length = 120
20+
indent-width = 2
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# RAG Bot package
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# An answer evaluation is added to evaluate the quality of the answer generated by the entire RAG pipeline.
2+
# In this example, we evaluate the relevance and faithfulness of the answer to the question and the expected output.
3+
4+
from typing import Annotated, TypedDict
5+
6+
from dotenv import load_dotenv
7+
from langchain_openai import ChatOpenAI
8+
from langfuse import Evaluation, get_client
9+
from langfuse.experiment import ExperimentItem
10+
from main import rag_bot
11+
12+
load_dotenv()
13+
langfuse = get_client()
14+
15+
16+
def rag_task(*, item: ExperimentItem, **kwargs):
17+
"""Task function that runs the full RAG pipeline."""
18+
question = item.input["question"] # type: ignore
19+
result = rag_bot(question)
20+
21+
return {"answer": result["answer"], "documents": result["documents"]}
22+
23+
24+
# Answer Relevance Evaluation
25+
class AnswerRelevanceGrade(TypedDict):
26+
explanation: Annotated[str, ..., "Explain your reasoning for the score"]
27+
score: Annotated[int, ..., "Rate the relevance of the answer to the question of 0 or 1"]
28+
29+
30+
answer_relevance_llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(
31+
AnswerRelevanceGrade, method="json_schema", strict=True
32+
)
33+
34+
answer_relevance_instructions = """You are evaluating the relevance of an answer to a question.
35+
You will be given a QUESTION, an ANSWER, and an EXPECTED OUTPUT.
36+
37+
Here is the grade criteria to follow:
38+
(1) The ANSWER should directly address the QUESTION
39+
(2) The ANSWER should be similar in scope to the EXPECTED OUTPUT
40+
(3) The ANSWER should not contain significant irrelevant information
41+
(4) It's acceptable if the ANSWER provides additional helpful context as long as it addresses the core question
42+
43+
You should return a score of 0 or 1, where:
44+
- 0: The answer is irrelevant or doesn't address the question
45+
- 1: The answer is relevant and addresses the question
46+
"""
47+
48+
49+
def answer_relevance_evaluator(*, input, output, expected_output, metadata, **kwargs):
50+
"""Evaluates how relevant the generated answer is to the question."""
51+
result = answer_relevance_llm.invoke(
52+
answer_relevance_instructions
53+
+ "\n\nQUESTION: "
54+
+ input["question"]
55+
+ "\n\nANSWER: "
56+
+ output["answer"]
57+
+ "\n\nEXPECTED OUTPUT: "
58+
+ expected_output["answer"]
59+
)
60+
61+
return Evaluation(name="answer_relevance", value=result["score"], comment=result.get("explanation", ""))
62+
63+
64+
# Faithfulness Evaluation
65+
class FaithfulnessGrade(TypedDict):
66+
explanation: Annotated[str, ..., "Explain your reasoning for the score"]
67+
score: Annotated[int, ..., "Rate the faithfulness of the answer to the source documents of 0 or 1"]
68+
69+
70+
faithfulness_llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(
71+
FaithfulnessGrade, method="json_schema", strict=True
72+
)
73+
74+
faithfulness_instructions = """You are evaluating the faithfulness of an answer to the source documents.
75+
You will be given an ANSWER and the FACTS (source documents) that were used to generate it.
76+
77+
Here is the grade criteria to follow:
78+
(1) The ANSWER should only contain information that can be verified from the FACTS
79+
(2) The ANSWER should not hallucinate or make up information not present in the FACTS
80+
(3) The ANSWER should not contradict information in the FACTS
81+
(4) It's acceptable for the ANSWER to say "I don't know" if the FACTS don't contain the information
82+
83+
You should return a score of 0 or 1, where:
84+
- 1: The answer is fully grounded in the source facts
85+
- 0: The answer contains hallucinations or unverified claims
86+
87+
Explain your reasoning for the score."""
88+
89+
90+
def faithfulness_evaluator(*, input, output, expected_output, metadata, **kwargs):
91+
"""Evaluates how faithful the generated answer is to the source facts."""
92+
result = faithfulness_llm.invoke(
93+
faithfulness_instructions
94+
+ "\n\nANSWER: "
95+
+ output["answer"]
96+
+ "\n\FACTS: "
97+
+ "\n\n".join(doc.page_content for doc in output["documents"])
98+
)
99+
100+
return Evaluation(name="faithfulness", value=result["score"], comment=result.get("explanation", ""))
101+
102+
103+
if __name__ == "__main__":
104+
print("Fetching dataset")
105+
dataset = langfuse.get_dataset(name="rag_bot_evals")
106+
107+
print("Running answer evaluation experiment")
108+
dataset.run_experiment(
109+
name="Answer Quality: Relevance and Faithfulness",
110+
task=rag_task,
111+
evaluators=[answer_relevance_evaluator, faithfulness_evaluator],
112+
)
113+
114+
print("Experiment run successfully")
115+
langfuse.flush()

0 commit comments

Comments
 (0)