|
2 | 2 |
|
3 | 3 | [](https://pypi.org/project/eval-protocol/) |
4 | 4 |
|
5 | | -**Eval Protocol (EP) is the open-source standard and toolkit for practicing Eval-Driven Development.** |
| 5 | +**The open-source toolkit for building your internal model leaderboard.** |
6 | 6 |
|
7 | | -Building with AI is different. Traditional software is deterministic, but AI systems are probabilistic. How do you ship new features without causing silent regressions? How do you prove a new prompt is actually better? |
8 | | - |
9 | | -The answer is a new engineering discipline: **Eval-Driven Development (EDD)**. It adapts the rigor of Test-Driven Development for the uncertain world of AI. With EDD, you define your AI's desired behavior as a suite of executable tests, creating a safety net that allows you to innovate with confidence. |
10 | | - |
11 | | -EP provides a consistent way to write evals, store traces, and analyze results. |
12 | | - |
13 | | -<p align="center"> |
14 | | - <img src="https://raw.githubusercontent.com/eval-protocol/python-sdk/refs/heads/main/assets/ui.png" alt="UI" /> |
15 | | - <br> |
16 | | - <sub><b>Log Viewer: Monitor your evaluation rollouts in real time.</b></sub> |
17 | | -</p> |
| 7 | +When you have multiple AI models to choose from—different versions, providers, or configurations—how do you know which one is best for your use case? |
18 | 8 |
|
19 | 9 | ## Quick Example |
20 | 10 |
|
21 | | -Here's a simple test function that checks if a model's response contains **bold** text formatting: |
| 11 | +Compare models on a simple formatting task: |
22 | 12 |
|
23 | 13 | ```python test_bold_format.py |
24 | 14 | from eval_protocol.models import EvaluateResult, EvaluationRow, Message |
25 | | -from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test |
| 15 | +from eval_protocol.pytest import default_single_turn_rollout_processor, evaluation_test |
26 | 16 |
|
27 | 17 | @evaluation_test( |
28 | 18 | input_messages=[ |
29 | 19 | [ |
30 | | - Message(role="system", content="You are a helpful assistant. Use bold text to highlight important information."), |
31 | | - Message(role="user", content="Explain why **evaluations** matter for building AI agents. Make it dramatic!"), |
| 20 | + Message(role="system", content="Use bold text to highlight important information."), |
| 21 | + Message(role="user", content="Explain why evaluations matter for AI agents. Make it dramatic!"), |
32 | 22 | ], |
33 | 23 | ], |
34 | | - completion_params=[{"model": "accounts/fireworks/models/llama-v3p1-8b-instruct"}], |
35 | | - rollout_processor=SingleTurnRolloutProcessor(), |
| 24 | + model=[ |
| 25 | + "fireworks_ai/accounts/fireworks/models/llama-v3p1-8b-instruct", |
| 26 | + "openai/gpt-4", |
| 27 | + "anthropic/claude-3-sonnet" |
| 28 | + ], |
| 29 | + rollout_processor=default_single_turn_rollout_processor, |
36 | 30 | mode="pointwise", |
37 | 31 | ) |
38 | 32 | def test_bold_format(row: EvaluationRow) -> EvaluationRow: |
39 | | - """ |
40 | | - Simple evaluation that checks if the model's response contains bold text. |
41 | | - """ |
42 | | - |
| 33 | + """Check if the model's response contains bold text.""" |
43 | 34 | assistant_response = row.messages[-1].content |
44 | 35 |
|
45 | | - # Check if response contains **bold** text |
46 | | - has_bold = "**" in assistant_response |
| 36 | + if assistant_response is None: |
| 37 | + row.evaluation_result = EvaluateResult(score=0.0, reason="No response") |
| 38 | + return row |
47 | 39 |
|
48 | | - if has_bold: |
49 | | - result = EvaluateResult(score=1.0, reason="✅ Response contains bold text") |
50 | | - else: |
51 | | - result = EvaluateResult(score=0.0, reason="❌ No bold text found") |
| 40 | + has_bold = "**" in str(assistant_response) |
| 41 | + score = 1.0 if has_bold else 0.0 |
| 42 | + reason = "Contains bold text" if has_bold else "No bold text found" |
52 | 43 |
|
53 | | - row.evaluation_result = result |
| 44 | + row.evaluation_result = EvaluateResult(score=score, reason=reason) |
54 | 45 | return row |
55 | 46 | ``` |
56 | 47 |
|
57 | | -## Documentation |
| 48 | +## 📚 Resources |
58 | 49 |
|
59 | | -See our [documentation](https://evalprotocol.io) for more details. |
| 50 | +- **[Documentation](https://evalprotocol.io)** - Complete guides and API reference |
| 51 | +- **[Discord](https://discord.com/channels/1137072072808472616/1400975572405850155)** - Community discussions |
60 | 52 |
|
61 | 53 | ## Installation |
62 | 54 |
|
|
0 commit comments