|
21 | 21 | 7. [Step-by-Step Setup](#7-step-by-step-setup) |
22 | 22 | 8. [Configuration Guide](#8-configuration-guide) |
23 | 23 | 9. [Running Evaluations](#9-running-evaluations) |
24 | | -10. [Understanding Results](#10-understanding-results) |
| 24 | +10. [Programmatic API](#10-programmatic-api) |
| 25 | +11. [Understanding Results](#11-understanding-results) |
25 | 26 |
|
26 | 27 | ### Part 4: Real-World Application |
27 | | -11. [Common Use Cases](#11-common-use-cases) |
28 | | -12. [Best Practices](#12-best-practices) |
29 | | -13. [Troubleshooting](#13-troubleshooting) |
| 28 | +12. [Common Use Cases](#12-common-use-cases) |
| 29 | +13. [Best Practices](#13-best-practices) |
| 30 | +14. [Troubleshooting](#14-troubleshooting) |
30 | 31 |
|
31 | 32 | ### Part 5: Reference Materials |
32 | | -14. [Quick Reference Tables](#14-quick-reference-tables) |
33 | | -15. [Resources & Links](#15-resources--links) |
| 33 | +15. [Quick Reference Tables](#15-quick-reference-tables) |
| 34 | +16. [Resources & Links](#16-resources--links) |
34 | 35 |
|
35 | 36 | --- |
36 | 37 |
|
@@ -856,7 +857,127 @@ lightspeed-eval \ |
856 | 857 |
|
857 | 858 | --- |
858 | 859 |
|
859 | | -## 10. Understanding Results |
| 860 | +## 10. Programmatic API |
| 861 | + |
| 862 | +In addition to the CLI, the framework can be used as a Python library. This is useful when you want to integrate evaluations into scripts, notebooks, CI pipelines, or custom tooling—without dealing with YAML files or command-line arguments. |
| 863 | + |
| 864 | +### Available Functions |
| 865 | + |
| 866 | +| Function | Purpose | |
| 867 | +|----------|---------| |
| 868 | +| `evaluate(config, data)` | Evaluate a list of conversations | |
| 869 | +| `evaluate_conversation(config, data)` | Evaluate a single conversation | |
| 870 | +| `evaluate_turn(config, turn)` | Evaluate a single turn | |
| 871 | + |
| 872 | +All three functions return `list[EvaluationResult]`. |
| 873 | + |
| 874 | +### Basic Example |
| 875 | + |
| 876 | +```python |
| 877 | +from lightspeed_evaluation import ( |
| 878 | + evaluate, |
| 879 | + EvaluationData, |
| 880 | + LLMConfig, |
| 881 | + SystemConfig, |
| 882 | + TurnData, |
| 883 | +) |
| 884 | +
|
| 885 | +# 1. Build configuration |
| 886 | +config = SystemConfig( |
| 887 | + llm=LLMConfig(provider="openai", model="gpt-4o-mini"), |
| 888 | +) |
| 889 | +
|
| 890 | +# 2. Build evaluation data |
| 891 | +data = EvaluationData( |
| 892 | + conversation_group_id="my_eval", |
| 893 | + turns=[ |
| 894 | + TurnData( |
| 895 | + turn_id="t1", |
| 896 | + query="What is OpenShift?", |
| 897 | + response="OpenShift is a Kubernetes-based container platform.", |
| 898 | + expected_response="OpenShift is Red Hat's Kubernetes platform.", |
| 899 | + turn_metrics=["ragas:response_relevancy"], |
| 900 | + ), |
| 901 | + ], |
| 902 | +) |
| 903 | +
|
| 904 | +# 3. Run evaluation |
| 905 | +results = evaluate(config, [data]) |
| 906 | +
|
| 907 | +# 4. Inspect results |
| 908 | +for r in results: |
| 909 | + print(f"{r.metric_identifier}: {r.result} (score={r.score})") |
| 910 | +``` |
| 911 | + |
| 912 | +### Evaluating a Single Turn |
| 913 | + |
| 914 | +Use `evaluate_turn()` when you want to evaluate one question-answer pair. You can override metrics without modifying the original turn object: |
| 915 | + |
| 916 | +```python |
| 917 | +from lightspeed_evaluation import evaluate_turn, SystemConfig, TurnData |
| 918 | +
|
| 919 | +config = SystemConfig() |
| 920 | +turn = TurnData( |
| 921 | + turn_id="t1", |
| 922 | + query="What is a pod?", |
| 923 | + response="A pod is the smallest deployable unit in Kubernetes.", |
| 924 | +) |
| 925 | +
|
| 926 | +results = evaluate_turn( |
| 927 | + config, |
| 928 | + turn, |
| 929 | + metrics=["ragas:response_relevancy", "ragas:faithfulness"], |
| 930 | +) |
| 931 | +``` |
| 932 | + |
| 933 | +### Evaluating a Single Conversation |
| 934 | + |
| 935 | +Use `evaluate_conversation()` when you have a single `EvaluationData` object: |
| 936 | + |
| 937 | +```python |
| 938 | +from lightspeed_evaluation import evaluate_conversation, EvaluationData, SystemConfig, TurnData |
| 939 | +
|
| 940 | +config = SystemConfig() |
| 941 | +data = EvaluationData( |
| 942 | + conversation_group_id="support_conv", |
| 943 | + turns=[ |
| 944 | + TurnData(turn_id="t1", query="Hello", response="Hi! How can I help?"), |
| 945 | + TurnData(turn_id="t2", query="What is OCP?", response="OCP is OpenShift."), |
| 946 | + ], |
| 947 | + conversation_metrics=["deepeval:knowledge_retention"], |
| 948 | +) |
| 949 | +
|
| 950 | +results = evaluate_conversation(config, data) |
| 951 | +``` |
| 952 | + |
| 953 | +### Working with Results |
| 954 | + |
| 955 | +The `evaluate()` functions return `list[EvaluationResult]`. Each result contains: |
| 956 | + |
| 957 | +| Field | Description | |
| 958 | +|-------|-------------| |
| 959 | +| `result` | Status: `PASS`, `FAIL`, `ERROR`, or `SKIPPED` | |
| 960 | +| `score` | Numeric score between 0.0 and 1.0 | |
| 961 | +| `threshold` | Pass/fail threshold used | |
| 962 | +| `reason` | Explanation from the judge LLM | |
| 963 | +| `metric_identifier` | Which metric produced this result | |
| 964 | +| `turn_id` | Turn ID (for turn-level metrics) | |
| 965 | +| `conversation_group_id` | Conversation group ID | |
| 966 | + |
| 967 | +No files are generated by default—file output is the caller's responsibility. If you need CSV/JSON reports, use the `OutputHandler` separately. |
| 968 | + |
| 969 | +### CLI vs Programmatic API |
| 970 | + |
| 971 | +| Aspect | CLI (`lightspeed-eval`) | Programmatic API | |
| 972 | +|--------|------------------------|------------------| |
| 973 | +| Configuration | YAML files | Python objects (`SystemConfig`) | |
| 974 | +| Input data | YAML files | Python objects (`EvaluationData`) | |
| 975 | +| Output | CSV, JSON, TXT files + graphs | `list[EvaluationResult]` in memory | |
| 976 | +| Use case | Standalone runs, CI jobs | Library integration, notebooks, scripts | |
| 977 | + |
| 978 | +--- |
| 979 | + |
| 980 | +## 11. Understanding Results |
860 | 981 |
|
861 | 982 | ### Output Files |
862 | 983 |
|
@@ -956,7 +1077,7 @@ ragas:faithfulness: |
956 | 1077 |
|
957 | 1078 | # Part 4: Real-World Application |
958 | 1079 |
|
959 | | -## 11. Common Use Cases |
| 1080 | +## 12. Common Use Cases |
960 | 1081 |
|
961 | 1082 | ### Use Case 1: Quality Assurance for Customer Support Bot |
962 | 1083 |
|
@@ -1132,7 +1253,7 @@ exit $? |
1132 | 1253 |
|
1133 | 1254 | --- |
1134 | 1255 |
|
1135 | | -## 12. Best Practices |
| 1256 | +## 13. Best Practices |
1136 | 1257 |
|
1137 | 1258 | ### 1. Start Small, Scale Up |
1138 | 1259 |
|
@@ -1257,7 +1378,7 @@ llm: |
1257 | 1378 |
|
1258 | 1379 | --- |
1259 | 1380 |
|
1260 | | -## 13. Troubleshooting |
| 1381 | +## 14. Troubleshooting |
1261 | 1382 |
|
1262 | 1383 | ### Issue 1: "No API key found" |
1263 | 1384 |
|
@@ -1468,7 +1589,7 @@ lightspeed-eval --eval-data config/eval_batch2.yaml |
1468 | 1589 |
|
1469 | 1590 | # Part 5: Reference Materials |
1470 | 1591 |
|
1471 | | -## 14. Quick Reference Tables |
| 1592 | +## 15. Quick Reference Tables |
1472 | 1593 |
|
1473 | 1594 | ### All Metrics at a Glance |
1474 | 1595 |
|
@@ -1564,7 +1685,7 @@ uv run python script/run_multi_provider_eval.py \ |
1564 | 1685 | --- |
1565 | 1686 |
|
1566 | 1687 |
|
1567 | | -## 15. Resources & Links |
| 1688 | +## 16. Resources & Links |
1568 | 1689 |
|
1569 | 1690 | ### Official Framework Documentation |
1570 | 1691 |
|
|
0 commit comments