|
| 1 | +--- |
| 2 | +date: 2025-11-07 |
| 3 | +title: Score Analytics with Multi-Score Comparison |
| 4 | +badge: Launch Week 4 🚀 |
| 5 | +description: Validate evaluation reliability and uncover insights with comprehensive score analysis. Compare different evaluation methods, track trends over time, and measure agreement between human annotators and LLM judges. |
| 6 | +author: Michael |
| 7 | +ogImage: /images/changelog/score-analytics-compare-numeric.png |
| 8 | +--- |
| 9 | + |
| 10 | +import { ChangelogHeader } from "@/components/changelog/ChangelogHeader"; |
| 11 | + |
| 12 | +<ChangelogHeader /> |
| 13 | + |
| 14 | +Score Analytics now provides comprehensive tools for analyzing and comparing evaluation scores across your LLM application. Whether you're validating that different LLM judges agree, checking if human annotations align with automated evaluations, or exploring score distributions and trends, Score Analytics gives you the insights you need to trust your evaluation process. |
| 15 | + |
| 16 | +## What's New |
| 17 | + |
| 18 | +- **Multi-Score Comparison**: Compare any two scores of the same data type to validate evaluation reliability. View correlation metrics, confusion matrices, and alignment patterns between different evaluation sources. |
| 19 | +- **Statistical Validation**: Measure agreement with Pearson correlation, Cohen's Kappa, F1 scores, and other metrics. Badge indicators show interpretation at a glance (e.g., "Very Strong" for correlations above 0.9). |
| 20 | +- **Multi-Data Type Support**: Analyze numeric scores (continuous ratings), categorical scores (discrete labels), or boolean scores (binary classifications) with type-appropriate visualizations and statistics. |
| 21 | +- **Matched vs All Analysis**: Toggle between matched data (scores attached to the same parent object) to measure alignment, or view all data to understand coverage and individual score distributions. |
| 22 | +- **Temporal Insights**: Track how scores evolve over time with configurable intervals from seconds to months. Identify quality regressions or improvements in your application. |
| 23 | + |
| 24 | +## How It Works |
| 25 | + |
| 26 | +<Frame fullWidth> |
| 27 | +  |
| 28 | +</Frame> |
| 29 | + |
| 30 | +**Single Score Analysis** |
| 31 | + |
| 32 | +1. Navigate to **Scores > Analytics** in your project |
| 33 | +2. Select a score from the dropdown to view its distribution and trend over time |
| 34 | +3. Filter by object type (Traces, Observations, Sessions, or Dataset Run Items) and time range |
| 35 | +4. Review summary statistics including mean, standard deviation, and total count |
| 36 | + |
| 37 | +**Two-Score Comparison** |
| 38 | + |
| 39 | +1. Select a second score to enable comparison mode |
| 40 | +2. View correlation metrics in the Statistics card showing how well the scores align |
| 41 | +3. Examine the Score Comparison Heatmap showing correlation patterns: |
| 42 | + - Strong diagonal patterns indicate good agreement |
| 43 | + - Anti-diagonal patterns reveal negative correlations |
| 44 | + - Scattered patterns suggest low alignment |
| 45 | +4. Compare distributions side-by-side in the matched vs all tabs |
| 46 | +5. Track how both scores trend together over time |
| 47 | + |
| 48 | +<Callout type="info"> |
| 49 | +**Self-Serve Dashboards**: Single-score Score Analytics continue to be available on our [self-serve dashboards](/docs/metrics/features/custom-dashboards). Multi-score comparison with correlation analysis requires different data computation that is currently not supported by the metrics API powering self-serve dashboards. |
| 50 | +</Callout> |
| 51 | + |
| 52 | +## Example Use Cases |
| 53 | + |
| 54 | +**Validate LLM Judge Reliability**: Compare helpfulness scores from GPT-4 vs Gemini. If Pearson correlation shows 0.98+ ("Very Strong"), both judges are aligned and your evaluation is reliable. |
| 55 | + |
| 56 | +**Human-AI Annotation Agreement**: Check if your AI evaluations match human annotations. High Cohen's Kappa (0.8+) means AI can augment or replace some manual annotation work. |
| 57 | + |
| 58 | +**Identify Coverage Gaps**: Toggle between "all" and "matched" tabs to see what percentage of your traces have evaluations. If only 50% are matched, you may need broader evaluation coverage. |
| 59 | + |
| 60 | +**Spot Quality Regressions**: Monitor scores over time to detect drops after deployments. Temporal analysis helps you quickly identify and investigate quality issues. |
| 61 | + |
| 62 | +**Discover Feature Relationships**: Compare boolean scores like "has_tool_use" vs "has_hallucination" to uncover insights. Negative correlation patterns show that tool use reduces hallucinations. |
| 63 | + |
| 64 | +## Getting Started |
| 65 | + |
| 66 | +1. Ensure you have [score data](/docs/evaluation/overview) in your Langfuse project |
| 67 | +2. Navigate to **Scores > Analytics** |
| 68 | +3. Select one or two scores to start analyzing |
| 69 | +4. Explore different object types and time ranges to find insights |
| 70 | + |
| 71 | +## Going Deeper |
| 72 | + |
| 73 | +Score Analytics provides a lightweight, zero-configuration way to analyze your scores out of the box. For more advanced analyses, the [experiment SDK](https://langfuse.com/docs/evaluation/experiments/experiments-via-sdk) helps expert users drill down even deeper into their evaluation data. |
| 74 | + |
| 75 | +- [Evaluation Overview](/docs/evaluation/overview) |
| 76 | +- [Guide to Automated Evaluations](https://langfuse.com/blog/2025-09-05-automated-evaluations) |
| 77 | +- [Guide to LLM Agent Evaluation](https://langfuse.com/blog/2025-11-06-experiment-interpretation) |
| 78 | + |
| 79 | +## Learn More |
| 80 | + |
| 81 | +import { Book, Calendar } from "lucide-react"; |
| 82 | + |
| 83 | +<Cards num={1}> |
| 84 | + <Card title="Score Analytics Documentation" href="/docs/evaluation/evaluation-methods/score-analytics" icon={<Book />} /> |
| 85 | + <Card title="Score Configuration Management" href="/faq/all/manage-score-configs" icon={<Book />} /> |
| 86 | + <Card title="See all Launch Week releases" href="/blog/2025-10-29-launch-week-4" icon={<Calendar />} /> |
| 87 | +</Cards> |
0 commit comments