Skip to content

Commit 577c278

Browse files
docs: LW4-D5 (#2272)
* Fix: Replace comment and reaction image * Update comment docs page: shorten title, update api routes * Add score analytics docs and changelog * edit LW blog page * push * push --------- Co-authored-by: Jannik Maierhöfer <jannik@langfuse.com>
1 parent 54f7138 commit 577c278

13 files changed

Lines changed: 586 additions & 120 deletions

data/generated/contributors.json

Lines changed: 214 additions & 105 deletions
Large diffs are not rendered by default.

pages/blog/2025-10-29-launch-week-4.mdx

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -145,28 +145,38 @@ We're adding a set of new features to Dataset Experiments in Langfuse:
145145

146146
We also added guides on [systematically interpreting experiment results](/blog/2025-11-06-experiment-interpretation) and [integrating Langfuse into CI/CD pipelines](/blog/2025-10-21-testing-llm-applications) for automated testing.
147147

148-
</Steps>
149148

150-
---
149+
### Day 5: Score Analytics [#score-analytics]
151150

152-
### Don't Miss a Launch
151+
<iframe
152+
width="100%"
153+
src="https://www.youtube-nocookie.com/embed/2hZjU1XqxRQ?si=HnW_PX9nMfWHptEV"
154+
title="Score Analytics with Multi-Score Comparison"
155+
frameBorder="0"
156+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
157+
className="aspect-video rounded border mt-6 max-w-xl"
158+
allowFullScreen
159+
></iframe>
153160

154-
Be the first to know. Subscribe to our mailing list or follow us on [X](https://x.com/langfuse) or [LinkedIn](https://www.linkedin.com/company/langfuse) for the daily updates.
161+
Scores track application quality, but getting reliable scores at scale is difficult. You must know if your evaluation methods are aligned. Score Analytics helps you compare and validate all your scores, from human ratings to LLM-as-a-judge.
155162

156-
<ProductUpdateSignup source="launch-week-4" className="my-2" />
163+
Select a score to see its distribution and trend over time. Add a second score to compare two evaluation methods. Correlation heatmaps and Pearson metrics reveal alignment between evaluators, such as two different LLM judges. This analysis also works for categorical scores to find relationships between trace properties.
164+
165+
**[Learn more](/changelog/2025-11-07-score-analytics-multi-score-comparison)**
166+
167+
168+
</Steps>
157169

158170
---
159171

160-
### Join Us Live!
172+
### Don't Miss a Launch
161173

162-
We're celebrating mid-week with a **[Virtual Community Hour](https://luma.com/oomgjebn)**. Join us on Wednesday, Nov 5th to chat with the team, see the new features in action, and ask questions.
174+
Be the first to know. Subscribe to our mailing list or follow us on [X](https://x.com/langfuse) or [LinkedIn](https://www.linkedin.com/company/langfuse) for the daily updates.
163175

164-
[Sign up here](https://luma.com/oomgjebn)
176+
<ProductUpdateSignup source="launch-week-4" className="my-2" />
165177

166178
---
167179

168-
_Want a taste of what's in store? Revisit [Launch Week #3](/blog/2025-05-19-launch-week-3) to see what we build._
169-
170180
## Learn More About Langfuse
171181

172182
<Cards num={2} className="gap-6">
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
---
2+
date: 2025-11-07
3+
title: Score Analytics with Multi-Score Comparison
4+
badge: Launch Week 4 🚀
5+
description: Validate evaluation reliability and uncover insights with comprehensive score analysis. Compare different evaluation methods, track trends over time, and measure agreement between human annotators and LLM judges.
6+
author: Michael
7+
ogImage: /images/changelog/score-analytics-compare-numeric.png
8+
---
9+
10+
import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";
11+
12+
<ChangelogHeader />
13+
14+
Score Analytics now provides comprehensive tools for analyzing and comparing evaluation scores across your LLM application. Whether you're validating that different LLM judges agree, checking if human annotations align with automated evaluations, or exploring score distributions and trends, Score Analytics gives you the insights you need to trust your evaluation process.
15+
16+
## What's New
17+
18+
- **Multi-Score Comparison**: Compare any two scores of the same data type to validate evaluation reliability. View correlation metrics, confusion matrices, and alignment patterns between different evaluation sources.
19+
- **Statistical Validation**: Measure agreement with Pearson correlation, Cohen's Kappa, F1 scores, and other metrics. Badge indicators show interpretation at a glance (e.g., "Very Strong" for correlations above 0.9).
20+
- **Multi-Data Type Support**: Analyze numeric scores (continuous ratings), categorical scores (discrete labels), or boolean scores (binary classifications) with type-appropriate visualizations and statistics.
21+
- **Matched vs All Analysis**: Toggle between matched data (scores attached to the same parent object) to measure alignment, or view all data to understand coverage and individual score distributions.
22+
- **Temporal Insights**: Track how scores evolve over time with configurable intervals from seconds to months. Identify quality regressions or improvements in your application.
23+
24+
## How It Works
25+
26+
<Frame fullWidth>
27+
![Score Analytics Dashboard](/images/changelog/score-analytics-compare-categorical.png)
28+
</Frame>
29+
30+
**Single Score Analysis**
31+
32+
1. Navigate to **Scores > Analytics** in your project
33+
2. Select a score from the dropdown to view its distribution and trend over time
34+
3. Filter by object type (Traces, Observations, Sessions, or Dataset Run Items) and time range
35+
4. Review summary statistics including mean, standard deviation, and total count
36+
37+
**Two-Score Comparison**
38+
39+
1. Select a second score to enable comparison mode
40+
2. View correlation metrics in the Statistics card showing how well the scores align
41+
3. Examine the Score Comparison Heatmap showing correlation patterns:
42+
- Strong diagonal patterns indicate good agreement
43+
- Anti-diagonal patterns reveal negative correlations
44+
- Scattered patterns suggest low alignment
45+
4. Compare distributions side-by-side in the matched vs all tabs
46+
5. Track how both scores trend together over time
47+
48+
<Callout type="info">
49+
**Self-Serve Dashboards**: Single-score Score Analytics continue to be available on our [self-serve dashboards](/docs/metrics/features/custom-dashboards). Multi-score comparison with correlation analysis requires different data computation that is currently not supported by the metrics API powering self-serve dashboards.
50+
</Callout>
51+
52+
## Example Use Cases
53+
54+
**Validate LLM Judge Reliability**: Compare helpfulness scores from GPT-4 vs Gemini. If Pearson correlation shows 0.98+ ("Very Strong"), both judges are aligned and your evaluation is reliable.
55+
56+
**Human-AI Annotation Agreement**: Check if your AI evaluations match human annotations. High Cohen's Kappa (0.8+) means AI can augment or replace some manual annotation work.
57+
58+
**Identify Coverage Gaps**: Toggle between "all" and "matched" tabs to see what percentage of your traces have evaluations. If only 50% are matched, you may need broader evaluation coverage.
59+
60+
**Spot Quality Regressions**: Monitor scores over time to detect drops after deployments. Temporal analysis helps you quickly identify and investigate quality issues.
61+
62+
**Discover Feature Relationships**: Compare boolean scores like "has_tool_use" vs "has_hallucination" to uncover insights. Negative correlation patterns show that tool use reduces hallucinations.
63+
64+
## Getting Started
65+
66+
1. Ensure you have [score data](/docs/evaluation/overview) in your Langfuse project
67+
2. Navigate to **Scores > Analytics**
68+
3. Select one or two scores to start analyzing
69+
4. Explore different object types and time ranges to find insights
70+
71+
## Going Deeper
72+
73+
Score Analytics provides a lightweight, zero-configuration way to analyze your scores out of the box. For more advanced analyses, the [experiment SDK](https://langfuse.com/docs/evaluation/experiments/experiments-via-sdk) helps expert users drill down even deeper into their evaluation data.
74+
75+
- [Evaluation Overview](/docs/evaluation/overview)
76+
- [Guide to Automated Evaluations](https://langfuse.com/blog/2025-09-05-automated-evaluations)
77+
- [Guide to LLM Agent Evaluation](https://langfuse.com/blog/2025-11-06-experiment-interpretation)
78+
79+
## Learn More
80+
81+
import { Book, Calendar } from "lucide-react";
82+
83+
<Cards num={1}>
84+
<Card title="Score Analytics Documentation" href="/docs/evaluation/evaluation-methods/score-analytics" icon={<Book />} />
85+
<Card title="Score Configuration Management" href="/faq/all/manage-score-configs" icon={<Book />} />
86+
<Card title="See all Launch Week releases" href="/blog/2025-10-29-launch-week-4" icon={<Calendar />} />
87+
</Cards>

pages/docs/evaluation/evaluation-methods/_meta.tsx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,6 @@ export default {
22
"llm-as-a-judge": "LLM-as-a-Judge",
33
annotation: "Human Annotations",
44
"custom-scores": "Custom Scores",
5+
"score-analytics": "Score Analytics",
56
"data-model": "Data Model",
67
};

0 commit comments

Comments
 (0)