Upgrade to llama-stack 0.6.0 and ragas 0.4.x#64
Merged
dmaniloff merged 4 commits intoMay 11, 2026
Conversation
Contributor
Reviewer's GuideAligns the ragas-based llama-stack provider with llama-stack 0.6.x and ragas 0.4.x by updating dependencies, centralizing metric registration (including new class-based metrics), implementing the required is_finished() method on both inline and remote LLM wrappers, and refreshing compatibility docs and tests. Class diagram for RagasEvaluatorBase metrics selection and registryclassDiagram
class RagasEvaluatorBase {
+dict~str, Benchmark~ benchmarks
+list~str~ _DEFAULT_METRICS
+_get_metrics(scoring_functions list~str~) list
}
class constants {
+str PROVIDER_TYPE
+str PROVIDER_ID_INLINE
+str PROVIDER_ID_REMOTE
+list _SINGLETON_METRICS
+list _CLASS_METRICS
+dict METRIC_MAPPING
+list AVAILABLE_METRICS
}
class Metric {
+str name
}
class AnswerAccuracy {
+str name
}
class ContextRelevance {
+str name
}
class FactualCorrectness {
+str name
}
class NoiseSensitivity {
+str name
}
class ResponseGroundedness {
+str name
}
constants o-- "*" Metric : _SINGLETON_METRICS
constants o-- "*" AnswerAccuracy : _CLASS_METRICS
constants o-- "*" ContextRelevance : _CLASS_METRICS
constants o-- "*" FactualCorrectness : _CLASS_METRICS
constants o-- "*" NoiseSensitivity : _CLASS_METRICS
constants o-- "*" ResponseGroundedness : _CLASS_METRICS
RagasEvaluatorBase ..> constants : uses_METRIC_MAPPING
RagasEvaluatorBase ..> Metric : returns_metrics_from__get_metrics
Metric <|-- AnswerAccuracy
Metric <|-- ContextRelevance
Metric <|-- FactualCorrectness
Metric <|-- NoiseSensitivity
Metric <|-- ResponseGroundedness
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
- Bump provider version to 0.7.0 targeting llama-stack >=0.6.0 - Upgrade ragas from ==0.3.0 to >=0.4.0,<0.5.0 - Add 6 new metrics: AnswerAccuracy, ContextRelevance, FactualCorrectness, NoiseSensitivity, ResponseGroundedness, context_entity_recall - Implement is_finished() on LLM wrappers (now required by BaseRagasLLM) - Fix test fixture metric name (semantic_similarity -> answer_similarity) - Update COMPATIBILITY.md with release/0.6.x branch and version entries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6fbb8db to
65b01a7
Compare
- Fix EvaluationResult import in kubeflow components (ragas.dataset_schema → ragas.evaluation) - Remove stale commented-out is_finished code from inline wrappers - Eliminate deprecation-triggering lazy imports in base._get_metrics by using METRIC_MAPPING Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
RagasEvaluatorBase._get_metrics, you're now referencingMETRIC_MAPPINGand_DEFAULT_METRICSwithout showing an import in this file; ensureMETRIC_MAPPINGis imported fromconstants(and that_DEFAULT_METRICSis kept in sync with it) to avoid runtime errors or mismatches. - The new
is_finishedimplementations default toTruewhenllm_outputis missing or lacksllama_stack_responses; consider adding a simple content-based fallback check (e.g., non-empty generations) so obviously incomplete or empty generations are not treated as finished.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `RagasEvaluatorBase._get_metrics`, you're now referencing `METRIC_MAPPING` and `_DEFAULT_METRICS` without showing an import in this file; ensure `METRIC_MAPPING` is imported from `constants` (and that `_DEFAULT_METRICS` is kept in sync with it) to avoid runtime errors or mismatches.
- The new `is_finished` implementations default to `True` when `llm_output` is missing or lacks `llama_stack_responses`; consider adding a simple content-based fallback check (e.g., non-empty generations) so obviously incomplete or empty generations are not treated as finished.
## Individual Comments
### Comment 1
<location path="src/llama_stack_provider_ragas/base.py" line_range="42-51" />
<code_context>
def __init__(self):
self.benchmarks: dict[str, Benchmark] = {}
+ _DEFAULT_METRICS = [
+ "answer_relevancy",
+ "context_precision",
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against _DEFAULT_METRICS names drifting from METRIC_MAPPING contents
Because `_DEFAULT_METRICS` is hard-coded and then looked up in `METRIC_MAPPING`, any rename/removal in `METRIC_MAPPING` will cause a `KeyError` at runtime.
Consider either deriving `_DEFAULT_METRICS` from `METRIC_MAPPING` (e.g., validated keys at import) or resolving via `METRIC_MAPPING.get(name)` and logging/skipping unknown metrics to avoid hard failures.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
…ack, add tests - Guard _DEFAULT_METRICS against METRIC_MAPPING drift with .get() + warning - Replace unconditional `return True` in is_finished with content-based check - Add unit tests for _get_metrics (6 tests) and is_finished (8 tests) - Add nv_accuracy (AnswerAccuracy) to benchmark scoring_functions and test_direct_evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
release/0.6.xbranch to maintain llama-stack 0.5.x supportNew metrics available
nv_accuracy(AnswerAccuracy)nv_context_relevance(ContextRelevance)factual_correctness(FactualCorrectness)noise_sensitivity(NoiseSensitivity)nv_response_groundedness(ResponseGroundedness)context_entity_recall(ContextEntityRecall)Breaking changes
evaluate()andBaseRagasLLMstill work (with deprecation warnings)Other changes
is_finished()on both LLM wrappers (now required byBaseRagasLLMin ragas 0.4.x)semantic_similarity→answer_similarity)Sourcery review follow-ups
_DEFAULT_METRICSagainstMETRIC_MAPPINGdrift — use.get()+ warning instead of bare key lookup; raiseRagasEvaluationErrorif all defaults are invalidis_finished()— check for non-empty generation text instead of unconditionally returningTruewhenllm_outputis missing_get_metrics(6 tests) andis_finished(8 tests)nv_accuracy(AnswerAccuracy) to benchmark scoring_functions andtest_direct_evaluationto exercise a new class-based metric end-to-endTest plan
uv run pre-commit run --all-filespasses (ruff, mypy, pytest)uv run pytest tests/ --ignore=tests/test_e2e.py)nv_accuracy) produces valid scores in both direct evaluation and full pipeline🤖 Generated with Claude Code