Upgrade to llama-stack 0.6.0 and ragas 0.4.x by dmaniloff · Pull Request #64 · trustyai-explainability/llama-stack-provider-ragas

dmaniloff · 2026-04-08T16:27:07Z

Summary

Bump provider version to 0.7.0 targeting llama-stack >=0.6.0
Upgrade ragas from ==0.3.0 to >=0.4.0,<0.5.0, adding 6 new metrics
Create release/0.6.x branch to maintain llama-stack 0.5.x support
Update COMPATIBILITY.md with new release branch and version entry

New metrics available

nv_accuracy (AnswerAccuracy)
nv_context_relevance (ContextRelevance)
factual_correctness (FactualCorrectness)
noise_sensitivity (NoiseSensitivity)
nv_response_groundedness (ResponseGroundedness)
context_entity_recall (ContextEntityRecall)

Breaking changes

None — the llama-stack eval/scoring/benchmarks APIs are unchanged in 0.6.0
Ragas 0.4.x is backwards-compatible; old evaluate() and BaseRagasLLM still work (with deprecation warnings)

Other changes

Implement is_finished() on both LLM wrappers (now required by BaseRagasLLM in ragas 0.4.x)
Fix test fixture metric name (semantic_similarity → answer_similarity)

Sourcery review follow-ups

Guard _DEFAULT_METRICS against METRIC_MAPPING drift — use .get() + warning instead of bare key lookup; raise RagasEvaluationError if all defaults are invalid
Add content-based fallback to is_finished() — check for non-empty generation text instead of unconditionally returning True when llm_output is missing
Add unit tests for _get_metrics (6 tests) and is_finished (8 tests)
Add nv_accuracy (AnswerAccuracy) to benchmark scoring_functions and test_direct_evaluation to exercise a new class-based metric end-to-end

Test plan

uv run pre-commit run --all-files passes (ruff, mypy, pytest)
22 unit/integration tests pass (uv run pytest tests/ --ignore=tests/test_e2e.py)
New class-based metric (nv_accuracy) produces valid scores in both direct evaluation and full pipeline
Integration test against llama-stack 0.6.0 distribution

🤖 Generated with Claude Code

sourcery-ai · 2026-04-08T16:28:15Z

Reviewer's Guide

Aligns the ragas-based llama-stack provider with llama-stack 0.6.x and ragas 0.4.x by updating dependencies, centralizing metric registration (including new class-based metrics), implementing the required is_finished() method on both inline and remote LLM wrappers, and refreshing compatibility docs and tests.

Class diagram for RagasEvaluatorBase metrics selection and registry

classDiagram
    class RagasEvaluatorBase {
        +dict~str, Benchmark~ benchmarks
        +list~str~ _DEFAULT_METRICS
        +_get_metrics(scoring_functions list~str~) list
    }

    class constants {
        +str PROVIDER_TYPE
        +str PROVIDER_ID_INLINE
        +str PROVIDER_ID_REMOTE
        +list _SINGLETON_METRICS
        +list _CLASS_METRICS
        +dict METRIC_MAPPING
        +list AVAILABLE_METRICS
    }

    class Metric {
        +str name
    }

    class AnswerAccuracy {
        +str name
    }

    class ContextRelevance {
        +str name
    }

    class FactualCorrectness {
        +str name
    }

    class NoiseSensitivity {
        +str name
    }

    class ResponseGroundedness {
        +str name
    }

    constants o-- "*" Metric : _SINGLETON_METRICS
    constants o-- "*" AnswerAccuracy : _CLASS_METRICS
    constants o-- "*" ContextRelevance : _CLASS_METRICS
    constants o-- "*" FactualCorrectness : _CLASS_METRICS
    constants o-- "*" NoiseSensitivity : _CLASS_METRICS
    constants o-- "*" ResponseGroundedness : _CLASS_METRICS

    RagasEvaluatorBase ..> constants : uses_METRIC_MAPPING
    RagasEvaluatorBase ..> Metric : returns_metrics_from__get_metrics

    Metric <|-- AnswerAccuracy
    Metric <|-- ContextRelevance
    Metric <|-- FactualCorrectness
    Metric <|-- NoiseSensitivity
    Metric <|-- ResponseGroundedness

File-Level Changes

Change	Details	Files
Implement required is_finished() hook for inline and remote Ragas LLM wrappers using llama-stack stop_reason metadata.	Add concrete is_finished(response: LLMResult) implementation that inspects response.llm_output['llama_stack_responses'] stop_reason values and treats None or 'out_of_tokens' as unfinished. Remove the large commented-out experimental is_finished implementation from the inline wrapper. Default to returning True when llama_stack_responses metadata is absent, preserving previous behavior.	`src/llama_stack_provider_ragas/inline/wrappers_inline.py` `src/llama_stack_provider_ragas/remote/wrappers_remote.py`
Upgrade ragas integration to 0.4.x and expand metric mapping to expose new metrics.	Wrap ragas.metrics imports in a warnings.catch_warnings block to suppress DeprecationWarning for module-level metric instances. Import new class-based metrics (AnswerAccuracy, ContextRelevance, FactualCorrectness, NoiseSensitivity, ResponseGroundedness) and additional singleton metrics (context_entity_recall). Introduce _SINGLETON_METRICS and _CLASS_METRICS collections and build METRIC_MAPPING from both, enabling the new metrics by name. Expose AVAILABLE_METRICS from the updated METRIC_MAPPING for downstream consumers.	`src/llama_stack_provider_ragas/constants.py`
Centralize default metric selection and reuse shared METRIC_MAPPING in the evaluator base.	Add a _DEFAULT_METRICS list of metric names to RagasEvaluatorBase. Refactor _get_metrics to use METRIC_MAPPING lookups instead of importing metrics directly from ragas. Use _DEFAULT_METRICS via METRIC_MAPPING when no valid scoring_functions are provided, preserving existing default behavior while decoupling from ragas imports.	`src/llama_stack_provider_ragas/base.py`
Update project and provider specs to target llama-stack 0.6.x and ragas 0.4.x.	Bump project version to 0.7.0 in pyproject.toml. Raise llama-stack, llama-stack-api, and llama-stack-client minimum versions to 0.6.0 and relax ragas requirement to >=0.4.0,<0.5.0. Update inline and remote provider get_provider_spec() definitions to install ragas>=0.4.0,<0.5.0 in their pip_packages. Adjust remote Kubeflow component import from ragas.dataset_schema.EvaluationResult to ragas.evaluation.EvaluationResult to match ragas 0.4.x API.	`pyproject.toml` `src/llama_stack_provider_ragas/inline/provider.py` `src/llama_stack_provider_ragas/remote/provider.py` `src/llama_stack_provider_ragas/remote/kubeflow/components.py`
Refresh compatibility documentation and tests to reflect new versions and metric naming.	Extend COMPATIBILITY.md with a new release/0.6.x branch entry, update the main branch to target llama-stack 0.6.x+, and add version rows for 0.7.0 and 0.6.1 with guidance for 0.6.x and 0.5.x stacks. Update the recommended provider versions per llama-stack version in COMPATIBILITY.md. Fix test benchmark registration to use the correct answer_similarity metric name instead of semantic_similarity.	`COMPATIBILITY.md` `tests/conftest.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

- Bump provider version to 0.7.0 targeting llama-stack >=0.6.0 - Upgrade ragas from ==0.3.0 to >=0.4.0,<0.5.0 - Add 6 new metrics: AnswerAccuracy, ContextRelevance, FactualCorrectness, NoiseSensitivity, ResponseGroundedness, context_entity_recall - Implement is_finished() on LLM wrappers (now required by BaseRagasLLM) - Fix test fixture metric name (semantic_similarity -> answer_similarity) - Update COMPATIBILITY.md with release/0.6.x branch and version entries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix EvaluationResult import in kubeflow components (ragas.dataset_schema → ragas.evaluation) - Remove stale commented-out is_finished code from inline wrappers - Eliminate deprecation-triggering lazy imports in base._get_metrics by using METRIC_MAPPING Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In RagasEvaluatorBase._get_metrics, you're now referencing METRIC_MAPPING and _DEFAULT_METRICS without showing an import in this file; ensure METRIC_MAPPING is imported from constants (and that _DEFAULT_METRICS is kept in sync with it) to avoid runtime errors or mismatches.
The new is_finished implementations default to True when llm_output is missing or lacks llama_stack_responses; consider adding a simple content-based fallback check (e.g., non-empty generations) so obviously incomplete or empty generations are not treated as finished.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `RagasEvaluatorBase._get_metrics`, you're now referencing `METRIC_MAPPING` and `_DEFAULT_METRICS` without showing an import in this file; ensure `METRIC_MAPPING` is imported from `constants` (and that `_DEFAULT_METRICS` is kept in sync with it) to avoid runtime errors or mismatches.
- The new `is_finished` implementations default to `True` when `llm_output` is missing or lacks `llama_stack_responses`; consider adding a simple content-based fallback check (e.g., non-empty generations) so obviously incomplete or empty generations are not treated as finished.

## Individual Comments

### Comment 1
<location path="src/llama_stack_provider_ragas/base.py" line_range="42-51" />
<code_context>
     def __init__(self):
         self.benchmarks: dict[str, Benchmark] = {}

+    _DEFAULT_METRICS = [
+        "answer_relevancy",
+        "context_precision",
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against _DEFAULT_METRICS names drifting from METRIC_MAPPING contents

Because `_DEFAULT_METRICS` is hard-coded and then looked up in `METRIC_MAPPING`, any rename/removal in `METRIC_MAPPING` will cause a `KeyError` at runtime.

Consider either deriving `_DEFAULT_METRICS` from `METRIC_MAPPING` (e.g., validated keys at import) or resolving via `METRIC_MAPPING.get(name)` and logging/skipping unknown metrics to avoid hard failures.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

…ack, add tests - Guard _DEFAULT_METRICS against METRIC_MAPPING drift with .get() + warning - Replace unconditional `return True` in is_finished with content-based check - Add unit tests for _get_metrics (6 tests) and is_finished (8 tests) - Add nv_accuracy (AnswerAccuracy) to benchmark scoring_functions and test_direct_evaluation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dmaniloff force-pushed the upgrade-lls-0.6-ragas-0.4 branch from 6fbb8db to 65b01a7 Compare April 8, 2026 16:30

dmaniloff marked this pull request as ready for review April 16, 2026 15:47

sourcery-ai Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread src/llama_stack_provider_ragas/base.py

dmaniloff and others added 2 commits May 11, 2026 10:48

Remove deprecation warning suppression for ragas.metrics imports

a433bea

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dmaniloff merged commit eb6d4bf into trustyai-explainability:main May 11, 2026
2 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to llama-stack 0.6.0 and ragas 0.4.x#64

Upgrade to llama-stack 0.6.0 and ragas 0.4.x#64
dmaniloff merged 4 commits into
trustyai-explainability:mainfrom
dmaniloff:upgrade-lls-0.6-ragas-0.4

dmaniloff commented Apr 8, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot commented Apr 8, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dmaniloff commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New metrics available

Breaking changes

Other changes

Sourcery review follow-ups

Test plan

Uh oh!

sourcery-ai Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for RagasEvaluatorBase metrics selection and registry

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dmaniloff commented Apr 8, 2026 •

edited

Loading

sourcery-ai Bot commented Apr 8, 2026 •

edited

Loading