Skip to content

Upgrade to llama-stack 0.6.0 and ragas 0.4.x#64

Merged
dmaniloff merged 4 commits into
trustyai-explainability:mainfrom
dmaniloff:upgrade-lls-0.6-ragas-0.4
May 11, 2026
Merged

Upgrade to llama-stack 0.6.0 and ragas 0.4.x#64
dmaniloff merged 4 commits into
trustyai-explainability:mainfrom
dmaniloff:upgrade-lls-0.6-ragas-0.4

Conversation

@dmaniloff
Copy link
Copy Markdown
Collaborator

@dmaniloff dmaniloff commented Apr 8, 2026

Summary

  • Bump provider version to 0.7.0 targeting llama-stack >=0.6.0
  • Upgrade ragas from ==0.3.0 to >=0.4.0,<0.5.0, adding 6 new metrics
  • Create release/0.6.x branch to maintain llama-stack 0.5.x support
  • Update COMPATIBILITY.md with new release branch and version entry

New metrics available

  • nv_accuracy (AnswerAccuracy)
  • nv_context_relevance (ContextRelevance)
  • factual_correctness (FactualCorrectness)
  • noise_sensitivity (NoiseSensitivity)
  • nv_response_groundedness (ResponseGroundedness)
  • context_entity_recall (ContextEntityRecall)

Breaking changes

  • None — the llama-stack eval/scoring/benchmarks APIs are unchanged in 0.6.0
  • Ragas 0.4.x is backwards-compatible; old evaluate() and BaseRagasLLM still work (with deprecation warnings)

Other changes

  • Implement is_finished() on both LLM wrappers (now required by BaseRagasLLM in ragas 0.4.x)
  • Fix test fixture metric name (semantic_similarityanswer_similarity)

Sourcery review follow-ups

  • Guard _DEFAULT_METRICS against METRIC_MAPPING drift — use .get() + warning instead of bare key lookup; raise RagasEvaluationError if all defaults are invalid
  • Add content-based fallback to is_finished() — check for non-empty generation text instead of unconditionally returning True when llm_output is missing
  • Add unit tests for _get_metrics (6 tests) and is_finished (8 tests)
  • Add nv_accuracy (AnswerAccuracy) to benchmark scoring_functions and test_direct_evaluation to exercise a new class-based metric end-to-end

Test plan

  • uv run pre-commit run --all-files passes (ruff, mypy, pytest)
  • 22 unit/integration tests pass (uv run pytest tests/ --ignore=tests/test_e2e.py)
  • New class-based metric (nv_accuracy) produces valid scores in both direct evaluation and full pipeline
  • Integration test against llama-stack 0.6.0 distribution

🤖 Generated with Claude Code

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Apr 8, 2026

Reviewer's Guide

Aligns the ragas-based llama-stack provider with llama-stack 0.6.x and ragas 0.4.x by updating dependencies, centralizing metric registration (including new class-based metrics), implementing the required is_finished() method on both inline and remote LLM wrappers, and refreshing compatibility docs and tests.

Class diagram for RagasEvaluatorBase metrics selection and registry

classDiagram
    class RagasEvaluatorBase {
        +dict~str, Benchmark~ benchmarks
        +list~str~ _DEFAULT_METRICS
        +_get_metrics(scoring_functions list~str~) list
    }

    class constants {
        +str PROVIDER_TYPE
        +str PROVIDER_ID_INLINE
        +str PROVIDER_ID_REMOTE
        +list _SINGLETON_METRICS
        +list _CLASS_METRICS
        +dict METRIC_MAPPING
        +list AVAILABLE_METRICS
    }

    class Metric {
        +str name
    }

    class AnswerAccuracy {
        +str name
    }

    class ContextRelevance {
        +str name
    }

    class FactualCorrectness {
        +str name
    }

    class NoiseSensitivity {
        +str name
    }

    class ResponseGroundedness {
        +str name
    }

    constants o-- "*" Metric : _SINGLETON_METRICS
    constants o-- "*" AnswerAccuracy : _CLASS_METRICS
    constants o-- "*" ContextRelevance : _CLASS_METRICS
    constants o-- "*" FactualCorrectness : _CLASS_METRICS
    constants o-- "*" NoiseSensitivity : _CLASS_METRICS
    constants o-- "*" ResponseGroundedness : _CLASS_METRICS

    RagasEvaluatorBase ..> constants : uses_METRIC_MAPPING
    RagasEvaluatorBase ..> Metric : returns_metrics_from__get_metrics

    Metric <|-- AnswerAccuracy
    Metric <|-- ContextRelevance
    Metric <|-- FactualCorrectness
    Metric <|-- NoiseSensitivity
    Metric <|-- ResponseGroundedness
Loading

File-Level Changes

Change Details Files
Implement required is_finished() hook for inline and remote Ragas LLM wrappers using llama-stack stop_reason metadata.
  • Add concrete is_finished(response: LLMResult) implementation that inspects response.llm_output['llama_stack_responses'] stop_reason values and treats None or 'out_of_tokens' as unfinished.
  • Remove the large commented-out experimental is_finished implementation from the inline wrapper.
  • Default to returning True when llama_stack_responses metadata is absent, preserving previous behavior.
src/llama_stack_provider_ragas/inline/wrappers_inline.py
src/llama_stack_provider_ragas/remote/wrappers_remote.py
Upgrade ragas integration to 0.4.x and expand metric mapping to expose new metrics.
  • Wrap ragas.metrics imports in a warnings.catch_warnings block to suppress DeprecationWarning for module-level metric instances.
  • Import new class-based metrics (AnswerAccuracy, ContextRelevance, FactualCorrectness, NoiseSensitivity, ResponseGroundedness) and additional singleton metrics (context_entity_recall).
  • Introduce _SINGLETON_METRICS and _CLASS_METRICS collections and build METRIC_MAPPING from both, enabling the new metrics by name.
  • Expose AVAILABLE_METRICS from the updated METRIC_MAPPING for downstream consumers.
src/llama_stack_provider_ragas/constants.py
Centralize default metric selection and reuse shared METRIC_MAPPING in the evaluator base.
  • Add a _DEFAULT_METRICS list of metric names to RagasEvaluatorBase.
  • Refactor _get_metrics to use METRIC_MAPPING lookups instead of importing metrics directly from ragas.
  • Use _DEFAULT_METRICS via METRIC_MAPPING when no valid scoring_functions are provided, preserving existing default behavior while decoupling from ragas imports.
src/llama_stack_provider_ragas/base.py
Update project and provider specs to target llama-stack 0.6.x and ragas 0.4.x.
  • Bump project version to 0.7.0 in pyproject.toml.
  • Raise llama-stack, llama-stack-api, and llama-stack-client minimum versions to 0.6.0 and relax ragas requirement to >=0.4.0,<0.5.0.
  • Update inline and remote provider get_provider_spec() definitions to install ragas>=0.4.0,<0.5.0 in their pip_packages.
  • Adjust remote Kubeflow component import from ragas.dataset_schema.EvaluationResult to ragas.evaluation.EvaluationResult to match ragas 0.4.x API.
pyproject.toml
src/llama_stack_provider_ragas/inline/provider.py
src/llama_stack_provider_ragas/remote/provider.py
src/llama_stack_provider_ragas/remote/kubeflow/components.py
Refresh compatibility documentation and tests to reflect new versions and metric naming.
  • Extend COMPATIBILITY.md with a new release/0.6.x branch entry, update the main branch to target llama-stack 0.6.x+, and add version rows for 0.7.0 and 0.6.1 with guidance for 0.6.x and 0.5.x stacks.
  • Update the recommended provider versions per llama-stack version in COMPATIBILITY.md.
  • Fix test benchmark registration to use the correct answer_similarity metric name instead of semantic_similarity.
COMPATIBILITY.md
tests/conftest.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

- Bump provider version to 0.7.0 targeting llama-stack >=0.6.0
- Upgrade ragas from ==0.3.0 to >=0.4.0,<0.5.0
- Add 6 new metrics: AnswerAccuracy, ContextRelevance, FactualCorrectness,
  NoiseSensitivity, ResponseGroundedness, context_entity_recall
- Implement is_finished() on LLM wrappers (now required by BaseRagasLLM)
- Fix test fixture metric name (semantic_similarity -> answer_similarity)
- Update COMPATIBILITY.md with release/0.6.x branch and version entries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dmaniloff dmaniloff force-pushed the upgrade-lls-0.6-ragas-0.4 branch from 6fbb8db to 65b01a7 Compare April 8, 2026 16:30
- Fix EvaluationResult import in kubeflow components (ragas.dataset_schema → ragas.evaluation)
- Remove stale commented-out is_finished code from inline wrappers
- Eliminate deprecation-triggering lazy imports in base._get_metrics by using METRIC_MAPPING

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dmaniloff dmaniloff marked this pull request as ready for review April 16, 2026 15:47
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In RagasEvaluatorBase._get_metrics, you're now referencing METRIC_MAPPING and _DEFAULT_METRICS without showing an import in this file; ensure METRIC_MAPPING is imported from constants (and that _DEFAULT_METRICS is kept in sync with it) to avoid runtime errors or mismatches.
  • The new is_finished implementations default to True when llm_output is missing or lacks llama_stack_responses; consider adding a simple content-based fallback check (e.g., non-empty generations) so obviously incomplete or empty generations are not treated as finished.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `RagasEvaluatorBase._get_metrics`, you're now referencing `METRIC_MAPPING` and `_DEFAULT_METRICS` without showing an import in this file; ensure `METRIC_MAPPING` is imported from `constants` (and that `_DEFAULT_METRICS` is kept in sync with it) to avoid runtime errors or mismatches.
- The new `is_finished` implementations default to `True` when `llm_output` is missing or lacks `llama_stack_responses`; consider adding a simple content-based fallback check (e.g., non-empty generations) so obviously incomplete or empty generations are not treated as finished.

## Individual Comments

### Comment 1
<location path="src/llama_stack_provider_ragas/base.py" line_range="42-51" />
<code_context>
     def __init__(self):
         self.benchmarks: dict[str, Benchmark] = {}

+    _DEFAULT_METRICS = [
+        "answer_relevancy",
+        "context_precision",
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against _DEFAULT_METRICS names drifting from METRIC_MAPPING contents

Because `_DEFAULT_METRICS` is hard-coded and then looked up in `METRIC_MAPPING`, any rename/removal in `METRIC_MAPPING` will cause a `KeyError` at runtime.

Consider either deriving `_DEFAULT_METRICS` from `METRIC_MAPPING` (e.g., validated keys at import) or resolving via `METRIC_MAPPING.get(name)` and logging/skipping unknown metrics to avoid hard failures.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/llama_stack_provider_ragas/base.py
dmaniloff and others added 2 commits May 11, 2026 10:48
…ack, add tests

- Guard _DEFAULT_METRICS against METRIC_MAPPING drift with .get() + warning
- Replace unconditional `return True` in is_finished with content-based check
- Add unit tests for _get_metrics (6 tests) and is_finished (8 tests)
- Add nv_accuracy (AnswerAccuracy) to benchmark scoring_functions and test_direct_evaluation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dmaniloff dmaniloff merged commit eb6d4bf into trustyai-explainability:main May 11, 2026
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant