test: agent skills infrastructure and marker taxonomy audit (#727, #728) by planetf1 · Pull Request #742 · generative-computing/mellea

planetf1 · 2026-03-25T10:18:14Z

Marker Taxonomy & Agent Skills

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Introduces a four-tier test marker taxonomy (unit / integration / e2e / qualitative), an agent skill to audit and fix markers, and applies the resulting reclassifications across the test suite. Also removes the legacy --isolate-heavy process isolation mechanism (superseded by cleanup_gpu_backend() from #721).

How we define the tiers

Tier	What it tests	Backend marker?
`unit`	Pure logic, no I/O, all external boundaries mocked — auto-applied by conftest	No
`integration`	Real third-party SDK wired in, or multiple components together; controls its own deps	No
`e2e`	Real backend (Ollama, cloud API, GPU-loaded model) — always paired with a backend marker	Yes
`qualitative`	Subset of e2e; assertions are on non-deterministic LLM output content	Yes

unit is never written explicitly — conftest applies it automatically to any test that carries none of the other three.

New agent skills (`.agents/skills/`)

Two skills following the agentskills.io standard, discoverable by Claude Code, VS Code/Copilot, and IBM Bob:

/audit-markers — classifies any test as unit/integration/e2e/qualitative using signal detection (imports, fixtures, assertion patterns, decorator shapes). Traces model identifiers to estimate min_vram_gb from parameter counts. Report-first by default; --apply skips confirmation.
/skill-author — meta-skill for creating new skills with correct frontmatter and structure.

pytest infrastructure changes

BACKEND_MARKERS registry in conftest.py — single source of truth for all 7 backend markers; pytest_configure registers them automatically. New backends need one dict entry.
unit auto-apply hook — pytest_collection_modifyitems applies unit to any collected test that has none of integration, e2e, qualitative, llm. Enables pytest -m unit.
Removed --isolate-heavy and all associated code (_run_heavy_modules_isolated(), pytest_collection_finish(), require_gpu_isolation()). The cleanup_gpu_backend() helper from ci: memory management in tests #721 handles GPU memory teardown; --group-by-backend handles ordering.
torch_dtype="auto" on model load — LocalHFBackend.from_pretrained now passes torch_dtype="auto" to AutoModelForCausalLM.from_pretrained, preventing silent float32 upcasting on CPU during model load. On MPS/CUDA this halves memory use for bfloat16/float16 models.
MPS VRAM detection — _gpu_vram_gb() on Apple Silicon now uses sysctl hw.memsize with a conservative heuristic (min(total * 0.75, total - 16 GB)) instead of returning 0 — leaves headroom for OS and desktop apps.
get_system_capabilities() cached — avoids repeated torch/MPS calls during collection.
--ignore-gpu-check, --ignore-ollama-check, --ignore-api-key-check, --ignore-all-checks removed — unused escape hatches; skips are now unconditional when a capability is missing.
require_ollama() removed — redundant with the ollama backend marker + conftest auto-skip.
llm marker deprecated — treated as synonym for e2e for backwards compat; 0 remaining uses in test/ or docs/examples/.

Test reclassifications

All changes are marker-only — no test logic was modified.

New integration tests (were unmarked/unit):

File	Reason
`test/cli/test_alora_train.py`	Wires tokenizer→model→peft→dataset→trainer; mocks only file I/O
`test/telemetry/test_metrics.py`	Real OTel `InMemoryMetricReader` — asserts SDK attribute names
`test/telemetry/test_tracing.py`	Real OTel `InMemorySpanExporter` — asserts span structure
`test/telemetry/test_metrics_token.py`	Same — asserts gen_ai semantic convention attributes
`test/telemetry/test_metrics_plugins.py`	Same
`test/package/test_dependency_isolation.py`	Spawns isolated `uv` subprocesses — controls its own deps
Various `test/plugins/`, `test/core/`, `test/stdlib/`	Components wired together with mocked backends

e2e marker additions/corrections:

File	Change
`test/backends/test_bedrock.py`	Added missing `bedrock` backend marker; registered in conftest/pyproject
`test/telemetry/test_metrics_backend.py`	Added `e2e` (had backend markers but no tier)
`test/formatters/granite/test_intrinsics_formatters.py`	Replaced deprecated `llm`/`requires_gpu`/`requires_heavy_ram`/`requires_gpu_isolation` with `e2e` + `require_gpu(min_vram_gb=12)`

VRAM gates updated (8 GB → 12 GB): the /audit-markers skill estimates min_vram_gb by tracing model identifiers to parameter counts — test authors can override the estimate directly on the require_gpu() call.

test_guardian.py, test_core.py, test_rag.py, test_spans.py

Docs updated

test/MARKERS_GUIDE.md — full rewrite with tier definitions, backend marker table, resource predicate reference, auto-skip logic, and common patterns
test/README.md — updated env var table; added OLLAMA_KEEP_ALIVE=1m tip for unordered runs
AGENTS.md / CONTRIBUTING.md — removed --isolate-heavy references; added skills discovery table

Local test run (Mac M1, 32 GB)

Full run (uv run pytest): 800 passed, 2 failed, 61 skipped, 19 deselected in 17m23s.

The 2 failures are @pytest.mark.qualitative tests (test_find_context_attributions, test_hallucination_detection) — non-deterministic content assertions that can vary between runs; not related to this PR.

The 19 deselected are slow tests excluded by default (-m "not slow" in addopts).

Skips breakdown (61 total — all expected):

Reason	Count	Tests
Insufficient VRAM (< 12 GB gate)	~20	`test_huggingface.py`, `test_alora_train_integration.py`, `test_richdocument.py`
Missing API credentials	~10	`test_watsonx.py`, `test_litellm_watsonx.py`, `test_bedrock.py`, `test_watsonx_token_metrics`
vLLM process not running	6	`test_openai_vllm.py`, `test_vllm_tools.py`
`test_tracing_backend.py` — telemetry not initialised	6	Known issue — see #754
`test_manager.py` — requires `--disable-default-mellea-plugins` flag	2	Intentional design
`test_reqlib_python.py` sandbox tests	3	Sandbox not available in this environment
Other (cpex not installed, qualitative skip)	~14	Various

Slow tests run explicitly (uv run pytest -m slow):

Test	Result	Time
`test_dependency_isolation.py`	✅ 18 passed, 1 skipped (vLLM not installed)	3m07s
`generative_gsm8k.py`	✅ passed	12m14s
`mini_researcher/researcher.py`	✅ passed	1m23s
`python_decompose_example.py`	❌ KeyError in pipeline — see #755	5m31s

Issues raised during testing

fix: test_tracing_backend.py tests always skip (Telemetry not initialized) #754 — test_tracing_backend.py tests always skip (Telemetry not initialized): root cause is _tracer_provider set at module import time; MonkeyPatch.setenv has no effect. Flagged for @ajbozart.
fix: decompose pipeline KeyError when constraint strings don't exactly match between stages #755 — python_decompose_example.py KeyError in finalize_result: constraint strings from two separate model calls don't match exactly. Flagged for @AngeloDanducci.

Testing

Marker changes verified with pytest --collect-only — collection unchanged
Full local run completed (see results above)
Slow tests run locally (see table above)
ruff format and ruff check pass
codespell and markdownlint pass

Cluster test run (IBM BLUEVELA LSF, Linux / Python 3.12.13, p-series GPU node)

Full run using test/scripts/run-all (starts Ollama, pulls models, warms up, then pytest --group-by-backend):
832 passed, 1 failed, 37 skipped, 19 deselected, 2 xfailed, 1 xpassed in 45m18s (job 737802).

The 1 failure is test_find_context_attributions (@pytest.mark.qualitative) — same non-deterministic content assertion flake as seen in local runs; not related to this PR.

The 1 xpassed is a bonus: a test marked xfail that unexpectedly passed.

Skips breakdown (37 total — all expected):

Reason	Approx count
Missing API credentials (Watsonx, Bedrock, OpenAI)	~20
OpenTelemetry backend tests (`skipif not OTEL_AVAILABLE`)	~13
Explicit `@pytest.mark.skip` (`test_richdocument` — memory)	1
Other (sandbox, cpex)	~3

Compared to a run without the startup script (job 737413): skips dropped from 142 → 37 once Ollama was running and models were warmed up.

Test run summary across environments

Run	Passed	Failed	Skipped	Deselected	Notes
Local (Mac M1, `uv run pytest`)	800	2	61	19	2 qualitative flakes
LSF bare (`uv run pytest`, no services)	728	1	142	19	Ollama not running
LSF via `test/scripts/run-all`	832	1	37	19	Ollama + models warm, vLLM available

19 deselected = slow tests excluded by default in pyproject.toml across all runs.
Skip reduction (142 → 37) is ~95 Ollama-dependent tests that become runnable once the startup script brings services up.
The LSF script run passes ~30 more tests than local because vLLM is available on the cluster.

mergify · 2026-03-25T10:18:50Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

github-actions · 2026-03-25T12:03:16Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

planetf1 · 2026-03-25T14:05:23Z

After completing the skills definitions, I ran the skill once - results below. I did then apply the recommended fixes.

Still to test but sharing to give an idea of what the skill does:

planetf1 · 2026-03-25T14:06:57Z

So this now tries to assess the vram needs by looking at models/hf

I'll experiment with running the tests & see how accurate the agent is in general. Plus manually review the assessments

planetf1 · 2026-03-27T09:25:18Z

The mypy failure (Name "torch.Tensor" is not defined in test/backends/test_huggingface.py:362) is pre-existing on upstream main — not introduced by this PR. Can be verified with git show upstream/main:test/backends/test_huggingface.py | grep -n 'torch.Tensor'.

planetf1 · 2026-03-27T12:45:39Z

Updated top post with current summary. Ready for review

planetf1 · 2026-03-27T15:04:21Z

@ajbozarth you asked about being able to run a test bypassing the gpu check. Without any code changes this is possible by using pytest to run the test directly. I'm thinking this is sufficient?

ajbozarth · 2026-03-27T15:12:25Z

@ajbozarth you asked about being able to run a test bypassing the gpu check. Without any code changes this is possible by using pytest to run the test directly. I'm thinking this is sufficient?

That's fair, but I think it'd still be worth having a flag to disable the part of contest that limits based on detected hardware. I wouldn't call it a blocker for this PR though, it could be a follow up issue.

As for review I'll do a deep dive into this this afternoon and will re-run all the tests myself for a "second opinion"

planetf1 · 2026-03-27T18:58:36Z

That's fair, but I think it'd still be worth having a flag to disable the part of contest that limits based on detected hardware. I wouldn't call it a blocker for this PR though, it could be a follow up issue.

Ok - my thought is to just have a generic flag like --skip-resource-checks. Those are the only ones I think we can have -- the api checks wouldn't make sense?

Can you raise the followup?

As for review I'll do a deep dive into this this afternoon and will re-run all the tests myself for a "second opinion"
Thanks!

ajbozarth · 2026-03-27T20:33:39Z

Can you raise the followup?

#758

ajbozarth

I've done an in-depth review including:

an actual read-through of the skill markdown -> LGTM
double check example mark updates -> LGTM
review updates to tests:
- mark updates and other fixes -> LGTM
- a few minor typos in importskip reasons -> inline suggested changes
conftest updates -> LGTM
helper functions in predicates -> LGTM

I'll apply the typo fixes myself, otherwise my other comments are non-blocking

I've also run all the tests and included the results below:

Test run summary

Local run (uv run pytest, Mac M1 Max 32GB, Python 3.12.8): 800 passed, 2 failed, 61 skipped, 19 deselected, 2 xfailed, 1 xpassed in 32m05s.

The 2 failures are @pytest.mark.qualitative tests (test_find_context_attributions, test_hallucination_detection) — non-deterministic content assertions

The 19 deselected are slow tests excluded by default.

Skips breakdown (61 total — all expected):

Reason	Count
Insufficient VRAM	~23
Missing API credentials	~16
vLLM process not running	7
`test_tracing_backend.py` — telemetry not initialised (see #754)	6
`test_manager.py` — requires `--disable-default-mellea-plugins`	2
`test_reqlib_python.py` sandbox tests	3
Other	~4

Terminal output

$ uv run pytest
      Built mellea @ file:///Users/ajbozarth/workspace/ai/mellea
Uninstalled 1 package in 1ms
Installed 3 packages in 3ms
=========================================================================================================== test session starts ============================================================================================================
platform darwin -- Python 3.12.8, pytest-9.0.0, pluggy-1.6.0
rootdir: /Users/ajbozarth/workspace/ai/mellea
configfile: pyproject.toml
testpaths: test, docs
plugins: nbmake-1.5.5, recording-0.13.4, anyio-4.11.0, xdist-3.8.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, asyncio-1.3.0, langsmith-0.6.6, Faker-37.12.0, cov-7.0.0
timeout: 900.0s
timeout method: signal
timeout func_only: False
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 883 items / 19 deselected / 2 skipped / 864 selected                                                                                                                                                                             

test/backends/test_adapters/test_adapter.py .                                                                                                                                                                                        [  0%]
test/backends/test_bedrock.py s                                                                                                                                                                                                      [  0%]
test/backends/test_huggingface.py sssssssssssssssssss                                                                                                                                                                                [  2%]
test/backends/test_huggingface_tools.py s                                                                                                                                                                                            [  2%]
test/backends/test_litellm_ollama.py ........                                                                                                                                                                                        [  3%]
test/backends/test_litellm_watsonx.py ssss                                                                                                                                                                                           [  3%]
test/backends/test_mellea_tool.py .......                                                                                                                                                                                            [  4%]
test/backends/test_model_options.py .....                                                                                                                                                                                            [  5%]
test/backends/test_ollama.py .....X....                                                                                                                                                                                              [  6%]
test/backends/test_openai_ollama.py .............                                                                                                                                                                                    [  7%]
test/backends/test_openai_vllm.py sssssss                                                                                                                                                                                            [  8%]
test/backends/test_tool_calls.py ...                                                                                                                                                                                                 [  9%]
test/backends/test_tool_decorator.py ...................                                                                                                                                                                             [ 11%]
test/backends/test_tool_helpers.py ...                                                                                                                                                                                               [ 11%]
test/backends/test_tool_validation_integration.py .................................                                                                                                                                                  [ 15%]
test/backends/test_vision_ollama.py ....                                                                                                                                                                                             [ 15%]
test/backends/test_vision_openai.py ....                                                                                                                                                                                             [ 16%]
test/backends/test_watsonx.py sssssssssss                                                                                                                                                                                            [ 17%]
test/cli/test_alora_train.py ....                                                                                                                                                                                                    [ 18%]
test/cli/test_alora_train_integration.py ss                                                                                                                                                                                          [ 18%]
test/core/test_astream_exception_propagation.py .....                                                                                                                                                                                [ 18%]
test/core/test_astream_incremental.py ......                                                                                                                                                                                         [ 19%]
test/core/test_astream_mock.py ......                                                                                                                                                                                                [ 20%]
test/core/test_base.py ....                                                                                                                                                                                                          [ 20%]
test/core/test_component_typing.py ........                                                                                                                                                                                          [ 21%]
test/core/test_model_output_thunk.py ..                                                                                                                                                                                              [ 21%]
test/decompose/test_decompose.py ..........                                                                                                                                                                                          [ 23%]
test/formatters/granite/test_intrinsics_formatters.py ........................................................x..................                                                                                                    [ 31%]
test/formatters/test_template_formatter.py ................                                                                                                                                                                          [ 33%]
test/helpers/test_event_loop_helper.py ....                                                                                                                                                                                          [ 34%]
test/helpers/test_server_type.py ................                                                                                                                                                                                    [ 35%]
test/plugins/test_all_payloads.py ...................................................................................................                                                                                                [ 47%]
test/plugins/test_blocking.py ................                                                                                                                                                                                       [ 49%]
test/plugins/test_build_global_context.py .......                                                                                                                                                                                    [ 50%]
test/plugins/test_decorators.py .........                                                                                                                                                                                            [ 51%]
test/plugins/test_execution_modes.py ...........................                                                                                                                                                                     [ 54%]
test/plugins/test_hook_call_sites.py ..............................                                                                                                                                                                  [ 57%]
test/plugins/test_manager.py ss......                                                                                                                                                                                                [ 58%]
test/plugins/test_mellea_plugin.py .......                                                                                                                                                                                           [ 59%]
test/plugins/test_payloads.py ..........                                                                                                                                                                                             [ 60%]
test/plugins/test_pluginset.py .........                                                                                                                                                                                             [ 61%]
test/plugins/test_policies.py ......                                                                                                                                                                                                 [ 62%]
test/plugins/test_policy_enforcement.py ..........                                                                                                                                                                                   [ 63%]
test/plugins/test_priority_ordering.py ..............                                                                                                                                                                                [ 65%]
test/plugins/test_scoping.py ...................................                                                                                                                                                                     [ 69%]
test/plugins/test_tool_hooks_redaction.py .......                                                                                                                                                                                    [ 70%]
test/plugins/test_unregister.py .........                                                                                                                                                                                            [ 71%]
test/stdlib/components/docs/test_document.py ...                                                                                                                                                                                     [ 71%]
test/stdlib/components/docs/test_richdocument.py .....s                                                                                                                                                                              [ 72%]
test/stdlib/components/intrinsic/test_core.py ..F                                                                                                                                                                                    [ 72%]
test/stdlib/components/intrinsic/test_guardian.py ......                                                                                                                                                                             [ 73%]
test/stdlib/components/intrinsic/test_rag.py ....F..                                                                                                                                                                                 [ 73%]
test/stdlib/components/test_chat.py .                                                                                                                                                                                                [ 74%]
test/stdlib/components/test_genslot.py ...................                                                                                                                                                                           [ 76%]
test/stdlib/components/test_hello_world.py ..                                                                                                                                                                                        [ 76%]
test/stdlib/components/test_mify.py ...........                                                                                                                                                                                      [ 77%]
test/stdlib/components/test_transform.py ..                                                                                                                                                                                          [ 78%]
test/stdlib/requirements/test_reqlib_markdown.py ......                                                                                                                                                                              [ 78%]
test/stdlib/requirements/test_reqlib_python.py .............sss.....                                                                                                                                                                 [ 81%]
test/stdlib/requirements/test_reqlib_tools.py .                                                                                                                                                                                      [ 81%]
test/stdlib/requirements/test_requirement.py .....                                                                                                                                                                                   [ 81%]
test/stdlib/sampling/test_majority_voting.py ..                                                                                                                                                                                      [ 82%]
test/stdlib/sampling/test_sampling_ctx.py ..                                                                                                                                                                                         [ 82%]
test/stdlib/sampling/test_sofai_graph_coloring.py .........................                                                                                                                                                          [ 85%]
test/stdlib/sampling/test_sofai_sampling.py .....................                                                                                                                                                                    [ 87%]
test/stdlib/sampling/test_think_budget_forcing.py ..                                                                                                                                                                                 [ 87%]
test/stdlib/test_base_context.py .....                                                                                                                                                                                               [ 88%]
test/stdlib/test_chat_view.py ..                                                                                                                                                                                                     [ 88%]
test/stdlib/test_functional.py ....                                                                                                                                                                                                  [ 89%]
test/stdlib/test_session.py s.......                                                                                                                                                                                                 [ 90%]
test/stdlib/test_spans.py .x                                                                                                                                                                                                         [ 90%]
test/telemetry/test_logging.py ........                                                                                                                                                                                              [ 91%]
test/telemetry/test_metrics.py .......................................                                                                                                                                                               [ 95%]
test/telemetry/test_metrics_backend.py ....s....                                                                                                                                                                                     [ 96%]
test/telemetry/test_metrics_plugins.py ....                                                                                                                                                                                          [ 97%]
test/telemetry/test_metrics_token.py ....                                                                                                                                                                                            [ 97%]
test/telemetry/test_tracing.py ..............                                                                                                                                                                                        [ 99%]
test/telemetry/test_tracing_backend.py ssssss                                                                                                                                                                                        [100%]

================================================================================================================= FAILURES =================================================================================================================
______________________________________________________________________________________________________ test_find_context_attributions ______________________________________________________________________________________________________

backend = <mellea.backends.huggingface.LocalHFBackend object at 0x14df00380>

    @pytest.mark.qualitative
    def test_find_context_attributions(backend):
        """Verify that the context-attribution intrinsic functions properly."""
        context, assistant_response, documents = _read_rag_input_json(
            "context-attribution.json"
        )
        expected = _read_rag_output_json("context-attribution.json")
    
>       result = core.find_context_attributions(
            assistant_response, documents, context, backend
        )

test/stdlib/components/intrinsic/test_core.py:102: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/stdlib/components/intrinsic/core.py:90: in find_context_attributions
    result_json = call_intrinsic(
mellea/stdlib/components/intrinsic/_util.py:39: in call_intrinsic
    model_output_thunk, _ = mfuncs.act(
mellea/stdlib/functional.py:98: in act
    out = _run_async_in_thread(
mellea/helpers/event_loop_helper.py:105: in _run_async_in_thread
    return __event_loop_handler(co)
           ^^^^^^^^^^^^^^^^^^^^^^^^
mellea/helpers/event_loop_helper.py:77: in __call__
    return asyncio.run_coroutine_threadsafe(co, self._event_loop).result()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../../.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
../../../.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
mellea/stdlib/functional.py:584: in aact
    await result.avalue()
mellea/core/base.py:394: in avalue
    await self.astream()
mellea/core/base.py:485: in astream
    await self._process(self, chunk)
mellea/backends/huggingface.py:581: in granite_formatters_processing
    res = result_processor.transform(chunk, rewritten)  # type: ignore
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
mellea/formatters/granite/base/io.py:182: in transform
    return self._transform_impl(chat_completion_response, chat_completion)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
mellea/formatters/granite/intrinsics/output.py:1267: in _transform_impl
    self._transform_choice(c, chat_completion)
mellea/formatters/granite/intrinsics/output.py:1308: in _transform_choice
    parsed_json = rule.apply(
mellea/formatters/granite/intrinsics/output.py:166: in apply
    result = self._apply_at_path(result, path, prepare_output)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
mellea/formatters/granite/intrinsics/output.py:251: in _apply_at_path
    new_values = self._transform(original_value, path, prepare_output)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <mellea.formatters.granite.intrinsics.output.DecodeSentences object at 0x14dbcf050>, value = 765211, path = (0, 'r')
prepare_output = {'begins': [0, 137], 'document_ids': [None, None], 'ends': [137, 257], 'message_indices': [None, None], ...}

    def _transform(self, value: Any, path: tuple, prepare_output: dict) -> dict:
        # Unpack global values we set aside during the prepare phase
        begins = prepare_output["begins"]
        ends = prepare_output["ends"]
        texts = prepare_output["texts"]
        document_ids = prepare_output.get("document_ids")
        message_indices = prepare_output.get("message_indices")
    
        if not isinstance(value, int):
            raise TypeError(
                f"Expected integer sentence number at path {path}, but "
                f"found non-integer value {value} of type {type(value)}"
            )
        sentence_num = value
    
        result = {}
        if self.begin_name is not None:
>           result[self.begin_name] = begins[sentence_num]
                                      ^^^^^^^^^^^^^^^^^^^^
E           IndexError: list index out of range

mellea/formatters/granite/intrinsics/output.py:714: IndexError
----------------------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------------------
=== 15:39:12-INFO ======
passing in model options when generating with an adapter; some model options may be overwritten / ignored
------------------------------------------------------------------------------------------------------------ Captured log call -------------------------------------------------------------------------------------------------------------
INFO     fancy_logger:huggingface.py:475 passing in model options when generating with an adapter; some model options may be overwritten / ignored
--------------------------------------------------------------------------------------------------------- Captured stdout teardown ---------------------------------------------------------------------------------------------------------
=== 15:41:30-INFO ======
Cleaning up test_core backend GPU memory...
=== 15:41:30-INFO ======
  Cleared LRU cache
=== 15:41:30-INFO ======
  Removed accelerate dispatch hooks
---------------------------------------------------------------------------------------------------------- Captured log teardown -----------------------------------------------------------------------------------------------------------
INFO     fancy_logger:conftest.py:342 Cleaning up test_core backend GPU memory...
INFO     fancy_logger:conftest.py:365   Cleared LRU cache
INFO     fancy_logger:conftest.py:402   Removed accelerate dispatch hooks
_______________________________________________________________________________________________________ test_hallucination_detection _______________________________________________________________________________________________________

backend = <mellea.backends.huggingface.LocalHFBackend object at 0x13eff8d10>

    @pytest.mark.qualitative
    def test_hallucination_detection(backend):
        """Verify that the hallucination detection intrinsic functions properly."""
        context, assistant_response, docs = _read_input_json("hallucination_detection.json")
        expected = _read_output_json("hallucination_detection.json")
    
        # First call triggers adapter loading
        result = rag.flag_hallucinated_content(assistant_response, docs, context, backend)
        # pytest.approx() chokes on lists of records, so we do this complicated dance.
        for r, e in zip(result, expected, strict=True):  # type: ignore
>           assert pytest.approx(r, abs=2e-2) == e
E           AssertionError: assert approx({'resp...he sentence.}) == {'explanation...end': 31, ...}
E             
E             comparison failed. Mismatched elements: 1 / 5:
E             Max absolute difference: 0.022802131238099044
E             Max relative difference: 0.03036794087969006
E             Index                   | Obtained           | Expected                 
E             faithfulness_likelihood | 0.7280598165124975 | 0.7508619477505966 ± 0.02

test/stdlib/components/intrinsic/test_rag.py:159: AssertionError
----------------------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------------------
=== 15:42:34-INFO ======
passing in model options when generating with an adapter; some model options may be overwritten / ignored
------------------------------------------------------------------------------------------------------------ Captured log call -------------------------------------------------------------------------------------------------------------
INFO     fancy_logger:huggingface.py:475 passing in model options when generating with an adapter; some model options may be overwritten / ignored
============================================================================================================= warnings summary =============================================================================================================
test/backends/test_litellm_ollama.py::test_litellm_ollama_chat
test/backends/test_litellm_ollama.py::test_generate_from_raw
test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/backends/test_litellm_ollama.py::test_async_avalue
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/aiohttp/connector.py:993: DeprecationWarning: enable_cleanup_closed ignored because https://github.com/python/cpython/pull/118960 is fixed in Python version sys.version_info(major=3, minor=12, micro=8, releaselevel='final', serial=0)
    super().__init__(

test/backends/test_litellm_ollama.py::test_litellm_ollama_chat
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='The answ...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="Subject:...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='yes', ro...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Subject:...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_gen_slot
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{\n"resu...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/streaming_handler.py:1855: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
    obj_dict = processed_chunk.dict()

test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/backends/test_litellm_ollama.py::test_async_avalue
test/backends/test_litellm_ollama.py::test_async_avalue
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Hello! H...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_async_parallel_requests
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Goodbye!...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_tool_calls.py::test_tool_called_from_context_action
  <frozen abc>:106: DeprecationWarning: Use BaseMetaSerializer() instead.

test/backends/test_vision_ollama.py::test_image_block_construction
  /Users/ajbozarth/workspace/ai/mellea/test/backends/test_vision_ollama.py:38: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
    random_image = Image.fromarray(random_pixel_data, "RGB")

test/backends/test_vision_openai.py::test_image_block_construction
  /Users/ajbozarth/workspace/ai/mellea/test/backends/test_vision_openai.py:48: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
    random_image = Image.fromarray(random_pixel_data, "RGB")

test/cli/test_alora_train.py::test_alora_config_creation
test/cli/test_alora_train.py::test_lora_config_creation
test/cli/test_alora_train.py::test_invocation_prompt_tokenization
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/trl/trainer/sft_config.py:257: DeprecationWarning: `max_seq_length` is deprecated and will be removed in version 0.20.0. Use `max_length` instead.
    warnings.warn(

test/helpers/test_event_loop_helper.py::test_event_loop_handler_with_forking
  /Users/ajbozarth/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=49161) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

test/stdlib/components/docs/test_richdocument.py::test_richdocument_basics
test/stdlib/components/docs/test_richdocument.py::test_richdocument_markdown
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/docling_core/transforms/serializer/markdown.py:490: DeprecationWarning: Field `annotations` is deprecated; use `meta` instead.
    for ann in item.annotations

test/stdlib/components/intrinsic/test_core.py: 2 warnings
test/stdlib/components/intrinsic/test_guardian.py: 3 warnings
test/stdlib/components/intrinsic/test_rag.py: 5 warnings
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/peft/tuners/tuners_utils.py:285: UserWarning: Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
    warnings.warn(

test/stdlib/test_spans.py::test_lazy_spans
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/torch/nn/functional.py:5294: UserWarning: MPS: The constant padding of more than 3 dimensions is not currently supported natively. It uses View Ops default implementation to run. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Pad.mm:468.)
    return torch._C._nn.pad(input, pad, mode, value)

test/telemetry/test_logging.py::test_otlp_logging_enabled_without_endpoint_warns
  /Users/ajbozarth/workspace/ai/mellea/mellea/telemetry/logging.py:97: UserWarning: OTLP logs exporter is enabled (MELLEA_LOGS_OTLP=true) but no endpoint is configured. Set OTEL_EXPORTER_OTLP_LOGS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT to export logs.
    _logger_provider = _setup_logger_provider()

test/telemetry/test_metrics.py: 24 warnings
test/telemetry/test_metrics_backend.py: 8 warnings
test/telemetry/test_metrics_token.py: 4 warnings
  /Users/ajbozarth/workspace/ai/mellea/mellea/telemetry/metrics.py:245: UserWarning: Metrics are enabled (MELLEA_METRICS_ENABLED=true) but no exporters are configured. Metrics will be collected but not exported. Set MELLEA_METRICS_PROMETHEUS=true, set MELLEA_METRICS_OTLP=true with an endpoint (OTEL_EXPORTER_OTLP_METRICS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT), or set MELLEA_METRICS_CONSOLE=true to export metrics.
    _meter_provider = _setup_meter_provider()

test/telemetry/test_metrics.py: 28 warnings
test/telemetry/test_metrics_backend.py: 8 warnings
test/telemetry/test_metrics_token.py: 4 warnings
  /Users/ajbozarth/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/importlib/__init__.py:131: UserWarning: TokenMetricsPlugin already registered: Plugin token_metrics.generation_post_call already registered
    _bootstrap._exec(spec, module)

test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
test/telemetry/test_tracing.py::test_session_with_tracing_disabled
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="Of cours...ields={'refusal': None}), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/model_response_utils.py:206: PydanticDeprecatedSince211: Accessing the 'model_computed_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
    or callable(getattr(delta, attr_name))

test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/model_response_utils.py:206: PydanticDeprecatedSince211: Accessing the 'model_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
    or callable(getattr(delta, attr_name))

test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
  /Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="I'm an A...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================================= Skipped Examples =============================================================================================================
The following examples were skipped during collection:

  • 102_example.py: Example marked with skip marker
  • example_readme_generator.py: Example marked with skip marker
  • make_training_data.py: Example marked with skip marker
  • stembolts_intrinsic.py: Example marked with skip marker
  • bedrock_litellm_example.py: Example marked with skip marker
  • bedrock_openai_example.py: Example marked with skip marker
  • qiskit_code_validation.py: Example marked with skip marker
  • validation_helpers.py: Example marked with skip marker
  • python_decompose_result.py: Example marked to always skip (skip_always marker)
  • m_decomp_result.py: Example marked to always skip (skip_always marker)
  • client.py: Example marked to always skip (skip_always marker)
  • pii_serve.py: Example marked to always skip (skip_always marker)
  • mcp_example.py: Example marked to always skip (skip_always marker)
  • rich_document_advanced.py: Example marked with skip marker
  • mellea_pdf.py: Example marked to always skip (skip_always marker)
  • simple_rag_with_filter.py: Example marked to always skip (skip_always marker)
============================================================================================================== tests coverage ==============================================================================================================
_____________________________________________________________________________________________ coverage: platform darwin, python 3.12.8-final-0 _____________________________________________________________________________________________

Coverage HTML written to dir htmlcov
Coverage JSON written to file coverage.json
========================================================================================================= short test summary info ==========================================================================================================
FAILED test/stdlib/components/intrinsic/test_core.py::test_find_context_attributions - IndexError: list index out of range
FAILED test/stdlib/components/intrinsic/test_rag.py::test_hallucination_detection - AssertionError: assert approx({'resp...he sentence.}) == {'explanation...end': 31, ...}
================================================================ 2 failed, 800 passed, 61 skipped, 19 deselected, 2 xfailed, 1 xpassed, 122 warnings in 1925.97s (0:32:05) =================================================================

Local slow run (uv run pytest -m slow, Mac M1 Max 32GB): 18 passed, 3 skipped, 864 deselected in 3m32s. All expected.

Terminal output

$ uv run pytest -m slow
=========================================================================================================== test session starts ============================================================================================================
platform darwin -- Python 3.12.8, pytest-9.0.0, pluggy-1.6.0
rootdir: /Users/ajbozarth/workspace/ai/mellea
configfile: pyproject.toml
testpaths: test, docs
plugins: nbmake-1.5.5, recording-0.13.4, anyio-4.11.0, xdist-3.8.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, asyncio-1.3.0, langsmith-0.6.6, Faker-37.12.0, cov-7.0.0
timeout: 900.0s
timeout method: signal
timeout func_only: False
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 883 items / 864 deselected / 2 skipped / 19 selected                                                                                                                                                                             

test/package/test_dependency_isolation.py ..s................                                                                                                                                                                        [100%]

============================================================================================================= Skipped Examples =============================================================================================================
The following examples were skipped during collection:

  • 102_example.py: Example marked with skip marker
  • example_readme_generator.py: Example marked with skip marker
  • make_training_data.py: Example marked with skip marker
  • stembolts_intrinsic.py: Example marked with skip marker
  • bedrock_litellm_example.py: Example marked with skip marker
  • bedrock_openai_example.py: Example marked with skip marker
  • qiskit_code_validation.py: Example marked with skip marker
  • validation_helpers.py: Example marked with skip marker
  • python_decompose_result.py: Example marked to always skip (skip_always marker)
  • m_decomp_result.py: Example marked to always skip (skip_always marker)
  • client.py: Example marked to always skip (skip_always marker)
  • pii_serve.py: Example marked to always skip (skip_always marker)
  • mcp_example.py: Example marked to always skip (skip_always marker)
  • rich_document_advanced.py: Example marked with skip marker
  • mellea_pdf.py: Example marked to always skip (skip_always marker)
  • simple_rag_with_filter.py: Example marked to always skip (skip_always marker)
============================================================================================================== tests coverage ==============================================================================================================
_____________________________________________________________________________________________ coverage: platform darwin, python 3.12.8-final-0 _____________________________________________________________________________________________

Coverage HTML written to dir htmlcov
Coverage JSON written to file coverage.json
======================================================================================== 18 passed, 3 skipped, 864 deselected in 212.41s (0:03:32) =========================================================================================

Cluster run (./test/scripts/run_tests_with_ollama.sh, IBM LSF, NVIDIA GPU node, Python 3.12.12): 735 passed, 47 failed, 30 skipped, 58 errors, 19 deselected, 3 xfailed in 1:20:16.

The 58 errors and majority of the 47 failures are Ollama connectivity issues — the script detected an existing Ollama server but all three model warmups timed out, and tests then errored with "could not create OllamaModelBackend: ollama server not running at None" (base_url resolving to None). This is an environment issue, not related to this PR. Planning to re-run with a clean environment.

test_find_context_attributions qualitative flake also present, same as local run.

Terminal output

$ bsub -Is -n 1 -G grp_preemptable -q preemptable -gpu "num=1/task:mode=shared:mps=no:j_exclusive=yes:gvendor=nvidia" /bin/bash
num=1/task:mode=shared:mps=no:j_exclusive=yes:gvendor=nvidia
GPU mode=shared. This is allowed but deprecated
Job <741102> is submitted to queue <preemptable>.
<<Waiting for dispatch ...>>
<<Starting on p5-r06-n1>>
[ajbozarth@p5-r06-n1 mellea]$ bash ./test/scripts/run_tests_with_ollama.sh
[20:27:25] WARNING: CACHE_DIR not set. Ollama models will download to ~/.ollama (default)
[20:27:25] Using standalone log directory: logs/2026-03-27-20:27:25
[20:27:25] Ollama already running on 127.0.0.1:11434 — using existing server
[20:27:26] Model granite4:micro already pulled
[20:27:26] Model granite4:micro-h already pulled
[20:27:26] Pulling granite3.2-vision ...
success 
[20:27:40] All models ready.
[20:27:40] Warming up models...
[20:27:40]   Warming granite4:micro ...
[20:29:40]   Warning: warmup for granite4:micro timed out (will load on first test)
[20:29:40]   Warming granite4:micro-h ...
[20:31:40]   Warning: warmup for granite4:micro-h timed out (will load on first test)
[20:31:40]   Warming granite3.2-vision ...
[20:33:40]   Warning: warmup for granite3.2-vision timed out (will load on first test)
[20:33:40] Warmup complete.
[20:33:40] Starting pytest...
[20:33:40] Log directory: logs/2026-03-27-20:27:25
[20:33:40] Pytest args: --group-by-backend
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.0, pluggy-1.6.0
rootdir: /proj/dmfexp/eiger/users/ajbozarth/mellea
configfile: pyproject.toml
plugins: nbmake-1.5.5, anyio-4.11.0, json-report-1.5.0, recording-0.13.4, asyncio-1.3.0, timeout-2.4.0, metadata-3.1.1, Faker-37.12.0, xdist-3.8.0, langsmith-0.6.6, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
timeout: 900.0s
timeout method: signal
timeout func_only: False
collected 892 items / 19 deselected / 873 selected

test/backends/test_huggingface.py ...................                    [  2%]
test/backends/test_huggingface_tools.py .                                [  2%]
test/cli/test_alora_train_integration.py ..                              [  2%]
test/formatters/granite/test_intrinsics_formatters.py ....x..........    [  4%]
test/stdlib/components/docs/test_richdocument.py s                       [  4%]
test/stdlib/components/intrinsic/test_core.py ..F                        [  4%]
test/stdlib/components/intrinsic/test_guardian.py ......                 [  5%]
test/stdlib/components/intrinsic/test_rag.py .......                     [  6%]
test/stdlib/test_spans.py .x                                             [  6%]
test/telemetry/test_metrics_backend.py ..                                [  6%]
test/backends/test_openai_ollama.py FFFFFFFF.....                        [  8%]
test/backends/test_openai_vllm.py .......                                [  8%]
test/backends/test_vision_openai.py ..FF                                 [  9%]
test/telemetry/test_metrics_backend.py FF                                [  9%]
test/backends/test_vllm.py ........                                      [ 10%]
test/backends/test_vllm_tools.py .                                       [ 10%]
test/backends/test_litellm_ollama.py .FFFFFFF                            [ 11%]
test/backends/test_mellea_tool.py EE                                     [ 11%]
test/backends/test_ollama.py EEEEExEEEE                                  [ 12%]
test/backends/test_tool_calls.py EEE                                     [ 13%]
test/backends/test_vision_ollama.py ..EE                                 [ 13%]
test/core/test_astream_incremental.py FFFF.F                             [ 14%]
test/core/test_component_typing.py EEE                                   [ 14%]
test/core/test_model_output_thunk.py EE                                  [ 15%]
test/stdlib/components/test_genslot.py EEEEEEEEEEEEEEE.EEE               [ 17%]
test/stdlib/requirements/test_requirement.py FF...                       [ 17%]
test/stdlib/sampling/test_majority_voting.py EE                          [ 17%]
test/stdlib/sampling/test_sampling_ctx.py EE                             [ 18%]
test/stdlib/sampling/test_sofai_graph_coloring.py FFF                    [ 18%]
test/stdlib/sampling/test_sofai_sampling.py F                            [ 18%]
test/stdlib/sampling/test_think_budget_forcing.py EE                     [ 18%]
test/stdlib/test_chat_view.py EE                                         [ 19%]
test/stdlib/test_functional.py EEEE                                      [ 19%]
test/stdlib/test_session.py sEEEEEEE                                     [ 20%]
test/telemetry/test_metrics_backend.py FFFF                              [ 20%]
test/telemetry/test_tracing.py FFFF                                      [ 21%]
test/telemetry/test_tracing_backend.py ssssss                            [ 22%]
test/backends/test_bedrock.py s                                          [ 22%]
test/backends/test_litellm_watsonx.py ssss                               [ 22%]
test/backends/test_watsonx.py sssssssssss                                [ 23%]
test/telemetry/test_metrics_backend.py s                                 [ 24%]
test/backends/test_adapters/test_adapter.py .                            [ 24%]
test/backends/test_mellea_tool.py .....                                  [ 24%]
test/backends/test_model_options.py .....                                [ 25%]
test/backends/test_tool_decorator.py ...................                 [ 27%]
test/backends/test_tool_helpers.py ...                                   [ 27%]
test/backends/test_tool_validation_integration.py ...................... [ 30%]
...........                                                              [ 31%]
test/cli/test_alora_train.py ....                                        [ 32%]
test/core/test_astream_exception_propagation.py .....                    [ 32%]
test/core/test_astream_mock.py ......                                    [ 33%]
test/core/test_base.py ....                                              [ 33%]
test/core/test_component_typing.py .....                                 [ 34%]
test/decompose/test_decompose.py ..........                              [ 35%]
test/formatters/granite/test_intrinsics_formatters.py .................. [ 37%]
..................................FFFFFFFF                               [ 42%]
test/formatters/test_template_formatter.py ................              [ 44%]
test/helpers/test_event_loop_helper.py ....                              [ 44%]
test/helpers/test_server_type.py ................                        [ 46%]
test/plugins/test_all_payloads.py ...................................... [ 50%]
.............................................................            [ 57%]
test/plugins/test_blocking.py ................                           [ 59%]
test/plugins/test_build_global_context.py .......                        [ 60%]
test/plugins/test_decorators.py .........                                [ 61%]
test/plugins/test_execution_modes.py ...........................         [ 64%]
test/plugins/test_hook_call_sites.py ..............................      [ 68%]
test/plugins/test_manager.py ss......                                    [ 68%]
test/plugins/test_mellea_plugin.py .......                               [ 69%]
test/plugins/test_payloads.py ..........                                 [ 70%]
test/plugins/test_pluginset.py .........                                 [ 71%]
test/plugins/test_policies.py ......                                     [ 72%]
test/plugins/test_policy_enforcement.py ..........                       [ 73%]
test/plugins/test_priority_ordering.py ..............                    [ 75%]
test/plugins/test_scoping.py ...................................         [ 79%]
test/plugins/test_tool_hooks_redaction.py .......                        [ 80%]
test/plugins/test_unregister.py .........                                [ 81%]
test/stdlib/components/docs/test_document.py ...                         [ 81%]
test/stdlib/components/docs/test_richdocument.py .....                   [ 82%]
test/stdlib/components/test_chat.py .                                    [ 82%]
test/stdlib/components/test_hello_world.py ..                            [ 82%]
test/stdlib/components/test_mify.py ...........                          [ 83%]
test/stdlib/components/test_transform.py ..                              [ 83%]
test/stdlib/requirements/test_reqlib_markdown.py ......                  [ 84%]
test/stdlib/requirements/test_reqlib_python.py .............sss.....     [ 87%]
test/stdlib/requirements/test_reqlib_tools.py .                          [ 87%]
test/stdlib/sampling/test_sofai_graph_coloring.py ...................... [ 89%]
                                                                         [ 89%]
test/stdlib/sampling/test_sofai_sampling.py ....................         [ 91%]
test/stdlib/test_base_context.py .....                                   [ 92%]
test/telemetry/test_logging.py ........                                  [ 93%]
test/telemetry/test_metrics.py .......................................   [ 97%]
test/telemetry/test_metrics_plugins.py ....                              [ 98%]
test/telemetry/test_metrics_token.py ....                                [ 98%]
test/telemetry/test_tracing.py ..........                                [100%]

==================================== ERRORS ====================================

... (removed 32,000 lines of error output)

=============================== warnings summary ===============================
test/backends/test_huggingface.py: 1 warning
test/stdlib/components/intrinsic/test_core.py: 2 warnings
test/stdlib/components/intrinsic/test_guardian.py: 3 warnings
test/stdlib/components/intrinsic/test_rag.py: 5 warnings
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/peft/tuners/tuners_utils.py:285: UserWarning: Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/utils.py:103: DeprecationWarning: This class is deprecated and will be removed in version 0.20.0. To train on completion only, please use the parameter `completion_only_loss` of `SFTConfig` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
test/cli/test_alora_train.py::test_alora_config_creation
test/cli/test_alora_train.py::test_lora_config_creation
test/cli/test_alora_train.py::test_invocation_prompt_tokenization
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/sft_config.py:257: DeprecationWarning: `max_seq_length` is deprecated and will be removed in version 0.20.0. Use `max_length` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:678: DeprecationWarning: Failed to apply the formatting function due to the following error: string index out of range. This may be because the function is designed for batched input. Please update it to process one example at a time (i.e., accept and return a single example). For now, we will attempt to apply the function in batched mode, but note that batched formatting is deprecated and will be removed in version 0.21.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
    return data.pin_memory(device)

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.is_pinned() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:31.)
    return data.pin_memory(device)

test/telemetry/test_metrics_backend.py: 8 warnings
test/telemetry/test_metrics.py: 24 warnings
test/telemetry/test_metrics_token.py: 4 warnings
  /proj/dmfexp/eiger/users/ajbozarth/mellea/mellea/telemetry/metrics.py:245: UserWarning: Metrics are enabled (MELLEA_METRICS_ENABLED=true) but no exporters are configured. Metrics will be collected but not exported. Set MELLEA_METRICS_PROMETHEUS=true, set MELLEA_METRICS_OTLP=true with an endpoint (OTEL_EXPORTER_OTLP_METRICS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT), or set MELLEA_METRICS_CONSOLE=true to export metrics.
    _meter_provider = _setup_meter_provider()

test/telemetry/test_metrics_backend.py: 7 warnings
test/telemetry/test_metrics.py: 28 warnings
test/telemetry/test_metrics_token.py: 4 warnings
  /u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py:131: UserWarning: TokenMetricsPlugin already registered: Plugin token_metrics.generation_post_call already registered
    _bootstrap._exec(spec, module)

test/backends/test_vision_openai.py::test_image_block_construction
  /proj/dmfexp/eiger/users/ajbozarth/mellea/test/backends/test_vision_openai.py:48: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
    random_image = Image.fromarray(random_pixel_data, "RGB")

test/backends/test_litellm_ollama.py::test_litellm_ollama_chat
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
test/backends/test_litellm_ollama.py::test_gen_slot
test/backends/test_litellm_ollama.py::test_generate_from_raw
test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/backends/test_litellm_ollama.py::test_async_avalue
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/aiohttp/connector.py:993: DeprecationWarning: enable_cleanup_closed ignored because https://github.com/python/cpython/pull/118960 is fixed in Python version sys.version_info(major=3, minor=12, micro=12, releaselevel='final', serial=0)
    super().__init__(

test/backends/test_vision_ollama.py::test_image_block_construction
  /proj/dmfexp/eiger/users/ajbozarth/mellea/test/backends/test_vision_ollama.py:38: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
    random_image = Image.fromarray(random_pixel_data, "RGB")

test/helpers/test_event_loop_helper.py::test_event_loop_handler_with_forking
  /u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=4140229) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

test/stdlib/components/docs/test_richdocument.py::test_richdocument_basics
  <frozen abc>:106: DeprecationWarning: Use BaseMetaSerializer() instead.

test/stdlib/components/docs/test_richdocument.py::test_richdocument_basics
test/stdlib/components/docs/test_richdocument.py::test_richdocument_markdown
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/docling_core/transforms/serializer/markdown.py:490: DeprecationWarning: Field `annotations` is deprecated; use `meta` instead.
    for ann in item.annotations

test/telemetry/test_logging.py::test_otlp_logging_enabled_without_endpoint_warns
  /proj/dmfexp/eiger/users/ajbozarth/mellea/mellea/telemetry/logging.py:97: UserWarning: OTLP logs exporter is enabled (MELLEA_LOGS_OTLP=true) but no endpoint is configured. Set OTEL_EXPORTER_OTLP_LOGS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT to export logs.
    _logger_provider = _setup_logger_provider()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================ tests coverage ================================
_______________ coverage: platform linux, python 3.12.12-final-0 _______________

Coverage HTML written to dir htmlcov
Coverage JSON written to file coverage.json
=========================== short test summary info ============================
FAILED test/stdlib/components/intrinsic/test_core.py::test_find_context_attributions
FAILED test/backends/test_openai_ollama.py::test_instruct - openai.APIConnect...
FAILED test/backends/test_openai_ollama.py::test_multiturn - openai.APIConnec...
FAILED test/backends/test_openai_ollama.py::test_chat - openai.APIConnectionE...
FAILED test/backends/test_openai_ollama.py::test_chat_stream - openai.APIConn...
FAILED test/backends/test_openai_ollama.py::test_format - openai.APIConnectio...
FAILED test/backends/test_openai_ollama.py::test_generate_from_raw - openai.A...
FAILED test/backends/test_openai_ollama.py::test_async_parallel_requests - op...
FAILED test/backends/test_openai_ollama.py::test_async_avalue - openai.APICon...
FAILED test/backends/test_vision_openai.py::test_image_block_in_instruction
FAILED test/backends/test_vision_openai.py::test_image_block_in_chat - openai...
FAILED test/telemetry/test_metrics_backend.py::test_openai_token_metrics_integration[non-streaming]
FAILED test/telemetry/test_metrics_backend.py::test_openai_token_metrics_integration[streaming]
FAILED test/backends/test_litellm_ollama.py::test_litellm_ollama_chat - litel...
FAILED test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct - l...
FAILED test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
FAILED test/backends/test_litellm_ollama.py::test_gen_slot - litellm.exceptio...
FAILED test/backends/test_litellm_ollama.py::test_generate_from_raw - litellm...
FAILED test/backends/test_litellm_ollama.py::test_async_parallel_requests - l...
FAILED test/backends/test_litellm_ollama.py::test_async_avalue - litellm.exce...
FAILED test/core/test_astream_incremental.py::test_astream_returns_incremental_chunks
FAILED test/core/test_astream_incremental.py::test_astream_multiple_calls_accumulate_correctly
FAILED test/core/test_astream_incremental.py::test_astream_beginning_length_tracking
FAILED test/core/test_astream_incremental.py::test_astream_empty_beginning - ...
FAILED test/core/test_astream_incremental.py::test_non_streaming_astream - Ex...
FAILED test/stdlib/requirements/test_requirement.py::test_llmaj_validation_req_output_field
FAILED test/stdlib/requirements/test_requirement.py::test_llmaj_requirement_uses_requirement_template
FAILED test/stdlib/sampling/test_sofai_graph_coloring.py::TestSOFAIGraphColoringIntegration::test_graph_coloring_fresh_start
FAILED test/stdlib/sampling/test_sofai_graph_coloring.py::TestSOFAIGraphColoringIntegration::test_graph_coloring_continue_chat
FAILED test/stdlib/sampling/test_sofai_graph_coloring.py::TestSOFAIGraphColoringIntegration::test_graph_coloring_best_attempt
FAILED test/stdlib/sampling/test_sofai_sampling.py::TestSOFAIIntegration::test_sofai_with_ollama
FAILED test/telemetry/test_metrics_backend.py::test_ollama_token_metrics_integration[non-streaming]
FAILED test/telemetry/test_metrics_backend.py::test_ollama_token_metrics_integration[streaming]
FAILED test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
FAILED test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
FAILED test/telemetry/test_tracing.py::test_session_with_tracing_disabled - E...
FAILED test/telemetry/test_tracing.py::test_session_with_application_tracing
FAILED test/telemetry/test_tracing.py::test_session_with_backend_tracing - Ex...
FAILED test/telemetry/test_tracing.py::test_generative_function_with_tracing
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_simple]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_answerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_unanswerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[hallucination_detection]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_clarification]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_rewrite]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[context_relevance]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[citations]
ERROR test/backends/test_mellea_tool.py::test_from_callable_generation - Exce...
ERROR test/backends/test_mellea_tool.py::test_from_langchain_generation - Exc...
ERROR test/backends/test_ollama.py::test_simple_instruct - Exception: could n...
ERROR test/backends/test_ollama.py::test_instruct_with_requirement - Exceptio...
ERROR test/backends/test_ollama.py::test_chat - Exception: could not create O...
ERROR test/backends/test_ollama.py::test_format - Exception: could not create...
ERROR test/backends/test_ollama.py::test_generate_from_raw - Exception: could...
ERROR test/backends/test_ollama.py::test_async_parallel_requests - Exception:...
ERROR test/backends/test_ollama.py::test_async_avalue - Exception: could not ...
ERROR test/backends/test_ollama.py::test_multiple_asyncio_runs - Exception: c...
ERROR test/backends/test_ollama.py::test_client_cache - Exception: could not ...
ERROR test/backends/test_tool_calls.py::test_tool_called_from_context_action
ERROR test/backends/test_tool_calls.py::test_tool_called - Exception: could n...
ERROR test/backends/test_tool_calls.py::test_tool_not_called - Exception: cou...
ERROR test/backends/test_vision_ollama.py::test_image_block_in_instruction - ...
ERROR test/backends/test_vision_ollama.py::test_image_block_in_chat - Excepti...
ERROR test/core/test_component_typing.py::test_generating - Exception: could ...
ERROR test/core/test_component_typing.py::test_message_typing - Exception: co...
ERROR test/core/test_component_typing.py::test_generating_with_sampling - Exc...
ERROR test/core/test_model_output_thunk.py::test_model_output_thunk_copy - Ex...
ERROR test/core/test_model_output_thunk.py::test_model_output_thunk_deepcopy
ERROR test/stdlib/components/test_genslot.py::test_gen_slot_output - Exceptio...
ERROR test/stdlib/components/test_genslot.py::test_func - Exception: could no...
ERROR test/stdlib/components/test_genslot.py::test_sentiment_output - Excepti...
ERROR test/stdlib/components/test_genslot.py::test_gen_slot_logs - Exception:...
ERROR test/stdlib/components/test_genslot.py::test_gen_slot_with_context_and_backend
ERROR test/stdlib/components/test_genslot.py::test_async_gen_slot - Exception...
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[session] - ...
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[context and backend]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[backend without context]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[duplicate arg and kwarg]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[original func args as positional args]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[session and func as kwargs]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[all kwargs]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[interspersed kwargs]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[missing required args]
ERROR test/stdlib/components/test_genslot.py::test_precondition_failure - Exc...
ERROR test/stdlib/components/test_genslot.py::test_requirement - Exception: c...
ERROR test/stdlib/components/test_genslot.py::test_with_no_args - Exception: ...
ERROR test/stdlib/sampling/test_majority_voting.py::test_majority_voting_for_math
ERROR test/stdlib/sampling/test_majority_voting.py::test_MBRDRougeL - Excepti...
ERROR test/stdlib/sampling/test_sampling_ctx.py::TestSamplingCtxCase::test_ctx_for_rejection_sampling
ERROR test/stdlib/sampling/test_sampling_ctx.py::TestSamplingCtxCase::test_ctx_for_multiturn
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_big - Exc...
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_little - ...
ERROR test/stdlib/test_chat_view.py::test_chat_view_linear_ctx - Exception: c...
ERROR test/stdlib/test_chat_view.py::test_chat_view_simple_ctx - Exception: c...
ERROR test/stdlib/test_functional.py::test_func_context - Exception: could no...
ERROR test/stdlib/test_functional.py::test_aact - Exception: could not create...
ERROR test/stdlib/test_functional.py::test_ainstruct - Exception: could not c...
ERROR test/stdlib/test_functional.py::test_avalidate - Exception: could not c...
ERROR test/stdlib/test_session.py::test_start_session_openai_with_kwargs - Ex...
ERROR test/stdlib/test_session.py::test_aact - Exception: could not create Ol...
ERROR test/stdlib/test_session.py::test_ainstruct - Exception: could not crea...
ERROR test/stdlib/test_session.py::test_async_await_with_chat_context - Excep...
ERROR test/stdlib/test_session.py::test_async_without_waiting_with_chat_context
ERROR test/stdlib/test_session.py::test_session_copy_with_context_ops - Excep...
ERROR test/stdlib/test_session.py::test_powerup - Exception: could not create...
= 47 failed, 735 passed, 30 skipped, 19 deselected, 3 xfailed, 117 warnings, 58 errors in 4816.40s (1:20:16) =
[21:56:14] Shutting down ollama server...
[21:56:14] Ollama stopped.

Run	Passed	Failed	Skipped	Deselected	Notes
Local, Mac M1 Max 32GB, Python 3.12.8	800	2	61	19	2 qualitative flakes
Local slow, Mac M1 Max 32GB, Python 3.12.8	18	0	3	864	All expected
Cluster, LSF GPU node, Python 3.12.12	735	47	30	19	Ollama connectivity issue — re-run planned

ajbozarth · 2026-03-27T22:17:32Z

Per convo with @planetf1 I've fixed my own review nits so as to unblock this (as he's now on vacation).

I'll be holding off on merge until I've gotten my cluster run to pass (in case it failed due to something I need to fix) and for more reviews.

ajbozarth · 2026-03-27T22:28:51Z

Opened #759 on my Ollama connectivity issue that blew up my bluevela run

ajbozarth · 2026-03-27T23:31:13Z

Re-ran the cluster tests after resolving the Ollama connectivity issue from my first run (stale server from a previous session). Results below:

Cluster run (./test/scripts/run_tests_with_ollama.sh, IBM LSF, NVIDIA GPU node, Python 3.12.12): 821 passed, 10 failed, 37 skipped, 2 errors, 19 deselected, 2 xfailed, 1 xpassed in 38m15s.

Remaining failures — none related to this PR:

test_run_ollama[*] (8 failures) — Disk quota exceeded on the cluster node when downloading LoRA files from HuggingFace; infrastructure issue
test_find_context_attributions — qualitative flake, same as local run
test_vision_openai::test_image_block_in_instruction — non-deterministic LLM output (model described image instead of yes/no); qualitative flake
test_think_budget_forcing (2 errors) — gpt-oss:20b pull failed; likely caused by the disk quota being exhausted earlier in the run rather than the model being unavailable (passes locally)

Terminal output

$ test/scripts/run_tests_with_ollama.sh 
[22:20:32] WARNING: CACHE_DIR not set. Ollama models will download to ~/.ollama (default)
[22:20:32] Using standalone log directory: logs/2026-03-27-22:20:32
[22:20:32] Starting ollama server on 127.0.0.1:11434...
[22:20:32] Added system CUDA to LD_LIBRARY_PATH
[22:20:32] Ollama server PID: 1135706
[22:20:32] Waiting for ollama to be ready...
[22:20:34] Ollama ready after 2s
[22:20:34] Model granite4:micro already pulled
[22:20:34] Model granite4:micro-h already pulled
[22:20:34] Model granite3.2-vision already pulled
[22:20:34] All models ready.
[22:20:34] Warming up models...
[22:20:34]   Warming granite4:micro ...
[22:21:38]   Warming granite4:micro-h ...
[22:21:48]   Warming granite3.2-vision ...
[22:21:55] Warmup complete.
[22:21:55] Starting pytest...
[22:21:55] Log directory: logs/2026-03-27-22:20:32
[22:21:55] Pytest args: --group-by-backend
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.0, pluggy-1.6.0
rootdir: /proj/dmfexp/eiger/users/ajbozarth/mellea
configfile: pyproject.toml
plugins: nbmake-1.5.5, anyio-4.11.0, json-report-1.5.0, recording-0.13.4, asyncio-1.3.0, timeout-2.4.0, metadata-3.1.1, Faker-37.12.0, xdist-3.8.0, langsmith-0.6.6, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
timeout: 900.0s
timeout method: signal
timeout func_only: False
collected 892 items / 19 deselected / 873 selected

test/backends/test_huggingface.py ...................                    [  2%]
test/backends/test_huggingface_tools.py .                                [  2%]
test/cli/test_alora_train_integration.py ..                              [  2%]
test/formatters/granite/test_intrinsics_formatters.py ....x..........    [  4%]
test/stdlib/components/docs/test_richdocument.py s                       [  4%]
test/stdlib/components/intrinsic/test_core.py ..F                        [  4%]
test/stdlib/components/intrinsic/test_guardian.py ......                 [  5%]
test/stdlib/components/intrinsic/test_rag.py .......                     [  6%]
test/stdlib/test_spans.py .x                                             [  6%]
test/telemetry/test_metrics_backend.py ..                                [  6%]
test/backends/test_openai_ollama.py .............                        [  8%]
test/backends/test_openai_vllm.py sssssss                                [  8%]
test/backends/test_vision_openai.py ..F.                                 [  9%]
test/telemetry/test_metrics_backend.py ..                                [  9%]
test/backends/test_vllm.py ........                                      [ 10%]
test/backends/test_vllm_tools.py .                                       [ 10%]
test/backends/test_litellm_ollama.py ........                            [ 11%]
test/backends/test_mellea_tool.py ..                                     [ 11%]
test/backends/test_ollama.py .....X....                                  [ 12%]
test/backends/test_tool_calls.py ...                                     [ 13%]
test/backends/test_vision_ollama.py ....                                 [ 13%]
test/core/test_astream_incremental.py ......                             [ 14%]
test/core/test_component_typing.py ...                                   [ 14%]
test/core/test_model_output_thunk.py ..                                  [ 15%]
test/stdlib/components/test_genslot.py ...................               [ 17%]
test/stdlib/requirements/test_requirement.py .....                       [ 17%]
test/stdlib/sampling/test_majority_voting.py ..                          [ 17%]
test/stdlib/sampling/test_sampling_ctx.py ..                             [ 18%]
test/stdlib/sampling/test_sofai_graph_coloring.py ...                    [ 18%]
test/stdlib/sampling/test_sofai_sampling.py .                            [ 18%]
test/stdlib/sampling/test_think_budget_forcing.py EE                     [ 18%]
test/stdlib/test_chat_view.py ..                                         [ 19%]
test/stdlib/test_functional.py ....                                      [ 19%]
test/stdlib/test_session.py s.......                                     [ 20%]
test/telemetry/test_metrics_backend.py ....                              [ 20%]
test/telemetry/test_tracing.py ....                                      [ 21%]
test/telemetry/test_tracing_backend.py ssssss                            [ 22%]
test/backends/test_bedrock.py s                                          [ 22%]
test/backends/test_litellm_watsonx.py ssss                               [ 22%]
test/backends/test_watsonx.py sssssssssss                                [ 23%]
test/telemetry/test_metrics_backend.py s                                 [ 24%]
test/backends/test_adapters/test_adapter.py .                            [ 24%]
test/backends/test_mellea_tool.py .....                                  [ 24%]
test/backends/test_model_options.py .....                                [ 25%]
test/backends/test_tool_decorator.py ...................                 [ 27%]
test/backends/test_tool_helpers.py ...                                   [ 27%]
test/backends/test_tool_validation_integration.py ...................... [ 30%]
...........                                                              [ 31%]
test/cli/test_alora_train.py ....                                        [ 32%]
test/core/test_astream_exception_propagation.py .....                    [ 32%]
test/core/test_astream_mock.py ......                                    [ 33%]
test/core/test_base.py ....                                              [ 33%]
test/core/test_component_typing.py .....                                 [ 34%]
test/decompose/test_decompose.py ..........                              [ 35%]
test/formatters/granite/test_intrinsics_formatters.py .................. [ 37%]
..................................FFFFFFFF                               [ 42%]
test/formatters/test_template_formatter.py ................              [ 44%]
test/helpers/test_event_loop_helper.py ....                              [ 44%]
test/helpers/test_server_type.py ................                        [ 46%]
test/plugins/test_all_payloads.py ...................................... [ 50%]
.............................................................            [ 57%]
test/plugins/test_blocking.py ................                           [ 59%]
test/plugins/test_build_global_context.py .......                        [ 60%]
test/plugins/test_decorators.py .........                                [ 61%]
test/plugins/test_execution_modes.py ...........................         [ 64%]
test/plugins/test_hook_call_sites.py ..............................      [ 68%]
test/plugins/test_manager.py ss......                                    [ 68%]
test/plugins/test_mellea_plugin.py .......                               [ 69%]
test/plugins/test_payloads.py ..........                                 [ 70%]
test/plugins/test_pluginset.py .........                                 [ 71%]
test/plugins/test_policies.py ......                                     [ 72%]
test/plugins/test_policy_enforcement.py ..........                       [ 73%]
test/plugins/test_priority_ordering.py ..............                    [ 75%]
test/plugins/test_scoping.py ...................................         [ 79%]
test/plugins/test_tool_hooks_redaction.py .......                        [ 80%]
test/plugins/test_unregister.py .........                                [ 81%]
test/stdlib/components/docs/test_document.py ...                         [ 81%]
test/stdlib/components/docs/test_richdocument.py .....                   [ 82%]
test/stdlib/components/test_chat.py .                                    [ 82%]
test/stdlib/components/test_hello_world.py ..                            [ 82%]
test/stdlib/components/test_mify.py ...........                          [ 83%]
test/stdlib/components/test_transform.py ..                              [ 83%]
test/stdlib/requirements/test_reqlib_markdown.py ......                  [ 84%]
test/stdlib/requirements/test_reqlib_python.py .............sss.....     [ 87%]
test/stdlib/requirements/test_reqlib_tools.py .                          [ 87%]
test/stdlib/sampling/test_sofai_graph_coloring.py ...................... [ 89%]
                                                                         [ 89%]
test/stdlib/sampling/test_sofai_sampling.py ....................         [ 91%]
test/stdlib/test_base_context.py .....                                   [ 92%]
test/telemetry/test_logging.py ........                                  [ 93%]
test/telemetry/test_metrics.py .......................................   [ 97%]
test/telemetry/test_metrics_plugins.py ....                              [ 98%]
test/telemetry/test_metrics_token.py ....                                [ 98%]
test/telemetry/test_tracing.py ..........                                [100%]

==================================== ERRORS ====================================
_______________________ ERROR at setup of test_think_big _______________________

gh_run = 0

    @pytest.fixture(scope="module")
    def m_session(gh_run):
        """Start default Mellea's session."""
        if gh_run == 1:  # on github
            m = start_session(
                "ollama", model_id=MODEL_ID, model_options={ModelOption.MAX_NEW_TOKENS: 5}
            )
        else:
>           m = start_session("ollama", model_id=MODEL_ID)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

test/stdlib/sampling/test_think_budget_forcing.py:25: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/stdlib/session.py:241: in start_session
    backend = backend_class(model_id, model_options=model_options, **backend_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <mellea.backends.ollama.OllamaModelBackend object at 0x148cc4795df0>
model_id = ModelIdentifier(hf_model_name='openai/gpt-oss-20b', ollama_name='gpt-oss:20b', watsonx_name=None, mlx_name=None, openai_name=None, bedrock_name='openai.gpt-oss-20b', hf_tokenizer_name=None)
formatter = None, base_url = None, model_options = None

    def __init__(
        self,
        model_id: str | ModelIdentifier = model_ids.IBM_GRANITE_4_MICRO_3B,
        formatter: ChatFormatter | None = None,
        base_url: str | None = None,
        model_options: dict | None = None,
    ):
        """Initialize an Ollama backend, connecting to the server and pulling the model if needed."""
        super().__init__(
            model_id=model_id,
            formatter=(
                formatter
                if formatter is not None
                else TemplateFormatter(model_id=model_id)
            ),
            model_options=model_options,
        )
        # Run the ollama model id accessor early, so that an Assertion fails immediately if we cannot find an ollama model id for the provided ModelIdentifier.
        self._get_ollama_model_id()
    
        # Setup the client and ensure that we have the model available.
        self._base_url = base_url
        self._client = ollama.Client(base_url)
    
        self._client_cache = ClientCache(2)
    
        # Call once to set up an async client and prepopulate the cache.
        _ = self._async_client
    
        if not self._check_ollama_server():
            err = f"could not create OllamaModelBackend: ollama server not running at {base_url}"
            FancyLogger.get_logger().error(err)
            raise Exception(err)
        if not self._pull_ollama_model():
            err = f"could not create OllamaModelBackend: {self._get_ollama_model_id()} could not be pulled from ollama library"
            FancyLogger.get_logger().error(err)
>           raise Exception(err)
E           Exception: could not create OllamaModelBackend: gpt-oss:20b could not be pulled from ollama library

mellea/backends/ollama.py:97: Exception
---------------------------- Captured stdout setup -----------------------------
=== 22:40:43-ERROR ======
could not create OllamaModelBackend: gpt-oss:20b could not be pulled from ollama library
---------------------------- Captured stderr setup -----------------------------
                                                                                       
------------------------------ Captured log setup ------------------------------
ERROR    fancy_logger:ollama.py:96 could not create OllamaModelBackend: gpt-oss:20b could not be pulled from ollama library
_____________________ ERROR at setup of test_think_little ______________________

gh_run = 0

    @pytest.fixture(scope="module")
    def m_session(gh_run):
        """Start default Mellea's session."""
        if gh_run == 1:  # on github
            m = start_session(
                "ollama", model_id=MODEL_ID, model_options={ModelOption.MAX_NEW_TOKENS: 5}
            )
        else:
>           m = start_session("ollama", model_id=MODEL_ID)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

test/stdlib/sampling/test_think_budget_forcing.py:25: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/stdlib/session.py:241: in start_session
    backend = backend_class(model_id, model_options=model_options, **backend_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <mellea.backends.ollama.OllamaModelBackend object at 0x148cc4795df0>
model_id = ModelIdentifier(hf_model_name='openai/gpt-oss-20b', ollama_name='gpt-oss:20b', watsonx_name=None, mlx_name=None, openai_name=None, bedrock_name='openai.gpt-oss-20b', hf_tokenizer_name=None)
formatter = None, base_url = None, model_options = None

    def __init__(
        self,
        model_id: str | ModelIdentifier = model_ids.IBM_GRANITE_4_MICRO_3B,
        formatter: ChatFormatter | None = None,
        base_url: str | None = None,
        model_options: dict | None = None,
    ):
        """Initialize an Ollama backend, connecting to the server and pulling the model if needed."""
        super().__init__(
            model_id=model_id,
            formatter=(
                formatter
                if formatter is not None
                else TemplateFormatter(model_id=model_id)
            ),
            model_options=model_options,
        )
        # Run the ollama model id accessor early, so that an Assertion fails immediately if we cannot find an ollama model id for the provided ModelIdentifier.
        self._get_ollama_model_id()
    
        # Setup the client and ensure that we have the model available.
        self._base_url = base_url
        self._client = ollama.Client(base_url)
    
        self._client_cache = ClientCache(2)
    
        # Call once to set up an async client and prepopulate the cache.
        _ = self._async_client
    
        if not self._check_ollama_server():
            err = f"could not create OllamaModelBackend: ollama server not running at {base_url}"
            FancyLogger.get_logger().error(err)
            raise Exception(err)
        if not self._pull_ollama_model():
            err = f"could not create OllamaModelBackend: {self._get_ollama_model_id()} could not be pulled from ollama library"
            FancyLogger.get_logger().error(err)
>           raise Exception(err)
E           Exception: could not create OllamaModelBackend: gpt-oss:20b could not be pulled from ollama library

mellea/backends/ollama.py:97: Exception
=================================== FAILURES ===================================
________________________ test_find_context_attributions ________________________

backend = <mellea.backends.huggingface.LocalHFBackend object at 0x14894643d550>

    @pytest.mark.qualitative
    def test_find_context_attributions(backend):
        """Verify that the context-attribution intrinsic functions properly."""
        context, assistant_response, documents = _read_rag_input_json(
            "context-attribution.json"
        )
        expected = _read_rag_output_json("context-attribution.json")
    
        result = core.find_context_attributions(
            assistant_response, documents, context, backend
        )
>       assert result == expected
E       AssertionError: assert [{'attributio...ne, ...}, ...] == [{'attributio...ne, ...}, ...]
E         
E         Left contains 5 more items, first extra item: {'attribution_begin': 0, 'attribution_doc_id': None, 'attribution_end': 66, 'attribution_msg_index': 2, ...}
E         Use -v to get more diff

test/stdlib/components/intrinsic/test_core.py:105: AssertionError
----------------------------- Captured stdout call -----------------------------
=== 22:30:56-INFO ======
passing in model options when generating with an adapter; some model options may be overwritten / ignored
----------------------------- Captured stderr call -----------------------------
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 9320.68it/s]
Fetching 3 files: 100%|██████████| 3/3 [00:00<00:00, 10960.72it/s]
------------------------------ Captured log call -------------------------------
INFO     fancy_logger:huggingface.py:475 passing in model options when generating with an adapter; some model options may be overwritten / ignored
--------------------------- Captured stdout teardown ---------------------------
=== 22:30:59-INFO ======
Cleaning up test_core backend GPU memory...
=== 22:30:59-INFO ======
  GPU before cleanup: 58.1GB free / 79.2GB total
=== 22:30:59-INFO ======
  Cleared LRU cache
=== 22:30:59-INFO ======
  Removed accelerate dispatch hooks
=== 22:31:00-INFO ======
  GPU after cleanup: 78.1GB free / 79.2GB total (reclaimed 20.0GB)
---------------------------- Captured log teardown -----------------------------
INFO     fancy_logger:conftest.py:342 Cleaning up test_core backend GPU memory...
INFO     fancy_logger:conftest.py:349   GPU before cleanup: 58.1GB free / 79.2GB total
INFO     fancy_logger:conftest.py:365   Cleared LRU cache
INFO     fancy_logger:conftest.py:402   Removed accelerate dispatch hooks
INFO     fancy_logger:conftest.py:437   GPU after cleanup: 78.1GB free / 79.2GB total (reclaimed 20.0GB)
_______________________ test_image_block_in_instruction ________________________

m_session = <mellea.stdlib.session.MelleaSession object at 0x148f2d2cdbb0>
pil_image = <PIL.Image.Image image mode=RGB size=200x150 at 0x148F2D255640>
gh_run = 0

    def test_image_block_in_instruction(
        m_session: MelleaSession, pil_image: Image.Image, gh_run: int
    ):
        image_block = ImageBlock.from_pil_image(pil_image)
    
        # Set strategy=None here since we are directly comparing the object and sampling strategies tend to do a deepcopy.
        instr = m_session.instruct(
            "Is this image mainly blue? Answer yes or no.",
            images=[image_block],
            strategy=None,
        )
        assert isinstance(instr, ModelOutputThunk)
    
        # if not on GH
        if not gh_run == 1:
>           assert "yes" in instr.value.lower() or "no" in instr.value.lower()  # type: ignore
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E           AssertionError: assert ('yes' in '\nthe image is predominantly blue with varying shades creating a mosaic effect.' or 'no' in '\nthe image is predominantly blue with varying shades creating a mosaic effect.')
E            +  where '\nthe image is predominantly blue with varying shades creating a mosaic effect.' = <built-in method lower of str object at 0x148f2d2c65e0>()
E            +    where <built-in method lower of str object at 0x148f2d2c65e0> = '\nThe image is predominantly blue with varying shades creating a mosaic effect.'.lower
E            +      where '\nThe image is predominantly blue with varying shades creating a mosaic effect.' = ModelOutputThunk(\nThe image is predominantly blue with varying shades creating a mosaic effect.).value
E            +  and   '\nthe image is predominantly blue with varying shades creating a mosaic effect.' = <built-in method lower of str object at 0x148f2d2c65e0>()
E            +    where <built-in method lower of str object at 0x148f2d2c65e0> = '\nThe image is predominantly blue with varying shades creating a mosaic effect.'.lower
E            +      where '\nThe image is predominantly blue with varying shades creating a mosaic effect.' = ModelOutputThunk(\nThe image is predominantly blue with varying shades creating a mosaic effect.).value

test/backends/test_vision_openai.py:86: AssertionError
---------------------------- Captured stdout setup -----------------------------
=== 22:33:27-INFO ======
Starting Mellea session: backend=openai, model=granite3.2-vision, context=SimpleContext, model_options={'@@@max_new_tokens@@@': 5}
------------------------------ Captured log setup ------------------------------
INFO     fancy_logger:session.py:246 Starting Mellea session: backend=openai, model=granite3.2-vision, context=SimpleContext, model_options={'@@@max_new_tokens@@@': 5}
____________________ test_run_ollama[answerability_simple] _____________________

yaml_json_combo_for_ollama = YamlJsonCombo(short_name='answerability_simple', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--ibm-...rability', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')

    def test_run_ollama(yaml_json_combo_for_ollama):
        """
        Run the target model end-to-end with a mock Ollama backend.
        """
        cfg = yaml_json_combo_for_ollama
    
        # Change base model id to Ollama's version
        if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
            cfg.base_model_id = "granite4:micro"
        else:
            pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
    
        if cfg.arguments_file:
            with open(cfg.arguments_file, encoding="utf8") as f:
                transform_kwargs = json.load(f)
        else:
            transform_kwargs = {}
    
        # Load input request
        with open(cfg.inputs_file, encoding="utf-8") as f:
            model_input = ChatCompletion.model_validate_json(f.read())
        model_input.model = cfg.task
    
        # Download files from Hugging Face Hub
        try:
>           lora_dir = intrinsics_util.obtain_lora(
                cfg.task,
                cfg.base_model_id,
                cfg.repo_id,
                revision=cfg.revision,
                alora=cfg.is_alora,
            )

test/formatters/granite/test_intrinsics_formatters.py:714: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
    local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
    thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
    for obj in iterable:
               ^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
    return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
    xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def xet_get(
        *,
        incomplete_path: Path,
        xet_file_data: XetFileData,
        headers: Dict[str, str],
        expected_size: Optional[int] = None,
        displayed_filename: Optional[str] = None,
        _tqdm_bar: Optional[tqdm] = None,
    ) -> None:
        """
        Download a file using Xet storage service.
    
        Args:
            incomplete_path (`Path`):
                The path to the file to download.
            xet_file_data (`XetFileData`):
                The file metadata needed to make the request to the xet storage service.
            headers (`Dict[str, str]`):
                The headers to send to the xet storage service.
            expected_size (`int`, *optional*):
                The expected size of the file to download. If set, the download will raise an error if the size of the
                received content is different from the expected one.
            displayed_filename (`str`, *optional*):
                The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
                not set, the filename is guessed from the URL or the `Content-Disposition` header.
    
        **How it works:**
            The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
            for efficient storage and transfer.
    
            `hf_xet.download_files` manages downloading files by:
            - Taking a list of files to download (each with its unique content hash)
            - Connecting to a storage server (CAS server) that knows how files are chunked
            - Using authentication to ensure secure access
            - Providing progress updates during download
    
            Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
            connection to the storage server.
    
            The download process works like this:
            1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
            2. Download files in parallel:
                2.1. Prepare to write the file to disk
                2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
                    The server responds with:
                    - Which chunks make up the complete file
                    - Where each chunk can be downloaded from
                2.3. For each needed chunk:
                    - Checks if we already have it in our local cache
                    - If not, download it from cloud storage (S3)
                    - Save it to cache for future use
                    - Assemble the chunks in order to recreate the original file
    
        """
        try:
            from hf_xet import PyXetDownloadInfo, download_files  # type: ignore[no-redef]
        except ImportError:
            raise ValueError(
                "To use optimized download using Xet storage, you need to install the hf_xet package. "
                'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
            )
    
        connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
    
        def token_refresher() -> Tuple[str, int]:
            connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
            if connection_info is None:
                raise ValueError("Failed to refresh token using xet metadata.")
            return connection_info.access_token, connection_info.expiration_unix_epoch
    
        xet_download_info = [
            PyXetDownloadInfo(
                destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
            )
        ]
    
        if not displayed_filename:
            displayed_filename = incomplete_path.name
    
        # Truncate filename if too long to display
        if len(displayed_filename) > 40:
            displayed_filename = f"{displayed_filename[:40]}(…)"
    
        progress_cm = _get_progress_bar_context(
            desc=displayed_filename,
            log_level=logger.getEffectiveLevel(),
            total=expected_size,
            initial=0,
            name="huggingface_hub.xet_get",
            _tqdm_bar=_tqdm_bar,
        )
    
        with progress_cm as progress:
    
            def progress_updater(progress_bytes: float):
                progress.update(progress_bytes)
    
>           download_files(
                xet_download_info,
                endpoint=connection_info.endpoint,
                token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
                token_refresher=token_refresher,
                progress_updater=[progress_updater],
            )
E           RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)

.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]
__________________ test_run_ollama[answerability_answerable] ___________________

yaml_json_combo_for_ollama = YamlJsonCombo(short_name='answerability_answerable', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--...rability', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')

    def test_run_ollama(yaml_json_combo_for_ollama):
        """
        Run the target model end-to-end with a mock Ollama backend.
        """
        cfg = yaml_json_combo_for_ollama
    
        # Change base model id to Ollama's version
        if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
            cfg.base_model_id = "granite4:micro"
        else:
            pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
    
        if cfg.arguments_file:
            with open(cfg.arguments_file, encoding="utf8") as f:
                transform_kwargs = json.load(f)
        else:
            transform_kwargs = {}
    
        # Load input request
        with open(cfg.inputs_file, encoding="utf-8") as f:
            model_input = ChatCompletion.model_validate_json(f.read())
        model_input.model = cfg.task
    
        # Download files from Hugging Face Hub
        try:
>           lora_dir = intrinsics_util.obtain_lora(
                cfg.task,
                cfg.base_model_id,
                cfg.repo_id,
                revision=cfg.revision,
                alora=cfg.is_alora,
            )

test/formatters/granite/test_intrinsics_formatters.py:714: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
    local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
    thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
    for obj in iterable:
               ^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
    return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
    xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def xet_get(
        *,
        incomplete_path: Path,
        xet_file_data: XetFileData,
        headers: Dict[str, str],
        expected_size: Optional[int] = None,
        displayed_filename: Optional[str] = None,
        _tqdm_bar: Optional[tqdm] = None,
    ) -> None:
        """
        Download a file using Xet storage service.
    
        Args:
            incomplete_path (`Path`):
                The path to the file to download.
            xet_file_data (`XetFileData`):
                The file metadata needed to make the request to the xet storage service.
            headers (`Dict[str, str]`):
                The headers to send to the xet storage service.
            expected_size (`int`, *optional*):
                The expected size of the file to download. If set, the download will raise an error if the size of the
                received content is different from the expected one.
            displayed_filename (`str`, *optional*):
                The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
                not set, the filename is guessed from the URL or the `Content-Disposition` header.
    
        **How it works:**
            The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
            for efficient storage and transfer.
    
            `hf_xet.download_files` manages downloading files by:
            - Taking a list of files to download (each with its unique content hash)
            - Connecting to a storage server (CAS server) that knows how files are chunked
            - Using authentication to ensure secure access
            - Providing progress updates during download
    
            Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
            connection to the storage server.
    
            The download process works like this:
            1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
            2. Download files in parallel:
                2.1. Prepare to write the file to disk
                2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
                    The server responds with:
                    - Which chunks make up the complete file
                    - Where each chunk can be downloaded from
                2.3. For each needed chunk:
                    - Checks if we already have it in our local cache
                    - If not, download it from cloud storage (S3)
                    - Save it to cache for future use
                    - Assemble the chunks in order to recreate the original file
    
        """
        try:
            from hf_xet import PyXetDownloadInfo, download_files  # type: ignore[no-redef]
        except ImportError:
            raise ValueError(
                "To use optimized download using Xet storage, you need to install the hf_xet package. "
                'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
            )
    
        connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
    
        def token_refresher() -> Tuple[str, int]:
            connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
            if connection_info is None:
                raise ValueError("Failed to refresh token using xet metadata.")
            return connection_info.access_token, connection_info.expiration_unix_epoch
    
        xet_download_info = [
            PyXetDownloadInfo(
                destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
            )
        ]
    
        if not displayed_filename:
            displayed_filename = incomplete_path.name
    
        # Truncate filename if too long to display
        if len(displayed_filename) > 40:
            displayed_filename = f"{displayed_filename[:40]}(…)"
    
        progress_cm = _get_progress_bar_context(
            desc=displayed_filename,
            log_level=logger.getEffectiveLevel(),
            total=expected_size,
            initial=0,
            name="huggingface_hub.xet_get",
            _tqdm_bar=_tqdm_bar,
        )
    
        with progress_cm as progress:
    
            def progress_updater(progress_bytes: float):
                progress.update(progress_bytes)
    
>           download_files(
                xet_download_info,
                endpoint=connection_info.endpoint,
                token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
                token_refresher=token_refresher,
                progress_updater=[progress_updater],
            )
E           RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)

.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]
_________________ test_run_ollama[answerability_unanswerable] __________________

yaml_json_combo_for_ollama = YamlJsonCombo(short_name='answerability_unanswerable', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models...rability', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')

    def test_run_ollama(yaml_json_combo_for_ollama):
        """
        Run the target model end-to-end with a mock Ollama backend.
        """
        cfg = yaml_json_combo_for_ollama
    
        # Change base model id to Ollama's version
        if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
            cfg.base_model_id = "granite4:micro"
        else:
            pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
    
        if cfg.arguments_file:
            with open(cfg.arguments_file, encoding="utf8") as f:
                transform_kwargs = json.load(f)
        else:
            transform_kwargs = {}
    
        # Load input request
        with open(cfg.inputs_file, encoding="utf-8") as f:
            model_input = ChatCompletion.model_validate_json(f.read())
        model_input.model = cfg.task
    
        # Download files from Hugging Face Hub
        try:
>           lora_dir = intrinsics_util.obtain_lora(
                cfg.task,
                cfg.base_model_id,
                cfg.repo_id,
                revision=cfg.revision,
                alora=cfg.is_alora,
            )

test/formatters/granite/test_intrinsics_formatters.py:714: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
    local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
    thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
    for obj in iterable:
               ^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
    return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
    xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def xet_get(
        *,
        incomplete_path: Path,
        xet_file_data: XetFileData,
        headers: Dict[str, str],
        expected_size: Optional[int] = None,
        displayed_filename: Optional[str] = None,
        _tqdm_bar: Optional[tqdm] = None,
    ) -> None:
        """
        Download a file using Xet storage service.
    
        Args:
            incomplete_path (`Path`):
                The path to the file to download.
            xet_file_data (`XetFileData`):
                The file metadata needed to make the request to the xet storage service.
            headers (`Dict[str, str]`):
                The headers to send to the xet storage service.
            expected_size (`int`, *optional*):
                The expected size of the file to download. If set, the download will raise an error if the size of the
                received content is different from the expected one.
            displayed_filename (`str`, *optional*):
                The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
                not set, the filename is guessed from the URL or the `Content-Disposition` header.
    
        **How it works:**
            The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
            for efficient storage and transfer.
    
            `hf_xet.download_files` manages downloading files by:
            - Taking a list of files to download (each with its unique content hash)
            - Connecting to a storage server (CAS server) that knows how files are chunked
            - Using authentication to ensure secure access
            - Providing progress updates during download
    
            Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
            connection to the storage server.
    
            The download process works like this:
            1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
            2. Download files in parallel:
                2.1. Prepare to write the file to disk
                2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
                    The server responds with:
                    - Which chunks make up the complete file
                    - Where each chunk can be downloaded from
                2.3. For each needed chunk:
                    - Checks if we already have it in our local cache
                    - If not, download it from cloud storage (S3)
                    - Save it to cache for future use
                    - Assemble the chunks in order to recreate the original file
    
        """
        try:
            from hf_xet import PyXetDownloadInfo, download_files  # type: ignore[no-redef]
        except ImportError:
            raise ValueError(
                "To use optimized download using Xet storage, you need to install the hf_xet package. "
                'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
            )
    
        connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
    
        def token_refresher() -> Tuple[str, int]:
            connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
            if connection_info is None:
                raise ValueError("Failed to refresh token using xet metadata.")
            return connection_info.access_token, connection_info.expiration_unix_epoch
    
        xet_download_info = [
            PyXetDownloadInfo(
                destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
            )
        ]
    
        if not displayed_filename:
            displayed_filename = incomplete_path.name
    
        # Truncate filename if too long to display
        if len(displayed_filename) > 40:
            displayed_filename = f"{displayed_filename[:40]}(…)"
    
        progress_cm = _get_progress_bar_context(
            desc=displayed_filename,
            log_level=logger.getEffectiveLevel(),
            total=expected_size,
            initial=0,
            name="huggingface_hub.xet_get",
            _tqdm_bar=_tqdm_bar,
        )
    
        with progress_cm as progress:
    
            def progress_updater(progress_bytes: float):
                progress.update(progress_bytes)
    
>           download_files(
                xet_download_info,
                endpoint=connection_info.endpoint,
                token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
                token_refresher=token_refresher,
                progress_updater=[progress_updater],
            )
E           RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)

.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]
___________________ test_run_ollama[hallucination_detection] ___________________

yaml_json_combo_for_ollama = YamlJsonCombo(short_name='hallucination_detection', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--i...etection', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')

    def test_run_ollama(yaml_json_combo_for_ollama):
        """
        Run the target model end-to-end with a mock Ollama backend.
        """
        cfg = yaml_json_combo_for_ollama
    
        # Change base model id to Ollama's version
        if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
            cfg.base_model_id = "granite4:micro"
        else:
            pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
    
        if cfg.arguments_file:
            with open(cfg.arguments_file, encoding="utf8") as f:
                transform_kwargs = json.load(f)
        else:
            transform_kwargs = {}
    
        # Load input request
        with open(cfg.inputs_file, encoding="utf-8") as f:
            model_input = ChatCompletion.model_validate_json(f.read())
        model_input.model = cfg.task
    
        # Download files from Hugging Face Hub
        try:
>           lora_dir = intrinsics_util.obtain_lora(
                cfg.task,
                cfg.base_model_id,
                cfg.repo_id,
                revision=cfg.revision,
                alora=cfg.is_alora,
            )

test/formatters/granite/test_intrinsics_formatters.py:714: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
    local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
    thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
    for obj in iterable:
               ^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
    return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
    xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def xet_get(
        *,
        incomplete_path: Path,
        xet_file_data: XetFileData,
        headers: Dict[str, str],
        expected_size: Optional[int] = None,
        displayed_filename: Optional[str] = None,
        _tqdm_bar: Optional[tqdm] = None,
    ) -> None:
        """
        Download a file using Xet storage service.
    
        Args:
            incomplete_path (`Path`):
                The path to the file to download.
            xet_file_data (`XetFileData`):
                The file metadata needed to make the request to the xet storage service.
            headers (`Dict[str, str]`):
                The headers to send to the xet storage service.
            expected_size (`int`, *optional*):
                The expected size of the file to download. If set, the download will raise an error if the size of the
                received content is different from the expected one.
            displayed_filename (`str`, *optional*):
                The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
                not set, the filename is guessed from the URL or the `Content-Disposition` header.
    
        **How it works:**
            The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
            for efficient storage and transfer.
    
            `hf_xet.download_files` manages downloading files by:
            - Taking a list of files to download (each with its unique content hash)
            - Connecting to a storage server (CAS server) that knows how files are chunked
            - Using authentication to ensure secure access
            - Providing progress updates during download
    
            Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
            connection to the storage server.
    
            The download process works like this:
            1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
            2. Download files in parallel:
                2.1. Prepare to write the file to disk
                2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
                    The server responds with:
                    - Which chunks make up the complete file
                    - Where each chunk can be downloaded from
                2.3. For each needed chunk:
                    - Checks if we already have it in our local cache
                    - If not, download it from cloud storage (S3)
                    - Save it to cache for future use
                    - Assemble the chunks in order to recreate the original file
    
        """
        try:
            from hf_xet import PyXetDownloadInfo, download_files  # type: ignore[no-redef]
        except ImportError:
            raise ValueError(
                "To use optimized download using Xet storage, you need to install the hf_xet package. "
                'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
            )
    
        connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
    
        def token_refresher() -> Tuple[str, int]:
            connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
            if connection_info is None:
                raise ValueError("Failed to refresh token using xet metadata.")
            return connection_info.access_token, connection_info.expiration_unix_epoch
    
        xet_download_info = [
            PyXetDownloadInfo(
                destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
            )
        ]
    
        if not displayed_filename:
            displayed_filename = incomplete_path.name
    
        # Truncate filename if too long to display
        if len(displayed_filename) > 40:
            displayed_filename = f"{displayed_filename[:40]}(…)"
    
        progress_cm = _get_progress_bar_context(
            desc=displayed_filename,
            log_level=logger.getEffectiveLevel(),
            total=expected_size,
            initial=0,
            name="huggingface_hub.xet_get",
            _tqdm_bar=_tqdm_bar,
        )
    
        with progress_cm as progress:
    
            def progress_updater(progress_bytes: float):
                progress.update(progress_bytes)
    
>           download_files(
                xet_download_info,
                endpoint=connection_info.endpoint,
                token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
                token_refresher=token_refresher,
                progress_updater=[progress_updater],
            )
E           RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)

.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]
_____________________ test_run_ollama[query_clarification] _____________________

yaml_json_combo_for_ollama = YamlJsonCombo(short_name='query_clarification', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--ibm-g...fication', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')

    def test_run_ollama(yaml_json_combo_for_ollama):
        """
        Run the target model end-to-end with a mock Ollama backend.
        """
        cfg = yaml_json_combo_for_ollama
    
        # Change base model id to Ollama's version
        if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
            cfg.base_model_id = "granite4:micro"
        else:
            pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
    
        if cfg.arguments_file:
            with open(cfg.arguments_file, encoding="utf8") as f:
                transform_kwargs = json.load(f)
        else:
            transform_kwargs = {}
    
        # Load input request
        with open(cfg.inputs_file, encoding="utf-8") as f:
            model_input = ChatCompletion.model_validate_json(f.read())
        model_input.model = cfg.task
    
        # Download files from Hugging Face Hub
        try:
>           lora_dir = intrinsics_util.obtain_lora(
                cfg.task,
                cfg.base_model_id,
                cfg.repo_id,
                revision=cfg.revision,
                alora=cfg.is_alora,
            )

test/formatters/granite/test_intrinsics_formatters.py:714: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
    local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
    thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
    for obj in iterable:
               ^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
    return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
    xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def xet_get(
        *,
        incomplete_path: Path,
        xet_file_data: XetFileData,
        headers: Dict[str, str],
        expected_size: Optional[int] = None,
        displayed_filename: Optional[str] = None,
        _tqdm_bar: Optional[tqdm] = None,
    ) -> None:
        """
        Download a file using Xet storage service.
    
        Args:
            incomplete_path (`Path`):
                The path to the file to download.
            xet_file_data (`XetFileData`):
                The file metadata needed to make the request to the xet storage service.
            headers (`Dict[str, str]`):
                The headers to send to the xet storage service.
            expected_size (`int`, *optional*):
                The expected size of the file to download. If set, the download will raise an error if the size of the
                received content is different from the expected one.
            displayed_filename (`str`, *optional*):
                The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
                not set, the filename is guessed from the URL or the `Content-Disposition` header.
    
        **How it works:**
            The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
            for efficient storage and transfer.
    
            `hf_xet.download_files` manages downloading files by:
            - Taking a list of files to download (each with its unique content hash)
            - Connecting to a storage server (CAS server) that knows how files are chunked
            - Using authentication to ensure secure access
            - Providing progress updates during download
    
            Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
            connection to the storage server.
    
            The download process works like this:
            1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
            2. Download files in parallel:
                2.1. Prepare to write the file to disk
                2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
                    The server responds with:
                    - Which chunks make up the complete file
                    - Where each chunk can be downloaded from
                2.3. For each needed chunk:
                    - Checks if we already have it in our local cache
                    - If not, download it from cloud storage (S3)
                    - Save it to cache for future use
                    - Assemble the chunks in order to recreate the original file
    
        """
        try:
            from hf_xet import PyXetDownloadInfo, download_files  # type: ignore[no-redef]
        except ImportError:
            raise ValueError(
                "To use optimized download using Xet storage, you need to install the hf_xet package. "
                'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
            )
    
        connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
    
        def token_refresher() -> Tuple[str, int]:
            connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
            if connection_info is None:
                raise ValueError("Failed to refresh token using xet metadata.")
            return connection_info.access_token, connection_info.expiration_unix_epoch
    
        xet_download_info = [
            PyXetDownloadInfo(
                destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
            )
        ]
    
        if not displayed_filename:
            displayed_filename = incomplete_path.name
    
        # Truncate filename if too long to display
        if len(displayed_filename) > 40:
            displayed_filename = f"{displayed_filename[:40]}(…)"
    
        progress_cm = _get_progress_bar_context(
            desc=displayed_filename,
            log_level=logger.getEffectiveLevel(),
            total=expected_size,
            initial=0,
            name="huggingface_hub.xet_get",
            _tqdm_bar=_tqdm_bar,
        )
    
        with progress_cm as progress:
    
            def progress_updater(progress_bytes: float):
                progress.update(progress_bytes)
    
>           download_files(
                xet_download_info,
                endpoint=connection_info.endpoint,
                token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
                token_refresher=token_refresher,
                progress_updater=[progress_updater],
            )
E           RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)

.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]
________________________ test_run_ollama[query_rewrite] ________________________

yaml_json_combo_for_ollama = YamlJsonCombo(short_name='query_rewrite', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--ibm-granite..._rewrite', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')

    def test_run_ollama(yaml_json_combo_for_ollama):
        """
        Run the target model end-to-end with a mock Ollama backend.
        """
        cfg = yaml_json_combo_for_ollama
    
        # Change base model id to Ollama's version
        if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
            cfg.base_model_id = "granite4:micro"
        else:
            pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
    
        if cfg.arguments_file:
            with open(cfg.arguments_file, encoding="utf8") as f:
                transform_kwargs = json.load(f)
        else:
            transform_kwargs = {}
    
        # Load input request
        with open(cfg.inputs_file, encoding="utf-8") as f:
            model_input = ChatCompletion.model_validate_json(f.read())
        model_input.model = cfg.task
    
        # Download files from Hugging Face Hub
        try:
>           lora_dir = intrinsics_util.obtain_lora(
                cfg.task,
                cfg.base_model_id,
                cfg.repo_id,
                revision=cfg.revision,
                alora=cfg.is_alora,
            )

test/formatters/granite/test_intrinsics_formatters.py:714: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
    local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
    thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
    for obj in iterable:
               ^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
    return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
    xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def xet_get(
        *,
        incomplete_path: Path,
        xet_file_data: XetFileData,
        headers: Dict[str, str],
        expected_size: Optional[int] = None,
        displayed_filename: Optional[str] = None,
        _tqdm_bar: Optional[tqdm] = None,
    ) -> None:
        """
        Download a file using Xet storage service.
    
        Args:
            incomplete_path (`Path`):
                The path to the file to download.
            xet_file_data (`XetFileData`):
                The file metadata needed to make the request to the xet storage service.
            headers (`Dict[str, str]`):
                The headers to send to the xet storage service.
            expected_size (`int`, *optional*):
                The expected size of the file to download. If set, the download will raise an error if the size of the
                received content is different from the expected one.
            displayed_filename (`str`, *optional*):
                The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
                not set, the filename is guessed from the URL or the `Content-Disposition` header.
    
        **How it works:**
            The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
            for efficient storage and transfer.
    
            `hf_xet.download_files` manages downloading files by:
            - Taking a list of files to download (each with its unique content hash)
            - Connecting to a storage server (CAS server) that knows how files are chunked
            - Using authentication to ensure secure access
            - Providing progress updates during download
    
            Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
            connection to the storage server.
    
            The download process works like this:
            1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
            2. Download files in parallel:
                2.1. Prepare to write the file to disk
                2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
                    The server responds with:
                    - Which chunks make up the complete file
                    - Where each chunk can be downloaded from
                2.3. For each needed chunk:
                    - Checks if we already have it in our local cache
                    - If not, download it from cloud storage (S3)
                    - Save it to cache for future use
                    - Assemble the chunks in order to recreate the original file
    
        """
        try:
            from hf_xet import PyXetDownloadInfo, download_files  # type: ignore[no-redef]
        except ImportError:
            raise ValueError(
                "To use optimized download using Xet storage, you need to install the hf_xet package. "
                'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
            )
    
        connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
    
        def token_refresher() -> Tuple[str, int]:
            connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
            if connection_info is None:
                raise ValueError("Failed to refresh token using xet metadata.")
            return connection_info.access_token, connection_info.expiration_unix_epoch
    
        xet_download_info = [
            PyXetDownloadInfo(
                destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
            )
        ]
    
        if not displayed_filename:
            displayed_filename = incomplete_path.name
    
        # Truncate filename if too long to display
        if len(displayed_filename) > 40:
            displayed_filename = f"{displayed_filename[:40]}(…)"
    
        progress_cm = _get_progress_bar_context(
            desc=displayed_filename,
            log_level=logger.getEffectiveLevel(),
            total=expected_size,
            initial=0,
            name="huggingface_hub.xet_get",
            _tqdm_bar=_tqdm_bar,
        )
    
        with progress_cm as progress:
    
            def progress_updater(progress_bytes: float):
                progress.update(progress_bytes)
    
>           download_files(
                xet_download_info,
                endpoint=connection_info.endpoint,
                token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
                token_refresher=token_refresher,
                progress_updater=[progress_updater],
            )
E           RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)

.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]
______________________ test_run_ollama[context_relevance] ______________________

yaml_json_combo_for_ollama = YamlJsonCombo(short_name='context_relevance', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--ibm-gra...elevance', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')

    def test_run_ollama(yaml_json_combo_for_ollama):
        """
        Run the target model end-to-end with a mock Ollama backend.
        """
        cfg = yaml_json_combo_for_ollama
    
        # Change base model id to Ollama's version
        if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
            cfg.base_model_id = "granite4:micro"
        else:
            pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
    
        if cfg.arguments_file:
            with open(cfg.arguments_file, encoding="utf8") as f:
                transform_kwargs = json.load(f)
        else:
            transform_kwargs = {}
    
        # Load input request
        with open(cfg.inputs_file, encoding="utf-8") as f:
            model_input = ChatCompletion.model_validate_json(f.read())
        model_input.model = cfg.task
    
        # Download files from Hugging Face Hub
        try:
>           lora_dir = intrinsics_util.obtain_lora(
                cfg.task,
                cfg.base_model_id,
                cfg.repo_id,
                revision=cfg.revision,
                alora=cfg.is_alora,
            )

test/formatters/granite/test_intrinsics_formatters.py:714: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
    local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
    thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
    for obj in iterable:
               ^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
    return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
    xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def xet_get(
        *,
        incomplete_path: Path,
        xet_file_data: XetFileData,
        headers: Dict[str, str],
        expected_size: Optional[int] = None,
        displayed_filename: Optional[str] = None,
        _tqdm_bar: Optional[tqdm] = None,
    ) -> None:
        """
        Download a file using Xet storage service.
    
        Args:
            incomplete_path (`Path`):
                The path to the file to download.
            xet_file_data (`XetFileData`):
                The file metadata needed to make the request to the xet storage service.
            headers (`Dict[str, str]`):
                The headers to send to the xet storage service.
            expected_size (`int`, *optional*):
                The expected size of the file to download. If set, the download will raise an error if the size of the
                received content is different from the expected one.
            displayed_filename (`str`, *optional*):
                The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
                not set, the filename is guessed from the URL or the `Content-Disposition` header.
    
        **How it works:**
            The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
            for efficient storage and transfer.
    
            `hf_xet.download_files` manages downloading files by:
            - Taking a list of files to download (each with its unique content hash)
            - Connecting to a storage server (CAS server) that knows how files are chunked
            - Using authentication to ensure secure access
            - Providing progress updates during download
    
            Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
            connection to the storage server.
    
            The download process works like this:
            1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
            2. Download files in parallel:
                2.1. Prepare to write the file to disk
                2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
                    The server responds with:
                    - Which chunks make up the complete file
                    - Where each chunk can be downloaded from
                2.3. For each needed chunk:
                    - Checks if we already have it in our local cache
                    - If not, download it from cloud storage (S3)
                    - Save it to cache for future use
                    - Assemble the chunks in order to recreate the original file
    
        """
        try:
            from hf_xet import PyXetDownloadInfo, download_files  # type: ignore[no-redef]
        except ImportError:
            raise ValueError(
                "To use optimized download using Xet storage, you need to install the hf_xet package. "
                'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
            )
    
        connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
    
        def token_refresher() -> Tuple[str, int]:
            connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
            if connection_info is None:
                raise ValueError("Failed to refresh token using xet metadata.")
            return connection_info.access_token, connection_info.expiration_unix_epoch
    
        xet_download_info = [
            PyXetDownloadInfo(
                destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
            )
        ]
    
        if not displayed_filename:
            displayed_filename = incomplete_path.name
    
        # Truncate filename if too long to display
        if len(displayed_filename) > 40:
            displayed_filename = f"{displayed_filename[:40]}(…)"
    
        progress_cm = _get_progress_bar_context(
            desc=displayed_filename,
            log_level=logger.getEffectiveLevel(),
            total=expected_size,
            initial=0,
            name="huggingface_hub.xet_get",
            _tqdm_bar=_tqdm_bar,
        )
    
        with progress_cm as progress:
    
            def progress_updater(progress_bytes: float):
                progress.update(progress_bytes)
    
>           download_files(
                xet_download_info,
                endpoint=connection_info.endpoint,
                token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
                token_refresher=token_refresher,
                progress_updater=[progress_updater],
            )
E           RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)

.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]
__________________________ test_run_ollama[citations] __________________________

yaml_json_combo_for_ollama = YamlJsonCombo(short_name='citations', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--ibm-granite--gr...itations', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')

    def test_run_ollama(yaml_json_combo_for_ollama):
        """
        Run the target model end-to-end with a mock Ollama backend.
        """
        cfg = yaml_json_combo_for_ollama
    
        # Change base model id to Ollama's version
        if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
            cfg.base_model_id = "granite4:micro"
        else:
            pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
    
        if cfg.arguments_file:
            with open(cfg.arguments_file, encoding="utf8") as f:
                transform_kwargs = json.load(f)
        else:
            transform_kwargs = {}
    
        # Load input request
        with open(cfg.inputs_file, encoding="utf-8") as f:
            model_input = ChatCompletion.model_validate_json(f.read())
        model_input.model = cfg.task
    
        # Download files from Hugging Face Hub
        try:
>           lora_dir = intrinsics_util.obtain_lora(
                cfg.task,
                cfg.base_model_id,
                cfg.repo_id,
                revision=cfg.revision,
                alora=cfg.is_alora,
            )

test/formatters/granite/test_intrinsics_formatters.py:714: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
    local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
    thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
    for obj in iterable:
               ^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
    return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
    xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def xet_get(
        *,
        incomplete_path: Path,
        xet_file_data: XetFileData,
        headers: Dict[str, str],
        expected_size: Optional[int] = None,
        displayed_filename: Optional[str] = None,
        _tqdm_bar: Optional[tqdm] = None,
    ) -> None:
        """
        Download a file using Xet storage service.
    
        Args:
            incomplete_path (`Path`):
                The path to the file to download.
            xet_file_data (`XetFileData`):
                The file metadata needed to make the request to the xet storage service.
            headers (`Dict[str, str]`):
                The headers to send to the xet storage service.
            expected_size (`int`, *optional*):
                The expected size of the file to download. If set, the download will raise an error if the size of the
                received content is different from the expected one.
            displayed_filename (`str`, *optional*):
                The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
                not set, the filename is guessed from the URL or the `Content-Disposition` header.
    
        **How it works:**
            The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
            for efficient storage and transfer.
    
            `hf_xet.download_files` manages downloading files by:
            - Taking a list of files to download (each with its unique content hash)
            - Connecting to a storage server (CAS server) that knows how files are chunked
            - Using authentication to ensure secure access
            - Providing progress updates during download
    
            Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
            connection to the storage server.
    
            The download process works like this:
            1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
            2. Download files in parallel:
                2.1. Prepare to write the file to disk
                2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
                    The server responds with:
                    - Which chunks make up the complete file
                    - Where each chunk can be downloaded from
                2.3. For each needed chunk:
                    - Checks if we already have it in our local cache
                    - If not, download it from cloud storage (S3)
                    - Save it to cache for future use
                    - Assemble the chunks in order to recreate the original file
    
        """
        try:
            from hf_xet import PyXetDownloadInfo, download_files  # type: ignore[no-redef]
        except ImportError:
            raise ValueError(
                "To use optimized download using Xet storage, you need to install the hf_xet package. "
                'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
            )
    
        connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
    
        def token_refresher() -> Tuple[str, int]:
            connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
            if connection_info is None:
                raise ValueError("Failed to refresh token using xet metadata.")
            return connection_info.access_token, connection_info.expiration_unix_epoch
    
        xet_download_info = [
            PyXetDownloadInfo(
                destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
            )
        ]
    
        if not displayed_filename:
            displayed_filename = incomplete_path.name
    
        # Truncate filename if too long to display
        if len(displayed_filename) > 40:
            displayed_filename = f"{displayed_filename[:40]}(…)"
    
        progress_cm = _get_progress_bar_context(
            desc=displayed_filename,
            log_level=logger.getEffectiveLevel(),
            total=expected_size,
            initial=0,
            name="huggingface_hub.xet_get",
            _tqdm_bar=_tqdm_bar,
        )
    
        with progress_cm as progress:
    
            def progress_updater(progress_bytes: float):
                progress.update(progress_bytes)
    
>           download_files(
                xet_download_info,
                endpoint=connection_info.endpoint,
                token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
                token_refresher=token_refresher,
                progress_updater=[progress_updater],
            )
E           RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)

.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]
=============================== warnings summary ===============================
test/backends/test_huggingface.py: 1 warning
test/stdlib/components/intrinsic/test_core.py: 2 warnings
test/stdlib/components/intrinsic/test_guardian.py: 3 warnings
test/stdlib/components/intrinsic/test_rag.py: 5 warnings
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/peft/tuners/tuners_utils.py:285: UserWarning: Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/utils.py:103: DeprecationWarning: This class is deprecated and will be removed in version 0.20.0. To train on completion only, please use the parameter `completion_only_loss` of `SFTConfig` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
test/cli/test_alora_train.py::test_alora_config_creation
test/cli/test_alora_train.py::test_lora_config_creation
test/cli/test_alora_train.py::test_invocation_prompt_tokenization
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/sft_config.py:257: DeprecationWarning: `max_seq_length` is deprecated and will be removed in version 0.20.0. Use `max_length` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:678: DeprecationWarning: Failed to apply the formatting function due to the following error: string index out of range. This may be because the function is designed for batched input. Please update it to process one example at a time (i.e., accept and return a single example). For now, we will attempt to apply the function in batched mode, but note that batched formatting is deprecated and will be removed in version 0.21.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
    return data.pin_memory(device)

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.is_pinned() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:31.)
    return data.pin_memory(device)

test/telemetry/test_metrics_backend.py: 8 warnings
test/telemetry/test_metrics.py: 24 warnings
test/telemetry/test_metrics_token.py: 4 warnings
  /proj/dmfexp/eiger/users/ajbozarth/mellea/mellea/telemetry/metrics.py:245: UserWarning: Metrics are enabled (MELLEA_METRICS_ENABLED=true) but no exporters are configured. Metrics will be collected but not exported. Set MELLEA_METRICS_PROMETHEUS=true, set MELLEA_METRICS_OTLP=true with an endpoint (OTEL_EXPORTER_OTLP_METRICS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT), or set MELLEA_METRICS_CONSOLE=true to export metrics.
    _meter_provider = _setup_meter_provider()

test/telemetry/test_metrics_backend.py: 7 warnings
test/telemetry/test_metrics.py: 28 warnings
test/telemetry/test_metrics_token.py: 4 warnings
  /u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py:131: UserWarning: TokenMetricsPlugin already registered: Plugin token_metrics.generation_post_call already registered
    _bootstrap._exec(spec, module)

test/backends/test_vision_openai.py::test_image_block_construction
  /proj/dmfexp/eiger/users/ajbozarth/mellea/test/backends/test_vision_openai.py:48: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
    random_image = Image.fromarray(random_pixel_data, "RGB")

test/backends/test_litellm_ollama.py::test_litellm_ollama_chat
test/backends/test_litellm_ollama.py::test_generate_from_raw
test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/backends/test_litellm_ollama.py::test_async_avalue
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/aiohttp/connector.py:993: DeprecationWarning: enable_cleanup_closed ignored because https://github.com/python/cpython/pull/118960 is fixed in Python version sys.version_info(major=3, minor=12, micro=12, releaselevel='final', serial=0)
    super().__init__(

test/backends/test_litellm_ollama.py::test_litellm_ollama_chat
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='The answ...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Subject:...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='yes', ro...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="Subject:...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_gen_slot
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{\n    "...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/streaming_handler.py:1855: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
    obj_dict = processed_chunk.dict()

test/backends/test_litellm_ollama.py::test_async_parallel_requests
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="Goodbye!...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/backends/test_litellm_ollama.py::test_async_avalue
test/backends/test_litellm_ollama.py::test_async_avalue
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Hello! H...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/backends/test_tool_calls.py::test_tool_called_from_context_action
  <frozen abc>:106: DeprecationWarning: Use BaseMetaSerializer() instead.

test/backends/test_vision_ollama.py::test_image_block_construction
  /proj/dmfexp/eiger/users/ajbozarth/mellea/test/backends/test_vision_ollama.py:38: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
    random_image = Image.fromarray(random_pixel_data, "RGB")

test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
test/telemetry/test_tracing.py::test_session_with_tracing_disabled
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="I'm here...ields={'refusal': None}), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/model_response_utils.py:206: PydanticDeprecatedSince211: Accessing the 'model_computed_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
    or callable(getattr(delta, attr_name))

test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/model_response_utils.py:206: PydanticDeprecatedSince211: Accessing the 'model_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
    or callable(getattr(delta, attr_name))

test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
    PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='As an AI...er_specific_fields=None), input_type=Message])
    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
    return self.__pydantic_serializer__.to_python(

test/helpers/test_event_loop_helper.py::test_event_loop_handler_with_forking
  /u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=1137938) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

test/stdlib/components/docs/test_richdocument.py::test_richdocument_basics
test/stdlib/components/docs/test_richdocument.py::test_richdocument_markdown
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
  /proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/docling_core/transforms/serializer/markdown.py:490: DeprecationWarning: Field `annotations` is deprecated; use `meta` instead.
    for ann in item.annotations

test/telemetry/test_logging.py::test_otlp_logging_enabled_without_endpoint_warns
  /proj/dmfexp/eiger/users/ajbozarth/mellea/mellea/telemetry/logging.py:97: UserWarning: OTLP logs exporter is enabled (MELLEA_LOGS_OTLP=true) but no endpoint is configured. Set OTEL_EXPORTER_OTLP_LOGS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT to export logs.
    _logger_provider = _setup_logger_provider()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================ tests coverage ================================
_______________ coverage: platform linux, python 3.12.12-final-0 _______________

Coverage HTML written to dir htmlcov
Coverage JSON written to file coverage.json
=========================== short test summary info ============================
FAILED test/stdlib/components/intrinsic/test_core.py::test_find_context_attributions
FAILED test/backends/test_vision_openai.py::test_image_block_in_instruction
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_simple]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_answerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_unanswerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[hallucination_detection]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_clarification]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_rewrite]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[context_relevance]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[citations]
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_big - Exc...
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_little - ...
= 10 failed, 821 passed, 37 skipped, 19 deselected, 2 xfailed, 1 xpassed, 131 warnings, 2 errors in 2295.03s (0:38:15) =
[23:00:24] Shutting down ollama server...
[23:00:24] Ollama stopped.

Run	Passed	Failed	Skipped	Deselected	Notes
Local (`uv run pytest`, Mac M1 Max 32GB, Python 3.12.8)	800	2	61	19	2 qualitative flakes
Local slow (`uv run pytest -m slow`, Mac M1 Max 32GB, Python 3.12.8)	18	0	3	864	All expected
Cluster (`run_tests_with_ollama.sh`, LSF GPU node, Python 3.12.12)	821	10	37	19	Disk quota + 2 qualitative flakes

ajbozarth · 2026-03-27T23:33:38Z

Looking into the above issues I've actually hit a handful of things to address that I will be retuning to on Monday, including but not limited to:

adding missing qualitative marks for the vision tests (note the failure in above)
figure out why I can't use the skills that were added in this PR (Claude refuses to see them, "ran" the skill by looking at its md manually to)
figure out why I got a Disk quota exceeded on bluevela

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

…covery

… cases

…skip/resource issues

planetf1 · 2026-03-28T09:59:54Z

Looking into the above issues I've actually hit a handful of things to address that I will be retuning to on Monday, including but not limited to:

Thanks!

adding missing qualitative marks for the vision tests (note the failure in above)

I had a look at the test. It does test some structural aspects, but also has a few qualitative things in. I didn't hit the issue in multi runs, but agree with your suggested classification. It would benefit from some rewrite to tease out the different aspects . I also looked at the skill, but we're into subtle detail here so I don't think there's anything else to add.

Pushed a fix on this

figure out why I can't use the skills that were added in this PR (Claude refuses to see them, "ran" the skill by looking at its md manually to)

My error -- I had left a necessary file in .gitignore for project-specific config (which is why it worked for me!). That is accepted best-practice and claude intent. I also tweaked the description slightly as we discussed last week so it considers the one-off cases (though I'm sure more tweaks are possible)

figure out why I got a Disk quota exceeded on bluevela
Infra followup ...

I've also done a rebase on upstream/main (with no conflicts) - not squashed, but if'when we're ready I can do that to make it easier to track upstream whilst being reviewed.

Trying a cluster run with 'uv sync --all-extras --all-groups && uv run test/scripts/run_tests_with_ollama.sh' (and using -gpu "num=1:mode=shared:j_exclusive=yes")

Thanks for the thorough checks and patches.

ajbozarth · 2026-03-31T01:46:05Z

I've rebased on main to bring in the fixes in #765 and #764 in addition to the fixes @planetf1 made on Saturday then reran the tests.

Test run summary

Local run (uv run pytest, Mac M1 Max 32GB, Python 3.12.8): 800 passed, 2 failed, 61 skipped, 19 deselected, 2 xfailed, 1 xpassed in 17m16s — identical counts to the original review run. Skips breakdown unchanged (see previous comment).

Same 2 @pytest.mark.qualitative failures:

test_find_context_attributions — IndexError in sentence-number decoder; model returned an index outside the prepared range — non-deterministic LLM output
test_hallucination_detection — faithfulness_likelihood off by 0.0228 (just outside ±0.02 tolerance); same floating-point qualitative flake as before

Cluster run (./test/scripts/run_tests_with_ollama.sh, IBM LSF, NVIDIA GPU node, Python 3.12.12): 822 passed, 9 failed, 37 skipped, 19 deselected, 2 xfailed, 1 xpassed, 2 errors in 37m33s — marginal improvement over the previous cluster run (821/10/2err).

Same infrastructure failures as before:

test_run_ollama[*] (8 failures) — Disk quota exceeded downloading LoRA files from HuggingFace; same cluster infrastructure issue
test_find_context_attributions — qualitative flake (different manifestation: 5 extra attribution items vs expected, vs IndexError locally)
test_think_budget_forcing (2 errors) — gpt-oss:20b pull failed; same as previous cluster run, likely downstream of disk quota exhaustion

The one improvement vs the last cluster run: test_vision_openai::test_image_block_in_instruction no longer flaking (was non-deterministic LLM output).

Local terminal output

$ uv run pytest
============================================================================================================ test session starts ============================================================================================================
platform darwin -- Python 3.12.8, pytest-9.0.0, pluggy-1.6.0
rootdir: /Users/ajbozarth/workspace/ai/mellea
configfile: pyproject.toml
testpaths: test, docs
plugins: nbmake-1.5.5, recording-0.13.4, anyio-4.11.0, xdist-3.8.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, asyncio-1.3.0, langsmith-0.6.6, Faker-37.12.0, cov-7.0.0
timeout: 900.0s
timeout method: signal
timeout func_only: False
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 883 items / 19 deselected / 2 skipped / 864 selected

test/backends/test_adapters/test_adapter.py .                                                                                                                                                                                         [  0%]
test/backends/test_bedrock.py s                                                                                                                                                                                                       [  0%]
test/backends/test_huggingface.py sssssssssssssssssss                                                                                                                                                                                 [  2%]
test/backends/test_huggingface_tools.py s                                                                                                                                                                                             [  2%]
test/backends/test_litellm_ollama.py ........                                                                                                                                                                                         [  3%]
test/backends/test_litellm_watsonx.py ssss                                                                                                                                                                                            [  3%]
test/backends/test_mellea_tool.py ........                                                                                                                                                                                            [  4%]
test/backends/test_model_options.py .....                                                                                                                                                                                             [  5%]
test/backends/test_ollama.py .....X....                                                                                                                                                                                               [  6%]
test/backends/test_openai_ollama.py .............                                                                                                                                                                                     [  8%]
test/backends/test_openai_vllm.py sssssss                                                                                                                                                                                             [  8%]
test/backends/test_tool_calls.py ...                                                                                                                                                                                                  [  9%]
test/backends/test_tool_decorator.py ...................                                                                                                                                                                              [ 11%]
test/backends/test_tool_helpers.py ...                                                                                                                                                                                                [ 11%]
test/backends/test_tool_validation_integration.py .................................                                                                                                                                                   [ 15%]
test/backends/test_vision_ollama.py ....                                                                                                                                                                                              [ 16%]
test/backends/test_vision_openai.py ....                                                                                                                                                                                              [ 16%]
test/backends/test_watsonx.py sssssssssss                                                                                                                                                                                             [ 17%]
test/cli/test_alora_train.py ....                                                                                                                                                                                                     [ 18%]
test/cli/test_alora_train_integration.py ss                                                                                                                                                                                           [ 18%]
test/core/test_astream_exception_propagation.py .....                                                                                                                                                                                 [ 19%]
test/core/test_astream_incremental.py ......                                                                                                                                                                                          [ 19%]
test/core/test_astream_mock.py ......                                                                                                                                                                                                 [ 20%]
test/core/test_base.py ....                                                                                                                                                                                                           [ 20%]
test/core/test_component_typing.py ........                                                                                                                                                                                           [ 21%]
test/core/test_model_output_thunk.py ..                                                                                                                                                                                               [ 22%]
test/decompose/test_decompose.py ..........                                                                                                                                                                                           [ 23%]
test/formatters/granite/test_intrinsics_formatters.py ........................................................x.................                                                                                                      [ 31%]
test/formatters/test_template_formatter.py ................                                                                                                                                                                           [ 33%]
test/helpers/test_event_loop_helper.py ....                                                                                                                                                                                           [ 34%]
test/helpers/test_server_type.py ................                                                                                                                                                                                     [ 35%]
test/plugins/test_all_payloads.py ...................................................................................................                                                                                                 [ 47%]
test/plugins/test_blocking.py ................                                                                                                                                                                                        [ 49%]
test/plugins/test_build_global_context.py .......                                                                                                                                                                                     [ 50%]
test/plugins/test_decorators.py .........                                                                                                                                                                                             [ 51%]
test/plugins/test_execution_modes.py ...........................                                                                                                                                                                      [ 54%]
test/plugins/test_hook_call_sites.py ..............................                                                                                                                                                                   [ 57%]
test/plugins/test_manager.py ss......                                                                                                                                                                                                 [ 58%]
test/plugins/test_mellea_plugin.py .......                                                                                                                                                                                            [ 59%]
test/plugins/test_payloads.py ..........                                                                                                                                                                                              [ 60%]
test/plugins/test_pluginset.py .........                                                                                                                                                                                              [ 61%]
test/plugins/test_policies.py ......                                                                                                                                                                                                  [ 62%]
test/plugins/test_policy_enforcement.py ..........                                                                                                                                                                                    [ 63%]
test/plugins/test_priority_ordering.py ..............                                                                                                                                                                                 [ 65%]
test/plugins/test_scoping.py ...................................                                                                                                                                                                      [ 69%]
test/plugins/test_tool_hooks_redaction.py .......                                                                                                                                                                                     [ 70%]
test/plugins/test_unregister.py .........                                                                                                                                                                                             [ 71%]
test/stdlib/components/docs/test_document.py ...                                                                                                                                                                                      [ 71%]
test/stdlib/components/docs/test_richdocument.py .....s                                                                                                                                                                               [ 72%]
test/stdlib/components/intrinsic/test_core.py ..F                                                                                                                                                                                     [ 72%]
test/stdlib/components/intrinsic/test_guardian.py ......                                                                                                                                                                              [ 73%]
test/stdlib/components/intrinsic/test_rag.py ....F..                                                                                                                                                                                  [ 73%]
test/stdlib/components/test_chat.py .                                                                                                                                                                                                 [ 74%]
test/stdlib/components/test_genslot.py ...................                                                                                                                                                                            [ 76%]
test/stdlib/components/test_hello_world.py ..                                                                                                                                                                                         [ 76%]
test/stdlib/components/test_mify.py ...........                                                                                                                                                                                       [ 77%]
test/stdlib/components/test_transform.py ..                                                                                                                                                                                           [ 78%]
test/stdlib/requirements/test_reqlib_markdown.py ......                                                                                                                                                                               [ 78%]
test/stdlib/requirements/test_reqlib_python.py .............sss.....                                                                                                                                                                  [ 81%]
test/stdlib/requirements/test_reqlib_tools.py .                                                                                                                                                                                       [ 81%]
test/stdlib/requirements/test_requirement.py .....                                                                                                                                                                                    [ 81%]
test/stdlib/sampling/test_majority_voting.py ..                                                                                                                                                                                       [ 82%]
test/stdlib/sampling/test_sampling_ctx.py ..                                                                                                                                                                                          [ 82%]
test/stdlib/sampling/test_sofai_graph_coloring.py .........................                                                                                                                                                           [ 85%]
test/stdlib/sampling/test_sofai_sampling.py .....................                                                                                                                                                                     [ 87%]
test/stdlib/sampling/test_think_budget_forcing.py ..                                                                                                                                                                                  [ 87%]
test/stdlib/test_base_context.py .....                                                                                                                                                                                                [ 88%]
test/stdlib/test_chat_view.py ..                                                                                                                                                                                                      [ 88%]
test/stdlib/test_functional.py ....                                                                                                                                                                                                   [ 89%]
test/stdlib/test_session.py s.......                                                                                                                                                                                                  [ 90%]
test/stdlib/test_spans.py .x                                                                                                                                                                                                          [ 90%]
test/telemetry/test_logging.py ........                                                                                                                                                                                               [ 91%]
test/telemetry/test_metrics.py .......................................                                                                                                                                                                [ 95%]
test/telemetry/test_metrics_backend.py ....s....                                                                                                                                                                                      [ 96%]
test/telemetry/test_metrics_plugins.py ....                                                                                                                                                                                           [ 97%]
test/telemetry/test_metrics_token.py ....                                                                                                                                                                                             [ 97%]
test/telemetry/test_tracing.py ..............                                                                                                                                                                                         [ 99%]
test/telemetry/test_tracing_backend.py ssssss                                                                                                                                                                                         [100%]

FAILED test/stdlib/components/intrinsic/test_core.py::test_find_context_attributions - IndexError: list index out of range
FAILED test/stdlib/components/intrinsic/test_rag.py::test_hallucination_detection - AssertionError: assert approx({'resp...he sentence.}) == {'explanation...end': 31, ...}
================================================================= 2 failed, 800 passed, 61 skipped, 19 deselected, 2 xfailed, 1 xpassed, 122 warnings in 1036.95s (0:17:16) =================================================================

Cluster terminal output

$ bsub -Is -n 1 -G grp_preemptable -q preemptable -gpu "num=1/task:mode=shared:mps=no:j_exclusive=yes:gvendor=nvidia" /bin/bash
Job <756712> is submitted to queue <preemptable>.
<<Waiting for dispatch ...>>
<<Starting on p2-r28-n2>>
[ajbozarth@p2-r28-n2 mellea]$ test/scripts/run_tests_with_ollama.sh
[00:56:43] WARNING: CACHE_DIR not set. Ollama models will download to ~/.ollama (default)
[00:56:43] Using standalone log directory: logs/2026-03-31-00:56:43
[00:56:43] Starting ollama server on 127.0.0.1:11434...
[00:56:43] Added system CUDA to LD_LIBRARY_PATH
[00:56:43] Ollama server PID: 229334
[00:56:43] Waiting for ollama to be ready...
[00:56:46] Ollama ready after 2s
[00:56:46] Model granite4:micro already pulled
[00:56:46] Model granite4:micro-h already pulled
[00:56:46] Model granite3.2-vision already pulled
[00:56:46] All models ready.
[00:56:46] Warming up models...
[00:56:46]   Warming granite4:micro ...
[00:56:49]   Warming granite4:micro-h ...
[00:56:52]   Warming granite3.2-vision ...
[00:56:55] Warmup complete.
[00:56:55] Starting pytest...
[00:56:55] Log directory: logs/2026-03-31-00:56:43
[00:56:55] Pytest args: --group-by-backend
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.0, pluggy-1.6.0
rootdir: /proj/dmfexp/eiger/users/ajbozarth/mellea
configfile: pyproject.toml
plugins: nbmake-1.5.5, anyio-4.11.0, json-report-1.5.0, recording-0.13.4, asyncio-1.3.0, timeout-2.4.0, metadata-3.1.1, Faker-37.12.0, xdist-3.8.0, langsmith-0.6.6, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
timeout: 900.0s
timeout method: signal
timeout func_only: False
collected 892 items / 19 deselected / 873 selected

test/backends/test_huggingface.py ...................                    [  2%]
test/backends/test_huggingface_tools.py .                                [  2%]
test/cli/test_alora_train_integration.py ..                              [  2%]
test/formatters/granite/test_intrinsics_formatters.py ....x.........     [  4%]
test/stdlib/components/docs/test_richdocument.py s                       [  4%]
test/stdlib/components/intrinsic/test_core.py ..F                        [  4%]
test/stdlib/components/intrinsic/test_guardian.py ......                 [  5%]
test/stdlib/components/intrinsic/test_rag.py .......                     [  6%]
test/stdlib/test_spans.py .x                                             [  6%]
test/telemetry/test_metrics_backend.py ..                                [  6%]
test/backends/test_openai_ollama.py .............                        [  8%]
test/backends/test_openai_vllm.py sssssss                                [  8%]
test/backends/test_vision_openai.py ....                                 [  9%]
test/telemetry/test_metrics_backend.py ..                                [  9%]
test/backends/test_vllm.py ........                                      [ 10%]
test/backends/test_vllm_tools.py .                                       [ 10%]
test/backends/test_litellm_ollama.py ........                            [ 11%]
test/backends/test_mellea_tool.py ..                                     [ 11%]
test/backends/test_ollama.py .....X....                                  [ 12%]
test/backends/test_tool_calls.py ...                                     [ 13%]
test/backends/test_vision_ollama.py ....                                 [ 13%]
test/core/test_astream_incremental.py ......                             [ 14%]
test/core/test_component_typing.py ...                                   [ 14%]
test/core/test_model_output_thunk.py ..                                  [ 14%]
test/stdlib/components/test_genslot.py ...................               [ 17%]
test/stdlib/requirements/test_requirement.py .....                       [ 17%]
test/stdlib/sampling/test_majority_voting.py ..                          [ 17%]
test/stdlib/sampling/test_sampling_ctx.py ..                             [ 18%]
test/stdlib/sampling/test_sofai_graph_coloring.py ...                    [ 18%]
test/stdlib/sampling/test_sofai_sampling.py .                            [ 18%]
test/stdlib/sampling/test_think_budget_forcing.py EE                     [ 18%]
test/stdlib/test_chat_view.py ..                                         [ 19%]
test/stdlib/test_functional.py ....                                      [ 19%]
test/stdlib/test_session.py s.......                                     [ 20%]
test/telemetry/test_metrics_backend.py ....                              [ 20%]
test/telemetry/test_tracing.py ....                                      [ 21%]
test/telemetry/test_tracing_backend.py ssssss                            [ 21%]
test/backends/test_bedrock.py s                                          [ 22%]
test/backends/test_litellm_watsonx.py ssss                               [ 22%]
test/backends/test_watsonx.py sssssssssss                                [ 23%]
test/telemetry/test_metrics_backend.py s                                 [ 23%]
test/backends/test_adapters/test_adapter.py .                            [ 24%]
test/backends/test_mellea_tool.py ......                                 [ 24%]
test/backends/test_model_options.py .....                                [ 25%]
test/backends/test_tool_decorator.py ...................                 [ 27%]
test/backends/test_tool_helpers.py ...                                   [ 27%]
test/backends/test_tool_validation_integration.py ......................
...........                                                              [ 31%]
test/cli/test_alora_train.py ....                                        [ 32%]
test/core/test_astream_exception_propagation.py .....                    [ 32%]
test/core/test_astream_mock.py ......                                    [ 33%]
test/core/test_base.py ....                                              [ 33%]
test/core/test_component_typing.py .....                                 [ 34%]
test/decompose/test_decompose.py ..........                              [ 35%]
test/formatters/granite/test_intrinsics_formatters.py ....................
..................................FFFFFFFF                               [ 42%]
test/formatters/test_template_formatter.py ................              [ 44%]
test/helpers/test_event_loop_helper.py ....                              [ 44%]
test/helpers/test_server_type.py ................                        [ 46%]
test/plugins/test_all_payloads.py ......................................
.............................................................            [ 57%]
test/plugins/test_blocking.py ................                           [ 59%]
test/plugins/test_build_global_context.py .......                        [ 60%]
test/plugins/test_decorators.py .........                                [ 61%]
test/plugins/test_execution_modes.py ...........................         [ 64%]
test/plugins/test_hook_call_sites.py ..............................      [ 68%]
test/plugins/test_manager.py ss......                                    [ 68%]
test/plugins/test_mellea_plugin.py .......                               [ 69%]
test/plugins/test_payloads.py ..........                                 [ 70%]
test/plugins/test_pluginset.py .........                                 [ 71%]
test/plugins/test_policies.py ......                                     [ 72%]
test/plugins/test_policy_enforcement.py ..........                       [ 73%]
test/plugins/test_priority_ordering.py ..............                    [ 75%]
test/plugins/test_scoping.py ...................................         [ 79%]
test/plugins/test_tool_hooks_redaction.py .......                        [ 80%]
test/plugins/test_unregister.py .........                                [ 81%]
test/stdlib/components/docs/test_document.py ...                         [ 81%]
test/stdlib/components/docs/test_richdocument.py .....                   [ 82%]
test/stdlib/components/test_chat.py .                                    [ 82%]
test/stdlib/components/test_hello_world.py ..                            [ 82%]
test/stdlib/components/test_mify.py ...........                          [ 83%]
test/stdlib/components/test_transform.py ..                              [ 83%]
test/stdlib/requirements/test_reqlib_markdown.py ......                  [ 84%]
test/stdlib/requirements/test_reqlib_python.py .............sss.....     [ 87%]
test/stdlib/requirements/test_reqlib_tools.py .                          [ 87%]
test/stdlib/sampling/test_sofai_graph_coloring.py ......................
                                                                         [ 89%]
test/stdlib/sampling/test_sofai_sampling.py ....................         [ 91%]
test/stdlib/test_base_context.py .....                                   [ 92%]
test/telemetry/test_logging.py ........                                  [ 93%]
test/telemetry/test_metrics.py .......................................   [ 97%]
test/telemetry/test_metrics_plugins.py ....                              [ 98%]
test/telemetry/test_metrics_token.py ....                                [ 98%]
test/telemetry/test_tracing.py ..........                                [100%]

=========================== short test summary info ============================
FAILED test/stdlib/components/intrinsic/test_core.py::test_find_context_attributions
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_simple]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_answerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_unanswerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[hallucination_detection]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_clarification]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_rewrite]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[context_relevance]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[citations]
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_big - Exc...
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_little - ...
= 9 failed, 822 passed, 37 skipped, 19 deselected, 2 xfailed, 1 xpassed, 131 warnings, 2 errors in 2253.24s (0:37:33) =
[01:34:44] Shutting down ollama server...
[01:34:44] Ollama stopped.

ajbozarth · 2026-03-31T01:48:54Z

I think I'm ok with the PR as is now, though I'd like a few other eyes on it @avinash2692 @psschwei @jakelorocco if you could try running the tests as well?

@avinash2692 based on my re-run above I'm still hitting the disk quota issue on bluevela, I'm unsure if that was supposed to have been fixed by #765 if you could see if you also hit it (also check the commands I'm running in the terminal output above to make sure I'm just not running the wrong commands (I got them from you)

jakelorocco

lgtm; I'm hitting some environment errors on blue vela but the tests run locally for me (except for the expected skips and the failures already mentioned).

I think we can investigate any remaining test failures separately.

psschwei

in general, these are mainly marker classification changes and shouldn't majorly break things. I'm ok with this going in, the team can fix issues if they pop up.

psschwei · 2026-03-31T18:01:42Z

    ignore_all = config.getoption("--ignore-all-checks", default=False)
    ignore_gpu = config.getoption("--ignore-gpu-check", default=False) or ignore_all


does this still work or did we drop either/both of these?

I believe this was removed from the test dir contest, but not examples. I'm unsure why @planetf1 did this or if it was just missed.

psschwei · 2026-03-31T18:05:14Z

                # Get the model and tokenizer.
                self._model: PreTrainedModel = AutoModelForCausalLM.from_pretrained(
-                    self._hf_model_id, device_map=str(self._device)
+                    self._hf_model_id, device_map=str(self._device), torch_dtype="auto"


just noting to check here as one possible cause in the event that we start seeing qualitative flakes in the HF tests somewhere down the line.

avinash2692 · 2026-03-31T19:11:07Z

+        if torch.cuda.is_available():
+            return torch.cuda.get_device_properties(0).total_memory / (1024**3)


This might cause an issue. iirc torch.cuda.get_device_properties(0) does create a context with CUDA so this might lead to device not available errors if you repeatedly call it.

avinash2692 · 2026-03-31T19:12:21Z

-    pytest.mark.requires_heavy_ram,
-    pytest.mark.requires_gpu_isolation,  # Activate GPU memory isolation
+    pytest.mark.e2e,
+    require_gpu(min_vram_gb=20),


might need to be a little careful here with GPU mem setting. IMHO, we should just let the test fail if there isn't enough vram rather than checking.

avinash2692

LGTM. I did have a few comments on looking up GPU mem and if we need it at all, but happy to fix that in a later PR if it really breaks things.

ajbozarth · 2026-03-31T19:25:17Z

Lets merge this as is (especially since @planetf1 is OOTO this week) and open up followup issues:

@psschwei and @avinash2692 if you could open follow up issues with any of your above concerns that you think need them, I'll put this into the merge queue and you or @planetf1 can address those followups after.

ajbozarth · 2026-03-31T19:28:11Z

lgtm; I'm hitting some environment errors on blue vela but the tests run locally for me (except for the expected skips and the failures already mentioned).

@jakelorocco I actually figured this out, the run_tests_with_ollama.sh will use your tiny user cache if CACHE_DIR is not set to a dir in the larger project dir. WARNING: CACHE_DIR not set. Ollama models will download to ~/.ollama (default)

tl;dr always set CACHE_DIR before using run_tests_with_ollama.sh

…ve-computing#727, generative-computing#728) (generative-computing#742) * test: add granularity marker taxonomy infrastructure (generative-computing#727) Register unit/integration/e2e markers in conftest and pyproject.toml. Add unit auto-apply hook in pytest_collection_modifyitems. Deprecate llm marker (synonym for e2e). Remove dead plugins marker. Rewrite MARKERS_GUIDE.md as authoritative marker reference. Sync AGENTS.md Section 3 with new taxonomy. * test: add audit-markers skill for test classification (generative-computing#728) Skill classifies tests as unit/integration/e2e/qualitative using general heuristics (Part 1) and project-specific rules (Part 2). Includes fixture chain tracing guidance, backend detection heuristics, and example file handling. References MARKERS_GUIDE.md for tables. * chore: add CLAUDE.md and agent skills infrastructure Add CLAUDE.md referencing AGENTS.md for project directives. Add skill-author meta-skill for cross-compatible skill creation. The audit-markers skill was added in the previous commit. * test: improve audit-markers skill quality and add resource predicates Resolve 8 quality issues from dry-run review of the audit-markers skill: - Add behavioural signal detection tables and Step 0 triage procedure for scaling to full-repo audits (grep for backend behaviour, not just existing markers) - Clarify unit/integration boundary with scope-of-mocks rule - Allow module-level qualitative when every function qualifies - Replace resource marker inference with predicate factory pattern - Make llm→e2e rule explicit for # pytest: comments in examples - Redesign report format: 3-tier output (summary table, issues-only detail, batch groups) instead of per-function listing - Remove stale infrastructure note (conftest hook already exists) Add test/predicates.py with reusable skipif decorators: require_gpu, require_ram, require_gpu_isolation, require_api_key, require_package, require_ollama, require_python. Update skill-author with dry-run review step and 4 new authoring guidelines (variable scope, category boundaries, temporal assertions, qualifying absolutes). Refs: generative-computing#727, generative-computing#728 * chore: remove issue references from audit-markers skill Epic/issue numbers are task context, not permanent skill knowledge. * docs: align MARKERS_GUIDE.md with predicate factory pattern MARKERS_GUIDE.md documented legacy resource markers (requires_gpu, etc.) as the active convention while SKILL.md instructed migration to predicates — a direct conflict that would cause the audit agent to stall or produce incorrect edits. - Replace resource markers section with predicate-first documentation - Move legacy markers to deprecated subsection (conftest still handles them) - Update common patterns example to use predicate imports - Add test/predicates.py to related files - Add explicit dry-run enforcement to SKILL.md Step 4 Refs: generative-computing#727, generative-computing#728 * fix: validate_skill.py schema mismatch and brittle YAML parsing Two bugs: - Required `version` at root level but skill-author guide nests it under `metadata` — guaranteed failure on valid skills - Naive `content.split('---')` breaks on markdown horizontal rules Fix: use yaml.safe_load_all for robust frontmatter extraction, check `name`/`description` at root and `version` under `metadata.version`. * fix: migrate deprecated llm markers to e2e, add backend registry, update audit-markers skill - Replace all `pytest.mark.llm` with `pytest.mark.e2e` across 34 test files and 87 example files (comment-based markers) - Add `BACKEND_MARKERS` data-driven registry in test/conftest.py as single source of truth for backend marker registration - Register `bedrock` backend marker in conftest.py, pyproject.toml, MARKERS_GUIDE.md, and add missing marker to test_bedrock.py - Reclassify test_alora_train.py as integration (was unit); add importorskip for peft dependency - Add missing `e2e` tier markers to test_tracing.py and test_tracing_backend.py - Update audit-markers skill: report-first default, predicate migration as fix (not recommendation), backend registry gap detection * feat: add estimate-vram skill and fix MPS VRAM detection - New /estimate-vram agent skill that analyses test files to determine correct require_gpu(min_vram_gb=N) and require_ram(min_gb=N) values by tracing model IDs and looking up parameter counts dynamically - Fix _gpu_vram_gb() in test/predicates.py to use torch.mps.recommended_max_memory() on macOS MPS instead of returning 0 - Fix get_system_capabilities() in test/conftest.py with same MPS path - Update test/README.md with predicates table and legacy marker deprecation - Add /estimate-vram cross-reference in audit-markers skill * refactor: fold estimate-vram into audit-markers skill VRAM estimation is only useful during marker audits, not standalone. Move the model-tracing and VRAM computation procedure into the audit-markers resource gating section and delete the separate skill. * docs: drop isolation refs and fix RAM guidance in markers docs requires_heavy_ram and requires_gpu_isolation are deprecated with no replacement — models load into VRAM not system RAM, and GPU isolation is now automatic. require_ram() stays available for genuinely RAM-bound tests but has no current use case. * docs: add legacy marker guidance for example files in audit-markers skill * refactor: remove require_ollama() predicate — redundant with backend marker The ollama backend marker + conftest auto-skip already handles Ollama availability. No other backend has a dedicated predicate — consistent to let the marker system handle it. * refactor: replace requires_heavy_ram gate with huggingface backend marker in examples conftest The legacy requires_heavy_ram marker (blanket 48 GB RAM threshold) conflated VRAM with system RAM. Replace both the collection-time and runtime skip logic to gate on the huggingface backend marker instead, which accurately checks GPU availability. * refactor: replace ad-hoc bedrock skipif with require_api_key predicate * refactor: migrate legacy resource markers to predicates Replace deprecated pytest markers with typed predicate functions from test/predicates.py across all test files and example files: - requires_gpu → require_gpu(min_vram_gb=N) with per-model VRAM estimates - requires_heavy_ram → removed (conflated VRAM with RAM; no replacement needed) - requires_gpu_isolation → removed (GPU isolation is now automatic) - requires_api_key → require_api_key("VAR1", "VAR2", ...) with explicit env vars Also removes spurious requires_gpu from ollama-backed tests (test_genslot, test_think_budget_forcing, test_component_typing) and adds missing integration marker to test_hook_call_sites. VRAM estimates computed from model parameter counts using bf16 formula (params_B × 2 × 1.2, rounded up to next even GB): - granite-3.3-8b: 20 GB, Mistral-7B: 18 GB, granite-4.0-micro (3B): 8 GB - Qwen3-0.6B: 4 GB (conservative for vLLM KV cache headroom) - granite-4.0-h-micro (3B): 8 GB, alora training (3B): 12 GB * test: skip collection gracefully when optional backend deps are missing Add pytest.importorskip() / pytest.importorskip() guards to 14 test files that previously aborted the entire test run with a ModuleNotFoundError when optional extras were not installed: - torch / llguidance (mellea[hf]): test_huggingface, test_huggingface_tools, test_alora_train_integration, test_intrinsics_formatters, test_core, test_guardian, test_rag, test_spans - litellm (mellea[litellm]): test_litellm_ollama, test_litellm_watsonx - ibm_watsonx_ai (mellea[watsonx]): test_watsonx - docling / docling_core (mellea[mify]): test_tool_calls, test_richdocument, test_transform With these guards, `uv run pytest` runs all collectable tests and reports skipped files with a clear reason instead of aborting at first ImportError. * test: refine integration marker definition and apply audit fixes Expand integration to cover SDK-boundary tests (OTel InMemoryMetricReader, InMemorySpanExporter, LoggingHandler) — tests that assert against a real third-party SDK contract, not just multi-component wiring. Updates SKILL.md and MARKERS_GUIDE.md with new definition, indicators, tie-breaker, and SDK-boundary signal tables. Applied fixes: - test/telemetry/test_{metrics,metrics_token,logging}.py: add integration marker - test/telemetry/test_metrics_backend.py: add openai marker to OTel+OpenAI test, remove redundant inline skip already covered by require_api_key predicate - test/cli/test_alora_train.py: add integration to test_imports_work (real LoraConfig) - test/formatters/granite/test_intrinsics_formatters.py: remove unregistered block_network marker - test/stdlib/components/docs/test_richdocument.py: add integration pytestmark + e2e/huggingface/qualitative on skipped generation test - test/backends/test_openai_ollama.py: note inherited module marker limitation - docs/examples/plugins/testing_plugins.py: add # pytest: unit * test: add importorskip guards and optional-dep skip logic for examples - test/plugins/test_payloads.py: importorskip("cpex") — skip module when mellea[hooks] not installed instead of failing mid-test with ImportError - test/telemetry/test_metrics_plugins.py: same cpex guard - docs/examples/conftest.py: extend _check_optional_imports to cover docling, pandas, cpex (mellea.plugins imports), and litellm; also call the check from pytest_pycollect_makemodule so directly-specified files are guarded too - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * fix: convert example import errors to skips; add cpex importorskip guards Replace per-dep import checks in examples conftest with a runtime approach: ExampleModule (a pytest.Module subclass) is now returned by pytest_pycollect_makemodule for all runnable example files, preventing pytest's default collector from importing them directly. Import errors in the subprocess are caught in ExampleItem.runtest() and converted to skips, so no optional dependency needs to be encoded in conftest. Remove _check_optional_imports entirely — it was hand-maintained and would need updating for every new optional dep. Also: - test/plugins/test_payloads.py: importorskip("cpex") - test/telemetry/test_metrics_plugins.py: importorskip("cpex") - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * test: skip OTel-dependent tests when opentelemetry not installed Locally running without mellea[telemetry] caused three tests to fail with assertion errors rather than skip cleanly. Add importorskip at module level for test_tracing.py and a skipif decorator for the single OTel-gated test in test_astream_exception_propagation.py. * fix: use conservative heuristic for Apple Silicon GPU memory detection Metal's recommendedMaxWorkingSetSize is a static device property (~75% of total RAM) that ignores current system load. Replace it with min(total * 0.75, total - 16) so that desktop/IDE memory usage is accounted for. Also removes the torch dependency for GPU detection on Apple Silicon — sysctl hw.memsize is used directly. CUDA path on Linux is unchanged. * test: add training memory signals to audit-markers skill; bump alora VRAM gate Training tests need ~2x the base model inference memory (activations, optimizer states, gradient temporaries). The skill now detects training signals (train_model, Trainer, epochs=) and checks that require_gpu min_vram_gb uses the 2x rule. Bump test_alora_train_integration from min_vram_gb=12 to 20 (3B bfloat16: ~6 GB inference, ~12 GB training peak + headroom) so it skips correctly on 32 GB Apple Silicon under typical load. * fix: cache system capabilities result in examples conftest get_system_capabilities() was caching the function reference, not the result — causing the Ollama socket check (1s timeout) and full capability detection to re-run for every example file during collection (~102 times). Cache the result dict instead so detection runs exactly once. * fix: cache get_system_capabilities() result in test/conftest.py The function was called once per test in pytest_runtest_setup (325+ calls) and once at collection in pytest_collection_modifyitems, each time re-running the Ollama socket check (1s timeout when down), sysctl subprocess, and psutil query. Cache the result after the first call. * fix: flush MPS memory pool in intrinsic test fixture teardown torch.cuda.empty_cache() is a no-op on Apple Silicon MPS, leaving the MPS allocator pool occupied after each module fixture tears down. The next module then loads a fresh model into an already-pressured pool, causing the process RSS to grow unboundedly across modules. Both calls are now guarded so CUDA and MPS runs each get the correct flush. * fix: load LocalHFBackend model in config dtype to prevent float32 upcasting AutoModelForCausalLM.from_pretrained without torch_dtype may load weights in float32 on CPU before moving to MPS/CUDA, doubling peak memory briefly and leaving float32 remnants in the allocator pool. torch_dtype="auto" respects the model config (bfloat16 for Granite) for both the CPU load and the device transfer. * test: remove --isolate-heavy process isolation and bump intrinsic VRAM gates - Remove --isolate-heavy flag, _run_heavy_modules_isolated(), pytest_collection_finish(), and require_gpu_isolation() predicate — superseded by cleanup_gpu_backend() from PR generative-computing#721 - Remove dead requires_gpu/requires_api_key branches from docs/examples/conftest.py - Bump min_vram_gb from 8 → 12 on test_guardian, test_core, test_rag, test_spans — correct gate for 3B base model (6 GB) + adapters + inference overhead; 8 GB was wrong and masked by the now-fixed MPS pool leak - Add adapter accumulation signals to audit-markers skill - Update AGENTS.md, test/README.md, MARKERS_GUIDE.md to remove --isolate-heavy references * test: migrate legacy markers in test_intrinsics_formatters.py Replace deprecated @pytest.mark.llm, @pytest.mark.requires_gpu, @pytest.mark.requires_heavy_ram, @pytest.mark.requires_gpu_isolation with @pytest.mark.e2e and @require_gpu(min_vram_gb=12) to align with the new marker taxonomy (generative-computing#727/generative-computing#728). VRAM gate set to 12 GB matching the 3B-parameter model loaded across the parametrized test cases. * test: add integration marker to test_dependency_isolation.py * docs: document OLLAMA_KEEP_ALIVE=1m as memory optimisation for unordered test runs * fix: suppress mypy name-defined for torch.Tensor after importorskip change * fix: ruff format huggingface.py from_pretrained args * fix: ruff format test_watsonx.py and test_huggingface_tools.py * refactor: remove requires_gpu, requires_heavy_ram, requires_gpu_isolation markers and handlers * refactor: remove --ignore-*-check override flags from conftest * refactor: remove requires_api_key marker; fix api backend group to match watsonx+bedrock markers * fix: address review Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> * test: mark test_image_block_in_instruction as qualitative * chore: commit .claude/settings.json with skillLocations for skill discovery * docs: broaden audit-markers skill description to cover diagnostic use cases * docs: add diagnostic mode to audit-markers skill for troubleshooting skip/resource issues --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> Co-authored-by: Alex Bozarth <ajbozart@us.ibm.com>

* decompse doc string * pipline doc string * logging doc string * decomp README * merge docstrings * clean: pre-commit * decomp guide * fix: subtask tag * clean: pre-commit * clean: Readme * merge docstrings * clean: pre-commit * decomp guide * fix: subtask tag * clean: pre-commit * test: agent skills infrastructure and marker taxonomy audit (#727, #728) (#742) * test: add granularity marker taxonomy infrastructure (#727) Register unit/integration/e2e markers in conftest and pyproject.toml. Add unit auto-apply hook in pytest_collection_modifyitems. Deprecate llm marker (synonym for e2e). Remove dead plugins marker. Rewrite MARKERS_GUIDE.md as authoritative marker reference. Sync AGENTS.md Section 3 with new taxonomy. * test: add audit-markers skill for test classification (#728) Skill classifies tests as unit/integration/e2e/qualitative using general heuristics (Part 1) and project-specific rules (Part 2). Includes fixture chain tracing guidance, backend detection heuristics, and example file handling. References MARKERS_GUIDE.md for tables. * chore: add CLAUDE.md and agent skills infrastructure Add CLAUDE.md referencing AGENTS.md for project directives. Add skill-author meta-skill for cross-compatible skill creation. The audit-markers skill was added in the previous commit. * test: improve audit-markers skill quality and add resource predicates Resolve 8 quality issues from dry-run review of the audit-markers skill: - Add behavioural signal detection tables and Step 0 triage procedure for scaling to full-repo audits (grep for backend behaviour, not just existing markers) - Clarify unit/integration boundary with scope-of-mocks rule - Allow module-level qualitative when every function qualifies - Replace resource marker inference with predicate factory pattern - Make llm→e2e rule explicit for # pytest: comments in examples - Redesign report format: 3-tier output (summary table, issues-only detail, batch groups) instead of per-function listing - Remove stale infrastructure note (conftest hook already exists) Add test/predicates.py with reusable skipif decorators: require_gpu, require_ram, require_gpu_isolation, require_api_key, require_package, require_ollama, require_python. Update skill-author with dry-run review step and 4 new authoring guidelines (variable scope, category boundaries, temporal assertions, qualifying absolutes). Refs: #727, #728 * chore: remove issue references from audit-markers skill Epic/issue numbers are task context, not permanent skill knowledge. * docs: align MARKERS_GUIDE.md with predicate factory pattern MARKERS_GUIDE.md documented legacy resource markers (requires_gpu, etc.) as the active convention while SKILL.md instructed migration to predicates — a direct conflict that would cause the audit agent to stall or produce incorrect edits. - Replace resource markers section with predicate-first documentation - Move legacy markers to deprecated subsection (conftest still handles them) - Update common patterns example to use predicate imports - Add test/predicates.py to related files - Add explicit dry-run enforcement to SKILL.md Step 4 Refs: #727, #728 * fix: validate_skill.py schema mismatch and brittle YAML parsing Two bugs: - Required `version` at root level but skill-author guide nests it under `metadata` — guaranteed failure on valid skills - Naive `content.split('---')` breaks on markdown horizontal rules Fix: use yaml.safe_load_all for robust frontmatter extraction, check `name`/`description` at root and `version` under `metadata.version`. * fix: migrate deprecated llm markers to e2e, add backend registry, update audit-markers skill - Replace all `pytest.mark.llm` with `pytest.mark.e2e` across 34 test files and 87 example files (comment-based markers) - Add `BACKEND_MARKERS` data-driven registry in test/conftest.py as single source of truth for backend marker registration - Register `bedrock` backend marker in conftest.py, pyproject.toml, MARKERS_GUIDE.md, and add missing marker to test_bedrock.py - Reclassify test_alora_train.py as integration (was unit); add importorskip for peft dependency - Add missing `e2e` tier markers to test_tracing.py and test_tracing_backend.py - Update audit-markers skill: report-first default, predicate migration as fix (not recommendation), backend registry gap detection * feat: add estimate-vram skill and fix MPS VRAM detection - New /estimate-vram agent skill that analyses test files to determine correct require_gpu(min_vram_gb=N) and require_ram(min_gb=N) values by tracing model IDs and looking up parameter counts dynamically - Fix _gpu_vram_gb() in test/predicates.py to use torch.mps.recommended_max_memory() on macOS MPS instead of returning 0 - Fix get_system_capabilities() in test/conftest.py with same MPS path - Update test/README.md with predicates table and legacy marker deprecation - Add /estimate-vram cross-reference in audit-markers skill * refactor: fold estimate-vram into audit-markers skill VRAM estimation is only useful during marker audits, not standalone. Move the model-tracing and VRAM computation procedure into the audit-markers resource gating section and delete the separate skill. * docs: drop isolation refs and fix RAM guidance in markers docs requires_heavy_ram and requires_gpu_isolation are deprecated with no replacement — models load into VRAM not system RAM, and GPU isolation is now automatic. require_ram() stays available for genuinely RAM-bound tests but has no current use case. * docs: add legacy marker guidance for example files in audit-markers skill * refactor: remove require_ollama() predicate — redundant with backend marker The ollama backend marker + conftest auto-skip already handles Ollama availability. No other backend has a dedicated predicate — consistent to let the marker system handle it. * refactor: replace requires_heavy_ram gate with huggingface backend marker in examples conftest The legacy requires_heavy_ram marker (blanket 48 GB RAM threshold) conflated VRAM with system RAM. Replace both the collection-time and runtime skip logic to gate on the huggingface backend marker instead, which accurately checks GPU availability. * refactor: replace ad-hoc bedrock skipif with require_api_key predicate * refactor: migrate legacy resource markers to predicates Replace deprecated pytest markers with typed predicate functions from test/predicates.py across all test files and example files: - requires_gpu → require_gpu(min_vram_gb=N) with per-model VRAM estimates - requires_heavy_ram → removed (conflated VRAM with RAM; no replacement needed) - requires_gpu_isolation → removed (GPU isolation is now automatic) - requires_api_key → require_api_key("VAR1", "VAR2", ...) with explicit env vars Also removes spurious requires_gpu from ollama-backed tests (test_genslot, test_think_budget_forcing, test_component_typing) and adds missing integration marker to test_hook_call_sites. VRAM estimates computed from model parameter counts using bf16 formula (params_B × 2 × 1.2, rounded up to next even GB): - granite-3.3-8b: 20 GB, Mistral-7B: 18 GB, granite-4.0-micro (3B): 8 GB - Qwen3-0.6B: 4 GB (conservative for vLLM KV cache headroom) - granite-4.0-h-micro (3B): 8 GB, alora training (3B): 12 GB * test: skip collection gracefully when optional backend deps are missing Add pytest.importorskip() / pytest.importorskip() guards to 14 test files that previously aborted the entire test run with a ModuleNotFoundError when optional extras were not installed: - torch / llguidance (mellea[hf]): test_huggingface, test_huggingface_tools, test_alora_train_integration, test_intrinsics_formatters, test_core, test_guardian, test_rag, test_spans - litellm (mellea[litellm]): test_litellm_ollama, test_litellm_watsonx - ibm_watsonx_ai (mellea[watsonx]): test_watsonx - docling / docling_core (mellea[mify]): test_tool_calls, test_richdocument, test_transform With these guards, `uv run pytest` runs all collectable tests and reports skipped files with a clear reason instead of aborting at first ImportError. * test: refine integration marker definition and apply audit fixes Expand integration to cover SDK-boundary tests (OTel InMemoryMetricReader, InMemorySpanExporter, LoggingHandler) — tests that assert against a real third-party SDK contract, not just multi-component wiring. Updates SKILL.md and MARKERS_GUIDE.md with new definition, indicators, tie-breaker, and SDK-boundary signal tables. Applied fixes: - test/telemetry/test_{metrics,metrics_token,logging}.py: add integration marker - test/telemetry/test_metrics_backend.py: add openai marker to OTel+OpenAI test, remove redundant inline skip already covered by require_api_key predicate - test/cli/test_alora_train.py: add integration to test_imports_work (real LoraConfig) - test/formatters/granite/test_intrinsics_formatters.py: remove unregistered block_network marker - test/stdlib/components/docs/test_richdocument.py: add integration pytestmark + e2e/huggingface/qualitative on skipped generation test - test/backends/test_openai_ollama.py: note inherited module marker limitation - docs/examples/plugins/testing_plugins.py: add # pytest: unit * test: add importorskip guards and optional-dep skip logic for examples - test/plugins/test_payloads.py: importorskip("cpex") — skip module when mellea[hooks] not installed instead of failing mid-test with ImportError - test/telemetry/test_metrics_plugins.py: same cpex guard - docs/examples/conftest.py: extend _check_optional_imports to cover docling, pandas, cpex (mellea.plugins imports), and litellm; also call the check from pytest_pycollect_makemodule so directly-specified files are guarded too - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * fix: convert example import errors to skips; add cpex importorskip guards Replace per-dep import checks in examples conftest with a runtime approach: ExampleModule (a pytest.Module subclass) is now returned by pytest_pycollect_makemodule for all runnable example files, preventing pytest's default collector from importing them directly. Import errors in the subprocess are caught in ExampleItem.runtest() and converted to skips, so no optional dependency needs to be encoded in conftest. Remove _check_optional_imports entirely — it was hand-maintained and would need updating for every new optional dep. Also: - test/plugins/test_payloads.py: importorskip("cpex") - test/telemetry/test_metrics_plugins.py: importorskip("cpex") - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * test: skip OTel-dependent tests when opentelemetry not installed Locally running without mellea[telemetry] caused three tests to fail with assertion errors rather than skip cleanly. Add importorskip at module level for test_tracing.py and a skipif decorator for the single OTel-gated test in test_astream_exception_propagation.py. * fix: use conservative heuristic for Apple Silicon GPU memory detection Metal's recommendedMaxWorkingSetSize is a static device property (~75% of total RAM) that ignores current system load. Replace it with min(total * 0.75, total - 16) so that desktop/IDE memory usage is accounted for. Also removes the torch dependency for GPU detection on Apple Silicon — sysctl hw.memsize is used directly. CUDA path on Linux is unchanged. * test: add training memory signals to audit-markers skill; bump alora VRAM gate Training tests need ~2x the base model inference memory (activations, optimizer states, gradient temporaries). The skill now detects training signals (train_model, Trainer, epochs=) and checks that require_gpu min_vram_gb uses the 2x rule. Bump test_alora_train_integration from min_vram_gb=12 to 20 (3B bfloat16: ~6 GB inference, ~12 GB training peak + headroom) so it skips correctly on 32 GB Apple Silicon under typical load. * fix: cache system capabilities result in examples conftest get_system_capabilities() was caching the function reference, not the result — causing the Ollama socket check (1s timeout) and full capability detection to re-run for every example file during collection (~102 times). Cache the result dict instead so detection runs exactly once. * fix: cache get_system_capabilities() result in test/conftest.py The function was called once per test in pytest_runtest_setup (325+ calls) and once at collection in pytest_collection_modifyitems, each time re-running the Ollama socket check (1s timeout when down), sysctl subprocess, and psutil query. Cache the result after the first call. * fix: flush MPS memory pool in intrinsic test fixture teardown torch.cuda.empty_cache() is a no-op on Apple Silicon MPS, leaving the MPS allocator pool occupied after each module fixture tears down. The next module then loads a fresh model into an already-pressured pool, causing the process RSS to grow unboundedly across modules. Both calls are now guarded so CUDA and MPS runs each get the correct flush. * fix: load LocalHFBackend model in config dtype to prevent float32 upcasting AutoModelForCausalLM.from_pretrained without torch_dtype may load weights in float32 on CPU before moving to MPS/CUDA, doubling peak memory briefly and leaving float32 remnants in the allocator pool. torch_dtype="auto" respects the model config (bfloat16 for Granite) for both the CPU load and the device transfer. * test: remove --isolate-heavy process isolation and bump intrinsic VRAM gates - Remove --isolate-heavy flag, _run_heavy_modules_isolated(), pytest_collection_finish(), and require_gpu_isolation() predicate — superseded by cleanup_gpu_backend() from PR #721 - Remove dead requires_gpu/requires_api_key branches from docs/examples/conftest.py - Bump min_vram_gb from 8 → 12 on test_guardian, test_core, test_rag, test_spans — correct gate for 3B base model (6 GB) + adapters + inference overhead; 8 GB was wrong and masked by the now-fixed MPS pool leak - Add adapter accumulation signals to audit-markers skill - Update AGENTS.md, test/README.md, MARKERS_GUIDE.md to remove --isolate-heavy references * test: migrate legacy markers in test_intrinsics_formatters.py Replace deprecated @pytest.mark.llm, @pytest.mark.requires_gpu, @pytest.mark.requires_heavy_ram, @pytest.mark.requires_gpu_isolation with @pytest.mark.e2e and @require_gpu(min_vram_gb=12) to align with the new marker taxonomy (#727/#728). VRAM gate set to 12 GB matching the 3B-parameter model loaded across the parametrized test cases. * test: add integration marker to test_dependency_isolation.py * docs: document OLLAMA_KEEP_ALIVE=1m as memory optimisation for unordered test runs * fix: suppress mypy name-defined for torch.Tensor after importorskip change * fix: ruff format huggingface.py from_pretrained args * fix: ruff format test_watsonx.py and test_huggingface_tools.py * refactor: remove requires_gpu, requires_heavy_ram, requires_gpu_isolation markers and handlers * refactor: remove --ignore-*-check override flags from conftest * refactor: remove requires_api_key marker; fix api backend group to match watsonx+bedrock markers * fix: address review Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> * test: mark test_image_block_in_instruction as qualitative * chore: commit .claude/settings.json with skillLocations for skill discovery * docs: broaden audit-markers skill description to cover diagnostic use cases * docs: add diagnostic mode to audit-markers skill for troubleshooting skip/resource issues --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> Co-authored-by: Alex Bozarth <ajbozart@us.ibm.com> * clean: Readme --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> Co-authored-by: csbobby <phdbobbywu.cs@gmail.com> Co-authored-by: Nigel Jones <jonesn@uk.ibm.com> Co-authored-by: Alex Bozarth <ajbozart@us.ibm.com>

planetf1 · 2026-04-07T10:48:32Z

thanks for merging

github-actions Bot added the testing label Mar 25, 2026

planetf1 changed the title ~~test: add granularity marker taxonomy infrastructure (#727)~~ test: agent skills infrastructure and marker taxonomy audit (#727, #728) Mar 25, 2026

planetf1 mentioned this pull request Mar 25, 2026

ci: memory management in tests #721

Merged

7 tasks

planetf1 force-pushed the test/audit-markers-727-728 branch 2 times, most recently from 87d30e4 to d3135cb Compare March 27, 2026 07:12

planetf1 mentioned this pull request Mar 27, 2026

fix: test_tracing_backend.py tests always skip (Telemetry not initialized) #754

Closed

planetf1 marked this pull request as ready for review March 27, 2026 10:04

planetf1 requested review from a team as code owners March 27, 2026 10:04

planetf1 requested review from ajbozarth, avinash2692, jakelorocco and psschwei March 27, 2026 12:45

ajbozarth mentioned this pull request Mar 27, 2026

feat: add --skip-resource-checks flag to bypass hardware capability gates #758

Closed

ajbozarth requested changes Mar 27, 2026

View reviewed changes

Comment thread .agents/skills/skill-author/SKILL.md

Comment thread test/backends/test_tool_calls.py Outdated

Comment thread test/stdlib/components/docs/test_richdocument.py Outdated

Comment thread test/stdlib/components/test_transform.py Outdated

ajbozarth mentioned this pull request Mar 27, 2026

fix: run_tests_with_ollama.sh proceeds silently when Ollama warmup times out #759

Closed

ajbozarth and others added 5 commits March 28, 2026 09:59

fix: address review

e1d79fb

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

test: mark test_image_block_in_instruction as qualitative

b772cc4

chore: commit .claude/settings.json with skillLocations for skill dis…

c9b996d

…covery

docs: broaden audit-markers skill description to cover diagnostic use…

3d80a81

… cases

docs: add diagnostic mode to audit-markers skill for troubleshooting …

ec0254d

…skip/resource issues

planetf1 force-pushed the test/audit-markers-727-728 branch from 53b7fed to ec0254d Compare March 28, 2026 09:59

Merge branch 'main' into test/audit-markers-727-728

07a5562

ajbozarth self-requested a review March 31, 2026 01:49

jakelorocco approved these changes Mar 31, 2026

View reviewed changes

psschwei approved these changes Mar 31, 2026

View reviewed changes

avinash2692 reviewed Mar 31, 2026

View reviewed changes

avinash2692 approved these changes Mar 31, 2026

View reviewed changes

ajbozarth approved these changes Mar 31, 2026

View reviewed changes

ajbozarth added this pull request to the merge queue Mar 31, 2026

Merged via the queue into generative-computing:main with commit d3d6040 Mar 31, 2026
6 checks passed

csbobby mentioned this pull request Apr 1, 2026

test: agent skills infrastructure and marker taxonomy audit (#727, #7… csbobby/mellea_clean#3

Closed

5 tasks

planetf1 mentioned this pull request Apr 8, 2026

pytest docs/examples --collect-only collects 0 tests after conftest.py refactor #794

Closed

ajbozarth mentioned this pull request Apr 10, 2026

test: model consolidation and flexibility #732

Open

7 tasks

This was referenced Apr 22, 2026

fix: CI regression — Run CI doubled from ~17 min to ~35 min after PR #742 #903

Closed

fix: restore @pytest.mark.block_network on test_openai_compat #904

Closed

		ignore_all = config.getoption("--ignore-all-checks", default=False)
		ignore_gpu = config.getoption("--ignore-gpu-check", default=False) or ignore_all

		if torch.cuda.is_available():
		return torch.cuda.get_device_properties(0).total_memory / (1024**3)

Conversation

planetf1 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Marker Taxonomy & Agent Skills

Type of PR

Description

How we define the tiers

New agent skills (.agents/skills/)

pytest infrastructure changes

Test reclassifications

Docs updated

Local test run (Mac M1, 32 GB)

Issues raised during testing

Testing

Cluster test run (IBM BLUEVELA LSF, Linux / Python 3.12.13, p-series GPU node)

Test run summary across environments

Uh oh!

mergify Bot commented Mar 25, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

planetf1 commented Mar 25, 2026

Uh oh!

planetf1 commented Mar 25, 2026

Uh oh!

planetf1 commented Mar 27, 2026

Uh oh!

planetf1 commented Mar 27, 2026

Uh oh!

planetf1 commented Mar 27, 2026

Uh oh!

ajbozarth commented Mar 27, 2026

Uh oh!

planetf1 commented Mar 27, 2026

Uh oh!

ajbozarth commented Mar 27, 2026

Uh oh!

ajbozarth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Test run summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbozarth commented Mar 27, 2026

Uh oh!

ajbozarth commented Mar 27, 2026

Uh oh!

ajbozarth commented Mar 27, 2026

Uh oh!

ajbozarth commented Mar 27, 2026

Uh oh!

planetf1 commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajbozarth commented Mar 31, 2026

Test run summary

Uh oh!

ajbozarth commented Mar 31, 2026

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

psschwei left a comment

Choose a reason for hiding this comment

Uh oh!

psschwei Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ajbozarth Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

psschwei Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

avinash2692 Mar 31, 2026

planetf1 commented Mar 25, 2026 •

edited

Loading

New agent skills (`.agents/skills/`)

ajbozarth left a comment •

edited

Loading

planetf1 commented Mar 28, 2026 •

edited

Loading