test: agent skills infrastructure and marker taxonomy audit (#727, #728)#742
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
|
So this now tries to assess the vram needs by looking at models/hf I'll experiment with running the tests & see how accurate the agent is in general. Plus manually review the assessments |
87d30e4 to
d3135cb
Compare
|
The mypy failure ( |
|
Updated top post with current summary. Ready for review |
|
@ajbozarth you asked about being able to run a test bypassing the gpu check. Without any code changes this is possible by using pytest to run the test directly. I'm thinking this is sufficient? |
That's fair, but I think it'd still be worth having a flag to disable the part of contest that limits based on detected hardware. I wouldn't call it a blocker for this PR though, it could be a follow up issue. As for review I'll do a deep dive into this this afternoon and will re-run all the tests myself for a "second opinion" |
Ok - my thought is to just have a generic flag like Can you raise the followup?
|
|
There was a problem hiding this comment.
I've done an in-depth review including:
- an actual read-through of the skill markdown -> LGTM
- double check example mark updates -> LGTM
- review updates to tests:
- mark updates and other fixes -> LGTM
- a few minor typos in
importskipreasons -> inline suggested changes
- conftest updates -> LGTM
- helper functions in predicates -> LGTM
I'll apply the typo fixes myself, otherwise my other comments are non-blocking
I've also run all the tests and included the results below:
Test run summary
Local run (uv run pytest, Mac M1 Max 32GB, Python 3.12.8): 800 passed, 2 failed, 61 skipped, 19 deselected, 2 xfailed, 1 xpassed in 32m05s.
The 2 failures are @pytest.mark.qualitative tests (test_find_context_attributions, test_hallucination_detection) — non-deterministic content assertions
The 19 deselected are slow tests excluded by default.
Skips breakdown (61 total — all expected):
| Reason | Count |
|---|---|
| Insufficient VRAM | ~23 |
| Missing API credentials | ~16 |
| vLLM process not running | 7 |
test_tracing_backend.py — telemetry not initialised (see #754) |
6 |
test_manager.py — requires --disable-default-mellea-plugins |
2 |
test_reqlib_python.py sandbox tests |
3 |
| Other | ~4 |
Terminal output
$ uv run pytest
Built mellea @ file:///Users/ajbozarth/workspace/ai/mellea
Uninstalled 1 package in 1ms
Installed 3 packages in 3ms
=========================================================================================================== test session starts ============================================================================================================
platform darwin -- Python 3.12.8, pytest-9.0.0, pluggy-1.6.0
rootdir: /Users/ajbozarth/workspace/ai/mellea
configfile: pyproject.toml
testpaths: test, docs
plugins: nbmake-1.5.5, recording-0.13.4, anyio-4.11.0, xdist-3.8.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, asyncio-1.3.0, langsmith-0.6.6, Faker-37.12.0, cov-7.0.0
timeout: 900.0s
timeout method: signal
timeout func_only: False
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 883 items / 19 deselected / 2 skipped / 864 selected
test/backends/test_adapters/test_adapter.py . [ 0%]
test/backends/test_bedrock.py s [ 0%]
test/backends/test_huggingface.py sssssssssssssssssss [ 2%]
test/backends/test_huggingface_tools.py s [ 2%]
test/backends/test_litellm_ollama.py ........ [ 3%]
test/backends/test_litellm_watsonx.py ssss [ 3%]
test/backends/test_mellea_tool.py ....... [ 4%]
test/backends/test_model_options.py ..... [ 5%]
test/backends/test_ollama.py .....X.... [ 6%]
test/backends/test_openai_ollama.py ............. [ 7%]
test/backends/test_openai_vllm.py sssssss [ 8%]
test/backends/test_tool_calls.py ... [ 9%]
test/backends/test_tool_decorator.py ................... [ 11%]
test/backends/test_tool_helpers.py ... [ 11%]
test/backends/test_tool_validation_integration.py ................................. [ 15%]
test/backends/test_vision_ollama.py .... [ 15%]
test/backends/test_vision_openai.py .... [ 16%]
test/backends/test_watsonx.py sssssssssss [ 17%]
test/cli/test_alora_train.py .... [ 18%]
test/cli/test_alora_train_integration.py ss [ 18%]
test/core/test_astream_exception_propagation.py ..... [ 18%]
test/core/test_astream_incremental.py ...... [ 19%]
test/core/test_astream_mock.py ...... [ 20%]
test/core/test_base.py .... [ 20%]
test/core/test_component_typing.py ........ [ 21%]
test/core/test_model_output_thunk.py .. [ 21%]
test/decompose/test_decompose.py .......... [ 23%]
test/formatters/granite/test_intrinsics_formatters.py ........................................................x.................. [ 31%]
test/formatters/test_template_formatter.py ................ [ 33%]
test/helpers/test_event_loop_helper.py .... [ 34%]
test/helpers/test_server_type.py ................ [ 35%]
test/plugins/test_all_payloads.py ................................................................................................... [ 47%]
test/plugins/test_blocking.py ................ [ 49%]
test/plugins/test_build_global_context.py ....... [ 50%]
test/plugins/test_decorators.py ......... [ 51%]
test/plugins/test_execution_modes.py ........................... [ 54%]
test/plugins/test_hook_call_sites.py .............................. [ 57%]
test/plugins/test_manager.py ss...... [ 58%]
test/plugins/test_mellea_plugin.py ....... [ 59%]
test/plugins/test_payloads.py .......... [ 60%]
test/plugins/test_pluginset.py ......... [ 61%]
test/plugins/test_policies.py ...... [ 62%]
test/plugins/test_policy_enforcement.py .......... [ 63%]
test/plugins/test_priority_ordering.py .............. [ 65%]
test/plugins/test_scoping.py ................................... [ 69%]
test/plugins/test_tool_hooks_redaction.py ....... [ 70%]
test/plugins/test_unregister.py ......... [ 71%]
test/stdlib/components/docs/test_document.py ... [ 71%]
test/stdlib/components/docs/test_richdocument.py .....s [ 72%]
test/stdlib/components/intrinsic/test_core.py ..F [ 72%]
test/stdlib/components/intrinsic/test_guardian.py ...... [ 73%]
test/stdlib/components/intrinsic/test_rag.py ....F.. [ 73%]
test/stdlib/components/test_chat.py . [ 74%]
test/stdlib/components/test_genslot.py ................... [ 76%]
test/stdlib/components/test_hello_world.py .. [ 76%]
test/stdlib/components/test_mify.py ........... [ 77%]
test/stdlib/components/test_transform.py .. [ 78%]
test/stdlib/requirements/test_reqlib_markdown.py ...... [ 78%]
test/stdlib/requirements/test_reqlib_python.py .............sss..... [ 81%]
test/stdlib/requirements/test_reqlib_tools.py . [ 81%]
test/stdlib/requirements/test_requirement.py ..... [ 81%]
test/stdlib/sampling/test_majority_voting.py .. [ 82%]
test/stdlib/sampling/test_sampling_ctx.py .. [ 82%]
test/stdlib/sampling/test_sofai_graph_coloring.py ......................... [ 85%]
test/stdlib/sampling/test_sofai_sampling.py ..................... [ 87%]
test/stdlib/sampling/test_think_budget_forcing.py .. [ 87%]
test/stdlib/test_base_context.py ..... [ 88%]
test/stdlib/test_chat_view.py .. [ 88%]
test/stdlib/test_functional.py .... [ 89%]
test/stdlib/test_session.py s....... [ 90%]
test/stdlib/test_spans.py .x [ 90%]
test/telemetry/test_logging.py ........ [ 91%]
test/telemetry/test_metrics.py ....................................... [ 95%]
test/telemetry/test_metrics_backend.py ....s.... [ 96%]
test/telemetry/test_metrics_plugins.py .... [ 97%]
test/telemetry/test_metrics_token.py .... [ 97%]
test/telemetry/test_tracing.py .............. [ 99%]
test/telemetry/test_tracing_backend.py ssssss [100%]
================================================================================================================= FAILURES =================================================================================================================
______________________________________________________________________________________________________ test_find_context_attributions ______________________________________________________________________________________________________
backend = <mellea.backends.huggingface.LocalHFBackend object at 0x14df00380>
@pytest.mark.qualitative
def test_find_context_attributions(backend):
"""Verify that the context-attribution intrinsic functions properly."""
context, assistant_response, documents = _read_rag_input_json(
"context-attribution.json"
)
expected = _read_rag_output_json("context-attribution.json")
> result = core.find_context_attributions(
assistant_response, documents, context, backend
)
test/stdlib/components/intrinsic/test_core.py:102:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mellea/stdlib/components/intrinsic/core.py:90: in find_context_attributions
result_json = call_intrinsic(
mellea/stdlib/components/intrinsic/_util.py:39: in call_intrinsic
model_output_thunk, _ = mfuncs.act(
mellea/stdlib/functional.py:98: in act
out = _run_async_in_thread(
mellea/helpers/event_loop_helper.py:105: in _run_async_in_thread
return __event_loop_handler(co)
^^^^^^^^^^^^^^^^^^^^^^^^
mellea/helpers/event_loop_helper.py:77: in __call__
return asyncio.run_coroutine_threadsafe(co, self._event_loop).result()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../../.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/concurrent/futures/_base.py:456: in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
../../../.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
raise self._exception
mellea/stdlib/functional.py:584: in aact
await result.avalue()
mellea/core/base.py:394: in avalue
await self.astream()
mellea/core/base.py:485: in astream
await self._process(self, chunk)
mellea/backends/huggingface.py:581: in granite_formatters_processing
res = result_processor.transform(chunk, rewritten) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
mellea/formatters/granite/base/io.py:182: in transform
return self._transform_impl(chat_completion_response, chat_completion)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
mellea/formatters/granite/intrinsics/output.py:1267: in _transform_impl
self._transform_choice(c, chat_completion)
mellea/formatters/granite/intrinsics/output.py:1308: in _transform_choice
parsed_json = rule.apply(
mellea/formatters/granite/intrinsics/output.py:166: in apply
result = self._apply_at_path(result, path, prepare_output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
mellea/formatters/granite/intrinsics/output.py:251: in _apply_at_path
new_values = self._transform(original_value, path, prepare_output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <mellea.formatters.granite.intrinsics.output.DecodeSentences object at 0x14dbcf050>, value = 765211, path = (0, 'r')
prepare_output = {'begins': [0, 137], 'document_ids': [None, None], 'ends': [137, 257], 'message_indices': [None, None], ...}
def _transform(self, value: Any, path: tuple, prepare_output: dict) -> dict:
# Unpack global values we set aside during the prepare phase
begins = prepare_output["begins"]
ends = prepare_output["ends"]
texts = prepare_output["texts"]
document_ids = prepare_output.get("document_ids")
message_indices = prepare_output.get("message_indices")
if not isinstance(value, int):
raise TypeError(
f"Expected integer sentence number at path {path}, but "
f"found non-integer value {value} of type {type(value)}"
)
sentence_num = value
result = {}
if self.begin_name is not None:
> result[self.begin_name] = begins[sentence_num]
^^^^^^^^^^^^^^^^^^^^
E IndexError: list index out of range
mellea/formatters/granite/intrinsics/output.py:714: IndexError
----------------------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------------------
=== 15:39:12-INFO ======
passing in model options when generating with an adapter; some model options may be overwritten / ignored
------------------------------------------------------------------------------------------------------------ Captured log call -------------------------------------------------------------------------------------------------------------
INFO fancy_logger:huggingface.py:475 passing in model options when generating with an adapter; some model options may be overwritten / ignored
--------------------------------------------------------------------------------------------------------- Captured stdout teardown ---------------------------------------------------------------------------------------------------------
=== 15:41:30-INFO ======
Cleaning up test_core backend GPU memory...
=== 15:41:30-INFO ======
Cleared LRU cache
=== 15:41:30-INFO ======
Removed accelerate dispatch hooks
---------------------------------------------------------------------------------------------------------- Captured log teardown -----------------------------------------------------------------------------------------------------------
INFO fancy_logger:conftest.py:342 Cleaning up test_core backend GPU memory...
INFO fancy_logger:conftest.py:365 Cleared LRU cache
INFO fancy_logger:conftest.py:402 Removed accelerate dispatch hooks
_______________________________________________________________________________________________________ test_hallucination_detection _______________________________________________________________________________________________________
backend = <mellea.backends.huggingface.LocalHFBackend object at 0x13eff8d10>
@pytest.mark.qualitative
def test_hallucination_detection(backend):
"""Verify that the hallucination detection intrinsic functions properly."""
context, assistant_response, docs = _read_input_json("hallucination_detection.json")
expected = _read_output_json("hallucination_detection.json")
# First call triggers adapter loading
result = rag.flag_hallucinated_content(assistant_response, docs, context, backend)
# pytest.approx() chokes on lists of records, so we do this complicated dance.
for r, e in zip(result, expected, strict=True): # type: ignore
> assert pytest.approx(r, abs=2e-2) == e
E AssertionError: assert approx({'resp...he sentence.}) == {'explanation...end': 31, ...}
E
E comparison failed. Mismatched elements: 1 / 5:
E Max absolute difference: 0.022802131238099044
E Max relative difference: 0.03036794087969006
E Index | Obtained | Expected
E faithfulness_likelihood | 0.7280598165124975 | 0.7508619477505966 ± 0.02
test/stdlib/components/intrinsic/test_rag.py:159: AssertionError
----------------------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------------------
=== 15:42:34-INFO ======
passing in model options when generating with an adapter; some model options may be overwritten / ignored
------------------------------------------------------------------------------------------------------------ Captured log call -------------------------------------------------------------------------------------------------------------
INFO fancy_logger:huggingface.py:475 passing in model options when generating with an adapter; some model options may be overwritten / ignored
============================================================================================================= warnings summary =============================================================================================================
test/backends/test_litellm_ollama.py::test_litellm_ollama_chat
test/backends/test_litellm_ollama.py::test_generate_from_raw
test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/backends/test_litellm_ollama.py::test_async_avalue
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/aiohttp/connector.py:993: DeprecationWarning: enable_cleanup_closed ignored because https://github.com/python/cpython/pull/118960 is fixed in Python version sys.version_info(major=3, minor=12, micro=8, releaselevel='final', serial=0)
super().__init__(
test/backends/test_litellm_ollama.py::test_litellm_ollama_chat
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='The answ...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="Subject:...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='yes', ro...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Subject:...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_gen_slot
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{\n"resu...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/streaming_handler.py:1855: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
obj_dict = processed_chunk.dict()
test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/backends/test_litellm_ollama.py::test_async_avalue
test/backends/test_litellm_ollama.py::test_async_avalue
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Hello! H...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_async_parallel_requests
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Goodbye!...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_tool_calls.py::test_tool_called_from_context_action
<frozen abc>:106: DeprecationWarning: Use BaseMetaSerializer() instead.
test/backends/test_vision_ollama.py::test_image_block_construction
/Users/ajbozarth/workspace/ai/mellea/test/backends/test_vision_ollama.py:38: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
random_image = Image.fromarray(random_pixel_data, "RGB")
test/backends/test_vision_openai.py::test_image_block_construction
/Users/ajbozarth/workspace/ai/mellea/test/backends/test_vision_openai.py:48: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
random_image = Image.fromarray(random_pixel_data, "RGB")
test/cli/test_alora_train.py::test_alora_config_creation
test/cli/test_alora_train.py::test_lora_config_creation
test/cli/test_alora_train.py::test_invocation_prompt_tokenization
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/trl/trainer/sft_config.py:257: DeprecationWarning: `max_seq_length` is deprecated and will be removed in version 0.20.0. Use `max_length` instead.
warnings.warn(
test/helpers/test_event_loop_helper.py::test_event_loop_handler_with_forking
/Users/ajbozarth/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=49161) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
test/stdlib/components/docs/test_richdocument.py::test_richdocument_basics
test/stdlib/components/docs/test_richdocument.py::test_richdocument_markdown
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/docling_core/transforms/serializer/markdown.py:490: DeprecationWarning: Field `annotations` is deprecated; use `meta` instead.
for ann in item.annotations
test/stdlib/components/intrinsic/test_core.py: 2 warnings
test/stdlib/components/intrinsic/test_guardian.py: 3 warnings
test/stdlib/components/intrinsic/test_rag.py: 5 warnings
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/peft/tuners/tuners_utils.py:285: UserWarning: Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
warnings.warn(
test/stdlib/test_spans.py::test_lazy_spans
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/torch/nn/functional.py:5294: UserWarning: MPS: The constant padding of more than 3 dimensions is not currently supported natively. It uses View Ops default implementation to run. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Pad.mm:468.)
return torch._C._nn.pad(input, pad, mode, value)
test/telemetry/test_logging.py::test_otlp_logging_enabled_without_endpoint_warns
/Users/ajbozarth/workspace/ai/mellea/mellea/telemetry/logging.py:97: UserWarning: OTLP logs exporter is enabled (MELLEA_LOGS_OTLP=true) but no endpoint is configured. Set OTEL_EXPORTER_OTLP_LOGS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT to export logs.
_logger_provider = _setup_logger_provider()
test/telemetry/test_metrics.py: 24 warnings
test/telemetry/test_metrics_backend.py: 8 warnings
test/telemetry/test_metrics_token.py: 4 warnings
/Users/ajbozarth/workspace/ai/mellea/mellea/telemetry/metrics.py:245: UserWarning: Metrics are enabled (MELLEA_METRICS_ENABLED=true) but no exporters are configured. Metrics will be collected but not exported. Set MELLEA_METRICS_PROMETHEUS=true, set MELLEA_METRICS_OTLP=true with an endpoint (OTEL_EXPORTER_OTLP_METRICS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT), or set MELLEA_METRICS_CONSOLE=true to export metrics.
_meter_provider = _setup_meter_provider()
test/telemetry/test_metrics.py: 28 warnings
test/telemetry/test_metrics_backend.py: 8 warnings
test/telemetry/test_metrics_token.py: 4 warnings
/Users/ajbozarth/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/importlib/__init__.py:131: UserWarning: TokenMetricsPlugin already registered: Plugin token_metrics.generation_post_call already registered
_bootstrap._exec(spec, module)
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
test/telemetry/test_tracing.py::test_session_with_tracing_disabled
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="Of cours...ields={'refusal': None}), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/model_response_utils.py:206: PydanticDeprecatedSince211: Accessing the 'model_computed_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
or callable(getattr(delta, attr_name))
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/model_response_utils.py:206: PydanticDeprecatedSince211: Accessing the 'model_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
or callable(getattr(delta, attr_name))
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
/Users/ajbozarth/workspace/ai/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="I'm an A...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================================= Skipped Examples =============================================================================================================
The following examples were skipped during collection:
• 102_example.py: Example marked with skip marker
• example_readme_generator.py: Example marked with skip marker
• make_training_data.py: Example marked with skip marker
• stembolts_intrinsic.py: Example marked with skip marker
• bedrock_litellm_example.py: Example marked with skip marker
• bedrock_openai_example.py: Example marked with skip marker
• qiskit_code_validation.py: Example marked with skip marker
• validation_helpers.py: Example marked with skip marker
• python_decompose_result.py: Example marked to always skip (skip_always marker)
• m_decomp_result.py: Example marked to always skip (skip_always marker)
• client.py: Example marked to always skip (skip_always marker)
• pii_serve.py: Example marked to always skip (skip_always marker)
• mcp_example.py: Example marked to always skip (skip_always marker)
• rich_document_advanced.py: Example marked with skip marker
• mellea_pdf.py: Example marked to always skip (skip_always marker)
• simple_rag_with_filter.py: Example marked to always skip (skip_always marker)
============================================================================================================== tests coverage ==============================================================================================================
_____________________________________________________________________________________________ coverage: platform darwin, python 3.12.8-final-0 _____________________________________________________________________________________________
Coverage HTML written to dir htmlcov
Coverage JSON written to file coverage.json
========================================================================================================= short test summary info ==========================================================================================================
FAILED test/stdlib/components/intrinsic/test_core.py::test_find_context_attributions - IndexError: list index out of range
FAILED test/stdlib/components/intrinsic/test_rag.py::test_hallucination_detection - AssertionError: assert approx({'resp...he sentence.}) == {'explanation...end': 31, ...}
================================================================ 2 failed, 800 passed, 61 skipped, 19 deselected, 2 xfailed, 1 xpassed, 122 warnings in 1925.97s (0:32:05) =================================================================Local slow run (uv run pytest -m slow, Mac M1 Max 32GB): 18 passed, 3 skipped, 864 deselected in 3m32s. All expected.
Terminal output
$ uv run pytest -m slow
=========================================================================================================== test session starts ============================================================================================================
platform darwin -- Python 3.12.8, pytest-9.0.0, pluggy-1.6.0
rootdir: /Users/ajbozarth/workspace/ai/mellea
configfile: pyproject.toml
testpaths: test, docs
plugins: nbmake-1.5.5, recording-0.13.4, anyio-4.11.0, xdist-3.8.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, asyncio-1.3.0, langsmith-0.6.6, Faker-37.12.0, cov-7.0.0
timeout: 900.0s
timeout method: signal
timeout func_only: False
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 883 items / 864 deselected / 2 skipped / 19 selected
test/package/test_dependency_isolation.py ..s................ [100%]
============================================================================================================= Skipped Examples =============================================================================================================
The following examples were skipped during collection:
• 102_example.py: Example marked with skip marker
• example_readme_generator.py: Example marked with skip marker
• make_training_data.py: Example marked with skip marker
• stembolts_intrinsic.py: Example marked with skip marker
• bedrock_litellm_example.py: Example marked with skip marker
• bedrock_openai_example.py: Example marked with skip marker
• qiskit_code_validation.py: Example marked with skip marker
• validation_helpers.py: Example marked with skip marker
• python_decompose_result.py: Example marked to always skip (skip_always marker)
• m_decomp_result.py: Example marked to always skip (skip_always marker)
• client.py: Example marked to always skip (skip_always marker)
• pii_serve.py: Example marked to always skip (skip_always marker)
• mcp_example.py: Example marked to always skip (skip_always marker)
• rich_document_advanced.py: Example marked with skip marker
• mellea_pdf.py: Example marked to always skip (skip_always marker)
• simple_rag_with_filter.py: Example marked to always skip (skip_always marker)
============================================================================================================== tests coverage ==============================================================================================================
_____________________________________________________________________________________________ coverage: platform darwin, python 3.12.8-final-0 _____________________________________________________________________________________________
Coverage HTML written to dir htmlcov
Coverage JSON written to file coverage.json
======================================================================================== 18 passed, 3 skipped, 864 deselected in 212.41s (0:03:32) =========================================================================================Cluster run (./test/scripts/run_tests_with_ollama.sh, IBM LSF, NVIDIA GPU node, Python 3.12.12): 735 passed, 47 failed, 30 skipped, 58 errors, 19 deselected, 3 xfailed in 1:20:16.
The 58 errors and majority of the 47 failures are Ollama connectivity issues — the script detected an existing Ollama server but all three model warmups timed out, and tests then errored with "could not create OllamaModelBackend: ollama server not running at None" (base_url resolving to None). This is an environment issue, not related to this PR. Planning to re-run with a clean environment.
test_find_context_attributions qualitative flake also present, same as local run.
Terminal output
$ bsub -Is -n 1 -G grp_preemptable -q preemptable -gpu "num=1/task:mode=shared:mps=no:j_exclusive=yes:gvendor=nvidia" /bin/bash
num=1/task:mode=shared:mps=no:j_exclusive=yes:gvendor=nvidia
GPU mode=shared. This is allowed but deprecated
Job <741102> is submitted to queue <preemptable>.
<<Waiting for dispatch ...>>
<<Starting on p5-r06-n1>>
[ajbozarth@p5-r06-n1 mellea]$ bash ./test/scripts/run_tests_with_ollama.sh
[20:27:25] WARNING: CACHE_DIR not set. Ollama models will download to ~/.ollama (default)
[20:27:25] Using standalone log directory: logs/2026-03-27-20:27:25
[20:27:25] Ollama already running on 127.0.0.1:11434 — using existing server
[20:27:26] Model granite4:micro already pulled
[20:27:26] Model granite4:micro-h already pulled
[20:27:26] Pulling granite3.2-vision ...
success
[20:27:40] All models ready.
[20:27:40] Warming up models...
[20:27:40] Warming granite4:micro ...
[20:29:40] Warning: warmup for granite4:micro timed out (will load on first test)
[20:29:40] Warming granite4:micro-h ...
[20:31:40] Warning: warmup for granite4:micro-h timed out (will load on first test)
[20:31:40] Warming granite3.2-vision ...
[20:33:40] Warning: warmup for granite3.2-vision timed out (will load on first test)
[20:33:40] Warmup complete.
[20:33:40] Starting pytest...
[20:33:40] Log directory: logs/2026-03-27-20:27:25
[20:33:40] Pytest args: --group-by-backend
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.0, pluggy-1.6.0
rootdir: /proj/dmfexp/eiger/users/ajbozarth/mellea
configfile: pyproject.toml
plugins: nbmake-1.5.5, anyio-4.11.0, json-report-1.5.0, recording-0.13.4, asyncio-1.3.0, timeout-2.4.0, metadata-3.1.1, Faker-37.12.0, xdist-3.8.0, langsmith-0.6.6, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
timeout: 900.0s
timeout method: signal
timeout func_only: False
collected 892 items / 19 deselected / 873 selected
test/backends/test_huggingface.py ................... [ 2%]
test/backends/test_huggingface_tools.py . [ 2%]
test/cli/test_alora_train_integration.py .. [ 2%]
test/formatters/granite/test_intrinsics_formatters.py ....x.......... [ 4%]
test/stdlib/components/docs/test_richdocument.py s [ 4%]
test/stdlib/components/intrinsic/test_core.py ..F [ 4%]
test/stdlib/components/intrinsic/test_guardian.py ...... [ 5%]
test/stdlib/components/intrinsic/test_rag.py ....... [ 6%]
test/stdlib/test_spans.py .x [ 6%]
test/telemetry/test_metrics_backend.py .. [ 6%]
test/backends/test_openai_ollama.py FFFFFFFF..... [ 8%]
test/backends/test_openai_vllm.py ....... [ 8%]
test/backends/test_vision_openai.py ..FF [ 9%]
test/telemetry/test_metrics_backend.py FF [ 9%]
test/backends/test_vllm.py ........ [ 10%]
test/backends/test_vllm_tools.py . [ 10%]
test/backends/test_litellm_ollama.py .FFFFFFF [ 11%]
test/backends/test_mellea_tool.py EE [ 11%]
test/backends/test_ollama.py EEEEExEEEE [ 12%]
test/backends/test_tool_calls.py EEE [ 13%]
test/backends/test_vision_ollama.py ..EE [ 13%]
test/core/test_astream_incremental.py FFFF.F [ 14%]
test/core/test_component_typing.py EEE [ 14%]
test/core/test_model_output_thunk.py EE [ 15%]
test/stdlib/components/test_genslot.py EEEEEEEEEEEEEEE.EEE [ 17%]
test/stdlib/requirements/test_requirement.py FF... [ 17%]
test/stdlib/sampling/test_majority_voting.py EE [ 17%]
test/stdlib/sampling/test_sampling_ctx.py EE [ 18%]
test/stdlib/sampling/test_sofai_graph_coloring.py FFF [ 18%]
test/stdlib/sampling/test_sofai_sampling.py F [ 18%]
test/stdlib/sampling/test_think_budget_forcing.py EE [ 18%]
test/stdlib/test_chat_view.py EE [ 19%]
test/stdlib/test_functional.py EEEE [ 19%]
test/stdlib/test_session.py sEEEEEEE [ 20%]
test/telemetry/test_metrics_backend.py FFFF [ 20%]
test/telemetry/test_tracing.py FFFF [ 21%]
test/telemetry/test_tracing_backend.py ssssss [ 22%]
test/backends/test_bedrock.py s [ 22%]
test/backends/test_litellm_watsonx.py ssss [ 22%]
test/backends/test_watsonx.py sssssssssss [ 23%]
test/telemetry/test_metrics_backend.py s [ 24%]
test/backends/test_adapters/test_adapter.py . [ 24%]
test/backends/test_mellea_tool.py ..... [ 24%]
test/backends/test_model_options.py ..... [ 25%]
test/backends/test_tool_decorator.py ................... [ 27%]
test/backends/test_tool_helpers.py ... [ 27%]
test/backends/test_tool_validation_integration.py ...................... [ 30%]
........... [ 31%]
test/cli/test_alora_train.py .... [ 32%]
test/core/test_astream_exception_propagation.py ..... [ 32%]
test/core/test_astream_mock.py ...... [ 33%]
test/core/test_base.py .... [ 33%]
test/core/test_component_typing.py ..... [ 34%]
test/decompose/test_decompose.py .......... [ 35%]
test/formatters/granite/test_intrinsics_formatters.py .................. [ 37%]
..................................FFFFFFFF [ 42%]
test/formatters/test_template_formatter.py ................ [ 44%]
test/helpers/test_event_loop_helper.py .... [ 44%]
test/helpers/test_server_type.py ................ [ 46%]
test/plugins/test_all_payloads.py ...................................... [ 50%]
............................................................. [ 57%]
test/plugins/test_blocking.py ................ [ 59%]
test/plugins/test_build_global_context.py ....... [ 60%]
test/plugins/test_decorators.py ......... [ 61%]
test/plugins/test_execution_modes.py ........................... [ 64%]
test/plugins/test_hook_call_sites.py .............................. [ 68%]
test/plugins/test_manager.py ss...... [ 68%]
test/plugins/test_mellea_plugin.py ....... [ 69%]
test/plugins/test_payloads.py .......... [ 70%]
test/plugins/test_pluginset.py ......... [ 71%]
test/plugins/test_policies.py ...... [ 72%]
test/plugins/test_policy_enforcement.py .......... [ 73%]
test/plugins/test_priority_ordering.py .............. [ 75%]
test/plugins/test_scoping.py ................................... [ 79%]
test/plugins/test_tool_hooks_redaction.py ....... [ 80%]
test/plugins/test_unregister.py ......... [ 81%]
test/stdlib/components/docs/test_document.py ... [ 81%]
test/stdlib/components/docs/test_richdocument.py ..... [ 82%]
test/stdlib/components/test_chat.py . [ 82%]
test/stdlib/components/test_hello_world.py .. [ 82%]
test/stdlib/components/test_mify.py ........... [ 83%]
test/stdlib/components/test_transform.py .. [ 83%]
test/stdlib/requirements/test_reqlib_markdown.py ...... [ 84%]
test/stdlib/requirements/test_reqlib_python.py .............sss..... [ 87%]
test/stdlib/requirements/test_reqlib_tools.py . [ 87%]
test/stdlib/sampling/test_sofai_graph_coloring.py ...................... [ 89%]
[ 89%]
test/stdlib/sampling/test_sofai_sampling.py .................... [ 91%]
test/stdlib/test_base_context.py ..... [ 92%]
test/telemetry/test_logging.py ........ [ 93%]
test/telemetry/test_metrics.py ....................................... [ 97%]
test/telemetry/test_metrics_plugins.py .... [ 98%]
test/telemetry/test_metrics_token.py .... [ 98%]
test/telemetry/test_tracing.py .......... [100%]
==================================== ERRORS ====================================
... (removed 32,000 lines of error output)
=============================== warnings summary ===============================
test/backends/test_huggingface.py: 1 warning
test/stdlib/components/intrinsic/test_core.py: 2 warnings
test/stdlib/components/intrinsic/test_guardian.py: 3 warnings
test/stdlib/components/intrinsic/test_rag.py: 5 warnings
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/peft/tuners/tuners_utils.py:285: UserWarning: Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
warnings.warn(
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/utils.py:103: DeprecationWarning: This class is deprecated and will be removed in version 0.20.0. To train on completion only, please use the parameter `completion_only_loss` of `SFTConfig` instead.
warnings.warn(
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
test/cli/test_alora_train.py::test_alora_config_creation
test/cli/test_alora_train.py::test_lora_config_creation
test/cli/test_alora_train.py::test_invocation_prompt_tokenization
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/sft_config.py:257: DeprecationWarning: `max_seq_length` is deprecated and will be removed in version 0.20.0. Use `max_length` instead.
warnings.warn(
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:678: DeprecationWarning: Failed to apply the formatting function due to the following error: string index out of range. This may be because the function is designed for batched input. Please update it to process one example at a time (i.e., accept and return a single example). For now, we will attempt to apply the function in batched mode, but note that batched formatting is deprecated and will be removed in version 0.21.
warnings.warn(
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.is_pinned() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:31.)
return data.pin_memory(device)
test/telemetry/test_metrics_backend.py: 8 warnings
test/telemetry/test_metrics.py: 24 warnings
test/telemetry/test_metrics_token.py: 4 warnings
/proj/dmfexp/eiger/users/ajbozarth/mellea/mellea/telemetry/metrics.py:245: UserWarning: Metrics are enabled (MELLEA_METRICS_ENABLED=true) but no exporters are configured. Metrics will be collected but not exported. Set MELLEA_METRICS_PROMETHEUS=true, set MELLEA_METRICS_OTLP=true with an endpoint (OTEL_EXPORTER_OTLP_METRICS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT), or set MELLEA_METRICS_CONSOLE=true to export metrics.
_meter_provider = _setup_meter_provider()
test/telemetry/test_metrics_backend.py: 7 warnings
test/telemetry/test_metrics.py: 28 warnings
test/telemetry/test_metrics_token.py: 4 warnings
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py:131: UserWarning: TokenMetricsPlugin already registered: Plugin token_metrics.generation_post_call already registered
_bootstrap._exec(spec, module)
test/backends/test_vision_openai.py::test_image_block_construction
/proj/dmfexp/eiger/users/ajbozarth/mellea/test/backends/test_vision_openai.py:48: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
random_image = Image.fromarray(random_pixel_data, "RGB")
test/backends/test_litellm_ollama.py::test_litellm_ollama_chat
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
test/backends/test_litellm_ollama.py::test_gen_slot
test/backends/test_litellm_ollama.py::test_generate_from_raw
test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/backends/test_litellm_ollama.py::test_async_avalue
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/aiohttp/connector.py:993: DeprecationWarning: enable_cleanup_closed ignored because https://github.com/python/cpython/pull/118960 is fixed in Python version sys.version_info(major=3, minor=12, micro=12, releaselevel='final', serial=0)
super().__init__(
test/backends/test_vision_ollama.py::test_image_block_construction
/proj/dmfexp/eiger/users/ajbozarth/mellea/test/backends/test_vision_ollama.py:38: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
random_image = Image.fromarray(random_pixel_data, "RGB")
test/helpers/test_event_loop_helper.py::test_event_loop_handler_with_forking
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=4140229) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
test/stdlib/components/docs/test_richdocument.py::test_richdocument_basics
<frozen abc>:106: DeprecationWarning: Use BaseMetaSerializer() instead.
test/stdlib/components/docs/test_richdocument.py::test_richdocument_basics
test/stdlib/components/docs/test_richdocument.py::test_richdocument_markdown
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/docling_core/transforms/serializer/markdown.py:490: DeprecationWarning: Field `annotations` is deprecated; use `meta` instead.
for ann in item.annotations
test/telemetry/test_logging.py::test_otlp_logging_enabled_without_endpoint_warns
/proj/dmfexp/eiger/users/ajbozarth/mellea/mellea/telemetry/logging.py:97: UserWarning: OTLP logs exporter is enabled (MELLEA_LOGS_OTLP=true) but no endpoint is configured. Set OTEL_EXPORTER_OTLP_LOGS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT to export logs.
_logger_provider = _setup_logger_provider()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================ tests coverage ================================
_______________ coverage: platform linux, python 3.12.12-final-0 _______________
Coverage HTML written to dir htmlcov
Coverage JSON written to file coverage.json
=========================== short test summary info ============================
FAILED test/stdlib/components/intrinsic/test_core.py::test_find_context_attributions
FAILED test/backends/test_openai_ollama.py::test_instruct - openai.APIConnect...
FAILED test/backends/test_openai_ollama.py::test_multiturn - openai.APIConnec...
FAILED test/backends/test_openai_ollama.py::test_chat - openai.APIConnectionE...
FAILED test/backends/test_openai_ollama.py::test_chat_stream - openai.APIConn...
FAILED test/backends/test_openai_ollama.py::test_format - openai.APIConnectio...
FAILED test/backends/test_openai_ollama.py::test_generate_from_raw - openai.A...
FAILED test/backends/test_openai_ollama.py::test_async_parallel_requests - op...
FAILED test/backends/test_openai_ollama.py::test_async_avalue - openai.APICon...
FAILED test/backends/test_vision_openai.py::test_image_block_in_instruction
FAILED test/backends/test_vision_openai.py::test_image_block_in_chat - openai...
FAILED test/telemetry/test_metrics_backend.py::test_openai_token_metrics_integration[non-streaming]
FAILED test/telemetry/test_metrics_backend.py::test_openai_token_metrics_integration[streaming]
FAILED test/backends/test_litellm_ollama.py::test_litellm_ollama_chat - litel...
FAILED test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct - l...
FAILED test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
FAILED test/backends/test_litellm_ollama.py::test_gen_slot - litellm.exceptio...
FAILED test/backends/test_litellm_ollama.py::test_generate_from_raw - litellm...
FAILED test/backends/test_litellm_ollama.py::test_async_parallel_requests - l...
FAILED test/backends/test_litellm_ollama.py::test_async_avalue - litellm.exce...
FAILED test/core/test_astream_incremental.py::test_astream_returns_incremental_chunks
FAILED test/core/test_astream_incremental.py::test_astream_multiple_calls_accumulate_correctly
FAILED test/core/test_astream_incremental.py::test_astream_beginning_length_tracking
FAILED test/core/test_astream_incremental.py::test_astream_empty_beginning - ...
FAILED test/core/test_astream_incremental.py::test_non_streaming_astream - Ex...
FAILED test/stdlib/requirements/test_requirement.py::test_llmaj_validation_req_output_field
FAILED test/stdlib/requirements/test_requirement.py::test_llmaj_requirement_uses_requirement_template
FAILED test/stdlib/sampling/test_sofai_graph_coloring.py::TestSOFAIGraphColoringIntegration::test_graph_coloring_fresh_start
FAILED test/stdlib/sampling/test_sofai_graph_coloring.py::TestSOFAIGraphColoringIntegration::test_graph_coloring_continue_chat
FAILED test/stdlib/sampling/test_sofai_graph_coloring.py::TestSOFAIGraphColoringIntegration::test_graph_coloring_best_attempt
FAILED test/stdlib/sampling/test_sofai_sampling.py::TestSOFAIIntegration::test_sofai_with_ollama
FAILED test/telemetry/test_metrics_backend.py::test_ollama_token_metrics_integration[non-streaming]
FAILED test/telemetry/test_metrics_backend.py::test_ollama_token_metrics_integration[streaming]
FAILED test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
FAILED test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
FAILED test/telemetry/test_tracing.py::test_session_with_tracing_disabled - E...
FAILED test/telemetry/test_tracing.py::test_session_with_application_tracing
FAILED test/telemetry/test_tracing.py::test_session_with_backend_tracing - Ex...
FAILED test/telemetry/test_tracing.py::test_generative_function_with_tracing
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_simple]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_answerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_unanswerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[hallucination_detection]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_clarification]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_rewrite]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[context_relevance]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[citations]
ERROR test/backends/test_mellea_tool.py::test_from_callable_generation - Exce...
ERROR test/backends/test_mellea_tool.py::test_from_langchain_generation - Exc...
ERROR test/backends/test_ollama.py::test_simple_instruct - Exception: could n...
ERROR test/backends/test_ollama.py::test_instruct_with_requirement - Exceptio...
ERROR test/backends/test_ollama.py::test_chat - Exception: could not create O...
ERROR test/backends/test_ollama.py::test_format - Exception: could not create...
ERROR test/backends/test_ollama.py::test_generate_from_raw - Exception: could...
ERROR test/backends/test_ollama.py::test_async_parallel_requests - Exception:...
ERROR test/backends/test_ollama.py::test_async_avalue - Exception: could not ...
ERROR test/backends/test_ollama.py::test_multiple_asyncio_runs - Exception: c...
ERROR test/backends/test_ollama.py::test_client_cache - Exception: could not ...
ERROR test/backends/test_tool_calls.py::test_tool_called_from_context_action
ERROR test/backends/test_tool_calls.py::test_tool_called - Exception: could n...
ERROR test/backends/test_tool_calls.py::test_tool_not_called - Exception: cou...
ERROR test/backends/test_vision_ollama.py::test_image_block_in_instruction - ...
ERROR test/backends/test_vision_ollama.py::test_image_block_in_chat - Excepti...
ERROR test/core/test_component_typing.py::test_generating - Exception: could ...
ERROR test/core/test_component_typing.py::test_message_typing - Exception: co...
ERROR test/core/test_component_typing.py::test_generating_with_sampling - Exc...
ERROR test/core/test_model_output_thunk.py::test_model_output_thunk_copy - Ex...
ERROR test/core/test_model_output_thunk.py::test_model_output_thunk_deepcopy
ERROR test/stdlib/components/test_genslot.py::test_gen_slot_output - Exceptio...
ERROR test/stdlib/components/test_genslot.py::test_func - Exception: could no...
ERROR test/stdlib/components/test_genslot.py::test_sentiment_output - Excepti...
ERROR test/stdlib/components/test_genslot.py::test_gen_slot_logs - Exception:...
ERROR test/stdlib/components/test_genslot.py::test_gen_slot_with_context_and_backend
ERROR test/stdlib/components/test_genslot.py::test_async_gen_slot - Exception...
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[session] - ...
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[context and backend]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[backend without context]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[duplicate arg and kwarg]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[original func args as positional args]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[session and func as kwargs]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[all kwargs]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[interspersed kwargs]
ERROR test/stdlib/components/test_genslot.py::test_arg_extraction[missing required args]
ERROR test/stdlib/components/test_genslot.py::test_precondition_failure - Exc...
ERROR test/stdlib/components/test_genslot.py::test_requirement - Exception: c...
ERROR test/stdlib/components/test_genslot.py::test_with_no_args - Exception: ...
ERROR test/stdlib/sampling/test_majority_voting.py::test_majority_voting_for_math
ERROR test/stdlib/sampling/test_majority_voting.py::test_MBRDRougeL - Excepti...
ERROR test/stdlib/sampling/test_sampling_ctx.py::TestSamplingCtxCase::test_ctx_for_rejection_sampling
ERROR test/stdlib/sampling/test_sampling_ctx.py::TestSamplingCtxCase::test_ctx_for_multiturn
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_big - Exc...
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_little - ...
ERROR test/stdlib/test_chat_view.py::test_chat_view_linear_ctx - Exception: c...
ERROR test/stdlib/test_chat_view.py::test_chat_view_simple_ctx - Exception: c...
ERROR test/stdlib/test_functional.py::test_func_context - Exception: could no...
ERROR test/stdlib/test_functional.py::test_aact - Exception: could not create...
ERROR test/stdlib/test_functional.py::test_ainstruct - Exception: could not c...
ERROR test/stdlib/test_functional.py::test_avalidate - Exception: could not c...
ERROR test/stdlib/test_session.py::test_start_session_openai_with_kwargs - Ex...
ERROR test/stdlib/test_session.py::test_aact - Exception: could not create Ol...
ERROR test/stdlib/test_session.py::test_ainstruct - Exception: could not crea...
ERROR test/stdlib/test_session.py::test_async_await_with_chat_context - Excep...
ERROR test/stdlib/test_session.py::test_async_without_waiting_with_chat_context
ERROR test/stdlib/test_session.py::test_session_copy_with_context_ops - Excep...
ERROR test/stdlib/test_session.py::test_powerup - Exception: could not create...
= 47 failed, 735 passed, 30 skipped, 19 deselected, 3 xfailed, 117 warnings, 58 errors in 4816.40s (1:20:16) =
[21:56:14] Shutting down ollama server...
[21:56:14] Ollama stopped.| Run | Passed | Failed | Skipped | Deselected | Notes |
|---|---|---|---|---|---|
| Local, Mac M1 Max 32GB, Python 3.12.8 | 800 | 2 | 61 | 19 | 2 qualitative flakes |
| Local slow, Mac M1 Max 32GB, Python 3.12.8 | 18 | 0 | 3 | 864 | All expected |
| Cluster, LSF GPU node, Python 3.12.12 | 735 | 47 | 30 | 19 | Ollama connectivity issue — re-run planned |
|
Per convo with @planetf1 I've fixed my own review nits so as to unblock this (as he's now on vacation). I'll be holding off on merge until I've gotten my cluster run to pass (in case it failed due to something I need to fix) and for more reviews. |
|
Opened #759 on my Ollama connectivity issue that blew up my bluevela run |
|
Re-ran the cluster tests after resolving the Ollama connectivity issue from my first run (stale server from a previous session). Results below: Cluster run ( Remaining failures — none related to this PR:
Terminal output$ test/scripts/run_tests_with_ollama.sh
[22:20:32] WARNING: CACHE_DIR not set. Ollama models will download to ~/.ollama (default)
[22:20:32] Using standalone log directory: logs/2026-03-27-22:20:32
[22:20:32] Starting ollama server on 127.0.0.1:11434...
[22:20:32] Added system CUDA to LD_LIBRARY_PATH
[22:20:32] Ollama server PID: 1135706
[22:20:32] Waiting for ollama to be ready...
[22:20:34] Ollama ready after 2s
[22:20:34] Model granite4:micro already pulled
[22:20:34] Model granite4:micro-h already pulled
[22:20:34] Model granite3.2-vision already pulled
[22:20:34] All models ready.
[22:20:34] Warming up models...
[22:20:34] Warming granite4:micro ...
[22:21:38] Warming granite4:micro-h ...
[22:21:48] Warming granite3.2-vision ...
[22:21:55] Warmup complete.
[22:21:55] Starting pytest...
[22:21:55] Log directory: logs/2026-03-27-22:20:32
[22:21:55] Pytest args: --group-by-backend
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.0, pluggy-1.6.0
rootdir: /proj/dmfexp/eiger/users/ajbozarth/mellea
configfile: pyproject.toml
plugins: nbmake-1.5.5, anyio-4.11.0, json-report-1.5.0, recording-0.13.4, asyncio-1.3.0, timeout-2.4.0, metadata-3.1.1, Faker-37.12.0, xdist-3.8.0, langsmith-0.6.6, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
timeout: 900.0s
timeout method: signal
timeout func_only: False
collected 892 items / 19 deselected / 873 selected
test/backends/test_huggingface.py ................... [ 2%]
test/backends/test_huggingface_tools.py . [ 2%]
test/cli/test_alora_train_integration.py .. [ 2%]
test/formatters/granite/test_intrinsics_formatters.py ....x.......... [ 4%]
test/stdlib/components/docs/test_richdocument.py s [ 4%]
test/stdlib/components/intrinsic/test_core.py ..F [ 4%]
test/stdlib/components/intrinsic/test_guardian.py ...... [ 5%]
test/stdlib/components/intrinsic/test_rag.py ....... [ 6%]
test/stdlib/test_spans.py .x [ 6%]
test/telemetry/test_metrics_backend.py .. [ 6%]
test/backends/test_openai_ollama.py ............. [ 8%]
test/backends/test_openai_vllm.py sssssss [ 8%]
test/backends/test_vision_openai.py ..F. [ 9%]
test/telemetry/test_metrics_backend.py .. [ 9%]
test/backends/test_vllm.py ........ [ 10%]
test/backends/test_vllm_tools.py . [ 10%]
test/backends/test_litellm_ollama.py ........ [ 11%]
test/backends/test_mellea_tool.py .. [ 11%]
test/backends/test_ollama.py .....X.... [ 12%]
test/backends/test_tool_calls.py ... [ 13%]
test/backends/test_vision_ollama.py .... [ 13%]
test/core/test_astream_incremental.py ...... [ 14%]
test/core/test_component_typing.py ... [ 14%]
test/core/test_model_output_thunk.py .. [ 15%]
test/stdlib/components/test_genslot.py ................... [ 17%]
test/stdlib/requirements/test_requirement.py ..... [ 17%]
test/stdlib/sampling/test_majority_voting.py .. [ 17%]
test/stdlib/sampling/test_sampling_ctx.py .. [ 18%]
test/stdlib/sampling/test_sofai_graph_coloring.py ... [ 18%]
test/stdlib/sampling/test_sofai_sampling.py . [ 18%]
test/stdlib/sampling/test_think_budget_forcing.py EE [ 18%]
test/stdlib/test_chat_view.py .. [ 19%]
test/stdlib/test_functional.py .... [ 19%]
test/stdlib/test_session.py s....... [ 20%]
test/telemetry/test_metrics_backend.py .... [ 20%]
test/telemetry/test_tracing.py .... [ 21%]
test/telemetry/test_tracing_backend.py ssssss [ 22%]
test/backends/test_bedrock.py s [ 22%]
test/backends/test_litellm_watsonx.py ssss [ 22%]
test/backends/test_watsonx.py sssssssssss [ 23%]
test/telemetry/test_metrics_backend.py s [ 24%]
test/backends/test_adapters/test_adapter.py . [ 24%]
test/backends/test_mellea_tool.py ..... [ 24%]
test/backends/test_model_options.py ..... [ 25%]
test/backends/test_tool_decorator.py ................... [ 27%]
test/backends/test_tool_helpers.py ... [ 27%]
test/backends/test_tool_validation_integration.py ...................... [ 30%]
........... [ 31%]
test/cli/test_alora_train.py .... [ 32%]
test/core/test_astream_exception_propagation.py ..... [ 32%]
test/core/test_astream_mock.py ...... [ 33%]
test/core/test_base.py .... [ 33%]
test/core/test_component_typing.py ..... [ 34%]
test/decompose/test_decompose.py .......... [ 35%]
test/formatters/granite/test_intrinsics_formatters.py .................. [ 37%]
..................................FFFFFFFF [ 42%]
test/formatters/test_template_formatter.py ................ [ 44%]
test/helpers/test_event_loop_helper.py .... [ 44%]
test/helpers/test_server_type.py ................ [ 46%]
test/plugins/test_all_payloads.py ...................................... [ 50%]
............................................................. [ 57%]
test/plugins/test_blocking.py ................ [ 59%]
test/plugins/test_build_global_context.py ....... [ 60%]
test/plugins/test_decorators.py ......... [ 61%]
test/plugins/test_execution_modes.py ........................... [ 64%]
test/plugins/test_hook_call_sites.py .............................. [ 68%]
test/plugins/test_manager.py ss...... [ 68%]
test/plugins/test_mellea_plugin.py ....... [ 69%]
test/plugins/test_payloads.py .......... [ 70%]
test/plugins/test_pluginset.py ......... [ 71%]
test/plugins/test_policies.py ...... [ 72%]
test/plugins/test_policy_enforcement.py .......... [ 73%]
test/plugins/test_priority_ordering.py .............. [ 75%]
test/plugins/test_scoping.py ................................... [ 79%]
test/plugins/test_tool_hooks_redaction.py ....... [ 80%]
test/plugins/test_unregister.py ......... [ 81%]
test/stdlib/components/docs/test_document.py ... [ 81%]
test/stdlib/components/docs/test_richdocument.py ..... [ 82%]
test/stdlib/components/test_chat.py . [ 82%]
test/stdlib/components/test_hello_world.py .. [ 82%]
test/stdlib/components/test_mify.py ........... [ 83%]
test/stdlib/components/test_transform.py .. [ 83%]
test/stdlib/requirements/test_reqlib_markdown.py ...... [ 84%]
test/stdlib/requirements/test_reqlib_python.py .............sss..... [ 87%]
test/stdlib/requirements/test_reqlib_tools.py . [ 87%]
test/stdlib/sampling/test_sofai_graph_coloring.py ...................... [ 89%]
[ 89%]
test/stdlib/sampling/test_sofai_sampling.py .................... [ 91%]
test/stdlib/test_base_context.py ..... [ 92%]
test/telemetry/test_logging.py ........ [ 93%]
test/telemetry/test_metrics.py ....................................... [ 97%]
test/telemetry/test_metrics_plugins.py .... [ 98%]
test/telemetry/test_metrics_token.py .... [ 98%]
test/telemetry/test_tracing.py .......... [100%]
==================================== ERRORS ====================================
_______________________ ERROR at setup of test_think_big _______________________
gh_run = 0
@pytest.fixture(scope="module")
def m_session(gh_run):
"""Start default Mellea's session."""
if gh_run == 1: # on github
m = start_session(
"ollama", model_id=MODEL_ID, model_options={ModelOption.MAX_NEW_TOKENS: 5}
)
else:
> m = start_session("ollama", model_id=MODEL_ID)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
test/stdlib/sampling/test_think_budget_forcing.py:25:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mellea/stdlib/session.py:241: in start_session
backend = backend_class(model_id, model_options=model_options, **backend_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <mellea.backends.ollama.OllamaModelBackend object at 0x148cc4795df0>
model_id = ModelIdentifier(hf_model_name='openai/gpt-oss-20b', ollama_name='gpt-oss:20b', watsonx_name=None, mlx_name=None, openai_name=None, bedrock_name='openai.gpt-oss-20b', hf_tokenizer_name=None)
formatter = None, base_url = None, model_options = None
def __init__(
self,
model_id: str | ModelIdentifier = model_ids.IBM_GRANITE_4_MICRO_3B,
formatter: ChatFormatter | None = None,
base_url: str | None = None,
model_options: dict | None = None,
):
"""Initialize an Ollama backend, connecting to the server and pulling the model if needed."""
super().__init__(
model_id=model_id,
formatter=(
formatter
if formatter is not None
else TemplateFormatter(model_id=model_id)
),
model_options=model_options,
)
# Run the ollama model id accessor early, so that an Assertion fails immediately if we cannot find an ollama model id for the provided ModelIdentifier.
self._get_ollama_model_id()
# Setup the client and ensure that we have the model available.
self._base_url = base_url
self._client = ollama.Client(base_url)
self._client_cache = ClientCache(2)
# Call once to set up an async client and prepopulate the cache.
_ = self._async_client
if not self._check_ollama_server():
err = f"could not create OllamaModelBackend: ollama server not running at {base_url}"
FancyLogger.get_logger().error(err)
raise Exception(err)
if not self._pull_ollama_model():
err = f"could not create OllamaModelBackend: {self._get_ollama_model_id()} could not be pulled from ollama library"
FancyLogger.get_logger().error(err)
> raise Exception(err)
E Exception: could not create OllamaModelBackend: gpt-oss:20b could not be pulled from ollama library
mellea/backends/ollama.py:97: Exception
---------------------------- Captured stdout setup -----------------------------
=== 22:40:43-ERROR ======
could not create OllamaModelBackend: gpt-oss:20b could not be pulled from ollama library
---------------------------- Captured stderr setup -----------------------------
------------------------------ Captured log setup ------------------------------
ERROR fancy_logger:ollama.py:96 could not create OllamaModelBackend: gpt-oss:20b could not be pulled from ollama library
_____________________ ERROR at setup of test_think_little ______________________
gh_run = 0
@pytest.fixture(scope="module")
def m_session(gh_run):
"""Start default Mellea's session."""
if gh_run == 1: # on github
m = start_session(
"ollama", model_id=MODEL_ID, model_options={ModelOption.MAX_NEW_TOKENS: 5}
)
else:
> m = start_session("ollama", model_id=MODEL_ID)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
test/stdlib/sampling/test_think_budget_forcing.py:25:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mellea/stdlib/session.py:241: in start_session
backend = backend_class(model_id, model_options=model_options, **backend_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <mellea.backends.ollama.OllamaModelBackend object at 0x148cc4795df0>
model_id = ModelIdentifier(hf_model_name='openai/gpt-oss-20b', ollama_name='gpt-oss:20b', watsonx_name=None, mlx_name=None, openai_name=None, bedrock_name='openai.gpt-oss-20b', hf_tokenizer_name=None)
formatter = None, base_url = None, model_options = None
def __init__(
self,
model_id: str | ModelIdentifier = model_ids.IBM_GRANITE_4_MICRO_3B,
formatter: ChatFormatter | None = None,
base_url: str | None = None,
model_options: dict | None = None,
):
"""Initialize an Ollama backend, connecting to the server and pulling the model if needed."""
super().__init__(
model_id=model_id,
formatter=(
formatter
if formatter is not None
else TemplateFormatter(model_id=model_id)
),
model_options=model_options,
)
# Run the ollama model id accessor early, so that an Assertion fails immediately if we cannot find an ollama model id for the provided ModelIdentifier.
self._get_ollama_model_id()
# Setup the client and ensure that we have the model available.
self._base_url = base_url
self._client = ollama.Client(base_url)
self._client_cache = ClientCache(2)
# Call once to set up an async client and prepopulate the cache.
_ = self._async_client
if not self._check_ollama_server():
err = f"could not create OllamaModelBackend: ollama server not running at {base_url}"
FancyLogger.get_logger().error(err)
raise Exception(err)
if not self._pull_ollama_model():
err = f"could not create OllamaModelBackend: {self._get_ollama_model_id()} could not be pulled from ollama library"
FancyLogger.get_logger().error(err)
> raise Exception(err)
E Exception: could not create OllamaModelBackend: gpt-oss:20b could not be pulled from ollama library
mellea/backends/ollama.py:97: Exception
=================================== FAILURES ===================================
________________________ test_find_context_attributions ________________________
backend = <mellea.backends.huggingface.LocalHFBackend object at 0x14894643d550>
@pytest.mark.qualitative
def test_find_context_attributions(backend):
"""Verify that the context-attribution intrinsic functions properly."""
context, assistant_response, documents = _read_rag_input_json(
"context-attribution.json"
)
expected = _read_rag_output_json("context-attribution.json")
result = core.find_context_attributions(
assistant_response, documents, context, backend
)
> assert result == expected
E AssertionError: assert [{'attributio...ne, ...}, ...] == [{'attributio...ne, ...}, ...]
E
E Left contains 5 more items, first extra item: {'attribution_begin': 0, 'attribution_doc_id': None, 'attribution_end': 66, 'attribution_msg_index': 2, ...}
E Use -v to get more diff
test/stdlib/components/intrinsic/test_core.py:105: AssertionError
----------------------------- Captured stdout call -----------------------------
=== 22:30:56-INFO ======
passing in model options when generating with an adapter; some model options may be overwritten / ignored
----------------------------- Captured stderr call -----------------------------
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 9320.68it/s]
Fetching 3 files: 100%|██████████| 3/3 [00:00<00:00, 10960.72it/s]
------------------------------ Captured log call -------------------------------
INFO fancy_logger:huggingface.py:475 passing in model options when generating with an adapter; some model options may be overwritten / ignored
--------------------------- Captured stdout teardown ---------------------------
=== 22:30:59-INFO ======
Cleaning up test_core backend GPU memory...
=== 22:30:59-INFO ======
GPU before cleanup: 58.1GB free / 79.2GB total
=== 22:30:59-INFO ======
Cleared LRU cache
=== 22:30:59-INFO ======
Removed accelerate dispatch hooks
=== 22:31:00-INFO ======
GPU after cleanup: 78.1GB free / 79.2GB total (reclaimed 20.0GB)
---------------------------- Captured log teardown -----------------------------
INFO fancy_logger:conftest.py:342 Cleaning up test_core backend GPU memory...
INFO fancy_logger:conftest.py:349 GPU before cleanup: 58.1GB free / 79.2GB total
INFO fancy_logger:conftest.py:365 Cleared LRU cache
INFO fancy_logger:conftest.py:402 Removed accelerate dispatch hooks
INFO fancy_logger:conftest.py:437 GPU after cleanup: 78.1GB free / 79.2GB total (reclaimed 20.0GB)
_______________________ test_image_block_in_instruction ________________________
m_session = <mellea.stdlib.session.MelleaSession object at 0x148f2d2cdbb0>
pil_image = <PIL.Image.Image image mode=RGB size=200x150 at 0x148F2D255640>
gh_run = 0
def test_image_block_in_instruction(
m_session: MelleaSession, pil_image: Image.Image, gh_run: int
):
image_block = ImageBlock.from_pil_image(pil_image)
# Set strategy=None here since we are directly comparing the object and sampling strategies tend to do a deepcopy.
instr = m_session.instruct(
"Is this image mainly blue? Answer yes or no.",
images=[image_block],
strategy=None,
)
assert isinstance(instr, ModelOutputThunk)
# if not on GH
if not gh_run == 1:
> assert "yes" in instr.value.lower() or "no" in instr.value.lower() # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E AssertionError: assert ('yes' in '\nthe image is predominantly blue with varying shades creating a mosaic effect.' or 'no' in '\nthe image is predominantly blue with varying shades creating a mosaic effect.')
E + where '\nthe image is predominantly blue with varying shades creating a mosaic effect.' = <built-in method lower of str object at 0x148f2d2c65e0>()
E + where <built-in method lower of str object at 0x148f2d2c65e0> = '\nThe image is predominantly blue with varying shades creating a mosaic effect.'.lower
E + where '\nThe image is predominantly blue with varying shades creating a mosaic effect.' = ModelOutputThunk(\nThe image is predominantly blue with varying shades creating a mosaic effect.).value
E + and '\nthe image is predominantly blue with varying shades creating a mosaic effect.' = <built-in method lower of str object at 0x148f2d2c65e0>()
E + where <built-in method lower of str object at 0x148f2d2c65e0> = '\nThe image is predominantly blue with varying shades creating a mosaic effect.'.lower
E + where '\nThe image is predominantly blue with varying shades creating a mosaic effect.' = ModelOutputThunk(\nThe image is predominantly blue with varying shades creating a mosaic effect.).value
test/backends/test_vision_openai.py:86: AssertionError
---------------------------- Captured stdout setup -----------------------------
=== 22:33:27-INFO ======
Starting Mellea session: backend=openai, model=granite3.2-vision, context=SimpleContext, model_options={'@@@max_new_tokens@@@': 5}
------------------------------ Captured log setup ------------------------------
INFO fancy_logger:session.py:246 Starting Mellea session: backend=openai, model=granite3.2-vision, context=SimpleContext, model_options={'@@@max_new_tokens@@@': 5}
____________________ test_run_ollama[answerability_simple] _____________________
yaml_json_combo_for_ollama = YamlJsonCombo(short_name='answerability_simple', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--ibm-...rability', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')
def test_run_ollama(yaml_json_combo_for_ollama):
"""
Run the target model end-to-end with a mock Ollama backend.
"""
cfg = yaml_json_combo_for_ollama
# Change base model id to Ollama's version
if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
cfg.base_model_id = "granite4:micro"
else:
pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
if cfg.arguments_file:
with open(cfg.arguments_file, encoding="utf8") as f:
transform_kwargs = json.load(f)
else:
transform_kwargs = {}
# Load input request
with open(cfg.inputs_file, encoding="utf-8") as f:
model_input = ChatCompletion.model_validate_json(f.read())
model_input.model = cfg.task
# Download files from Hugging Face Hub
try:
> lora_dir = intrinsics_util.obtain_lora(
cfg.task,
cfg.base_model_id,
cfg.repo_id,
revision=cfg.revision,
alora=cfg.is_alora,
)
test/formatters/granite/test_intrinsics_formatters.py:714:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
for obj in iterable:
^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
_download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def xet_get(
*,
incomplete_path: Path,
xet_file_data: XetFileData,
headers: Dict[str, str],
expected_size: Optional[int] = None,
displayed_filename: Optional[str] = None,
_tqdm_bar: Optional[tqdm] = None,
) -> None:
"""
Download a file using Xet storage service.
Args:
incomplete_path (`Path`):
The path to the file to download.
xet_file_data (`XetFileData`):
The file metadata needed to make the request to the xet storage service.
headers (`Dict[str, str]`):
The headers to send to the xet storage service.
expected_size (`int`, *optional*):
The expected size of the file to download. If set, the download will raise an error if the size of the
received content is different from the expected one.
displayed_filename (`str`, *optional*):
The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
not set, the filename is guessed from the URL or the `Content-Disposition` header.
**How it works:**
The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
for efficient storage and transfer.
`hf_xet.download_files` manages downloading files by:
- Taking a list of files to download (each with its unique content hash)
- Connecting to a storage server (CAS server) that knows how files are chunked
- Using authentication to ensure secure access
- Providing progress updates during download
Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
connection to the storage server.
The download process works like this:
1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
2. Download files in parallel:
2.1. Prepare to write the file to disk
2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
The server responds with:
- Which chunks make up the complete file
- Where each chunk can be downloaded from
2.3. For each needed chunk:
- Checks if we already have it in our local cache
- If not, download it from cloud storage (S3)
- Save it to cache for future use
- Assemble the chunks in order to recreate the original file
"""
try:
from hf_xet import PyXetDownloadInfo, download_files # type: ignore[no-redef]
except ImportError:
raise ValueError(
"To use optimized download using Xet storage, you need to install the hf_xet package. "
'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
)
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
def token_refresher() -> Tuple[str, int]:
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
if connection_info is None:
raise ValueError("Failed to refresh token using xet metadata.")
return connection_info.access_token, connection_info.expiration_unix_epoch
xet_download_info = [
PyXetDownloadInfo(
destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
)
]
if not displayed_filename:
displayed_filename = incomplete_path.name
# Truncate filename if too long to display
if len(displayed_filename) > 40:
displayed_filename = f"{displayed_filename[:40]}(…)"
progress_cm = _get_progress_bar_context(
desc=displayed_filename,
log_level=logger.getEffectiveLevel(),
total=expected_size,
initial=0,
name="huggingface_hub.xet_get",
_tqdm_bar=_tqdm_bar,
)
with progress_cm as progress:
def progress_updater(progress_bytes: float):
progress.update(progress_bytes)
> download_files(
xet_download_info,
endpoint=connection_info.endpoint,
token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
token_refresher=token_refresher,
progress_updater=[progress_updater],
)
E RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
__________________ test_run_ollama[answerability_answerable] ___________________
yaml_json_combo_for_ollama = YamlJsonCombo(short_name='answerability_answerable', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--...rability', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')
def test_run_ollama(yaml_json_combo_for_ollama):
"""
Run the target model end-to-end with a mock Ollama backend.
"""
cfg = yaml_json_combo_for_ollama
# Change base model id to Ollama's version
if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
cfg.base_model_id = "granite4:micro"
else:
pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
if cfg.arguments_file:
with open(cfg.arguments_file, encoding="utf8") as f:
transform_kwargs = json.load(f)
else:
transform_kwargs = {}
# Load input request
with open(cfg.inputs_file, encoding="utf-8") as f:
model_input = ChatCompletion.model_validate_json(f.read())
model_input.model = cfg.task
# Download files from Hugging Face Hub
try:
> lora_dir = intrinsics_util.obtain_lora(
cfg.task,
cfg.base_model_id,
cfg.repo_id,
revision=cfg.revision,
alora=cfg.is_alora,
)
test/formatters/granite/test_intrinsics_formatters.py:714:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
for obj in iterable:
^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
_download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def xet_get(
*,
incomplete_path: Path,
xet_file_data: XetFileData,
headers: Dict[str, str],
expected_size: Optional[int] = None,
displayed_filename: Optional[str] = None,
_tqdm_bar: Optional[tqdm] = None,
) -> None:
"""
Download a file using Xet storage service.
Args:
incomplete_path (`Path`):
The path to the file to download.
xet_file_data (`XetFileData`):
The file metadata needed to make the request to the xet storage service.
headers (`Dict[str, str]`):
The headers to send to the xet storage service.
expected_size (`int`, *optional*):
The expected size of the file to download. If set, the download will raise an error if the size of the
received content is different from the expected one.
displayed_filename (`str`, *optional*):
The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
not set, the filename is guessed from the URL or the `Content-Disposition` header.
**How it works:**
The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
for efficient storage and transfer.
`hf_xet.download_files` manages downloading files by:
- Taking a list of files to download (each with its unique content hash)
- Connecting to a storage server (CAS server) that knows how files are chunked
- Using authentication to ensure secure access
- Providing progress updates during download
Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
connection to the storage server.
The download process works like this:
1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
2. Download files in parallel:
2.1. Prepare to write the file to disk
2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
The server responds with:
- Which chunks make up the complete file
- Where each chunk can be downloaded from
2.3. For each needed chunk:
- Checks if we already have it in our local cache
- If not, download it from cloud storage (S3)
- Save it to cache for future use
- Assemble the chunks in order to recreate the original file
"""
try:
from hf_xet import PyXetDownloadInfo, download_files # type: ignore[no-redef]
except ImportError:
raise ValueError(
"To use optimized download using Xet storage, you need to install the hf_xet package. "
'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
)
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
def token_refresher() -> Tuple[str, int]:
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
if connection_info is None:
raise ValueError("Failed to refresh token using xet metadata.")
return connection_info.access_token, connection_info.expiration_unix_epoch
xet_download_info = [
PyXetDownloadInfo(
destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
)
]
if not displayed_filename:
displayed_filename = incomplete_path.name
# Truncate filename if too long to display
if len(displayed_filename) > 40:
displayed_filename = f"{displayed_filename[:40]}(…)"
progress_cm = _get_progress_bar_context(
desc=displayed_filename,
log_level=logger.getEffectiveLevel(),
total=expected_size,
initial=0,
name="huggingface_hub.xet_get",
_tqdm_bar=_tqdm_bar,
)
with progress_cm as progress:
def progress_updater(progress_bytes: float):
progress.update(progress_bytes)
> download_files(
xet_download_info,
endpoint=connection_info.endpoint,
token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
token_refresher=token_refresher,
progress_updater=[progress_updater],
)
E RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
_________________ test_run_ollama[answerability_unanswerable] __________________
yaml_json_combo_for_ollama = YamlJsonCombo(short_name='answerability_unanswerable', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models...rability', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')
def test_run_ollama(yaml_json_combo_for_ollama):
"""
Run the target model end-to-end with a mock Ollama backend.
"""
cfg = yaml_json_combo_for_ollama
# Change base model id to Ollama's version
if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
cfg.base_model_id = "granite4:micro"
else:
pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
if cfg.arguments_file:
with open(cfg.arguments_file, encoding="utf8") as f:
transform_kwargs = json.load(f)
else:
transform_kwargs = {}
# Load input request
with open(cfg.inputs_file, encoding="utf-8") as f:
model_input = ChatCompletion.model_validate_json(f.read())
model_input.model = cfg.task
# Download files from Hugging Face Hub
try:
> lora_dir = intrinsics_util.obtain_lora(
cfg.task,
cfg.base_model_id,
cfg.repo_id,
revision=cfg.revision,
alora=cfg.is_alora,
)
test/formatters/granite/test_intrinsics_formatters.py:714:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
for obj in iterable:
^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
_download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def xet_get(
*,
incomplete_path: Path,
xet_file_data: XetFileData,
headers: Dict[str, str],
expected_size: Optional[int] = None,
displayed_filename: Optional[str] = None,
_tqdm_bar: Optional[tqdm] = None,
) -> None:
"""
Download a file using Xet storage service.
Args:
incomplete_path (`Path`):
The path to the file to download.
xet_file_data (`XetFileData`):
The file metadata needed to make the request to the xet storage service.
headers (`Dict[str, str]`):
The headers to send to the xet storage service.
expected_size (`int`, *optional*):
The expected size of the file to download. If set, the download will raise an error if the size of the
received content is different from the expected one.
displayed_filename (`str`, *optional*):
The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
not set, the filename is guessed from the URL or the `Content-Disposition` header.
**How it works:**
The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
for efficient storage and transfer.
`hf_xet.download_files` manages downloading files by:
- Taking a list of files to download (each with its unique content hash)
- Connecting to a storage server (CAS server) that knows how files are chunked
- Using authentication to ensure secure access
- Providing progress updates during download
Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
connection to the storage server.
The download process works like this:
1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
2. Download files in parallel:
2.1. Prepare to write the file to disk
2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
The server responds with:
- Which chunks make up the complete file
- Where each chunk can be downloaded from
2.3. For each needed chunk:
- Checks if we already have it in our local cache
- If not, download it from cloud storage (S3)
- Save it to cache for future use
- Assemble the chunks in order to recreate the original file
"""
try:
from hf_xet import PyXetDownloadInfo, download_files # type: ignore[no-redef]
except ImportError:
raise ValueError(
"To use optimized download using Xet storage, you need to install the hf_xet package. "
'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
)
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
def token_refresher() -> Tuple[str, int]:
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
if connection_info is None:
raise ValueError("Failed to refresh token using xet metadata.")
return connection_info.access_token, connection_info.expiration_unix_epoch
xet_download_info = [
PyXetDownloadInfo(
destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
)
]
if not displayed_filename:
displayed_filename = incomplete_path.name
# Truncate filename if too long to display
if len(displayed_filename) > 40:
displayed_filename = f"{displayed_filename[:40]}(…)"
progress_cm = _get_progress_bar_context(
desc=displayed_filename,
log_level=logger.getEffectiveLevel(),
total=expected_size,
initial=0,
name="huggingface_hub.xet_get",
_tqdm_bar=_tqdm_bar,
)
with progress_cm as progress:
def progress_updater(progress_bytes: float):
progress.update(progress_bytes)
> download_files(
xet_download_info,
endpoint=connection_info.endpoint,
token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
token_refresher=token_refresher,
progress_updater=[progress_updater],
)
E RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
___________________ test_run_ollama[hallucination_detection] ___________________
yaml_json_combo_for_ollama = YamlJsonCombo(short_name='hallucination_detection', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--i...etection', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')
def test_run_ollama(yaml_json_combo_for_ollama):
"""
Run the target model end-to-end with a mock Ollama backend.
"""
cfg = yaml_json_combo_for_ollama
# Change base model id to Ollama's version
if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
cfg.base_model_id = "granite4:micro"
else:
pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
if cfg.arguments_file:
with open(cfg.arguments_file, encoding="utf8") as f:
transform_kwargs = json.load(f)
else:
transform_kwargs = {}
# Load input request
with open(cfg.inputs_file, encoding="utf-8") as f:
model_input = ChatCompletion.model_validate_json(f.read())
model_input.model = cfg.task
# Download files from Hugging Face Hub
try:
> lora_dir = intrinsics_util.obtain_lora(
cfg.task,
cfg.base_model_id,
cfg.repo_id,
revision=cfg.revision,
alora=cfg.is_alora,
)
test/formatters/granite/test_intrinsics_formatters.py:714:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
for obj in iterable:
^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
_download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def xet_get(
*,
incomplete_path: Path,
xet_file_data: XetFileData,
headers: Dict[str, str],
expected_size: Optional[int] = None,
displayed_filename: Optional[str] = None,
_tqdm_bar: Optional[tqdm] = None,
) -> None:
"""
Download a file using Xet storage service.
Args:
incomplete_path (`Path`):
The path to the file to download.
xet_file_data (`XetFileData`):
The file metadata needed to make the request to the xet storage service.
headers (`Dict[str, str]`):
The headers to send to the xet storage service.
expected_size (`int`, *optional*):
The expected size of the file to download. If set, the download will raise an error if the size of the
received content is different from the expected one.
displayed_filename (`str`, *optional*):
The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
not set, the filename is guessed from the URL or the `Content-Disposition` header.
**How it works:**
The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
for efficient storage and transfer.
`hf_xet.download_files` manages downloading files by:
- Taking a list of files to download (each with its unique content hash)
- Connecting to a storage server (CAS server) that knows how files are chunked
- Using authentication to ensure secure access
- Providing progress updates during download
Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
connection to the storage server.
The download process works like this:
1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
2. Download files in parallel:
2.1. Prepare to write the file to disk
2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
The server responds with:
- Which chunks make up the complete file
- Where each chunk can be downloaded from
2.3. For each needed chunk:
- Checks if we already have it in our local cache
- If not, download it from cloud storage (S3)
- Save it to cache for future use
- Assemble the chunks in order to recreate the original file
"""
try:
from hf_xet import PyXetDownloadInfo, download_files # type: ignore[no-redef]
except ImportError:
raise ValueError(
"To use optimized download using Xet storage, you need to install the hf_xet package. "
'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
)
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
def token_refresher() -> Tuple[str, int]:
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
if connection_info is None:
raise ValueError("Failed to refresh token using xet metadata.")
return connection_info.access_token, connection_info.expiration_unix_epoch
xet_download_info = [
PyXetDownloadInfo(
destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
)
]
if not displayed_filename:
displayed_filename = incomplete_path.name
# Truncate filename if too long to display
if len(displayed_filename) > 40:
displayed_filename = f"{displayed_filename[:40]}(…)"
progress_cm = _get_progress_bar_context(
desc=displayed_filename,
log_level=logger.getEffectiveLevel(),
total=expected_size,
initial=0,
name="huggingface_hub.xet_get",
_tqdm_bar=_tqdm_bar,
)
with progress_cm as progress:
def progress_updater(progress_bytes: float):
progress.update(progress_bytes)
> download_files(
xet_download_info,
endpoint=connection_info.endpoint,
token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
token_refresher=token_refresher,
progress_updater=[progress_updater],
)
E RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
_____________________ test_run_ollama[query_clarification] _____________________
yaml_json_combo_for_ollama = YamlJsonCombo(short_name='query_clarification', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--ibm-g...fication', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')
def test_run_ollama(yaml_json_combo_for_ollama):
"""
Run the target model end-to-end with a mock Ollama backend.
"""
cfg = yaml_json_combo_for_ollama
# Change base model id to Ollama's version
if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
cfg.base_model_id = "granite4:micro"
else:
pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
if cfg.arguments_file:
with open(cfg.arguments_file, encoding="utf8") as f:
transform_kwargs = json.load(f)
else:
transform_kwargs = {}
# Load input request
with open(cfg.inputs_file, encoding="utf-8") as f:
model_input = ChatCompletion.model_validate_json(f.read())
model_input.model = cfg.task
# Download files from Hugging Face Hub
try:
> lora_dir = intrinsics_util.obtain_lora(
cfg.task,
cfg.base_model_id,
cfg.repo_id,
revision=cfg.revision,
alora=cfg.is_alora,
)
test/formatters/granite/test_intrinsics_formatters.py:714:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
for obj in iterable:
^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
_download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def xet_get(
*,
incomplete_path: Path,
xet_file_data: XetFileData,
headers: Dict[str, str],
expected_size: Optional[int] = None,
displayed_filename: Optional[str] = None,
_tqdm_bar: Optional[tqdm] = None,
) -> None:
"""
Download a file using Xet storage service.
Args:
incomplete_path (`Path`):
The path to the file to download.
xet_file_data (`XetFileData`):
The file metadata needed to make the request to the xet storage service.
headers (`Dict[str, str]`):
The headers to send to the xet storage service.
expected_size (`int`, *optional*):
The expected size of the file to download. If set, the download will raise an error if the size of the
received content is different from the expected one.
displayed_filename (`str`, *optional*):
The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
not set, the filename is guessed from the URL or the `Content-Disposition` header.
**How it works:**
The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
for efficient storage and transfer.
`hf_xet.download_files` manages downloading files by:
- Taking a list of files to download (each with its unique content hash)
- Connecting to a storage server (CAS server) that knows how files are chunked
- Using authentication to ensure secure access
- Providing progress updates during download
Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
connection to the storage server.
The download process works like this:
1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
2. Download files in parallel:
2.1. Prepare to write the file to disk
2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
The server responds with:
- Which chunks make up the complete file
- Where each chunk can be downloaded from
2.3. For each needed chunk:
- Checks if we already have it in our local cache
- If not, download it from cloud storage (S3)
- Save it to cache for future use
- Assemble the chunks in order to recreate the original file
"""
try:
from hf_xet import PyXetDownloadInfo, download_files # type: ignore[no-redef]
except ImportError:
raise ValueError(
"To use optimized download using Xet storage, you need to install the hf_xet package. "
'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
)
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
def token_refresher() -> Tuple[str, int]:
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
if connection_info is None:
raise ValueError("Failed to refresh token using xet metadata.")
return connection_info.access_token, connection_info.expiration_unix_epoch
xet_download_info = [
PyXetDownloadInfo(
destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
)
]
if not displayed_filename:
displayed_filename = incomplete_path.name
# Truncate filename if too long to display
if len(displayed_filename) > 40:
displayed_filename = f"{displayed_filename[:40]}(…)"
progress_cm = _get_progress_bar_context(
desc=displayed_filename,
log_level=logger.getEffectiveLevel(),
total=expected_size,
initial=0,
name="huggingface_hub.xet_get",
_tqdm_bar=_tqdm_bar,
)
with progress_cm as progress:
def progress_updater(progress_bytes: float):
progress.update(progress_bytes)
> download_files(
xet_download_info,
endpoint=connection_info.endpoint,
token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
token_refresher=token_refresher,
progress_updater=[progress_updater],
)
E RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
________________________ test_run_ollama[query_rewrite] ________________________
yaml_json_combo_for_ollama = YamlJsonCombo(short_name='query_rewrite', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--ibm-granite..._rewrite', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')
def test_run_ollama(yaml_json_combo_for_ollama):
"""
Run the target model end-to-end with a mock Ollama backend.
"""
cfg = yaml_json_combo_for_ollama
# Change base model id to Ollama's version
if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
cfg.base_model_id = "granite4:micro"
else:
pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
if cfg.arguments_file:
with open(cfg.arguments_file, encoding="utf8") as f:
transform_kwargs = json.load(f)
else:
transform_kwargs = {}
# Load input request
with open(cfg.inputs_file, encoding="utf-8") as f:
model_input = ChatCompletion.model_validate_json(f.read())
model_input.model = cfg.task
# Download files from Hugging Face Hub
try:
> lora_dir = intrinsics_util.obtain_lora(
cfg.task,
cfg.base_model_id,
cfg.repo_id,
revision=cfg.revision,
alora=cfg.is_alora,
)
test/formatters/granite/test_intrinsics_formatters.py:714:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
for obj in iterable:
^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
_download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def xet_get(
*,
incomplete_path: Path,
xet_file_data: XetFileData,
headers: Dict[str, str],
expected_size: Optional[int] = None,
displayed_filename: Optional[str] = None,
_tqdm_bar: Optional[tqdm] = None,
) -> None:
"""
Download a file using Xet storage service.
Args:
incomplete_path (`Path`):
The path to the file to download.
xet_file_data (`XetFileData`):
The file metadata needed to make the request to the xet storage service.
headers (`Dict[str, str]`):
The headers to send to the xet storage service.
expected_size (`int`, *optional*):
The expected size of the file to download. If set, the download will raise an error if the size of the
received content is different from the expected one.
displayed_filename (`str`, *optional*):
The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
not set, the filename is guessed from the URL or the `Content-Disposition` header.
**How it works:**
The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
for efficient storage and transfer.
`hf_xet.download_files` manages downloading files by:
- Taking a list of files to download (each with its unique content hash)
- Connecting to a storage server (CAS server) that knows how files are chunked
- Using authentication to ensure secure access
- Providing progress updates during download
Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
connection to the storage server.
The download process works like this:
1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
2. Download files in parallel:
2.1. Prepare to write the file to disk
2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
The server responds with:
- Which chunks make up the complete file
- Where each chunk can be downloaded from
2.3. For each needed chunk:
- Checks if we already have it in our local cache
- If not, download it from cloud storage (S3)
- Save it to cache for future use
- Assemble the chunks in order to recreate the original file
"""
try:
from hf_xet import PyXetDownloadInfo, download_files # type: ignore[no-redef]
except ImportError:
raise ValueError(
"To use optimized download using Xet storage, you need to install the hf_xet package. "
'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
)
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
def token_refresher() -> Tuple[str, int]:
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
if connection_info is None:
raise ValueError("Failed to refresh token using xet metadata.")
return connection_info.access_token, connection_info.expiration_unix_epoch
xet_download_info = [
PyXetDownloadInfo(
destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
)
]
if not displayed_filename:
displayed_filename = incomplete_path.name
# Truncate filename if too long to display
if len(displayed_filename) > 40:
displayed_filename = f"{displayed_filename[:40]}(…)"
progress_cm = _get_progress_bar_context(
desc=displayed_filename,
log_level=logger.getEffectiveLevel(),
total=expected_size,
initial=0,
name="huggingface_hub.xet_get",
_tqdm_bar=_tqdm_bar,
)
with progress_cm as progress:
def progress_updater(progress_bytes: float):
progress.update(progress_bytes)
> download_files(
xet_download_info,
endpoint=connection_info.endpoint,
token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
token_refresher=token_refresher,
progress_updater=[progress_updater],
)
E RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
______________________ test_run_ollama[context_relevance] ______________________
yaml_json_combo_for_ollama = YamlJsonCombo(short_name='context_relevance', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--ibm-gra...elevance', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')
def test_run_ollama(yaml_json_combo_for_ollama):
"""
Run the target model end-to-end with a mock Ollama backend.
"""
cfg = yaml_json_combo_for_ollama
# Change base model id to Ollama's version
if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
cfg.base_model_id = "granite4:micro"
else:
pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
if cfg.arguments_file:
with open(cfg.arguments_file, encoding="utf8") as f:
transform_kwargs = json.load(f)
else:
transform_kwargs = {}
# Load input request
with open(cfg.inputs_file, encoding="utf-8") as f:
model_input = ChatCompletion.model_validate_json(f.read())
model_input.model = cfg.task
# Download files from Hugging Face Hub
try:
> lora_dir = intrinsics_util.obtain_lora(
cfg.task,
cfg.base_model_id,
cfg.repo_id,
revision=cfg.revision,
alora=cfg.is_alora,
)
test/formatters/granite/test_intrinsics_formatters.py:714:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
for obj in iterable:
^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
_download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def xet_get(
*,
incomplete_path: Path,
xet_file_data: XetFileData,
headers: Dict[str, str],
expected_size: Optional[int] = None,
displayed_filename: Optional[str] = None,
_tqdm_bar: Optional[tqdm] = None,
) -> None:
"""
Download a file using Xet storage service.
Args:
incomplete_path (`Path`):
The path to the file to download.
xet_file_data (`XetFileData`):
The file metadata needed to make the request to the xet storage service.
headers (`Dict[str, str]`):
The headers to send to the xet storage service.
expected_size (`int`, *optional*):
The expected size of the file to download. If set, the download will raise an error if the size of the
received content is different from the expected one.
displayed_filename (`str`, *optional*):
The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
not set, the filename is guessed from the URL or the `Content-Disposition` header.
**How it works:**
The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
for efficient storage and transfer.
`hf_xet.download_files` manages downloading files by:
- Taking a list of files to download (each with its unique content hash)
- Connecting to a storage server (CAS server) that knows how files are chunked
- Using authentication to ensure secure access
- Providing progress updates during download
Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
connection to the storage server.
The download process works like this:
1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
2. Download files in parallel:
2.1. Prepare to write the file to disk
2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
The server responds with:
- Which chunks make up the complete file
- Where each chunk can be downloaded from
2.3. For each needed chunk:
- Checks if we already have it in our local cache
- If not, download it from cloud storage (S3)
- Save it to cache for future use
- Assemble the chunks in order to recreate the original file
"""
try:
from hf_xet import PyXetDownloadInfo, download_files # type: ignore[no-redef]
except ImportError:
raise ValueError(
"To use optimized download using Xet storage, you need to install the hf_xet package. "
'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
)
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
def token_refresher() -> Tuple[str, int]:
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
if connection_info is None:
raise ValueError("Failed to refresh token using xet metadata.")
return connection_info.access_token, connection_info.expiration_unix_epoch
xet_download_info = [
PyXetDownloadInfo(
destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
)
]
if not displayed_filename:
displayed_filename = incomplete_path.name
# Truncate filename if too long to display
if len(displayed_filename) > 40:
displayed_filename = f"{displayed_filename[:40]}(…)"
progress_cm = _get_progress_bar_context(
desc=displayed_filename,
log_level=logger.getEffectiveLevel(),
total=expected_size,
initial=0,
name="huggingface_hub.xet_get",
_tqdm_bar=_tqdm_bar,
)
with progress_cm as progress:
def progress_updater(progress_bytes: float):
progress.update(progress_bytes)
> download_files(
xet_download_info,
endpoint=connection_info.endpoint,
token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
token_refresher=token_refresher,
progress_updater=[progress_updater],
)
E RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
__________________________ test_run_ollama[citations] __________________________
yaml_json_combo_for_ollama = YamlJsonCombo(short_name='citations', yaml_file=PosixPath('/u/ajbozarth/.cache/huggingface/hub/models--ibm-granite--gr...itations', is_alora=False, repo_id='ibm-granite/granite-lib-rag-r1.0', revision='main', base_model_id='granite4:micro')
def test_run_ollama(yaml_json_combo_for_ollama):
"""
Run the target model end-to-end with a mock Ollama backend.
"""
cfg = yaml_json_combo_for_ollama
# Change base model id to Ollama's version
if cfg.base_model_id == "ibm-granite/granite-4.0-micro":
cfg.base_model_id = "granite4:micro"
else:
pytest.xfail(f"Unsupported base model: {cfg.base_model_id}")
if cfg.arguments_file:
with open(cfg.arguments_file, encoding="utf8") as f:
transform_kwargs = json.load(f)
else:
transform_kwargs = {}
# Load input request
with open(cfg.inputs_file, encoding="utf-8") as f:
model_input = ChatCompletion.model_validate_json(f.read())
model_input.model = cfg.task
# Download files from Hugging Face Hub
try:
> lora_dir = intrinsics_util.obtain_lora(
cfg.task,
cfg.base_model_id,
cfg.repo_id,
revision=cfg.revision,
alora=cfg.is_alora,
)
test/formatters/granite/test_intrinsics_formatters.py:714:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mellea/formatters/granite/intrinsics/util.py:154: in obtain_lora
local_root_path = huggingface_hub.snapshot_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:332: in snapshot_download
thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
for obj in iterable:
^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:456: in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
raise self._exception
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py:59: in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py:306: in _inner_hf_hub_download
return hf_hub_download(
.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
return _hf_hub_download_to_cache_dir(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1168: in _hf_hub_download_to_cache_dir
_download_to_tmp_and_move(
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1720: in _download_to_tmp_and_move
xet_get(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def xet_get(
*,
incomplete_path: Path,
xet_file_data: XetFileData,
headers: Dict[str, str],
expected_size: Optional[int] = None,
displayed_filename: Optional[str] = None,
_tqdm_bar: Optional[tqdm] = None,
) -> None:
"""
Download a file using Xet storage service.
Args:
incomplete_path (`Path`):
The path to the file to download.
xet_file_data (`XetFileData`):
The file metadata needed to make the request to the xet storage service.
headers (`Dict[str, str]`):
The headers to send to the xet storage service.
expected_size (`int`, *optional*):
The expected size of the file to download. If set, the download will raise an error if the size of the
received content is different from the expected one.
displayed_filename (`str`, *optional*):
The filename of the file that is being downloaded. Value is used only to display a nice progress bar. If
not set, the filename is guessed from the URL or the `Content-Disposition` header.
**How it works:**
The file download system uses Xet storage, which is a content-addressable storage system that breaks files into chunks
for efficient storage and transfer.
`hf_xet.download_files` manages downloading files by:
- Taking a list of files to download (each with its unique content hash)
- Connecting to a storage server (CAS server) that knows how files are chunked
- Using authentication to ensure secure access
- Providing progress updates during download
Authentication works by regularly refreshing access tokens through `refresh_xet_connection_info` to maintain a valid
connection to the storage server.
The download process works like this:
1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
2. Download files in parallel:
2.1. Prepare to write the file to disk
2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
The server responds with:
- Which chunks make up the complete file
- Where each chunk can be downloaded from
2.3. For each needed chunk:
- Checks if we already have it in our local cache
- If not, download it from cloud storage (S3)
- Save it to cache for future use
- Assemble the chunks in order to recreate the original file
"""
try:
from hf_xet import PyXetDownloadInfo, download_files # type: ignore[no-redef]
except ImportError:
raise ValueError(
"To use optimized download using Xet storage, you need to install the hf_xet package. "
'Try `pip install "huggingface_hub[hf_xet]"` or `pip install hf_xet`.'
)
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
def token_refresher() -> Tuple[str, int]:
connection_info = refresh_xet_connection_info(file_data=xet_file_data, headers=headers)
if connection_info is None:
raise ValueError("Failed to refresh token using xet metadata.")
return connection_info.access_token, connection_info.expiration_unix_epoch
xet_download_info = [
PyXetDownloadInfo(
destination_path=str(incomplete_path.absolute()), hash=xet_file_data.file_hash, file_size=expected_size
)
]
if not displayed_filename:
displayed_filename = incomplete_path.name
# Truncate filename if too long to display
if len(displayed_filename) > 40:
displayed_filename = f"{displayed_filename[:40]}(…)"
progress_cm = _get_progress_bar_context(
desc=displayed_filename,
log_level=logger.getEffectiveLevel(),
total=expected_size,
initial=0,
name="huggingface_hub.xet_get",
_tqdm_bar=_tqdm_bar,
)
with progress_cm as progress:
def progress_updater(progress_bytes: float):
progress.update(progress_bytes)
> download_files(
xet_download_info,
endpoint=connection_info.endpoint,
token_info=(connection_info.access_token, connection_info.expiration_unix_epoch),
token_refresher=token_refresher,
progress_updater=[progress_updater],
)
E RuntimeError: Data processing error: CAS service error : IO Error: Disk quota exceeded (os error 122)
.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:626: RuntimeError
----------------------------- Captured stderr call -----------------------------
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
=============================== warnings summary ===============================
test/backends/test_huggingface.py: 1 warning
test/stdlib/components/intrinsic/test_core.py: 2 warnings
test/stdlib/components/intrinsic/test_guardian.py: 3 warnings
test/stdlib/components/intrinsic/test_rag.py: 5 warnings
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/peft/tuners/tuners_utils.py:285: UserWarning: Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
warnings.warn(
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/utils.py:103: DeprecationWarning: This class is deprecated and will be removed in version 0.20.0. To train on completion only, please use the parameter `completion_only_loss` of `SFTConfig` instead.
warnings.warn(
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
test/cli/test_alora_train.py::test_alora_config_creation
test/cli/test_alora_train.py::test_lora_config_creation
test/cli/test_alora_train.py::test_invocation_prompt_tokenization
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/sft_config.py:257: DeprecationWarning: `max_seq_length` is deprecated and will be removed in version 0.20.0. Use `max_length` instead.
warnings.warn(
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:678: DeprecationWarning: Failed to apply the formatting function due to the following error: string index out of range. This may be because the function is designed for batched input. Please update it to process one example at a time (i.e., accept and return a single example). For now, we will attempt to apply the function in batched mode, but note that batched formatting is deprecated and will be removed in version 0.21.
warnings.warn(
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.is_pinned() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:31.)
return data.pin_memory(device)
test/telemetry/test_metrics_backend.py: 8 warnings
test/telemetry/test_metrics.py: 24 warnings
test/telemetry/test_metrics_token.py: 4 warnings
/proj/dmfexp/eiger/users/ajbozarth/mellea/mellea/telemetry/metrics.py:245: UserWarning: Metrics are enabled (MELLEA_METRICS_ENABLED=true) but no exporters are configured. Metrics will be collected but not exported. Set MELLEA_METRICS_PROMETHEUS=true, set MELLEA_METRICS_OTLP=true with an endpoint (OTEL_EXPORTER_OTLP_METRICS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT), or set MELLEA_METRICS_CONSOLE=true to export metrics.
_meter_provider = _setup_meter_provider()
test/telemetry/test_metrics_backend.py: 7 warnings
test/telemetry/test_metrics.py: 28 warnings
test/telemetry/test_metrics_token.py: 4 warnings
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py:131: UserWarning: TokenMetricsPlugin already registered: Plugin token_metrics.generation_post_call already registered
_bootstrap._exec(spec, module)
test/backends/test_vision_openai.py::test_image_block_construction
/proj/dmfexp/eiger/users/ajbozarth/mellea/test/backends/test_vision_openai.py:48: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
random_image = Image.fromarray(random_pixel_data, "RGB")
test/backends/test_litellm_ollama.py::test_litellm_ollama_chat
test/backends/test_litellm_ollama.py::test_generate_from_raw
test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/backends/test_litellm_ollama.py::test_async_avalue
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/aiohttp/connector.py:993: DeprecationWarning: enable_cleanup_closed ignored because https://github.com/python/cpython/pull/118960 is fixed in Python version sys.version_info(major=3, minor=12, micro=12, releaselevel='final', serial=0)
super().__init__(
test/backends/test_litellm_ollama.py::test_litellm_ollama_chat
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='The answ...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Subject:...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='yes', ro...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_litellm_ollama_instruct_options
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="Subject:...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_gen_slot
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{\n "...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/streaming_handler.py:1855: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
obj_dict = processed_chunk.dict()
test/backends/test_litellm_ollama.py::test_async_parallel_requests
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="Goodbye!...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_litellm_ollama.py::test_async_parallel_requests
test/backends/test_litellm_ollama.py::test_async_avalue
test/backends/test_litellm_ollama.py::test_async_avalue
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Hello! H...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/backends/test_tool_calls.py::test_tool_called_from_context_action
<frozen abc>:106: DeprecationWarning: Use BaseMetaSerializer() instead.
test/backends/test_vision_ollama.py::test_image_block_construction
/proj/dmfexp/eiger/users/ajbozarth/mellea/test/backends/test_vision_ollama.py:38: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
random_image = Image.fromarray(random_pixel_data, "RGB")
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[non-streaming]
test/telemetry/test_tracing.py::test_session_with_tracing_disabled
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="I'm here...ields={'refusal': None}), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/model_response_utils.py:206: PydanticDeprecatedSince211: Accessing the 'model_computed_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
or callable(getattr(delta, attr_name))
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/model_response_utils.py:206: PydanticDeprecatedSince211: Accessing the 'model_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
or callable(getattr(delta, attr_name))
test/telemetry/test_metrics_backend.py::test_litellm_token_metrics_integration[streaming]
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:
PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='As an AI...er_specific_fields=None), input_type=Message])
PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
return self.__pydantic_serializer__.to_python(
test/helpers/test_event_loop_helper.py::test_event_loop_handler_with_forking
/u/ajbozarth/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=1137938) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
test/stdlib/components/docs/test_richdocument.py::test_richdocument_basics
test/stdlib/components/docs/test_richdocument.py::test_richdocument_markdown
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
test/stdlib/components/docs/test_richdocument.py::test_richdocument_save
/proj/dmfexp/eiger/users/ajbozarth/mellea/.venv/lib/python3.12/site-packages/docling_core/transforms/serializer/markdown.py:490: DeprecationWarning: Field `annotations` is deprecated; use `meta` instead.
for ann in item.annotations
test/telemetry/test_logging.py::test_otlp_logging_enabled_without_endpoint_warns
/proj/dmfexp/eiger/users/ajbozarth/mellea/mellea/telemetry/logging.py:97: UserWarning: OTLP logs exporter is enabled (MELLEA_LOGS_OTLP=true) but no endpoint is configured. Set OTEL_EXPORTER_OTLP_LOGS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT to export logs.
_logger_provider = _setup_logger_provider()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================ tests coverage ================================
_______________ coverage: platform linux, python 3.12.12-final-0 _______________
Coverage HTML written to dir htmlcov
Coverage JSON written to file coverage.json
=========================== short test summary info ============================
FAILED test/stdlib/components/intrinsic/test_core.py::test_find_context_attributions
FAILED test/backends/test_vision_openai.py::test_image_block_in_instruction
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_simple]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_answerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_unanswerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[hallucination_detection]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_clarification]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_rewrite]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[context_relevance]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[citations]
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_big - Exc...
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_little - ...
= 10 failed, 821 passed, 37 skipped, 19 deselected, 2 xfailed, 1 xpassed, 131 warnings, 2 errors in 2295.03s (0:38:15) =
[23:00:24] Shutting down ollama server...
[23:00:24] Ollama stopped.
|
|
Looking into the above issues I've actually hit a handful of things to address that I will be retuning to on Monday, including but not limited to:
|
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
…skip/resource issues
53b7fed to
ec0254d
Compare
Thanks!
I had a look at the test. It does test some structural aspects, but also has a few qualitative things in. I didn't hit the issue in multi runs, but agree with your suggested classification. It would benefit from some rewrite to tease out the different aspects . I also looked at the skill, but we're into subtle detail here so I don't think there's anything else to add. Pushed a fix on this
My error -- I had left a necessary file in .gitignore for project-specific config (which is why it worked for me!). That is accepted best-practice and claude intent. I also tweaked the description slightly as we discussed last week so it considers the one-off cases (though I'm sure more tweaks are possible)
I've also done a rebase on upstream/main (with no conflicts) - not squashed, but if'when we're ready I can do that to make it easier to track upstream whilst being reviewed. Trying a cluster run with 'uv sync --all-extras --all-groups && uv run test/scripts/run_tests_with_ollama.sh' (and using Thanks for the thorough checks and patches. |
|
I've rebased on main to bring in the fixes in #765 and #764 in addition to the fixes @planetf1 made on Saturday then reran the tests. Test run summaryLocal run ( Same 2
Cluster run ( Same infrastructure failures as before:
The one improvement vs the last cluster run: Local terminal output$ uv run pytest
============================================================================================================ test session starts ============================================================================================================
platform darwin -- Python 3.12.8, pytest-9.0.0, pluggy-1.6.0
rootdir: /Users/ajbozarth/workspace/ai/mellea
configfile: pyproject.toml
testpaths: test, docs
plugins: nbmake-1.5.5, recording-0.13.4, anyio-4.11.0, xdist-3.8.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, asyncio-1.3.0, langsmith-0.6.6, Faker-37.12.0, cov-7.0.0
timeout: 900.0s
timeout method: signal
timeout func_only: False
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 883 items / 19 deselected / 2 skipped / 864 selected
test/backends/test_adapters/test_adapter.py . [ 0%]
test/backends/test_bedrock.py s [ 0%]
test/backends/test_huggingface.py sssssssssssssssssss [ 2%]
test/backends/test_huggingface_tools.py s [ 2%]
test/backends/test_litellm_ollama.py ........ [ 3%]
test/backends/test_litellm_watsonx.py ssss [ 3%]
test/backends/test_mellea_tool.py ........ [ 4%]
test/backends/test_model_options.py ..... [ 5%]
test/backends/test_ollama.py .....X.... [ 6%]
test/backends/test_openai_ollama.py ............. [ 8%]
test/backends/test_openai_vllm.py sssssss [ 8%]
test/backends/test_tool_calls.py ... [ 9%]
test/backends/test_tool_decorator.py ................... [ 11%]
test/backends/test_tool_helpers.py ... [ 11%]
test/backends/test_tool_validation_integration.py ................................. [ 15%]
test/backends/test_vision_ollama.py .... [ 16%]
test/backends/test_vision_openai.py .... [ 16%]
test/backends/test_watsonx.py sssssssssss [ 17%]
test/cli/test_alora_train.py .... [ 18%]
test/cli/test_alora_train_integration.py ss [ 18%]
test/core/test_astream_exception_propagation.py ..... [ 19%]
test/core/test_astream_incremental.py ...... [ 19%]
test/core/test_astream_mock.py ...... [ 20%]
test/core/test_base.py .... [ 20%]
test/core/test_component_typing.py ........ [ 21%]
test/core/test_model_output_thunk.py .. [ 22%]
test/decompose/test_decompose.py .......... [ 23%]
test/formatters/granite/test_intrinsics_formatters.py ........................................................x................. [ 31%]
test/formatters/test_template_formatter.py ................ [ 33%]
test/helpers/test_event_loop_helper.py .... [ 34%]
test/helpers/test_server_type.py ................ [ 35%]
test/plugins/test_all_payloads.py ................................................................................................... [ 47%]
test/plugins/test_blocking.py ................ [ 49%]
test/plugins/test_build_global_context.py ....... [ 50%]
test/plugins/test_decorators.py ......... [ 51%]
test/plugins/test_execution_modes.py ........................... [ 54%]
test/plugins/test_hook_call_sites.py .............................. [ 57%]
test/plugins/test_manager.py ss...... [ 58%]
test/plugins/test_mellea_plugin.py ....... [ 59%]
test/plugins/test_payloads.py .......... [ 60%]
test/plugins/test_pluginset.py ......... [ 61%]
test/plugins/test_policies.py ...... [ 62%]
test/plugins/test_policy_enforcement.py .......... [ 63%]
test/plugins/test_priority_ordering.py .............. [ 65%]
test/plugins/test_scoping.py ................................... [ 69%]
test/plugins/test_tool_hooks_redaction.py ....... [ 70%]
test/plugins/test_unregister.py ......... [ 71%]
test/stdlib/components/docs/test_document.py ... [ 71%]
test/stdlib/components/docs/test_richdocument.py .....s [ 72%]
test/stdlib/components/intrinsic/test_core.py ..F [ 72%]
test/stdlib/components/intrinsic/test_guardian.py ...... [ 73%]
test/stdlib/components/intrinsic/test_rag.py ....F.. [ 73%]
test/stdlib/components/test_chat.py . [ 74%]
test/stdlib/components/test_genslot.py ................... [ 76%]
test/stdlib/components/test_hello_world.py .. [ 76%]
test/stdlib/components/test_mify.py ........... [ 77%]
test/stdlib/components/test_transform.py .. [ 78%]
test/stdlib/requirements/test_reqlib_markdown.py ...... [ 78%]
test/stdlib/requirements/test_reqlib_python.py .............sss..... [ 81%]
test/stdlib/requirements/test_reqlib_tools.py . [ 81%]
test/stdlib/requirements/test_requirement.py ..... [ 81%]
test/stdlib/sampling/test_majority_voting.py .. [ 82%]
test/stdlib/sampling/test_sampling_ctx.py .. [ 82%]
test/stdlib/sampling/test_sofai_graph_coloring.py ......................... [ 85%]
test/stdlib/sampling/test_sofai_sampling.py ..................... [ 87%]
test/stdlib/sampling/test_think_budget_forcing.py .. [ 87%]
test/stdlib/test_base_context.py ..... [ 88%]
test/stdlib/test_chat_view.py .. [ 88%]
test/stdlib/test_functional.py .... [ 89%]
test/stdlib/test_session.py s....... [ 90%]
test/stdlib/test_spans.py .x [ 90%]
test/telemetry/test_logging.py ........ [ 91%]
test/telemetry/test_metrics.py ....................................... [ 95%]
test/telemetry/test_metrics_backend.py ....s.... [ 96%]
test/telemetry/test_metrics_plugins.py .... [ 97%]
test/telemetry/test_metrics_token.py .... [ 97%]
test/telemetry/test_tracing.py .............. [ 99%]
test/telemetry/test_tracing_backend.py ssssss [100%]
FAILED test/stdlib/components/intrinsic/test_core.py::test_find_context_attributions - IndexError: list index out of range
FAILED test/stdlib/components/intrinsic/test_rag.py::test_hallucination_detection - AssertionError: assert approx({'resp...he sentence.}) == {'explanation...end': 31, ...}
================================================================= 2 failed, 800 passed, 61 skipped, 19 deselected, 2 xfailed, 1 xpassed, 122 warnings in 1036.95s (0:17:16) =================================================================Cluster terminal output$ bsub -Is -n 1 -G grp_preemptable -q preemptable -gpu "num=1/task:mode=shared:mps=no:j_exclusive=yes:gvendor=nvidia" /bin/bash
Job <756712> is submitted to queue <preemptable>.
<<Waiting for dispatch ...>>
<<Starting on p2-r28-n2>>
[ajbozarth@p2-r28-n2 mellea]$ test/scripts/run_tests_with_ollama.sh
[00:56:43] WARNING: CACHE_DIR not set. Ollama models will download to ~/.ollama (default)
[00:56:43] Using standalone log directory: logs/2026-03-31-00:56:43
[00:56:43] Starting ollama server on 127.0.0.1:11434...
[00:56:43] Added system CUDA to LD_LIBRARY_PATH
[00:56:43] Ollama server PID: 229334
[00:56:43] Waiting for ollama to be ready...
[00:56:46] Ollama ready after 2s
[00:56:46] Model granite4:micro already pulled
[00:56:46] Model granite4:micro-h already pulled
[00:56:46] Model granite3.2-vision already pulled
[00:56:46] All models ready.
[00:56:46] Warming up models...
[00:56:46] Warming granite4:micro ...
[00:56:49] Warming granite4:micro-h ...
[00:56:52] Warming granite3.2-vision ...
[00:56:55] Warmup complete.
[00:56:55] Starting pytest...
[00:56:55] Log directory: logs/2026-03-31-00:56:43
[00:56:55] Pytest args: --group-by-backend
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.0, pluggy-1.6.0
rootdir: /proj/dmfexp/eiger/users/ajbozarth/mellea
configfile: pyproject.toml
plugins: nbmake-1.5.5, anyio-4.11.0, json-report-1.5.0, recording-0.13.4, asyncio-1.3.0, timeout-2.4.0, metadata-3.1.1, Faker-37.12.0, xdist-3.8.0, langsmith-0.6.6, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
timeout: 900.0s
timeout method: signal
timeout func_only: False
collected 892 items / 19 deselected / 873 selected
test/backends/test_huggingface.py ................... [ 2%]
test/backends/test_huggingface_tools.py . [ 2%]
test/cli/test_alora_train_integration.py .. [ 2%]
test/formatters/granite/test_intrinsics_formatters.py ....x......... [ 4%]
test/stdlib/components/docs/test_richdocument.py s [ 4%]
test/stdlib/components/intrinsic/test_core.py ..F [ 4%]
test/stdlib/components/intrinsic/test_guardian.py ...... [ 5%]
test/stdlib/components/intrinsic/test_rag.py ....... [ 6%]
test/stdlib/test_spans.py .x [ 6%]
test/telemetry/test_metrics_backend.py .. [ 6%]
test/backends/test_openai_ollama.py ............. [ 8%]
test/backends/test_openai_vllm.py sssssss [ 8%]
test/backends/test_vision_openai.py .... [ 9%]
test/telemetry/test_metrics_backend.py .. [ 9%]
test/backends/test_vllm.py ........ [ 10%]
test/backends/test_vllm_tools.py . [ 10%]
test/backends/test_litellm_ollama.py ........ [ 11%]
test/backends/test_mellea_tool.py .. [ 11%]
test/backends/test_ollama.py .....X.... [ 12%]
test/backends/test_tool_calls.py ... [ 13%]
test/backends/test_vision_ollama.py .... [ 13%]
test/core/test_astream_incremental.py ...... [ 14%]
test/core/test_component_typing.py ... [ 14%]
test/core/test_model_output_thunk.py .. [ 14%]
test/stdlib/components/test_genslot.py ................... [ 17%]
test/stdlib/requirements/test_requirement.py ..... [ 17%]
test/stdlib/sampling/test_majority_voting.py .. [ 17%]
test/stdlib/sampling/test_sampling_ctx.py .. [ 18%]
test/stdlib/sampling/test_sofai_graph_coloring.py ... [ 18%]
test/stdlib/sampling/test_sofai_sampling.py . [ 18%]
test/stdlib/sampling/test_think_budget_forcing.py EE [ 18%]
test/stdlib/test_chat_view.py .. [ 19%]
test/stdlib/test_functional.py .... [ 19%]
test/stdlib/test_session.py s....... [ 20%]
test/telemetry/test_metrics_backend.py .... [ 20%]
test/telemetry/test_tracing.py .... [ 21%]
test/telemetry/test_tracing_backend.py ssssss [ 21%]
test/backends/test_bedrock.py s [ 22%]
test/backends/test_litellm_watsonx.py ssss [ 22%]
test/backends/test_watsonx.py sssssssssss [ 23%]
test/telemetry/test_metrics_backend.py s [ 23%]
test/backends/test_adapters/test_adapter.py . [ 24%]
test/backends/test_mellea_tool.py ...... [ 24%]
test/backends/test_model_options.py ..... [ 25%]
test/backends/test_tool_decorator.py ................... [ 27%]
test/backends/test_tool_helpers.py ... [ 27%]
test/backends/test_tool_validation_integration.py ......................
........... [ 31%]
test/cli/test_alora_train.py .... [ 32%]
test/core/test_astream_exception_propagation.py ..... [ 32%]
test/core/test_astream_mock.py ...... [ 33%]
test/core/test_base.py .... [ 33%]
test/core/test_component_typing.py ..... [ 34%]
test/decompose/test_decompose.py .......... [ 35%]
test/formatters/granite/test_intrinsics_formatters.py ....................
..................................FFFFFFFF [ 42%]
test/formatters/test_template_formatter.py ................ [ 44%]
test/helpers/test_event_loop_helper.py .... [ 44%]
test/helpers/test_server_type.py ................ [ 46%]
test/plugins/test_all_payloads.py ......................................
............................................................. [ 57%]
test/plugins/test_blocking.py ................ [ 59%]
test/plugins/test_build_global_context.py ....... [ 60%]
test/plugins/test_decorators.py ......... [ 61%]
test/plugins/test_execution_modes.py ........................... [ 64%]
test/plugins/test_hook_call_sites.py .............................. [ 68%]
test/plugins/test_manager.py ss...... [ 68%]
test/plugins/test_mellea_plugin.py ....... [ 69%]
test/plugins/test_payloads.py .......... [ 70%]
test/plugins/test_pluginset.py ......... [ 71%]
test/plugins/test_policies.py ...... [ 72%]
test/plugins/test_policy_enforcement.py .......... [ 73%]
test/plugins/test_priority_ordering.py .............. [ 75%]
test/plugins/test_scoping.py ................................... [ 79%]
test/plugins/test_tool_hooks_redaction.py ....... [ 80%]
test/plugins/test_unregister.py ......... [ 81%]
test/stdlib/components/docs/test_document.py ... [ 81%]
test/stdlib/components/docs/test_richdocument.py ..... [ 82%]
test/stdlib/components/test_chat.py . [ 82%]
test/stdlib/components/test_hello_world.py .. [ 82%]
test/stdlib/components/test_mify.py ........... [ 83%]
test/stdlib/components/test_transform.py .. [ 83%]
test/stdlib/requirements/test_reqlib_markdown.py ...... [ 84%]
test/stdlib/requirements/test_reqlib_python.py .............sss..... [ 87%]
test/stdlib/requirements/test_reqlib_tools.py . [ 87%]
test/stdlib/sampling/test_sofai_graph_coloring.py ......................
[ 89%]
test/stdlib/sampling/test_sofai_sampling.py .................... [ 91%]
test/stdlib/test_base_context.py ..... [ 92%]
test/telemetry/test_logging.py ........ [ 93%]
test/telemetry/test_metrics.py ....................................... [ 97%]
test/telemetry/test_metrics_plugins.py .... [ 98%]
test/telemetry/test_metrics_token.py .... [ 98%]
test/telemetry/test_tracing.py .......... [100%]
=========================== short test summary info ============================
FAILED test/stdlib/components/intrinsic/test_core.py::test_find_context_attributions
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_simple]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_answerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[answerability_unanswerable]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[hallucination_detection]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_clarification]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[query_rewrite]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[context_relevance]
FAILED test/formatters/granite/test_intrinsics_formatters.py::test_run_ollama[citations]
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_big - Exc...
ERROR test/stdlib/sampling/test_think_budget_forcing.py::test_think_little - ...
= 9 failed, 822 passed, 37 skipped, 19 deselected, 2 xfailed, 1 xpassed, 131 warnings, 2 errors in 2253.24s (0:37:33) =
[01:34:44] Shutting down ollama server...
[01:34:44] Ollama stopped. |
|
I think I'm ok with the PR as is now, though I'd like a few other eyes on it @avinash2692 @psschwei @jakelorocco if you could try running the tests as well? @avinash2692 based on my re-run above I'm still hitting the disk quota issue on bluevela, I'm unsure if that was supposed to have been fixed by #765 if you could see if you also hit it (also check the commands I'm running in the terminal output above to make sure I'm just not running the wrong commands (I got them from you) |
jakelorocco
left a comment
There was a problem hiding this comment.
lgtm; I'm hitting some environment errors on blue vela but the tests run locally for me (except for the expected skips and the failures already mentioned).
I think we can investigate any remaining test failures separately.
psschwei
left a comment
There was a problem hiding this comment.
in general, these are mainly marker classification changes and shouldn't majorly break things. I'm ok with this going in, the team can fix issues if they pop up.
| ignore_all = config.getoption("--ignore-all-checks", default=False) | ||
| ignore_gpu = config.getoption("--ignore-gpu-check", default=False) or ignore_all |
There was a problem hiding this comment.
does this still work or did we drop either/both of these?
There was a problem hiding this comment.
I believe this was removed from the test dir contest, but not examples. I'm unsure why @planetf1 did this or if it was just missed.
| # Get the model and tokenizer. | ||
| self._model: PreTrainedModel = AutoModelForCausalLM.from_pretrained( | ||
| self._hf_model_id, device_map=str(self._device) | ||
| self._hf_model_id, device_map=str(self._device), torch_dtype="auto" |
There was a problem hiding this comment.
just noting to check here as one possible cause in the event that we start seeing qualitative flakes in the HF tests somewhere down the line.
| if torch.cuda.is_available(): | ||
| return torch.cuda.get_device_properties(0).total_memory / (1024**3) |
There was a problem hiding this comment.
This might cause an issue. iirc torch.cuda.get_device_properties(0) does create a context with CUDA so this might lead to device not available errors if you repeatedly call it.
| pytest.mark.requires_heavy_ram, | ||
| pytest.mark.requires_gpu_isolation, # Activate GPU memory isolation | ||
| pytest.mark.e2e, | ||
| require_gpu(min_vram_gb=20), |
There was a problem hiding this comment.
might need to be a little careful here with GPU mem setting. IMHO, we should just let the test fail if there isn't enough vram rather than checking.
avinash2692
left a comment
There was a problem hiding this comment.
LGTM. I did have a few comments on looking up GPU mem and if we need it at all, but happy to fix that in a later PR if it really breaks things.
|
Lets merge this as is (especially since @planetf1 is OOTO this week) and open up followup issues: @psschwei and @avinash2692 if you could open follow up issues with any of your above concerns that you think need them, I'll put this into the merge queue and you or @planetf1 can address those followups after. |
@jakelorocco I actually figured this out, the tl;dr always set |
d3d6040
…ve-computing#727, generative-computing#728) (generative-computing#742) * test: add granularity marker taxonomy infrastructure (generative-computing#727) Register unit/integration/e2e markers in conftest and pyproject.toml. Add unit auto-apply hook in pytest_collection_modifyitems. Deprecate llm marker (synonym for e2e). Remove dead plugins marker. Rewrite MARKERS_GUIDE.md as authoritative marker reference. Sync AGENTS.md Section 3 with new taxonomy. * test: add audit-markers skill for test classification (generative-computing#728) Skill classifies tests as unit/integration/e2e/qualitative using general heuristics (Part 1) and project-specific rules (Part 2). Includes fixture chain tracing guidance, backend detection heuristics, and example file handling. References MARKERS_GUIDE.md for tables. * chore: add CLAUDE.md and agent skills infrastructure Add CLAUDE.md referencing AGENTS.md for project directives. Add skill-author meta-skill for cross-compatible skill creation. The audit-markers skill was added in the previous commit. * test: improve audit-markers skill quality and add resource predicates Resolve 8 quality issues from dry-run review of the audit-markers skill: - Add behavioural signal detection tables and Step 0 triage procedure for scaling to full-repo audits (grep for backend behaviour, not just existing markers) - Clarify unit/integration boundary with scope-of-mocks rule - Allow module-level qualitative when every function qualifies - Replace resource marker inference with predicate factory pattern - Make llm→e2e rule explicit for # pytest: comments in examples - Redesign report format: 3-tier output (summary table, issues-only detail, batch groups) instead of per-function listing - Remove stale infrastructure note (conftest hook already exists) Add test/predicates.py with reusable skipif decorators: require_gpu, require_ram, require_gpu_isolation, require_api_key, require_package, require_ollama, require_python. Update skill-author with dry-run review step and 4 new authoring guidelines (variable scope, category boundaries, temporal assertions, qualifying absolutes). Refs: generative-computing#727, generative-computing#728 * chore: remove issue references from audit-markers skill Epic/issue numbers are task context, not permanent skill knowledge. * docs: align MARKERS_GUIDE.md with predicate factory pattern MARKERS_GUIDE.md documented legacy resource markers (requires_gpu, etc.) as the active convention while SKILL.md instructed migration to predicates — a direct conflict that would cause the audit agent to stall or produce incorrect edits. - Replace resource markers section with predicate-first documentation - Move legacy markers to deprecated subsection (conftest still handles them) - Update common patterns example to use predicate imports - Add test/predicates.py to related files - Add explicit dry-run enforcement to SKILL.md Step 4 Refs: generative-computing#727, generative-computing#728 * fix: validate_skill.py schema mismatch and brittle YAML parsing Two bugs: - Required `version` at root level but skill-author guide nests it under `metadata` — guaranteed failure on valid skills - Naive `content.split('---')` breaks on markdown horizontal rules Fix: use yaml.safe_load_all for robust frontmatter extraction, check `name`/`description` at root and `version` under `metadata.version`. * fix: migrate deprecated llm markers to e2e, add backend registry, update audit-markers skill - Replace all `pytest.mark.llm` with `pytest.mark.e2e` across 34 test files and 87 example files (comment-based markers) - Add `BACKEND_MARKERS` data-driven registry in test/conftest.py as single source of truth for backend marker registration - Register `bedrock` backend marker in conftest.py, pyproject.toml, MARKERS_GUIDE.md, and add missing marker to test_bedrock.py - Reclassify test_alora_train.py as integration (was unit); add importorskip for peft dependency - Add missing `e2e` tier markers to test_tracing.py and test_tracing_backend.py - Update audit-markers skill: report-first default, predicate migration as fix (not recommendation), backend registry gap detection * feat: add estimate-vram skill and fix MPS VRAM detection - New /estimate-vram agent skill that analyses test files to determine correct require_gpu(min_vram_gb=N) and require_ram(min_gb=N) values by tracing model IDs and looking up parameter counts dynamically - Fix _gpu_vram_gb() in test/predicates.py to use torch.mps.recommended_max_memory() on macOS MPS instead of returning 0 - Fix get_system_capabilities() in test/conftest.py with same MPS path - Update test/README.md with predicates table and legacy marker deprecation - Add /estimate-vram cross-reference in audit-markers skill * refactor: fold estimate-vram into audit-markers skill VRAM estimation is only useful during marker audits, not standalone. Move the model-tracing and VRAM computation procedure into the audit-markers resource gating section and delete the separate skill. * docs: drop isolation refs and fix RAM guidance in markers docs requires_heavy_ram and requires_gpu_isolation are deprecated with no replacement — models load into VRAM not system RAM, and GPU isolation is now automatic. require_ram() stays available for genuinely RAM-bound tests but has no current use case. * docs: add legacy marker guidance for example files in audit-markers skill * refactor: remove require_ollama() predicate — redundant with backend marker The ollama backend marker + conftest auto-skip already handles Ollama availability. No other backend has a dedicated predicate — consistent to let the marker system handle it. * refactor: replace requires_heavy_ram gate with huggingface backend marker in examples conftest The legacy requires_heavy_ram marker (blanket 48 GB RAM threshold) conflated VRAM with system RAM. Replace both the collection-time and runtime skip logic to gate on the huggingface backend marker instead, which accurately checks GPU availability. * refactor: replace ad-hoc bedrock skipif with require_api_key predicate * refactor: migrate legacy resource markers to predicates Replace deprecated pytest markers with typed predicate functions from test/predicates.py across all test files and example files: - requires_gpu → require_gpu(min_vram_gb=N) with per-model VRAM estimates - requires_heavy_ram → removed (conflated VRAM with RAM; no replacement needed) - requires_gpu_isolation → removed (GPU isolation is now automatic) - requires_api_key → require_api_key("VAR1", "VAR2", ...) with explicit env vars Also removes spurious requires_gpu from ollama-backed tests (test_genslot, test_think_budget_forcing, test_component_typing) and adds missing integration marker to test_hook_call_sites. VRAM estimates computed from model parameter counts using bf16 formula (params_B × 2 × 1.2, rounded up to next even GB): - granite-3.3-8b: 20 GB, Mistral-7B: 18 GB, granite-4.0-micro (3B): 8 GB - Qwen3-0.6B: 4 GB (conservative for vLLM KV cache headroom) - granite-4.0-h-micro (3B): 8 GB, alora training (3B): 12 GB * test: skip collection gracefully when optional backend deps are missing Add pytest.importorskip() / pytest.importorskip() guards to 14 test files that previously aborted the entire test run with a ModuleNotFoundError when optional extras were not installed: - torch / llguidance (mellea[hf]): test_huggingface, test_huggingface_tools, test_alora_train_integration, test_intrinsics_formatters, test_core, test_guardian, test_rag, test_spans - litellm (mellea[litellm]): test_litellm_ollama, test_litellm_watsonx - ibm_watsonx_ai (mellea[watsonx]): test_watsonx - docling / docling_core (mellea[mify]): test_tool_calls, test_richdocument, test_transform With these guards, `uv run pytest` runs all collectable tests and reports skipped files with a clear reason instead of aborting at first ImportError. * test: refine integration marker definition and apply audit fixes Expand integration to cover SDK-boundary tests (OTel InMemoryMetricReader, InMemorySpanExporter, LoggingHandler) — tests that assert against a real third-party SDK contract, not just multi-component wiring. Updates SKILL.md and MARKERS_GUIDE.md with new definition, indicators, tie-breaker, and SDK-boundary signal tables. Applied fixes: - test/telemetry/test_{metrics,metrics_token,logging}.py: add integration marker - test/telemetry/test_metrics_backend.py: add openai marker to OTel+OpenAI test, remove redundant inline skip already covered by require_api_key predicate - test/cli/test_alora_train.py: add integration to test_imports_work (real LoraConfig) - test/formatters/granite/test_intrinsics_formatters.py: remove unregistered block_network marker - test/stdlib/components/docs/test_richdocument.py: add integration pytestmark + e2e/huggingface/qualitative on skipped generation test - test/backends/test_openai_ollama.py: note inherited module marker limitation - docs/examples/plugins/testing_plugins.py: add # pytest: unit * test: add importorskip guards and optional-dep skip logic for examples - test/plugins/test_payloads.py: importorskip("cpex") — skip module when mellea[hooks] not installed instead of failing mid-test with ImportError - test/telemetry/test_metrics_plugins.py: same cpex guard - docs/examples/conftest.py: extend _check_optional_imports to cover docling, pandas, cpex (mellea.plugins imports), and litellm; also call the check from pytest_pycollect_makemodule so directly-specified files are guarded too - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * fix: convert example import errors to skips; add cpex importorskip guards Replace per-dep import checks in examples conftest with a runtime approach: ExampleModule (a pytest.Module subclass) is now returned by pytest_pycollect_makemodule for all runnable example files, preventing pytest's default collector from importing them directly. Import errors in the subprocess are caught in ExampleItem.runtest() and converted to skips, so no optional dependency needs to be encoded in conftest. Remove _check_optional_imports entirely — it was hand-maintained and would need updating for every new optional dep. Also: - test/plugins/test_payloads.py: importorskip("cpex") - test/telemetry/test_metrics_plugins.py: importorskip("cpex") - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * test: skip OTel-dependent tests when opentelemetry not installed Locally running without mellea[telemetry] caused three tests to fail with assertion errors rather than skip cleanly. Add importorskip at module level for test_tracing.py and a skipif decorator for the single OTel-gated test in test_astream_exception_propagation.py. * fix: use conservative heuristic for Apple Silicon GPU memory detection Metal's recommendedMaxWorkingSetSize is a static device property (~75% of total RAM) that ignores current system load. Replace it with min(total * 0.75, total - 16) so that desktop/IDE memory usage is accounted for. Also removes the torch dependency for GPU detection on Apple Silicon — sysctl hw.memsize is used directly. CUDA path on Linux is unchanged. * test: add training memory signals to audit-markers skill; bump alora VRAM gate Training tests need ~2x the base model inference memory (activations, optimizer states, gradient temporaries). The skill now detects training signals (train_model, Trainer, epochs=) and checks that require_gpu min_vram_gb uses the 2x rule. Bump test_alora_train_integration from min_vram_gb=12 to 20 (3B bfloat16: ~6 GB inference, ~12 GB training peak + headroom) so it skips correctly on 32 GB Apple Silicon under typical load. * fix: cache system capabilities result in examples conftest get_system_capabilities() was caching the function reference, not the result — causing the Ollama socket check (1s timeout) and full capability detection to re-run for every example file during collection (~102 times). Cache the result dict instead so detection runs exactly once. * fix: cache get_system_capabilities() result in test/conftest.py The function was called once per test in pytest_runtest_setup (325+ calls) and once at collection in pytest_collection_modifyitems, each time re-running the Ollama socket check (1s timeout when down), sysctl subprocess, and psutil query. Cache the result after the first call. * fix: flush MPS memory pool in intrinsic test fixture teardown torch.cuda.empty_cache() is a no-op on Apple Silicon MPS, leaving the MPS allocator pool occupied after each module fixture tears down. The next module then loads a fresh model into an already-pressured pool, causing the process RSS to grow unboundedly across modules. Both calls are now guarded so CUDA and MPS runs each get the correct flush. * fix: load LocalHFBackend model in config dtype to prevent float32 upcasting AutoModelForCausalLM.from_pretrained without torch_dtype may load weights in float32 on CPU before moving to MPS/CUDA, doubling peak memory briefly and leaving float32 remnants in the allocator pool. torch_dtype="auto" respects the model config (bfloat16 for Granite) for both the CPU load and the device transfer. * test: remove --isolate-heavy process isolation and bump intrinsic VRAM gates - Remove --isolate-heavy flag, _run_heavy_modules_isolated(), pytest_collection_finish(), and require_gpu_isolation() predicate — superseded by cleanup_gpu_backend() from PR generative-computing#721 - Remove dead requires_gpu/requires_api_key branches from docs/examples/conftest.py - Bump min_vram_gb from 8 → 12 on test_guardian, test_core, test_rag, test_spans — correct gate for 3B base model (6 GB) + adapters + inference overhead; 8 GB was wrong and masked by the now-fixed MPS pool leak - Add adapter accumulation signals to audit-markers skill - Update AGENTS.md, test/README.md, MARKERS_GUIDE.md to remove --isolate-heavy references * test: migrate legacy markers in test_intrinsics_formatters.py Replace deprecated @pytest.mark.llm, @pytest.mark.requires_gpu, @pytest.mark.requires_heavy_ram, @pytest.mark.requires_gpu_isolation with @pytest.mark.e2e and @require_gpu(min_vram_gb=12) to align with the new marker taxonomy (generative-computing#727/generative-computing#728). VRAM gate set to 12 GB matching the 3B-parameter model loaded across the parametrized test cases. * test: add integration marker to test_dependency_isolation.py * docs: document OLLAMA_KEEP_ALIVE=1m as memory optimisation for unordered test runs * fix: suppress mypy name-defined for torch.Tensor after importorskip change * fix: ruff format huggingface.py from_pretrained args * fix: ruff format test_watsonx.py and test_huggingface_tools.py * refactor: remove requires_gpu, requires_heavy_ram, requires_gpu_isolation markers and handlers * refactor: remove --ignore-*-check override flags from conftest * refactor: remove requires_api_key marker; fix api backend group to match watsonx+bedrock markers * fix: address review Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> * test: mark test_image_block_in_instruction as qualitative * chore: commit .claude/settings.json with skillLocations for skill discovery * docs: broaden audit-markers skill description to cover diagnostic use cases * docs: add diagnostic mode to audit-markers skill for troubleshooting skip/resource issues --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> Co-authored-by: Alex Bozarth <ajbozart@us.ibm.com>
* decompse doc string * pipline doc string * logging doc string * decomp README * merge docstrings * clean: pre-commit * decomp guide * fix: subtask tag * clean: pre-commit * clean: Readme * merge docstrings * clean: pre-commit * decomp guide * fix: subtask tag * clean: pre-commit * test: agent skills infrastructure and marker taxonomy audit (#727, #728) (#742) * test: add granularity marker taxonomy infrastructure (#727) Register unit/integration/e2e markers in conftest and pyproject.toml. Add unit auto-apply hook in pytest_collection_modifyitems. Deprecate llm marker (synonym for e2e). Remove dead plugins marker. Rewrite MARKERS_GUIDE.md as authoritative marker reference. Sync AGENTS.md Section 3 with new taxonomy. * test: add audit-markers skill for test classification (#728) Skill classifies tests as unit/integration/e2e/qualitative using general heuristics (Part 1) and project-specific rules (Part 2). Includes fixture chain tracing guidance, backend detection heuristics, and example file handling. References MARKERS_GUIDE.md for tables. * chore: add CLAUDE.md and agent skills infrastructure Add CLAUDE.md referencing AGENTS.md for project directives. Add skill-author meta-skill for cross-compatible skill creation. The audit-markers skill was added in the previous commit. * test: improve audit-markers skill quality and add resource predicates Resolve 8 quality issues from dry-run review of the audit-markers skill: - Add behavioural signal detection tables and Step 0 triage procedure for scaling to full-repo audits (grep for backend behaviour, not just existing markers) - Clarify unit/integration boundary with scope-of-mocks rule - Allow module-level qualitative when every function qualifies - Replace resource marker inference with predicate factory pattern - Make llm→e2e rule explicit for # pytest: comments in examples - Redesign report format: 3-tier output (summary table, issues-only detail, batch groups) instead of per-function listing - Remove stale infrastructure note (conftest hook already exists) Add test/predicates.py with reusable skipif decorators: require_gpu, require_ram, require_gpu_isolation, require_api_key, require_package, require_ollama, require_python. Update skill-author with dry-run review step and 4 new authoring guidelines (variable scope, category boundaries, temporal assertions, qualifying absolutes). Refs: #727, #728 * chore: remove issue references from audit-markers skill Epic/issue numbers are task context, not permanent skill knowledge. * docs: align MARKERS_GUIDE.md with predicate factory pattern MARKERS_GUIDE.md documented legacy resource markers (requires_gpu, etc.) as the active convention while SKILL.md instructed migration to predicates — a direct conflict that would cause the audit agent to stall or produce incorrect edits. - Replace resource markers section with predicate-first documentation - Move legacy markers to deprecated subsection (conftest still handles them) - Update common patterns example to use predicate imports - Add test/predicates.py to related files - Add explicit dry-run enforcement to SKILL.md Step 4 Refs: #727, #728 * fix: validate_skill.py schema mismatch and brittle YAML parsing Two bugs: - Required `version` at root level but skill-author guide nests it under `metadata` — guaranteed failure on valid skills - Naive `content.split('---')` breaks on markdown horizontal rules Fix: use yaml.safe_load_all for robust frontmatter extraction, check `name`/`description` at root and `version` under `metadata.version`. * fix: migrate deprecated llm markers to e2e, add backend registry, update audit-markers skill - Replace all `pytest.mark.llm` with `pytest.mark.e2e` across 34 test files and 87 example files (comment-based markers) - Add `BACKEND_MARKERS` data-driven registry in test/conftest.py as single source of truth for backend marker registration - Register `bedrock` backend marker in conftest.py, pyproject.toml, MARKERS_GUIDE.md, and add missing marker to test_bedrock.py - Reclassify test_alora_train.py as integration (was unit); add importorskip for peft dependency - Add missing `e2e` tier markers to test_tracing.py and test_tracing_backend.py - Update audit-markers skill: report-first default, predicate migration as fix (not recommendation), backend registry gap detection * feat: add estimate-vram skill and fix MPS VRAM detection - New /estimate-vram agent skill that analyses test files to determine correct require_gpu(min_vram_gb=N) and require_ram(min_gb=N) values by tracing model IDs and looking up parameter counts dynamically - Fix _gpu_vram_gb() in test/predicates.py to use torch.mps.recommended_max_memory() on macOS MPS instead of returning 0 - Fix get_system_capabilities() in test/conftest.py with same MPS path - Update test/README.md with predicates table and legacy marker deprecation - Add /estimate-vram cross-reference in audit-markers skill * refactor: fold estimate-vram into audit-markers skill VRAM estimation is only useful during marker audits, not standalone. Move the model-tracing and VRAM computation procedure into the audit-markers resource gating section and delete the separate skill. * docs: drop isolation refs and fix RAM guidance in markers docs requires_heavy_ram and requires_gpu_isolation are deprecated with no replacement — models load into VRAM not system RAM, and GPU isolation is now automatic. require_ram() stays available for genuinely RAM-bound tests but has no current use case. * docs: add legacy marker guidance for example files in audit-markers skill * refactor: remove require_ollama() predicate — redundant with backend marker The ollama backend marker + conftest auto-skip already handles Ollama availability. No other backend has a dedicated predicate — consistent to let the marker system handle it. * refactor: replace requires_heavy_ram gate with huggingface backend marker in examples conftest The legacy requires_heavy_ram marker (blanket 48 GB RAM threshold) conflated VRAM with system RAM. Replace both the collection-time and runtime skip logic to gate on the huggingface backend marker instead, which accurately checks GPU availability. * refactor: replace ad-hoc bedrock skipif with require_api_key predicate * refactor: migrate legacy resource markers to predicates Replace deprecated pytest markers with typed predicate functions from test/predicates.py across all test files and example files: - requires_gpu → require_gpu(min_vram_gb=N) with per-model VRAM estimates - requires_heavy_ram → removed (conflated VRAM with RAM; no replacement needed) - requires_gpu_isolation → removed (GPU isolation is now automatic) - requires_api_key → require_api_key("VAR1", "VAR2", ...) with explicit env vars Also removes spurious requires_gpu from ollama-backed tests (test_genslot, test_think_budget_forcing, test_component_typing) and adds missing integration marker to test_hook_call_sites. VRAM estimates computed from model parameter counts using bf16 formula (params_B × 2 × 1.2, rounded up to next even GB): - granite-3.3-8b: 20 GB, Mistral-7B: 18 GB, granite-4.0-micro (3B): 8 GB - Qwen3-0.6B: 4 GB (conservative for vLLM KV cache headroom) - granite-4.0-h-micro (3B): 8 GB, alora training (3B): 12 GB * test: skip collection gracefully when optional backend deps are missing Add pytest.importorskip() / pytest.importorskip() guards to 14 test files that previously aborted the entire test run with a ModuleNotFoundError when optional extras were not installed: - torch / llguidance (mellea[hf]): test_huggingface, test_huggingface_tools, test_alora_train_integration, test_intrinsics_formatters, test_core, test_guardian, test_rag, test_spans - litellm (mellea[litellm]): test_litellm_ollama, test_litellm_watsonx - ibm_watsonx_ai (mellea[watsonx]): test_watsonx - docling / docling_core (mellea[mify]): test_tool_calls, test_richdocument, test_transform With these guards, `uv run pytest` runs all collectable tests and reports skipped files with a clear reason instead of aborting at first ImportError. * test: refine integration marker definition and apply audit fixes Expand integration to cover SDK-boundary tests (OTel InMemoryMetricReader, InMemorySpanExporter, LoggingHandler) — tests that assert against a real third-party SDK contract, not just multi-component wiring. Updates SKILL.md and MARKERS_GUIDE.md with new definition, indicators, tie-breaker, and SDK-boundary signal tables. Applied fixes: - test/telemetry/test_{metrics,metrics_token,logging}.py: add integration marker - test/telemetry/test_metrics_backend.py: add openai marker to OTel+OpenAI test, remove redundant inline skip already covered by require_api_key predicate - test/cli/test_alora_train.py: add integration to test_imports_work (real LoraConfig) - test/formatters/granite/test_intrinsics_formatters.py: remove unregistered block_network marker - test/stdlib/components/docs/test_richdocument.py: add integration pytestmark + e2e/huggingface/qualitative on skipped generation test - test/backends/test_openai_ollama.py: note inherited module marker limitation - docs/examples/plugins/testing_plugins.py: add # pytest: unit * test: add importorskip guards and optional-dep skip logic for examples - test/plugins/test_payloads.py: importorskip("cpex") — skip module when mellea[hooks] not installed instead of failing mid-test with ImportError - test/telemetry/test_metrics_plugins.py: same cpex guard - docs/examples/conftest.py: extend _check_optional_imports to cover docling, pandas, cpex (mellea.plugins imports), and litellm; also call the check from pytest_pycollect_makemodule so directly-specified files are guarded too - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * fix: convert example import errors to skips; add cpex importorskip guards Replace per-dep import checks in examples conftest with a runtime approach: ExampleModule (a pytest.Module subclass) is now returned by pytest_pycollect_makemodule for all runnable example files, preventing pytest's default collector from importing them directly. Import errors in the subprocess are caught in ExampleItem.runtest() and converted to skips, so no optional dependency needs to be encoded in conftest. Remove _check_optional_imports entirely — it was hand-maintained and would need updating for every new optional dep. Also: - test/plugins/test_payloads.py: importorskip("cpex") - test/telemetry/test_metrics_plugins.py: importorskip("cpex") - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * test: skip OTel-dependent tests when opentelemetry not installed Locally running without mellea[telemetry] caused three tests to fail with assertion errors rather than skip cleanly. Add importorskip at module level for test_tracing.py and a skipif decorator for the single OTel-gated test in test_astream_exception_propagation.py. * fix: use conservative heuristic for Apple Silicon GPU memory detection Metal's recommendedMaxWorkingSetSize is a static device property (~75% of total RAM) that ignores current system load. Replace it with min(total * 0.75, total - 16) so that desktop/IDE memory usage is accounted for. Also removes the torch dependency for GPU detection on Apple Silicon — sysctl hw.memsize is used directly. CUDA path on Linux is unchanged. * test: add training memory signals to audit-markers skill; bump alora VRAM gate Training tests need ~2x the base model inference memory (activations, optimizer states, gradient temporaries). The skill now detects training signals (train_model, Trainer, epochs=) and checks that require_gpu min_vram_gb uses the 2x rule. Bump test_alora_train_integration from min_vram_gb=12 to 20 (3B bfloat16: ~6 GB inference, ~12 GB training peak + headroom) so it skips correctly on 32 GB Apple Silicon under typical load. * fix: cache system capabilities result in examples conftest get_system_capabilities() was caching the function reference, not the result — causing the Ollama socket check (1s timeout) and full capability detection to re-run for every example file during collection (~102 times). Cache the result dict instead so detection runs exactly once. * fix: cache get_system_capabilities() result in test/conftest.py The function was called once per test in pytest_runtest_setup (325+ calls) and once at collection in pytest_collection_modifyitems, each time re-running the Ollama socket check (1s timeout when down), sysctl subprocess, and psutil query. Cache the result after the first call. * fix: flush MPS memory pool in intrinsic test fixture teardown torch.cuda.empty_cache() is a no-op on Apple Silicon MPS, leaving the MPS allocator pool occupied after each module fixture tears down. The next module then loads a fresh model into an already-pressured pool, causing the process RSS to grow unboundedly across modules. Both calls are now guarded so CUDA and MPS runs each get the correct flush. * fix: load LocalHFBackend model in config dtype to prevent float32 upcasting AutoModelForCausalLM.from_pretrained without torch_dtype may load weights in float32 on CPU before moving to MPS/CUDA, doubling peak memory briefly and leaving float32 remnants in the allocator pool. torch_dtype="auto" respects the model config (bfloat16 for Granite) for both the CPU load and the device transfer. * test: remove --isolate-heavy process isolation and bump intrinsic VRAM gates - Remove --isolate-heavy flag, _run_heavy_modules_isolated(), pytest_collection_finish(), and require_gpu_isolation() predicate — superseded by cleanup_gpu_backend() from PR #721 - Remove dead requires_gpu/requires_api_key branches from docs/examples/conftest.py - Bump min_vram_gb from 8 → 12 on test_guardian, test_core, test_rag, test_spans — correct gate for 3B base model (6 GB) + adapters + inference overhead; 8 GB was wrong and masked by the now-fixed MPS pool leak - Add adapter accumulation signals to audit-markers skill - Update AGENTS.md, test/README.md, MARKERS_GUIDE.md to remove --isolate-heavy references * test: migrate legacy markers in test_intrinsics_formatters.py Replace deprecated @pytest.mark.llm, @pytest.mark.requires_gpu, @pytest.mark.requires_heavy_ram, @pytest.mark.requires_gpu_isolation with @pytest.mark.e2e and @require_gpu(min_vram_gb=12) to align with the new marker taxonomy (#727/#728). VRAM gate set to 12 GB matching the 3B-parameter model loaded across the parametrized test cases. * test: add integration marker to test_dependency_isolation.py * docs: document OLLAMA_KEEP_ALIVE=1m as memory optimisation for unordered test runs * fix: suppress mypy name-defined for torch.Tensor after importorskip change * fix: ruff format huggingface.py from_pretrained args * fix: ruff format test_watsonx.py and test_huggingface_tools.py * refactor: remove requires_gpu, requires_heavy_ram, requires_gpu_isolation markers and handlers * refactor: remove --ignore-*-check override flags from conftest * refactor: remove requires_api_key marker; fix api backend group to match watsonx+bedrock markers * fix: address review Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> * test: mark test_image_block_in_instruction as qualitative * chore: commit .claude/settings.json with skillLocations for skill discovery * docs: broaden audit-markers skill description to cover diagnostic use cases * docs: add diagnostic mode to audit-markers skill for troubleshooting skip/resource issues --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> Co-authored-by: Alex Bozarth <ajbozart@us.ibm.com> * clean: Readme --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> Co-authored-by: csbobby <phdbobbywu.cs@gmail.com> Co-authored-by: Nigel Jones <jonesn@uk.ibm.com> Co-authored-by: Alex Bozarth <ajbozart@us.ibm.com>
|
thanks for merging |



Marker Taxonomy & Agent Skills
Type of PR
Description
Fixes #727, #728
Introduces a four-tier test marker taxonomy (
unit/integration/e2e/qualitative), an agent skill to audit and fix markers, and applies the resulting reclassifications across the test suite. Also removes the legacy--isolate-heavyprocess isolation mechanism (superseded bycleanup_gpu_backend()from #721).How we define the tiers
unitintegratione2equalitativeunitis never written explicitly — conftest applies it automatically to any test that carries none of the other three.New agent skills (
.agents/skills/)Two skills following the agentskills.io standard, discoverable by Claude Code, VS Code/Copilot, and IBM Bob:
/audit-markers— classifies any test as unit/integration/e2e/qualitative using signal detection (imports, fixtures, assertion patterns, decorator shapes). Traces model identifiers to estimatemin_vram_gbfrom parameter counts. Report-first by default;--applyskips confirmation./skill-author— meta-skill for creating new skills with correct frontmatter and structure.pytest infrastructure changes
BACKEND_MARKERSregistry inconftest.py— single source of truth for all 7 backend markers;pytest_configureregisters them automatically. New backends need one dict entry.unitauto-apply hook —pytest_collection_modifyitemsappliesunitto any collected test that has none ofintegration,e2e,qualitative,llm. Enablespytest -m unit.--isolate-heavyand all associated code (_run_heavy_modules_isolated(),pytest_collection_finish(),require_gpu_isolation()). Thecleanup_gpu_backend()helper from ci: memory management in tests #721 handles GPU memory teardown;--group-by-backendhandles ordering.torch_dtype="auto"on model load —LocalHFBackend.from_pretrainednow passestorch_dtype="auto"toAutoModelForCausalLM.from_pretrained, preventing silent float32 upcasting on CPU during model load. On MPS/CUDA this halves memory use for bfloat16/float16 models._gpu_vram_gb()on Apple Silicon now usessysctl hw.memsizewith a conservative heuristic (min(total * 0.75, total - 16 GB)) instead of returning 0 — leaves headroom for OS and desktop apps.get_system_capabilities()cached — avoids repeated torch/MPS calls during collection.--ignore-gpu-check,--ignore-ollama-check,--ignore-api-key-check,--ignore-all-checksremoved — unused escape hatches; skips are now unconditional when a capability is missing.require_ollama()removed — redundant with theollamabackend marker + conftest auto-skip.llmmarker deprecated — treated as synonym fore2efor backwards compat; 0 remaining uses intest/ordocs/examples/.Test reclassifications
All changes are marker-only — no test logic was modified.
New
integrationtests (were unmarked/unit):test/cli/test_alora_train.pytest/telemetry/test_metrics.pyInMemoryMetricReader— asserts SDK attribute namestest/telemetry/test_tracing.pyInMemorySpanExporter— asserts span structuretest/telemetry/test_metrics_token.pytest/telemetry/test_metrics_plugins.pytest/package/test_dependency_isolation.pyuvsubprocesses — controls its own depstest/plugins/,test/core/,test/stdlib/e2emarker additions/corrections:test/backends/test_bedrock.pybedrockbackend marker; registered in conftest/pyprojecttest/telemetry/test_metrics_backend.pye2e(had backend markers but no tier)test/formatters/granite/test_intrinsics_formatters.pyllm/requires_gpu/requires_heavy_ram/requires_gpu_isolationwithe2e+require_gpu(min_vram_gb=12)VRAM gates updated (8 GB → 12 GB): the
/audit-markersskill estimatesmin_vram_gbby tracing model identifiers to parameter counts — test authors can override the estimate directly on therequire_gpu()call.test_guardian.py,test_core.py,test_rag.py,test_spans.pyDocs updated
test/MARKERS_GUIDE.md— full rewrite with tier definitions, backend marker table, resource predicate reference, auto-skip logic, and common patternstest/README.md— updated env var table; addedOLLAMA_KEEP_ALIVE=1mtip for unordered runsAGENTS.md/CONTRIBUTING.md— removed--isolate-heavyreferences; added skills discovery tableLocal test run (Mac M1, 32 GB)
Full run (
uv run pytest): 800 passed, 2 failed, 61 skipped, 19 deselected in 17m23s.The 2 failures are
@pytest.mark.qualitativetests (test_find_context_attributions,test_hallucination_detection) — non-deterministic content assertions that can vary between runs; not related to this PR.The 19 deselected are
slowtests excluded by default (-m "not slow"inaddopts).Skips breakdown (61 total — all expected):
test_huggingface.py,test_alora_train_integration.py,test_richdocument.pytest_watsonx.py,test_litellm_watsonx.py,test_bedrock.py,test_watsonx_token_metricstest_openai_vllm.py,test_vllm_tools.pytest_tracing_backend.py— telemetry not initialisedtest_manager.py— requires--disable-default-mellea-pluginsflagtest_reqlib_python.pysandbox testsSlow tests run explicitly (
uv run pytest -m slow):test_dependency_isolation.pygenerative_gsm8k.pymini_researcher/researcher.pypython_decompose_example.pyIssues raised during testing
test_tracing_backend.pytests always skip (Telemetry not initialized): root cause is_tracer_providerset at module import time;MonkeyPatch.setenvhas no effect. Flagged for @ajbozart.python_decompose_example.pyKeyError infinalize_result: constraint strings from two separate model calls don't match exactly. Flagged for @AngeloDanducci.Testing
pytest --collect-only— collection unchangedruff formatandruff checkpasscodespellandmarkdownlintpassCluster test run (IBM BLUEVELA LSF, Linux / Python 3.12.13, p-series GPU node)
Full run using
test/scripts/run-all(starts Ollama, pulls models, warms up, thenpytest --group-by-backend):832 passed, 1 failed, 37 skipped, 19 deselected, 2 xfailed, 1 xpassed in 45m18s (job 737802).
The 1 failure is
test_find_context_attributions(@pytest.mark.qualitative) — same non-deterministic content assertion flake as seen in local runs; not related to this PR.The 1 xpassed is a bonus: a test marked
xfailthat unexpectedly passed.Skips breakdown (37 total — all expected):
skipif not OTEL_AVAILABLE)@pytest.mark.skip(test_richdocument— memory)Compared to a run without the startup script (job 737413): skips dropped from 142 → 37 once Ollama was running and models were warmed up.
Test run summary across environments
uv run pytest)uv run pytest, no services)test/scripts/run-all19 deselected =
slowtests excluded by default inpyproject.tomlacross all runs.Skip reduction (142 → 37) is ~95 Ollama-dependent tests that become runnable once the startup script brings services up.
The LSF script run passes ~30 more tests than local because vLLM is available on the cluster.