Summary
test_run_transformers[context_relevance_alora] and test_run_transformers[uncertainty_alora] fail on CPU-only hardware (confirmed on macOS, upstream/main at fe7ce3c7). The other two aLoRA formatter tests (answerability_answerable_alora, requirement_check_alora) pass on the same hardware.
The test file carries an explicit comment: "THIS TEST DOES NOT REQUIRE A GPU." These failures are therefore a bug, not a deliberate skip.
Observed failures
context_relevance_alora
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 48 (char 47)
Model output is truncated mid-JSON: { "context_relevance": "relevant", "response": "As of my last update in October — the model stops generating before closing the string.
uncertainty_alora
AssertionError: assert {'certainty': 0.061...} == approx({'certainty': 0.829 ± 0.1})
Max absolute difference: 0.769
The model returns a score of "0" rather than the expected value.
Root cause
peft emits the following warning for all aLoRA tests on CPU:
UserWarning: Cannot calculate aLoRA offsets during generate as input_ids are not available. Disabling aLoRA.
When aLoRA offsets cannot be computed, peft silently falls back to standard LoRA. For answerability and requirement_check this degraded output is still close enough to pass the assertions; context_relevance and uncertainty are sensitive enough to the missing prefix optimisation that they produce wrong or truncated output.
Why input_ids are unavailable
_generate_from_intrinsic in mellea/backends/huggingface.py calls generate_with_transformers (from granite_formatters) with a generate_input dict that includes input_tokens. However, peft's aLoRA hook fires inside model.generate() at the point where forward is called — if generate_with_transformers passes the inputs as inputs_embeds rather than input_ids at that stage, peft cannot recover the token IDs it needs to compute prefix offsets, even though the token IDs were available earlier in the pipeline.
The input_tokens are already on generate_input and also stored on output._process / output._post_process via input_ids=generate_input["input_tokens"]. The gap is whether they reach peft's hook inside model.generate().
Possible fix
Investigate whether generate_with_transformers (or the call site in _generate_from_intrinsic) can be modified to pass input_ids into the model.generate() call alongside any inputs_embeds, so peft's aLoRA hook has what it needs on CPU. Alternatively, if generate_with_transformers intentionally embeds first, the backend may need to retain input_ids and forward them separately via generate()'s input_ids kwarg.
Reproduction
uv run pytest \
"test/formatters/granite/test_intrinsics_formatters.py::test_run_transformers[context_relevance_alora]" \
"test/formatters/granite/test_intrinsics_formatters.py::test_run_transformers[uncertainty_alora]" \
--tb=short -q
Fails on macOS (CPU), expected to pass per the test file's own annotation.
Summary
test_run_transformers[context_relevance_alora]andtest_run_transformers[uncertainty_alora]fail on CPU-only hardware (confirmed on macOS,upstream/mainatfe7ce3c7). The other two aLoRA formatter tests (answerability_answerable_alora,requirement_check_alora) pass on the same hardware.The test file carries an explicit comment: "THIS TEST DOES NOT REQUIRE A GPU." These failures are therefore a bug, not a deliberate skip.
Observed failures
context_relevance_aloraModel output is truncated mid-JSON:
{ "context_relevance": "relevant", "response": "As of my last update in October— the model stops generating before closing the string.uncertainty_aloraThe model returns a score of
"0"rather than the expected value.Root cause
peft emits the following warning for all aLoRA tests on CPU:
When aLoRA offsets cannot be computed, peft silently falls back to standard LoRA. For
answerabilityandrequirement_checkthis degraded output is still close enough to pass the assertions;context_relevanceanduncertaintyare sensitive enough to the missing prefix optimisation that they produce wrong or truncated output.Why input_ids are unavailable
_generate_from_intrinsicinmellea/backends/huggingface.pycallsgenerate_with_transformers(fromgranite_formatters) with agenerate_inputdict that includesinput_tokens. However, peft's aLoRA hook fires insidemodel.generate()at the point where forward is called — ifgenerate_with_transformerspasses the inputs asinputs_embedsrather thaninput_idsat that stage, peft cannot recover the token IDs it needs to compute prefix offsets, even though the token IDs were available earlier in the pipeline.The
input_tokensare already ongenerate_inputand also stored onoutput._process/output._post_processviainput_ids=generate_input["input_tokens"]. The gap is whether they reach peft's hook insidemodel.generate().Possible fix
Investigate whether
generate_with_transformers(or the call site in_generate_from_intrinsic) can be modified to passinput_idsinto themodel.generate()call alongside anyinputs_embeds, so peft's aLoRA hook has what it needs on CPU. Alternatively, ifgenerate_with_transformersintentionally embeds first, the backend may need to retaininput_idsand forward them separately viagenerate()'sinput_idskwarg.Reproduction
Fails on macOS (CPU), expected to pass per the test file's own annotation.