Summary
Three intrinsic tests fail intermittently on GPU hardware. Pass rates on unmodified upstream/main (317d5d9) across 10 independent runs:
| Test |
Pass rate (main) |
Pass rate (branch PR #1269) |
test_run_transformers[context-attribution] |
1/10 |
5/10 |
test_run_transformers[requirement_check_alora] |
1/10 |
1/10 |
test_groundedness_e2e_string_documents |
2/10 |
4/10 |
Note: context_relevance_alora and uncertainty_alora (0/10 on all hardware) are tracked separately in #1286 — those are a deterministic peft bug, not flakiness.
Observed failure modes
test_run_transformers[context-attribution]
The model returns a variable number of attribution spans. The test asserts an exact list length:
AssertionError: assert [{'attribution'...}, ...] == approx([...])
Impossible to compare lists with different sizes.
Lengths: 7 and 12
Captured model output on a failing run:
[{"r": 0, "c": [0, 19]}, {"r": 1, "c": [2, 0, 1, 19, 3, 72, 70, 71, 4, 21]}]
On a passing run the model identifies the expected 12 citation span indices; on a failing run it produces only 7. The adapter function is non-deterministic in how many spans it attributes.
test_run_transformers[requirement_check_alora]
Score falls just outside the ±0.1 tolerance window:
AssertionError: assert {'score': 0.3208213073183745} == approx({'score': 0.2185103906492881 ± 0.1})
Max absolute difference: 0.1023109166690864
Max relative difference: 0.4682199156071104
The obtained score (0.321) exceeds the expected (0.219 ± 0.1) by just 0.002 above the upper bound. The tolerance window is narrower than the natural GPU score variance for this adapter.
test_groundedness_e2e_string_documents
On failing runs, the model incorrectly marks a clearly grounded sentence as NOT_SUPPORTED:
assert False is True
where False = as_bool()
ValidationResult(False,
reason='Response is not grounded - the following spans are not properly supported:\n\n'
'- "The Eiffel Tower is located in Paris, France." [NOT_SUPPORTED]\n\n'
'Summary: 0/1 spans needing citations are fully supported.',
score=None).as_bool
The test provides the sentence "The Eiffel Tower is located in Paris, France." verbatim in the document list and asks the model to verify the response "The Eiffel Tower is located in Paris, France." — a textbook grounded response. On failing runs the adapter returns NOT_SUPPORTED for the span. The same input passes on 2 of 10 GPU runs, confirming this is non-determinism rather than a logic error.
Likely causes
- GPU floating-point non-determinism: cuBLAS/cuDNN operations are not guaranteed reproducible without
torch.use_deterministic_algorithms(True) and fixed seeds.
- Assertion tolerances too tight: ±0.1 score tolerance for
requirement_check_alora is narrower than the observed GPU score variance. The test just barely tips over the boundary.
- Variable-length output:
context-attribution produces a non-deterministic number of spans; the test asserts exact list equality rather than checking that the expected spans are a subset of the returned spans.
Possible fixes
requirement_check_alora: widen the abs tolerance from 0.1 to 0.15, or verify the expected score was calibrated on GPU hardware
context-attribution: assert subset membership or at-minimum span count rather than exact list equality
groundedness_e2e: set a fixed torch.manual_seed in the test fixture, or run with torch.backends.cudnn.deterministic = True
- All three: consider
@pytest.mark.flaky(reruns=3) via pytest-rerunfailures as a short-term mitigation
Reproduction
# Requires GPU (min 8 GB VRAM for groundedness test)
for i in $(seq 1 10); do
uv run pytest \
'test/formatters/granite/test_intrinsics_formatters.py::test_run_transformers[context-attribution]' \
'test/formatters/granite/test_intrinsics_formatters.py::test_run_transformers[requirement_check_alora]' \
'test/stdlib/requirements/test_groundedness_requirement_e2e.py::test_groundedness_e2e_string_documents' \
-q
done
# Expect roughly 1–5 passes across 10 runs for each test
Summary
Three intrinsic tests fail intermittently on GPU hardware. Pass rates on unmodified
upstream/main(317d5d9) across 10 independent runs:test_run_transformers[context-attribution]test_run_transformers[requirement_check_alora]test_groundedness_e2e_string_documentsNote:
context_relevance_aloraanduncertainty_alora(0/10 on all hardware) are tracked separately in #1286 — those are a deterministic peft bug, not flakiness.Observed failure modes
test_run_transformers[context-attribution]The model returns a variable number of attribution spans. The test asserts an exact list length:
Captured model output on a failing run:
[{"r": 0, "c": [0, 19]}, {"r": 1, "c": [2, 0, 1, 19, 3, 72, 70, 71, 4, 21]}]On a passing run the model identifies the expected 12 citation span indices; on a failing run it produces only 7. The adapter function is non-deterministic in how many spans it attributes.
test_run_transformers[requirement_check_alora]Score falls just outside the ±0.1 tolerance window:
The obtained score (0.321) exceeds the expected (0.219 ± 0.1) by just 0.002 above the upper bound. The tolerance window is narrower than the natural GPU score variance for this adapter.
test_groundedness_e2e_string_documentsOn failing runs, the model incorrectly marks a clearly grounded sentence as
NOT_SUPPORTED:The test provides the sentence
"The Eiffel Tower is located in Paris, France."verbatim in the document list and asks the model to verify the response"The Eiffel Tower is located in Paris, France."— a textbook grounded response. On failing runs the adapter returnsNOT_SUPPORTEDfor the span. The same input passes on 2 of 10 GPU runs, confirming this is non-determinism rather than a logic error.Likely causes
torch.use_deterministic_algorithms(True)and fixed seeds.requirement_check_alorais narrower than the observed GPU score variance. The test just barely tips over the boundary.context-attributionproduces a non-deterministic number of spans; the test asserts exact list equality rather than checking that the expected spans are a subset of the returned spans.Possible fixes
requirement_check_alora: widen theabstolerance from 0.1 to 0.15, or verify the expected score was calibrated on GPU hardwarecontext-attribution: assert subset membership or at-minimum span count rather than exact list equalitygroundedness_e2e: set a fixedtorch.manual_seedin the test fixture, or run withtorch.backends.cudnn.deterministic = True@pytest.mark.flaky(reruns=3)viapytest-rerunfailuresas a short-term mitigationReproduction