Skip to content

test: GPU score non-determinism causes flaky failures in context-attribution, requirement_check_alora, groundedness_e2e #1291

@planetf1

Description

@planetf1

Summary

Three intrinsic tests fail intermittently on GPU hardware. Pass rates on unmodified upstream/main (317d5d9) across 10 independent runs:

Test Pass rate (main) Pass rate (branch PR #1269)
test_run_transformers[context-attribution] 1/10 5/10
test_run_transformers[requirement_check_alora] 1/10 1/10
test_groundedness_e2e_string_documents 2/10 4/10

Note: context_relevance_alora and uncertainty_alora (0/10 on all hardware) are tracked separately in #1286 — those are a deterministic peft bug, not flakiness.

Observed failure modes

test_run_transformers[context-attribution]

The model returns a variable number of attribution spans. The test asserts an exact list length:

AssertionError: assert [{'attribution'...}, ...] == approx([...])
  Impossible to compare lists with different sizes.
  Lengths: 7 and 12

Captured model output on a failing run:

[{"r": 0, "c": [0, 19]}, {"r": 1, "c": [2, 0, 1, 19, 3, 72, 70, 71, 4, 21]}]

On a passing run the model identifies the expected 12 citation span indices; on a failing run it produces only 7. The adapter function is non-deterministic in how many spans it attributes.

test_run_transformers[requirement_check_alora]

Score falls just outside the ±0.1 tolerance window:

AssertionError: assert {'score': 0.3208213073183745} == approx({'score': 0.2185103906492881 ± 0.1})
  Max absolute difference: 0.1023109166690864
  Max relative difference: 0.4682199156071104

The obtained score (0.321) exceeds the expected (0.219 ± 0.1) by just 0.002 above the upper bound. The tolerance window is narrower than the natural GPU score variance for this adapter.

test_groundedness_e2e_string_documents

On failing runs, the model incorrectly marks a clearly grounded sentence as NOT_SUPPORTED:

assert False is True
  where False = as_bool()

ValidationResult(False,
  reason='Response is not grounded - the following spans are not properly supported:\n\n'
         '- "The Eiffel Tower is located in Paris, France." [NOT_SUPPORTED]\n\n'
         'Summary: 0/1 spans needing citations are fully supported.',
  score=None).as_bool

The test provides the sentence "The Eiffel Tower is located in Paris, France." verbatim in the document list and asks the model to verify the response "The Eiffel Tower is located in Paris, France." — a textbook grounded response. On failing runs the adapter returns NOT_SUPPORTED for the span. The same input passes on 2 of 10 GPU runs, confirming this is non-determinism rather than a logic error.

Likely causes

  1. GPU floating-point non-determinism: cuBLAS/cuDNN operations are not guaranteed reproducible without torch.use_deterministic_algorithms(True) and fixed seeds.
  2. Assertion tolerances too tight: ±0.1 score tolerance for requirement_check_alora is narrower than the observed GPU score variance. The test just barely tips over the boundary.
  3. Variable-length output: context-attribution produces a non-deterministic number of spans; the test asserts exact list equality rather than checking that the expected spans are a subset of the returned spans.

Possible fixes

  • requirement_check_alora: widen the abs tolerance from 0.1 to 0.15, or verify the expected score was calibrated on GPU hardware
  • context-attribution: assert subset membership or at-minimum span count rather than exact list equality
  • groundedness_e2e: set a fixed torch.manual_seed in the test fixture, or run with torch.backends.cudnn.deterministic = True
  • All three: consider @pytest.mark.flaky(reruns=3) via pytest-rerunfailures as a short-term mitigation

Reproduction

# Requires GPU (min 8 GB VRAM for groundedness test)
for i in $(seq 1 10); do
  uv run pytest \
    'test/formatters/granite/test_intrinsics_formatters.py::test_run_transformers[context-attribution]' \
    'test/formatters/granite/test_intrinsics_formatters.py::test_run_transformers[requirement_check_alora]' \
    'test/stdlib/requirements/test_groundedness_requirement_e2e.py::test_groundedness_e2e_string_documents' \
    -q
done
# Expect roughly 1–5 passes across 10 runs for each test

Metadata

Metadata

Assignees

Labels

area/adapter-functionsGranite adapter functions: framework and adaptiers including RAG, Guardian, CorebugSomething isn't workingp2Medium/low: minor bugs, niche features, polish, docs, tests, cleanup. Scoped, lower urgency.testing

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions