test: GPU score non-determinism causes flaky failures in context-attribution, requirement_check_alora, groundedness_e2e

## Summary

Three intrinsic tests fail intermittently on GPU hardware. Pass rates on unmodified `upstream/main` (`317d5d9`) across 10 independent runs:

| Test | Pass rate (main) | Pass rate (branch PR #1269) |
|------|:----------------:|:---------------------------:|
| `test_run_transformers[context-attribution]` | 1/10 | 5/10 |
| `test_run_transformers[requirement_check_alora]` | 1/10 | 1/10 |
| `test_groundedness_e2e_string_documents` | 2/10 | 4/10 |

Note: `context_relevance_alora` and `uncertainty_alora` (0/10 on all hardware) are tracked separately in #1286 — those are a deterministic peft bug, not flakiness.

## Observed failure modes

### `test_run_transformers[context-attribution]`

The model returns a variable number of attribution spans. The test asserts an exact list length:

```
AssertionError: assert [{'attribution'...}, ...] == approx([...])
  Impossible to compare lists with different sizes.
  Lengths: 7 and 12
```

Captured model output on a failing run:
```json
[{"r": 0, "c": [0, 19]}, {"r": 1, "c": [2, 0, 1, 19, 3, 72, 70, 71, 4, 21]}]
```
On a passing run the model identifies the expected 12 citation span indices; on a failing run it produces only 7. The adapter function is non-deterministic in how many spans it attributes.

### `test_run_transformers[requirement_check_alora]`

Score falls just outside the ±0.1 tolerance window:

```
AssertionError: assert {'score': 0.3208213073183745} == approx({'score': 0.2185103906492881 ± 0.1})
  Max absolute difference: 0.1023109166690864
  Max relative difference: 0.4682199156071104
```

The obtained score (0.321) exceeds the expected (0.219 ± 0.1) by just 0.002 above the upper bound. The tolerance window is narrower than the natural GPU score variance for this adapter.

### `test_groundedness_e2e_string_documents`

On failing runs, the model incorrectly marks a clearly grounded sentence as `NOT_SUPPORTED`:

```
assert False is True
  where False = as_bool()

ValidationResult(False,
  reason='Response is not grounded - the following spans are not properly supported:\n\n'
         '- "The Eiffel Tower is located in Paris, France." [NOT_SUPPORTED]\n\n'
         'Summary: 0/1 spans needing citations are fully supported.',
  score=None).as_bool
```

The test provides the sentence `"The Eiffel Tower is located in Paris, France."` verbatim in the document list and asks the model to verify the response `"The Eiffel Tower is located in Paris, France."` — a textbook grounded response. On failing runs the adapter returns `NOT_SUPPORTED` for the span. The same input passes on 2 of 10 GPU runs, confirming this is non-determinism rather than a logic error.

## Likely causes

1. **GPU floating-point non-determinism**: cuBLAS/cuDNN operations are not guaranteed reproducible without `torch.use_deterministic_algorithms(True)` and fixed seeds.
2. **Assertion tolerances too tight**: ±0.1 score tolerance for `requirement_check_alora` is narrower than the observed GPU score variance. The test just barely tips over the boundary.
3. **Variable-length output**: `context-attribution` produces a non-deterministic number of spans; the test asserts exact list equality rather than checking that the expected spans are a subset of the returned spans.

## Possible fixes

- **`requirement_check_alora`**: widen the `abs` tolerance from 0.1 to 0.15, or verify the expected score was calibrated on GPU hardware
- **`context-attribution`**: assert subset membership or at-minimum span count rather than exact list equality
- **`groundedness_e2e`**: set a fixed `torch.manual_seed` in the test fixture, or run with `torch.backends.cudnn.deterministic = True`
- All three: consider `@pytest.mark.flaky(reruns=3)` via `pytest-rerunfailures` as a short-term mitigation

## Reproduction

```bash
# Requires GPU (min 8 GB VRAM for groundedness test)
for i in $(seq 1 10); do
  uv run pytest \
    'test/formatters/granite/test_intrinsics_formatters.py::test_run_transformers[context-attribution]' \
    'test/formatters/granite/test_intrinsics_formatters.py::test_run_transformers[requirement_check_alora]' \
    'test/stdlib/requirements/test_groundedness_requirement_e2e.py::test_groundedness_e2e_string_documents' \
    -q
done
# Expect roughly 1–5 passes across 10 runs for each test
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: GPU score non-determinism causes flaky failures in context-attribution, requirement_check_alora, groundedness_e2e #1291

Summary

Observed failure modes

`test_run_transformers[context-attribution]`

`test_run_transformers[requirement_check_alora]`

`test_groundedness_e2e_string_documents`

Likely causes

Possible fixes

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test	Pass rate (main)	Pass rate (branch PR #1269)
`test_run_transformers[context-attribution]`	1/10	5/10
`test_run_transformers[requirement_check_alora]`	1/10	1/10
`test_groundedness_e2e_string_documents`	2/10	4/10

test: GPU score non-determinism causes flaky failures in context-attribution, requirement_check_alora, groundedness_e2e #1291

Description

Summary

Observed failure modes

test_run_transformers[context-attribution]

test_run_transformers[requirement_check_alora]

test_groundedness_e2e_string_documents

Likely causes

Possible fixes

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`test_run_transformers[context-attribution]`

`test_run_transformers[requirement_check_alora]`

`test_groundedness_e2e_string_documents`