Epic: Testing Infrastructure & Strategy Overhaul
Agreed outcomes from Discussion #711 and the 2026-03-23 planning call with @ajbozarth, @planetf1, @jakelorocco, and @avinash2692. cc @psschwei for further planning, @avinash2692 regarding Bluevela nightlies.
Key Decisions
Two-dimensional marker taxonomy — granularity (unit, integration, e2e, qualitative) x backend (ollama, huggingface, vllm, openai, watsonx, litellm, etc.), plus resource markers (requires_gpu, requires_heavy_ram, requires_gpu_isolation).
| Tier |
Trigger |
Budget |
What runs |
| Pre-commit |
Every commit |
<60s |
Lint + type checking only |
| Local dev |
Ad-hoc |
<5 min |
All tests matching available backends/resources |
| PR CI |
Every push |
<15 min |
Unit + integration + Ollama e2e |
| Nightly CI |
Scheduled |
~60 min |
Every test, no exceptions (Bluevela, full GPU) |
| Pre-release |
Manual |
~90 min |
Manual trigger of nightly suite |
Principles: split e2e into integration + e2e pairs (don't just downgrade); parametrise across backends; fix root causes over workarounds; catalog minimal default models with overrides; scope covers both tests and examples; docs updated with every change.
Work Items
| # |
Issue |
Summary |
| 1a |
#727 |
Granularity marker taxonomy and tiered timeouts |
| 1b |
#728 |
Backend & resource marker audit (children: #622, #539, #629, #634) |
| 2a |
#729 |
Split e2e tests into integration + e2e pairs |
| 2b |
#730 |
Parametrise and consolidate backend-specific tests |
| 3a |
#731 |
Environment diagnostic, pre-flight checks & reporting (children: #574, #349) |
| 3b |
#732 |
Model consolidation and flexibility (children: #359) |
| 4 |
#733 |
CI parallelisation and dynamic test selection (see also #451) |
| 5 |
#734 |
On-demand nightly test runs for PRs |
| 6 |
#735 |
Semantic assertions & recording for qualitative tests (children: #692) |
| 7 |
#736 |
Backend resource cleanup post-PR #721 |
| 8 |
#737 |
Test results & coverage reporting |
| 9 |
#738 |
Notebook testing (children: #89) |
| 10 |
#739 |
Pre-commit & type checking (children: #456) |
| 11 |
#813 |
Test coverage improvement (children: #812) |
Related Issues
Expected to close with PR #721 (cleanup_gpu_backend()): #630, #625, #620, #699. Residual cleanup tracked in #736.
Flaky tests — addressed by #735 (semantic assertions): #398, #384, #628, #684, #121.
Not in scope: #691, #496, #347, #267 — remain standalone.
Epic: Testing Infrastructure & Strategy Overhaul
Agreed outcomes from Discussion #711 and the 2026-03-23 planning call with @ajbozarth, @planetf1, @jakelorocco, and @avinash2692. cc @psschwei for further planning, @avinash2692 regarding Bluevela nightlies.
Key Decisions
Two-dimensional marker taxonomy — granularity (
unit,integration,e2e,qualitative) x backend (ollama,huggingface,vllm,openai,watsonx,litellm, etc.), plus resource markers (requires_gpu,requires_heavy_ram,requires_gpu_isolation).Principles: split e2e into integration + e2e pairs (don't just downgrade); parametrise across backends; fix root causes over workarounds; catalog minimal default models with overrides; scope covers both tests and examples; docs updated with every change.
Work Items
Related Issues
Expected to close with PR #721 (
cleanup_gpu_backend()): #630, #625, #620, #699. Residual cleanup tracked in #736.Flaky tests — addressed by #735 (semantic assertions): #398, #384, #628, #684, #121.
Not in scope: #691, #496, #347, #267 — remain standalone.