Skip to content

Commit f4c796c

Browse files
authored
Improved Testing Workflow for GHA (#31)
* updated testing workflow for GHA
1 parent e93a75c commit f4c796c

6 files changed

Lines changed: 114 additions & 27 deletions

File tree

.github/workflows/test.yml

Lines changed: 71 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,14 @@ jobs:
2626
pip install uv
2727
uv sync --group dev
2828
- name: Run core tests
29-
run: |
30-
uv run pytest -m core -v
29+
run: uv run coverage run --parallel-mode -m pytest -m core -v
30+
- name: Upload coverage data
31+
if: matrix.python-version == '3.12'
32+
uses: actions/upload-artifact@v4
33+
with:
34+
name: coverage-core
35+
path: .coverage.*
36+
include-hidden-files: true
3137

3238
test-benchmark:
3339
name: Benchmark Tests
@@ -47,12 +53,17 @@ jobs:
4753
pip install uv
4854
uv sync --all-extras --group dev
4955
- name: Run benchmark tests
50-
run: |
51-
uv run pytest -m "benchmark and not (slow or live)" -v
56+
run: uv run coverage run --parallel-mode -m pytest -m "benchmark and not (slow or live)" -v
57+
- name: Upload coverage data
58+
if: matrix.python-version == '3.12'
59+
uses: actions/upload-artifact@v4
60+
with:
61+
name: coverage-benchmark
62+
path: .coverage.*
63+
include-hidden-files: true
5264

53-
test-all:
54-
name: All Tests (With Optional Deps)
55-
needs: [test-core, test-benchmark]
65+
test-core-optional:
66+
name: Core Tests (With Optional Deps)
5667
runs-on: ubuntu-latest
5768
strategy:
5869
matrix:
@@ -68,9 +79,42 @@ jobs:
6879
run: |
6980
pip install uv
7081
uv sync --all-extras --group dev
71-
- name: Run all tests
82+
- name: Run core tests with optional deps
83+
run: uv run coverage run --parallel-mode -m pytest -m core -v
84+
- name: Upload coverage data
85+
if: matrix.python-version == '3.12'
86+
uses: actions/upload-artifact@v4
87+
with:
88+
name: coverage-core-optional
89+
path: .coverage.*
90+
include-hidden-files: true
91+
92+
test-interface:
93+
name: Interface Tests
94+
runs-on: ubuntu-latest
95+
strategy:
96+
matrix:
97+
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
98+
99+
steps:
100+
- uses: actions/checkout@v3
101+
- name: Set up Python ${{ matrix.python-version }}
102+
uses: actions/setup-python@v4
103+
with:
104+
python-version: ${{ matrix.python-version }}
105+
- name: Install all dependencies
72106
run: |
73-
uv run pytest -v
107+
pip install uv
108+
uv sync --all-extras --group dev
109+
- name: Run interface tests
110+
run: uv run coverage run --parallel-mode -m pytest -m interface -v
111+
- name: Upload coverage data
112+
if: matrix.python-version == '3.12'
113+
uses: actions/upload-artifact@v4
114+
with:
115+
name: coverage-interface
116+
path: .coverage.*
117+
include-hidden-files: true
74118

75119
test-slow:
76120
name: Slow Tests (Data Downloads + Integrity)
@@ -86,9 +130,14 @@ jobs:
86130
run: |
87131
pip install uv
88132
uv sync --all-extras --group dev
89-
- name: Run slow tests
90-
run: |
91-
uv run pytest -m "slow and not credentialed" -v
133+
- name: Run slow and live tests
134+
run: uv run coverage run --parallel-mode -m pytest -m "(slow or live) and not credentialed" -v
135+
- name: Upload coverage data
136+
uses: actions/upload-artifact@v4
137+
with:
138+
name: coverage-slow
139+
path: .coverage.*
140+
include-hidden-files: true
92141

93142
# test-credentialed:
94143
# name: Credentialed Tests (Live API)
@@ -115,6 +164,7 @@ jobs:
115164

116165
coverage:
117166
name: Coverage Report
167+
needs: [test-core, test-benchmark, test-core-optional, test-interface, test-slow]
118168
runs-on: ubuntu-latest
119169
permissions:
120170
contents: write
@@ -131,11 +181,17 @@ jobs:
131181
- name: Install dependencies
132182
run: |
133183
pip install uv
134-
uv sync --all-extras --all-groups
184+
uv sync --group dev
185+
186+
- name: Download coverage data
187+
uses: actions/download-artifact@v4
188+
with:
189+
pattern: coverage-*
190+
merge-multiple: true
135191

136-
- name: Run tests with coverage
192+
- name: Combine and report coverage
137193
run: |
138-
uv run coverage run -m pytest
194+
uv run coverage combine
139195
uv run coverage xml
140196
uv run coverage html
141197
uv run coverage report

codecov.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
comment: false
2+
3+
coverage:
4+
status:
5+
project:
6+
default:
7+
informational: true
8+
patch:
9+
default:
10+
informational: true

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,7 @@ testpaths = ["tests"]
192192
[tool.coverage.run]
193193
relative_files = true
194194
source = ["maseval"]
195-
omit = ["*/tests/*", "*/examples/*", "*/__pycache__/*"]
195+
omit = ["*/tests/*", "*/examples/*", "*/__pycache__/*", "maseval/benchmark/multiagentbench/marble/**"]
196196
branch = true
197197

198198
[tool.coverage.report]
@@ -204,6 +204,7 @@ exclude_lines = [
204204
"if __name__ == .__main__.:",
205205
"if TYPE_CHECKING:",
206206
"@abstractmethod",
207+
"@overload",
207208
]
208209
precision = 2
209210

tests/README.md

Lines changed: 26 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -55,19 +55,30 @@ Defined in `pyproject.toml`:
5555

5656
## CI Pipeline
5757

58-
Six jobs in `.github/workflows/test.yml`:
59-
60-
| Job | Python | What it runs | Gate |
61-
| ----------------- | --------- | --------------------------------- | ---------------------- |
62-
| test-core | 3.10–3.14 | `-m core` ||
63-
| test-benchmark | 3.10–3.14 | `-m "benchmark and not (slow or live)"` ||
64-
| test-all | 3.10–3.14 | `pytest -v` (default filter) | After core + benchmark |
65-
| test-slow | 3.12 | `-m "slow and not credentialed"` ||
66-
| test-credentialed | 3.12 | `-m "credentialed and not smoke"` | Maintainer approval |
67-
| coverage | 3.12 | Default suite (fast) with coverage report ||
58+
Jobs in `.github/workflows/test.yml`. Each test job collects coverage data (from Python 3.12 only); the final coverage job merges them into one combined report.
59+
60+
| Job | Python | What it runs | Gate |
61+
| ------------------ | --------- | -------------------------------------------- | ------------------- |
62+
| test-core | 3.10–3.14 | `-m core` (no optional deps) ||
63+
| test-benchmark | 3.10–3.14 | `-m "benchmark and not (slow or live)"` ||
64+
| test-core-optional | 3.10–3.14 | `-m core` (with optional deps) ||
65+
| test-interface | 3.10–3.14 | `-m interface` ||
66+
| test-slow | 3.12 | `-m "(slow or live) and not credentialed"` ||
67+
| test-credentialed | 3.12 | `-m "credentialed and not smoke"` (disabled) | Maintainer approval |
68+
| coverage | 3.12 | Combines coverage from all jobs above | After all test jobs |
6869

6970
Contributors don't need API keys — the default suite and slow tests run without them.
7071

72+
### Detecting orphaned tests
73+
74+
Every test must carry at least one marker that maps to a CI job. To find tests that would be missed:
75+
76+
```bash
77+
uv run pytest --collect-only -m "not (core or benchmark or interface or slow or live or credentialed or smoke)"
78+
```
79+
80+
If this reports any collected tests, add the appropriate marker (usually `pytestmark = pytest.mark.core` or `pytest.mark.benchmark`) to the file.
81+
7182
## Test Organization
7283

7384
```
@@ -105,6 +116,7 @@ Benchmark tests follow a **two-tier pattern**:
105116
**Tier 1: Structural tests (offline, `benchmark` marker only)**
106117

107118
Tests that work without downloaded data or network access:
119+
108120
- Import protection: `maseval` runs without benchmark optional dependencies
109121
- Graceful errors: descriptive error when benchmark code is accessed without deps
110122
- Interface checks: class methods exist, types correct, invalid inputs rejected
@@ -113,6 +125,7 @@ Tests that work without downloaded data or network access:
113125
**Tier 2: Real data tests (`benchmark` + `live` markers)**
114126

115127
Tests that download and use actual benchmark data:
128+
116129
- Environment/tool tests: create real environments, execute tools on real databases
117130
- Data loading pipeline: `load_tasks`, `load_domain_config`, etc.
118131
- Data integrity validation (also marked `slow`): schema checks, minimum record counts, field structure
@@ -122,6 +135,7 @@ Tests that download and use actual benchmark data:
122135
Benchmarks use `ensure_data_exists()` to download data to the **package's default data directory** (not temp dirs). This function caches — it skips download if files already exist. A session-scoped pytest fixture (e.g., `ensure_tau2_data`, `ensure_macs_templates`) triggers the download once per test session.
123136

124137
Tests that need real data should:
138+
125139
1. Depend on the download fixture (`ensure_tau2_data`, `ensure_macs_templates`, etc.)
126140
2. Be marked `@pytest.mark.live`
127141
3. Use simple constructors — e.g., `Tau2Environment({"domain": "retail"})` — since data is already in the default location
@@ -131,6 +145,7 @@ Tests that don't need data (structural, mock-based) should NOT depend on the dow
131145
#### How to decide: mock or real data?
132146

133147
This is a judgment call. As a guideline:
148+
134149
- If the test validates **structure, types, or error handling** → Tier 1 (offline)
135150
- If the test operates on **real database records, files, or network resources** → Tier 2 (`live`)
136151
- Don't force synthetic fixtures where they add complexity without value. If something needs real data, test it with real data.
@@ -166,4 +181,4 @@ requires_openai = pytest.mark.skipif(
166181

167182
## Notes
168183

169-
- Credentialed tests require maintainer approval via GitHub Environment. See `EXTENDEDTESTINGSTRATEGYPLAN.md` for details.
184+
- Credentialed tests require maintainer approval via GitHub Environment.

tests/test_core/test_callbacks/test_progress_bar.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
)
1616
from maseval.core.task import Task
1717

18+
pytestmark = pytest.mark.core
19+
1820

1921
@pytest.fixture
2022
def mock_benchmark():

tests/test_core/test_exceptions.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
"""
88

99
import pytest
10+
1011
from maseval import (
1112
TaskQueue,
1213
TaskExecutionStatus,
@@ -19,6 +20,8 @@
1920
validate_arguments_from_schema,
2021
)
2122

23+
pytestmark = pytest.mark.core
24+
2225

2326
class TestExceptionClassification:
2427
"""Tests for exception classification in benchmark execution."""

0 commit comments

Comments
 (0)