Skip to content

Commit e6d8a03

Browse files
authored
Improved Testing Infrastructure (#29)
Layered test infrastructure: composable markers, API contract tests, CI tiers Testing was insufficient for the growing number of benchmarks: * Benchmark tests (Tau2, MACS) require downloaded data but had no dedicated tier or CI caching * Existing mocks were shallow — no infrastructure to test the full adapter → SDK → HTTP chain * No way to run live API round-trips in CI with proper credential gating * No marker system to separate fast/offline from slow/network/credentialed tests This PR introduces a layered testing strategy: * Composable pytest markers (`live`, `credentialed`, `slow`, `smoke`) with `credentialed` → `live` implication * Default `pytest` runs only fast, offline unit tests * `slow` tier for data integrity tests (Tau2/MACS download pipelines, schemas, DB content) with CI caching * `credentialed` tier for live API round-trips (OpenAI, Anthropic, Google GenAI, LiteLLM) behind GitHub Environment approval * HTTP-level API contract tests via `respx` mocks — full adapter → SDK → HTTP → ChatResponse chain, no keys needed * Skip decorators (`requires_openai`, `requires_anthropic`, `requires_google`) for key-dependent tests * Updated coverage script with `--exclude` flag for marker-aware runs * Fixed `DB.load()` / `DB.copy_deep()` return types in Tau2 (`"DB"` → `Self`)
1 parent d611c2a commit e6d8a03

27 files changed

Lines changed: 2881 additions & 540 deletions

.github/workflows/test.yml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,55 @@ jobs:
7272
run: |
7373
uv run pytest -v
7474
75+
test-slow:
76+
name: Slow Tests (Data Downloads + Integrity)
77+
runs-on: ubuntu-latest
78+
79+
steps:
80+
- uses: actions/checkout@v3
81+
- name: Set up Python 3.12
82+
uses: actions/setup-python@v4
83+
with:
84+
python-version: "3.12"
85+
- name: Install dependencies
86+
run: |
87+
pip install uv
88+
uv sync --all-extras --group dev
89+
- name: Cache benchmark data
90+
uses: actions/cache@v4
91+
with:
92+
path: |
93+
maseval/benchmark/tau2/data/
94+
maseval/benchmark/macs/data/
95+
maseval/benchmark/macs/prompt_templates/
96+
key: benchmark-data-${{ hashFiles('maseval/benchmark/tau2/data_loader.py', 'maseval/benchmark/macs/data_loader.py') }}
97+
- name: Run slow tests
98+
run: |
99+
uv run pytest -m "slow and not credentialed" -v
100+
101+
# test-credentialed:
102+
# name: Credentialed Tests (Live API)
103+
# runs-on: ubuntu-latest
104+
# environment: credentialed-tests
105+
106+
# steps:
107+
# - uses: actions/checkout@v3
108+
# - name: Set up Python 3.12
109+
# uses: actions/setup-python@v4
110+
# with:
111+
# python-version: "3.12"
112+
# - name: Install dependencies
113+
# run: |
114+
# pip install uv
115+
# uv sync --all-extras --group dev
116+
# - name: Run credentialed tests
117+
# env:
118+
# OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
119+
# ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
120+
# GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
121+
# run: |
122+
# uv run pytest -m "credentialed and not smoke" -v
123+
75124
coverage:
76125
name: Coverage Report
77126
runs-on: ubuntu-latest

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
# Subdirectories can have their own .gitignores.
2+
# E.g. check `maseval/benchmar/.../.gitignore
3+
14
# Custom
25
.idea/
36
.DS_Store

AGENTS.md

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,13 +38,16 @@ uv run ruff check . --fix
3838

3939
## Testing Instructions
4040

41-
- Tests use pytest markers: `core`, `interface`, `smolagents`, `langgraph`, `contract`
41+
- Tests use composable pytest markers — see `tests/README.md` for full details
42+
- **What it tests**: `core`, `interface`, `contract`, `benchmark`, `smolagents`, `langgraph`, `llamaindex`, `gaia2`, `camel`
43+
- **What it needs**: `live` (network), `credentialed` (API keys), `slow` (>30s), `smoke` (full pipeline)
44+
- Default `pytest` excludes `slow`, `credentialed`, and `smoke` via `addopts`
4245
- All tests must pass before PR merge
4346
- Add/update tests for code changes
44-
- Fix type errors and lint issues until suite is green
47+
- **Benchmark tests** follow a two-tier pattern (offline structural + live real-data). See `tests/README.md` for the recommended pattern when adding or modifying benchmark tests.
4548

4649
```bash
47-
# Run all tests
50+
# Default — fast tests only
4851
uv run pytest -v
4952

5053
# Core tests only (minimal dependencies)
@@ -53,14 +56,27 @@ uv run pytest -m core -v
5356
# Specific integration tests
5457
uv run pytest -m smolagents -v
5558
uv run pytest -m interface -v
59+
60+
# Data download validation (needs network)
61+
uv run pytest -m "live and slow" -v
62+
63+
# Live API tests (needs OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY)
64+
uv run pytest -m credentialed -v
65+
66+
# Fully offline
67+
uv run pytest -m "not live" -v
5668
```
5769

5870
## Coverage
5971

6072
View coverage by feature area (auto-discovers benchmarks/interfaces):
6173

6274
```bash
75+
# Full coverage (default + slow + live, excludes credentialed and smoke)
6376
uv run python scripts/coverage_by_feature.py
77+
78+
# Fast-only (skip slow and live tests)
79+
uv run python scripts/coverage_by_feature.py --exclude slow,live
6480
```
6581

6682
Manual coverage for specific modules:

CHANGELOG.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4747
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
4848
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
4949

50+
**Testing**
51+
52+
- Composable pytest markers (`live`, `credentialed`, `slow`, `smoke`) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
53+
- Marker implication hook: `credentialed` implies `live`, so `-m "not live"` always gives a fully offline run (PR: #29)
54+
- Skip decorators (`requires_openai`, `requires_anthropic`, `requires_google`) for tests needing API keys (PR: #29)
55+
- Data integrity tests for Tau2 and MACS benchmarks validating download pipelines, file structures, and database content (PR: #29)
56+
- HTTP-level API contract tests for model adapters (OpenAI, Anthropic, Google GenAI, LiteLLM) using `respx` mocks — no API keys needed (PR: #29)
57+
- Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
58+
- CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
59+
- Added `respx` dev dependency for HTTP-level mocking (PR: #29)
60+
5061
### Changed
5162

5263
**Core**
@@ -72,8 +83,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7283
- `LangGraphUser``LangGraphLLMUser`
7384
- `LlamaIndexUser``LlamaIndexLLMUser`
7485

86+
**Testing**
87+
88+
- Coverage script (`scripts/coverage_by_feature.py`) now accepts `--exclude` flag to skip additional markers; always excludes `credentialed` and `smoke` by default (PR: #29)
89+
7590
### Fixed
7691

92+
- Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` in Tau2 benchmark — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
93+
7794
### Removed
7895

7996
## [0.3.0] - 2025-01-18

CONTRIBUTING.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -211,11 +211,14 @@ When you open a Pull Request, a series of automated checks will run using **GitH
211211
The pipeline automatically performs the following tasks:
212212

213213
- **Linting and Formatting**: Verifies that your code adheres to our style guide using `ruff`.
214-
- **Testing**: Runs the entire test suite across different Python versions and operating systems. This includes tests for both the core package and the optional integrations.
214+
- **Testing** (tiered):
215+
- *Fast tests* (every PR, Python 3.10–3.14): core, benchmark, and all default-suite tests. No API keys needed.
216+
- *Slow tests* (every PR, Python 3.12): data download and integrity validation.
217+
- *Credentialed tests* (every PR, Python 3.12): live API tests. Requires maintainer approval to run — secrets are only exposed after approval.
215218
- **Type Checking**: Validates type annotations using `ty`.
216219
- **Documentation**: Ensures documentation builds without errors using `mkdocs`.
217220

218-
**All checks must pass** before your Pull Request can be merged. You can view the progress and logs of these checks directly on your Pull Request page in GitHub.
221+
**All checks must pass** before your Pull Request can be merged. Contributors don't need API keys — the default and slow test suites run without them. See `tests/README.md` for how markers work and for the recommended benchmark testing pattern (offline structural tests vs. real-data tests).
219222

220223
> **Note:** You don't need to run all these checks locally - CI will catch issues. However, running `uv run ruff format && uv run ruff check` before pushing can save you time.
221224

maseval/benchmark/tau2/domains/base.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@
1616
from pathlib import Path
1717
from typing import Any, Callable, Dict, Generic, Optional, TypeVar, Union
1818

19+
from typing_extensions import Self
20+
1921
from pydantic import BaseModel, ConfigDict
2022

2123
from maseval.benchmark.tau2.utils import get_pydantic_hash, load_file, update_pydantic_model_with_dict
@@ -37,7 +39,7 @@ class DB(BaseModel):
3739
model_config = ConfigDict(extra="forbid") # Reject unknown fields
3840

3941
@classmethod
40-
def load(cls, path: Union[str, Path]) -> "DB":
42+
def load(cls, path: Union[str, Path]) -> Self:
4143
"""Load the database from a structured file (JSON, TOML, YAML).
4244
4345
Args:
@@ -73,7 +75,7 @@ def get_statistics(self) -> Dict[str, Any]:
7375
"""
7476
return {}
7577

76-
def copy_deep(self) -> "DB":
78+
def copy_deep(self) -> Self:
7779
"""Create a deep copy of the database.
7880
7981
Returns:

pyproject.toml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ dev = [
8282
"ruff>=0.14.0",
8383
"ty>=0.0.5",
8484
"pre-commit>=4.0.0",
85+
"respx>=0.22.0",
8586
]
8687

8788
# Documentation building - for contributors only
@@ -112,9 +113,13 @@ markers = [
112113
"llamaindex: Tests that specifically require llama-index-core",
113114
"gaia2: Tests that specifically require ARE (Agent Research Environments)",
114115
"camel: Tests that specifically require camel-ai",
116+
"live: Tests requiring network access (downloads, external APIs)",
117+
"credentialed: Tests requiring API keys (implies live, costs money)",
118+
"slow: Tests taking >30 seconds (data downloads, large datasets)",
119+
"smoke: Full end-to-end pipeline validation (pre-release only)",
115120
]
116121
minversion = "6.0"
117-
addopts = "-ra -q"
122+
addopts = "-ra -q -m 'not (slow or credentialed or smoke)'"
118123
testpaths = ["tests"]
119124

120125
[tool.coverage.run]

scripts/coverage_by_feature.py

Lines changed: 59 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,31 @@
44
Automatically discovers benchmarks and integrations from the codebase structure.
55
Provides a high-level view of coverage by logical component rather than by file.
66
7-
Usage:
7+
By default, runs all tests except ``credentialed`` and ``smoke`` (i.e. includes
8+
``slow`` and ``live`` tests that don't need API keys). Use ``--exclude`` to
9+
skip additional markers.
10+
11+
Usage::
12+
13+
# Full coverage (default + slow + live)
814
uv run python scripts/coverage_by_feature.py
15+
16+
# Fast-only (skip slow and live tests)
17+
uv run python scripts/coverage_by_feature.py --exclude slow,live
918
"""
1019

20+
import argparse
1121
import json
1222
import subprocess
1323
import sys
1424
from pathlib import Path
1525
from typing import Dict, List, Set
1626

1727

28+
# Markers that are always excluded (need API keys or are pre-release only)
29+
ALWAYS_EXCLUDED = ["credentialed", "smoke"]
30+
31+
1832
def discover_benchmarks(maseval_dir: Path) -> List[str]:
1933
"""Auto-discover benchmark implementations."""
2034
benchmark_dir = maseval_dir / "benchmark"
@@ -66,14 +80,30 @@ def discover_integrations(maseval_dir: Path) -> Dict[str, Dict[str, List[str]]]:
6680
return integrations
6781

6882

69-
def run_coverage() -> bool:
70-
"""Run pytest with coverage collection."""
71-
print("Running tests with coverage...")
72-
result = subprocess.run(
73-
["pytest", "--cov=maseval", "--cov-report=json", "--quiet"],
74-
capture_output=True,
75-
text=True,
76-
)
83+
def build_marker_expression(exclude: List[str]) -> str:
84+
"""Build a pytest marker expression from the list of markers to exclude."""
85+
all_excluded = ALWAYS_EXCLUDED + [m for m in exclude if m not in ALWAYS_EXCLUDED]
86+
return "not (" + " or ".join(all_excluded) + ")"
87+
88+
89+
def run_coverage(marker_expr: str) -> bool:
90+
"""Run pytest with coverage collection.
91+
92+
Args:
93+
marker_expr: Pytest marker expression (passed via -m).
94+
"""
95+
cmd = [
96+
"pytest",
97+
"--override-ini=addopts=",
98+
"-m",
99+
marker_expr,
100+
"--cov=maseval",
101+
"--cov-report=json",
102+
"--quiet",
103+
]
104+
105+
print(f"Running tests with coverage (-m '{marker_expr}') ...")
106+
result = subprocess.run(cmd, capture_output=True, text=True)
77107
if result.returncode != 0:
78108
print("\nTests failed:")
79109
print(result.stdout)
@@ -133,13 +163,32 @@ def format_coverage(label: str, stats: Dict[str, float], indent: int = 0) -> str
133163
return f"{indent_str}{label:<30} {color}{percent:6.2f}%{reset} ({stats['covered']}/{stats['total']} lines)"
134164

135165

166+
def parse_args() -> argparse.Namespace:
167+
"""Parse command-line arguments."""
168+
parser = argparse.ArgumentParser(
169+
description="Generate test coverage report organized by feature area.",
170+
)
171+
parser.add_argument(
172+
"--exclude",
173+
type=str,
174+
default="",
175+
help="Comma-separated markers to exclude (e.g. 'slow,live'). 'credentialed' and 'smoke' are always excluded.",
176+
)
177+
return parser.parse_args()
178+
179+
136180
def main():
137181
"""Generate coverage report by feature area."""
182+
args = parse_args()
138183
repo_root = Path(__file__).parent.parent
139184
maseval_dir = repo_root / "maseval"
140185

186+
# Build marker expression
187+
extra_excludes = [m.strip() for m in args.exclude.split(",") if m.strip()]
188+
marker_expr = build_marker_expression(extra_excludes)
189+
141190
# Run coverage
142-
if not run_coverage():
191+
if not run_coverage(marker_expr):
143192
print("\nTests failed. Coverage report may be incomplete.")
144193

145194
print("\n" + "=" * 80)

0 commit comments

Comments
 (0)