Development Guide

Distribution tiers

agentevals ships as two distribution variants from a single codebase:

Tier	Install	What you get
Core	`pip install agentevals`	CLI + REST API + live mode (WebSocket streaming, sessions, SSE)
Bundle	`pip install agentevals` (bundled wheel)	Everything in Core + embedded React UI

Live mode (WebSocket streaming, session management, SSE) is always enabled when running agentevals serve. The --dev flag adds hot reload and dev-friendly console output but does not change what features are active.

The optional [live] extra (pip install "agentevals[live]") adds mcp and httpx, which are only needed for the MCP server (agentevals mcp). The bundled wheel is built with make build-bundle and includes compiled UI assets baked into the package.

Makefile

Development

make dev-backend       # start FastAPI in live mode (port 8001), reload on source changes
make dev-frontend      # start Vite dev server (port 5173) with HMR
make dev-bundle        # build UI, serve full bundled experience at port 8001 via uv run

Standard development uses dev-backend + dev-frontend in separate terminals. The Vite dev server proxies nothing; the frontend calls the backend at http://localhost:8001 directly via CORS.

dev-bundle is useful for testing the bundled UI experience without building a wheel. It copies ui/dist into the source tree temporarily and cleans up when the server exits.

Postgres backend (optional, for `/api/runs`)

Preview. The schema, the CLI surface, and /api/runs shape are still stabilizing. Recreate the agentevals schema between minor version upgrades until further notice; do not depend on persisted data surviving a git pull of agentevals itself.

The default in-memory backend keeps make dev-backend zero-config. To exercise the async run pipeline locally, bring up a Postgres alongside the app:

make pg-up             # start postgres:18.3-alpine in a docker container (port 5432, ephemeral via --rm)
make migrate           # apply the agentevals schema
make dev-backend-pg    # pg-up + migrate + serve --dev with backend=postgres wired up
make pg-down           # stop the container; data is discarded with --rm

Override the defaults via PG_PORT=5433 make pg-up etc. The migrate target is idempotent (a second invocation is a no-op).

Once running, submit a run with:

curl -X POST http://localhost:8001/api/runs \
    -H 'content-type: application/json' \
    -d '{"spec": {"approach": "trace_replay", "target": {"kind": "inline", "inline": {...}}, "evalConfig": {"evaluators": [{"name": "tool_trajectory_avg_score", "type": "builtin"}]}}}'

Then poll GET /api/runs/{runId} and GET /api/runs/{runId}/results. Without storage.backend=postgres, the /api/runs endpoints return 503 with a hint pointing at the env var.

Building

make build             # build core wheel → dist/agentevals-*.whl
make build-bundle      # build UI, embed into wheel, clean up → dist/agentevals-*.whl
make build-ui          # build React app only → ui/dist/

Both build and build-bundle produce dist/agentevals-*.whl with the same package name and version. The difference is that build-bundle embeds ui/dist/ as agentevals/_static/ inside the wheel. The hatchling artifacts config ensures the gitignored _static/ directory is included.

Testing

make test              # run all tests (unit + integration, excludes e2e)
make test-unit         # unit tests only (fast, no server startup)
make test-integration  # integration tests — OTLP pipeline, session grouping, timing (no API keys)
make test-e2e          # E2E tests — real agents as subprocesses (requires OPENAI_API_KEY and/or GOOGLE_API_KEY)

Cleanup

make clean             # remove dist/, build/, ui/dist/, src/agentevals/_static/

Testing

Test tiers

Tests are organized into three tiers with different trade-offs:

Tier	Location	Transport	API keys	What it verifies
Unit	`tests/` (excl. integration)	`TestClient` / mocks	None	Business logic, route handlers, converters
Integration	`tests/integration/`	ASGI in-process	None	OTLP session grouping, timing, concurrent batches, eval pipeline
E2E	`tests/integration/test_live_agents.py`	Real uvicorn servers	`OPENAI_API_KEY`, `GOOGLE_API_KEY`	Full pipeline — real agent → OTLP export → session creation → invocation extraction → API visibility

Integration tests use httpx.ASGITransport to hit the OTLP and streaming API routes in-process (no ports, no real HTTP). Timers are configured fast (0.1s grace, 0.5s idle) for quick deterministic tests.

E2E tests start real uvicorn servers on ephemeral ports in a background thread, then run example agent scripts as subprocesses that emit real OTLP traces with BatchSpanProcessor/BatchLogRecordProcessor flush timing.

Running E2E tests

E2E tests require OPENAI_API_KEY (LangChain and Strands agents) and/or GOOGLE_API_KEY (ADK agents). Each test class is skipped automatically when its required key is not set.

# Source your .env and run
set -a && source .env && set +a && make test-e2e

Adding tests for new examples

When adding a new example agent to examples/, add corresponding E2E tests to ensure the full OTLP pipeline works:

Add a test class in tests/integration/test_live_agents.py following the existing pattern (TestLangchainZeroCode, TestStrandsZeroCode, TestAdkZeroCode)
Each agent should have at minimum three tests:
- Session creation — agent runs successfully, session is created with spans (and logs if applicable)
- Invocation extraction — invocations are extracted with user/agent content
- API visibility — session appears in GET /api/streaming/sessions
Use _run_agent() to run the example as a subprocess with the test OTLP endpoint
Use wait_for_session_complete_sync() to poll until the session finalizes
Mark the test class with the appropriate skip condition (e.g., _skip_no_openai)
Use unique session_name values per test to avoid collisions within the session-scoped server fixture

Runtime behavior

The serve command always enables live mode (WebSocket, streaming, sessions). The flags control UI serving and reload behavior:

agentevals serve — live mode + REST API; UI served if bundled _static/ is present
agentevals serve --dev — same as above + hot reload on source changes + dev console output
agentevals serve --headless — live mode + REST API, UI suppressed even if bundled

Controlled by environment variables AGENTEVALS_LIVE=1 (always set by the CLI) and AGENTEVALS_HEADLESS=1 (set when --headless is passed).

NixOS / Nix devshell

The project provides a flake.nix devshell. Inside the Nix environment, agentevals in PATH points to the Nix store derivation (immutable). Use uv run to run from the live source tree:

uv run agentevals serve --dev    # live source, dev mode
make dev-bundle                   # live source, bundled UI test

To release a new Nix derivation, update flake.nix with the new version and rebuild.

Releasing

Bump version in pyproject.toml
Commit and push the change
Tag and push — this triggers the release workflow automatically:
```
git tag v0.1.0
git push origin v0.1.0
```
Alternatively, trigger manually from GitHub → Actions → Release → Run workflow and enter the tag

The workflow (.github/workflows/release.yml) runs make release, which builds the wheel twice (once without UI, once with embedded UI) into separate subdirectories:

dist/core/agentevals-<version>-py3-none-any.whl    # CLI + REST API
dist/bundle/agentevals-<version>-py3-none-any.whl  # CLI + REST API + streaming + embedded UI

Both wheels use the same standard filename (valid per PEP 427). They are attached as separate release assets to the GitHub Release. Users download the appropriate wheel:

pip install agentevals-<version>-py3-none-any.whl

To also use the MCP server (agentevals mcp), install with the [live] extra:

pip install "agentevals-<version>-py3-none-any.whl[live]"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Guide

Distribution tiers

Makefile

Development

Postgres backend (optional, for `/api/runs`)

Building

Testing

Cleanup

Testing

Test tiers

Running E2E tests

Adding tests for new examples

Runtime behavior

NixOS / Nix devshell

Releasing

FilesExpand file tree

DEVELOPMENT.md

Latest commit

History

DEVELOPMENT.md

File metadata and controls

Development Guide

Distribution tiers

Makefile

Development

Postgres backend (optional, for /api/runs)

Building

Testing

Cleanup

Testing

Test tiers

Running E2E tests

Adding tests for new examples

Runtime behavior

NixOS / Nix devshell

Releasing

Postgres backend (optional, for `/api/runs`)