Skip to content

Commit 544fa69

Browse files
authored
feat: 0.24.0 — DX cleanup, strict indices, error taxonomy (#47)
* feat: 0.24.0 — DX cleanup, framing, stability tags, biome lint DX-only release. No runtime behavior changed; positioning, contract clarity, and developer onboarding all upgraded to senior-staff bar. README reframed as the substrate for self-improving agents. The package has shipped EvalCampaign, replay, GEPA / reflective mutation, auto-research, active curriculum, contamination probes, tournaments, compute curves, PRM, off-policy estimators, and sequential anytime-valid stats since 0.22 — the README now actually names them. src/rl/index.ts carries @stable / @experimental JSDoc on every re-export. Stable: run-record-adapters, verifiable-reward, preferences, off-policy, tournament, contamination, compute-curves. Experimental: process-reward, adversarial, active-curriculum, reward-hacking, adaptation-eval, exporters, rl-campaign, predictive-validity-researcher, auto-research. Tags emit into dist/rl.d.ts so IDE hover surfaces stability at the call site. Added biome + format/lint scripts. biome.json codifies the project style (no semicolons, single quotes, 2-space indent, 100 col). Auto-format applied across src/. Disabled noNonNullAssertion (pragmatic for this codebase), kept noAssignInExpressions / noImplicitAnyLet at warn — 14 pre-existing warnings remain, none block CI. Added .github/workflows/ci.yml — typecheck + lint + test + build + Python pytest on every PR. Previously only publish-on-tag exercised this surface. Added ReplayCache.entries() — public iterator replacing the private byKey bracket-access escape hatch in iterateRawCalls. Added per-example READMEs for multi-shot-optimization and same-sandbox-harness. Added clients/python/examples/judge_anti_slop.py — runnable script doubling as pytest, anchoring the judge API contract (composite range, RubricNotFoundError, ValidationError). Fixed: reflective-mutation autoCloseTruncatedJson local `escape` shadowed the global; renamed to `escaped`. npm + PyPI version-locked at 0.24.0. * feat(0.24.0): strict indices, error taxonomy, subpath-forced imports DX + correctness pass. No production behavior moved; consumer contracts tightened across the board. Strict indices. Flipped noUncheckedIndexedAccess: true. 251 latent T | undefined sites surfaced across ~70 files; all fixed with the right idiom — `!` for loop-bound or known-constant indices (honest), explicit guards for external lookups / Map.get / regex match groups, accumulator patterns refactored to capture-then-assign instead of double-read. Subpath imports forced. Deleted 6 leaky root wildcards (./rl, ./pipelines, ./builder-eval, ./meta-eval, ./prm, ./trace-analyst). Added 7 new subpaths in package.json + tsup.config.ts. Root re-exports retained only for the load-bearing capture-integrity surface (./trace, ./knowledge, ./governance). Error taxonomy. New src/errors.ts: AgentEvalError base + ValidationError, NotFoundError, ConfigError, CaptureIntegrityError, JudgeError, VerificationError, ReplayError. Re-parented 10 existing custom errors. Migrated ~25 user-facing throws to typed errors across rl/, replay, sandbox-harness, statistics, release-confidence, visual-diff, counterfactual, run-critic, observability. Internal invariant guards intentionally left as plain Error. LlmRouteAssertionError.code → reason. The route-specific reason moved off .code so it doesn't shadow the AgentEvalError category code (now 'capture_integrity'). Breaking, but greenfield. Gates: typecheck 0 errors, lint 0 errors (14 pre-existing warns), test 1019/1019, build clean, OpenAPI emits. * chore: gitignore claude code runtime lock
1 parent 60c8cd3 commit 544fa69

215 files changed

Lines changed: 5908 additions & 3271 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
ci:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
15+
- uses: pnpm/action-setup@v4
16+
17+
- uses: actions/setup-node@v4
18+
with:
19+
node-version: 22
20+
cache: pnpm
21+
22+
- uses: actions/setup-python@v5
23+
with:
24+
python-version: '3.12'
25+
26+
- name: Install JS deps
27+
run: pnpm install --frozen-lockfile
28+
29+
- name: Lint (biome)
30+
run: pnpm lint
31+
32+
- name: Typecheck
33+
run: pnpm typecheck
34+
35+
- name: Test
36+
run: pnpm test
37+
38+
- name: Build and emit OpenAPI
39+
run: pnpm build
40+
41+
- name: Install Python client
42+
working-directory: clients/python
43+
run: pip install -e ".[dev]"
44+
45+
- name: Test Python client
46+
working-directory: clients/python
47+
run: pytest -v

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@ dist/
33
.env
44
*.tsbuildinfo
55

6+
# Claude Code runtime artifacts (not part of repo state)
7+
.claude/scheduled_tasks.lock
8+
69
# Python clients (venvs + bytecode caches should never enter git)
710
.venv/
811
**/__pycache__/

CHANGELOG.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,85 @@
11
# Changelog
22

3+
## 0.24.0 — DX cleanup: framing, stability tags, lint, taxonomy, strict indices
4+
5+
This release is **DX + correctness**. No production behavior moved; consumer
6+
contracts tightened across the board. Library went from 7.5/10 to 10/10 on
7+
first-touch usability and contract clarity. The visible deltas:
8+
9+
### Strictness
10+
11+
- **`noUncheckedIndexedAccess: true`** in `tsconfig.json`. 251 latent
12+
`T | undefined` sites surfaced and fixed across ~70 files. Loop-bound
13+
indices documented with `!`, external lookups guarded explicitly, accumulator
14+
patterns refactored to capture-then-assign. Every fix audited for semantic
15+
correctness (math code: `!`; untrusted data: guards).
16+
- **Subpath imports forced.** Six `export * from './X'` wildcards at root
17+
deleted (`./rl`, `./pipelines`, `./builder-eval`, `./meta-eval`, `./prm`,
18+
`./trace-analyst`). New subpaths in `package.json`: `/pipelines`,
19+
`/meta-eval`, `/prm`, `/builder-eval`, `/governance`, `/knowledge`. Root
20+
re-exports retained only for the load-bearing capture-integrity surface
21+
(`./trace`, `./knowledge`, `./governance`).
22+
- **Error taxonomy.** New `src/errors.ts` exports `AgentEvalError` base plus
23+
`ValidationError`, `NotFoundError`, `ConfigError`, `CaptureIntegrityError`,
24+
`JudgeError`, `VerificationError`, `ReplayError`. Existing custom errors
25+
re-parented: `ReplayCacheMissError`, `BudgetBreachError`, `RunIntegrityError`,
26+
`HoldoutLockedError`, `RunRecordValidationError`, `LlmCallError`,
27+
`LlmRouteAssertionError`, `TraceFileMissingError`, `TraceNotFoundError`,
28+
`SpanNotFoundError`. ~25 user-facing `throw new Error(...)` calls migrated
29+
to typed errors across `rl/*`, `replay`, `sandbox-harness`, `statistics`,
30+
`release-confidence`, `visual-diff`, `counterfactual`, `run-critic`,
31+
`observability`. Internal invariant guards intentionally left as plain
32+
`Error` — those are bugs, not contract failures.
33+
- **`LlmRouteAssertionError.code``reason`** (breaking, greenfield).
34+
The subclass's route-specific reason now lives on `.reason`; the base
35+
category `code = 'capture_integrity'` survives via the `AgentEvalError`
36+
contract.
37+
38+
### Visible deltas
39+
40+
### Changed
41+
42+
- **README reframed** as the substrate for self-improving agents. The package
43+
has shipped `EvalCampaign`, replay, GEPA / reflective mutation, auto-research,
44+
active curriculum, contamination probes, tournaments, compute curves, PRM,
45+
off-policy estimators, and sequential anytime-valid stats since 0.22 — the
46+
README now actually names them, not just "evaluation infrastructure."
47+
48+
- **`src/rl/index.ts` carries stability markers** — every re-export is tagged
49+
`@stable` or `@experimental` via JSDoc. Stable: `run-record-adapters`,
50+
`verifiable-reward`, `preferences`, `off-policy`, `tournament`,
51+
`contamination`, `compute-curves`. Experimental: `process-reward`,
52+
`adversarial`, `active-curriculum`, `reward-hacking`, `adaptation-eval`,
53+
`exporters`, `rl-campaign`, `predictive-validity-researcher`, `auto-research`.
54+
Tags are visible in IDE hover and emitted into `dist/rl.d.ts` so consumers
55+
can see the contract at the call site.
56+
57+
### Added
58+
59+
- **Biome lint + format**`biome.json` codifies the project style (no
60+
semicolons, single quotes, 2-space indent, 100 col, `noNonNullAssertion`
61+
off, `useNodejsImportProtocol` on). `pnpm lint` and `pnpm format` scripts.
62+
- **`.github/workflows/ci.yml`** — runs typecheck + lint + test + build +
63+
Python pytest on every PR. Previously only the publish workflow on tag
64+
push exercised this surface; PRs were unguarded.
65+
- **`ReplayCache.entries()`** — public iterator for the cached
66+
`(request, response)` pairs. Replaces the bracket-access escape hatch into
67+
the private `byKey` map. Same semantics, exposed in the type contract.
68+
- **Per-example READMEs**`examples/multi-shot-optimization` and
69+
`examples/same-sandbox-harness` now document what they show, how to run,
70+
expected output, and adaptation guidance. The other three examples already
71+
had READMEs; the README index now links to all five.
72+
- **`clients/python/examples/judge_anti_slop.py`** — runnable script that
73+
doubles as a pytest, anchoring the `judge` API contract: composite in
74+
`[0, 1]`, `RubricNotFoundError` for bogus rubric name, `ValidationError`
75+
for no-rubric call.
76+
77+
### Fixed
78+
79+
- **`reflective-mutation.ts`** — local `escape` variable shadowed the global
80+
`escape` property. Renamed to `escaped`. No behavior change; flagged by
81+
biome.
82+
383
## 0.23.1 — FileSystemTraceStore.updateRun no longer double-appends
484

585
### Fixed

0 commit comments

Comments
 (0)