Skip to content

Commit 50f72a7

Browse files
committed
v2 with SkillOpt Support
1 parent 46c0b79 commit 50f72a7

21 files changed

Lines changed: 1856 additions & 158 deletions

README.md

Lines changed: 111 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,16 @@
2222

2323
CodexOpt is a lightweight CLI for benchmarking and optimizing Codex instruction assets.
2424

25-
It focuses on two repo-local files:
25+
It focuses on Codex instruction assets:
2626

2727
- `AGENTS.md`
2828
- `.codex/skills/**/SKILL.md`
29+
- `.agents/skills/**/SKILL.md`
2930

3031
## Quick Links
3132

3233
- Documentation: [superagenticai.github.io/CodexOpt](https://superagenticai.github.io/CodexOpt/)
34+
- Codex user workflow: [docs/codex-users.md](docs/codex-users.md)
3335
- Demo repository: [github.com/SuperagenticAI/codexopt-demo](https://github.com/SuperagenticAI/codexopt-demo)
3436
- PyPI package: [pypi.org/project/codexopt](https://pypi.org/project/codexopt/)
3537
- Docs source: [docs/](/Users/shashi/oss/CodexOpt/docs)
@@ -58,8 +60,10 @@ CodexOpt turns these edits into measurable runs with artifacts you can inspect a
5860
- Benchmark scoring with sub-scores and natural-language feedback.
5961
- Optional evidence inputs from repo task files and issue exports.
6062
- Optimization engine `heuristic` (default, local and deterministic).
61-
- Optional optimization engine `gepa` (via `gepa.optimize_anything`).
62-
- Explicit reporting when a GEPA-requested run falls back to heuristic optimization.
63+
- Reflective engine for Codex-backed SkillOpt/GEPA-style optimization.
64+
- SkillOpt-inspired `skillopt` engine for SKILL.md files with train/validation evidence splits,
65+
bounded edits, and validation-gated acceptance.
66+
- Explicit reporting when a model-backed run falls back to heuristic optimization.
6367
- Safe apply flow with automatic backups.
6468
- Markdown reporting from latest runs.
6569
- Minimal OSS CI (lint, test, build).
@@ -119,12 +123,16 @@ uv run codexopt apply --kind agents
119123
uv run codexopt report --output codexopt-report.md
120124
```
121125

126+
For Codex-specific rollout workflows, including `codex exec --json` validation tasks, see
127+
[Using CodexOpt with Codex](docs/codex-users.md).
128+
122129
## How Teams Use CodexOpt
123130

124131
Developers use CodexOpt in the repository that contains their Codex instruction assets:
125132

126133
- `AGENTS.md`
127134
- `.codex/skills/**/SKILL.md`
135+
- `.agents/skills/**/SKILL.md`
128136

129137
Optional evidence can also be added to improve benchmarking and optimization quality:
130138

@@ -155,7 +163,10 @@ With that config in place, `benchmark` and `optimize` use:
155163
- repo task alignment
156164
- recurring issue/review themes
157165

158-
Today, task and issue files influence scoring and feedback. CodexOpt does not yet execute full agent task simulations.
166+
Today, task and issue files influence scoring and feedback. With `--engine skillopt`, CodexOpt
167+
uses task evidence as train/validation splits so skill candidates must improve held-out evidence
168+
before they are accepted. JSON task files can also define executable rollout commands; when present,
169+
those rollout pass rates become the held-out validation gate.
159170

160171
Use `codexopt.example.yaml` as a starting point for committed team config.
161172

@@ -198,7 +209,7 @@ Optimize AGENTS files.
198209
```bash
199210
codexopt optimize agents \
200211
[--file PATTERN] \
201-
[--engine heuristic|gepa] \
212+
[--engine heuristic|reflective] \
202213
[--reflection-model MODEL] \
203214
[--max-metric-calls N]
204215
```
@@ -210,11 +221,22 @@ Optimize SKILL files.
210221
```bash
211222
codexopt optimize skills \
212223
[--glob PATTERN] \
213-
[--engine heuristic|gepa] \
224+
[--engine heuristic|skillopt|reflective] \
214225
[--reflection-model MODEL] \
215226
[--max-metric-calls N]
216227
```
217228

229+
### `improve`
230+
231+
One command for Codex users: discover targets, mine starter tasks, run the
232+
reflective optimizer, and preview the diff.
233+
234+
```bash
235+
codexopt improve # offline preview
236+
codexopt improve --live # Codex-backed reflective preview
237+
codexopt improve --live --apply # write validated changes with backups
238+
```
239+
218240
### `apply`
219241

220242
Apply best candidates from the latest optimization run (or a provided run id).
@@ -245,6 +267,8 @@ targets:
245267
skills_globs:
246268
- ".codex/skills/**/SKILL.md"
247269
- "**/.codex/skills/**/SKILL.md"
270+
- ".agents/skills/**/SKILL.md"
271+
- "**/.agents/skills/**/SKILL.md"
248272
exclude_globs:
249273
- ".git/**"
250274
- ".codexopt/**"
@@ -261,6 +285,9 @@ optimization:
261285
min_apply_delta: 0.01
262286
max_metric_calls: 60
263287
reflection_model: null
288+
skillopt_train_ratio: 0.67
289+
skillopt_edit_budget: 24
290+
skillopt_validation_delta: 0.01
264291
```
265292

266293
Config notes:
@@ -271,10 +298,13 @@ Config notes:
271298
- `output.root_dir`: run artifacts and backups location.
272299
- `evidence.task_files`: optional markdown/json task lists used for repo-alignment scoring.
273300
- `evidence.issue_files`: optional markdown/json issue or review exports used for theme-aware feedback.
274-
- `optimization.engine`: default optimization engine.
301+
- `optimization.engine`: default optimization engine (`heuristic`, `reflective`, or `skillopt` for skills).
275302
- `optimization.min_apply_delta`: minimum score gain required to apply.
276-
- `optimization.max_metric_calls`: GEPA metric budget.
277-
- `optimization.reflection_model`: required when using GEPA engine.
303+
- `optimization.max_metric_calls`: legacy GEPA metric budget.
304+
- `optimization.reflection_model`: legacy GEPA reflection model.
305+
- `optimization.skillopt_train_ratio`: task evidence fraction used for skill candidate proposal.
306+
- `optimization.skillopt_edit_budget`: maximum line edit operations allowed for SkillOpt candidates.
307+
- `optimization.skillopt_validation_delta`: minimum held-out validation gain required for SkillOpt acceptance.
278308

279309
## How Scoring Works
280310

@@ -317,42 +347,89 @@ Candidate transforms include:
317347

318348
The best candidate is selected by score delta. If delta is below `min_apply_delta`, original content is kept.
319349

320-
### GEPA engine (optional)
350+
### Reflective engine
321351

322-
CodexOpt can call `gepa.optimize_anything` when `--engine gepa` is selected.
352+
The maintained SkillOpt/GEPA-inspired path is `--engine reflective`, or the
353+
Codex-user shortcut `codexopt improve`. It evaluates a candidate document on
354+
tasks, captures textual feedback, asks an optimizer model to rewrite the
355+
document, and accepts the rewrite only when it improves held-out validation
356+
tasks.
323357

324-
The GEPA path is model-agnostic. In practice, teams can use any reflection model
325-
supported by their GEPA / LiteLLM setup, including OpenAI, Gemini, local models,
326-
or other compatible providers. That means you can ask GEPA to generate feedback
327-
and candidate improvements using whichever model gives you the best quality /
328-
cost tradeoff for your workflow.
358+
Defaults stay offline and use static/verifier signals. To run the full live
359+
Codex loop, use:
329360

330-
Requirements:
361+
```bash
362+
codexopt improve --live
363+
```
331364

332-
- `gepa` installed in the environment.
333-
- A valid reflection model via `--reflection-model` or config.
365+
`--live` uses `codex exec` as both optimizer and judge. You can also set
366+
`reflective.optimizer_model` and `reflective.judge_model` to `codex`,
367+
`openai/<model>`, or another OpenAI-compatible model.
334368

335-
Common examples:
369+
### Legacy GEPA engine
370+
371+
`--engine gepa` is deprecated. It targeted an older `gepa.optimize_anything`
372+
API and now falls back with a clear warning. Use `--engine reflective` instead.
373+
374+
For SkillOpt-style skill optimization:
336375

337376
```yaml
338377
optimization:
339-
engine: "gepa"
340-
reflection_model: "openai/gpt-5-mini"
378+
engine: "skillopt"
379+
reflection_model: "openai/gpt-5-mini" # optional; without it, heuristic proposers are used
380+
skillopt_train_ratio: 0.67
381+
skillopt_edit_budget: 24
382+
skillopt_validation_delta: 0.01
341383
```
342384

343-
```yaml
344-
optimization:
345-
engine: "gepa"
346-
reflection_model: "gemini/gemini-2.5-pro"
385+
Executable rollout task files can be listed in `evidence.task_files`:
386+
387+
```json
388+
[
389+
{
390+
"name": "skill-verifier",
391+
"description": "Run a repo-local verifier against the candidate skill.",
392+
"command": ["python", "scripts/verify_skill.py"],
393+
"timeout_seconds": 30
394+
}
395+
]
347396
```
348397

349-
For OpenAI-backed GEPA runs, set:
398+
Codex-backed rollout tasks can use `backend: "codex"` and `codex_prompt`:
399+
400+
```json
401+
[
402+
{
403+
"name": "codex-skill-task",
404+
"backend": "codex",
405+
"description": "Run Codex against the candidate skill.",
406+
"codex_prompt": "Use the local skill to update CHANGELOG.md for a patch release.",
407+
"timeout_seconds": 120,
408+
"expected_final_response_contains": "CHANGELOG.md",
409+
"expected_file_change": "CHANGELOG.md",
410+
"expected_file_contains": {
411+
"path": "CHANGELOG.md",
412+
"contains": "Patch"
413+
}
414+
}
415+
]
416+
```
417+
418+
CodexOpt evaluates those commands in a temporary copy of the repo with the candidate `SKILL.md`
419+
written in place, then records pass/fail details in `optimize.json`. For Codex-backed rollouts,
420+
CodexOpt also parses `codex exec --json` events into trajectory metadata: final response,
421+
commands, file changes, token usage, and errors.
422+
423+
For OpenAI-compatible reflective models, set the provider credentials and use
424+
`reflective.optimizer_model` / `reflective.judge_model` values such as
425+
`openai/gpt-5-mini`:
350426

351427
```bash
352428
export OPENAI_API_KEY="your-openai-key"
353429
```
354430

355-
For Gemini-backed GEPA runs, set:
431+
For Gemini-compatible endpoints, set the credentials expected by your OpenAI-compatible
432+
client or run through `codexopt improve --live` to use `codex exec` directly.
356433

357434
```bash
358435
export GEMINI_API_KEY="your-gemini-key"
@@ -361,7 +438,8 @@ export GOOGLE_API_KEY="$GEMINI_API_KEY"
361438

362439
Fallback behavior:
363440

364-
- If GEPA is unavailable or errors, CodexOpt falls back to heuristic optimization.
441+
- If a configured optimizer or judge model is unavailable, CodexOpt records a note and
442+
falls back to the weaker heuristic/static path.
365443
- Fallbacks are recorded in optimization artifacts, CLI summaries, and reports.
366444

367445
## Artifacts and State
@@ -514,17 +592,17 @@ uv run codexopt apply --kind agents --run-id <run_id>
514592

515593
Cause:
516594

517-
- `gepa` is not installed, or
518-
- `reflection_model` is missing.
595+
- The legacy GEPA engine targeted an older `gepa.optimize_anything` API.
519596

520597
Behavior:
521598

522-
- CodexOpt falls back to heuristic optimization when GEPA errors.
599+
- CodexOpt falls back to heuristic optimization and records the deprecation reason.
523600

524601
Fix:
525602

526603
```bash
527-
uv run codexopt optimize agents --engine gepa --reflection-model <model_name>
604+
uv run codexopt optimize agents --engine reflective
605+
uv run codexopt improve --live
528606
```
529607

530608
### `apply --dry-run` says files would be applied, but nothing changed

codexopt.example.yaml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ targets:
77
skills_globs:
88
- ".codex/skills/**/SKILL.md"
99
- "**/.codex/skills/**/SKILL.md"
10+
- ".agents/skills/**/SKILL.md"
11+
- "**/.agents/skills/**/SKILL.md"
1012
exclude_globs:
1113
- ".git/**"
1214
- ".codexopt/**"
@@ -23,3 +25,23 @@ optimization:
2325
min_apply_delta: 0.01
2426
max_metric_calls: 60
2527
reflection_model: null
28+
skillopt_train_ratio: 0.67
29+
skillopt_edit_budget: 24
30+
skillopt_validation_delta: 0.01
31+
# Settings for the GEPA-concept-aligned reflective engine (`--engine reflective`,
32+
# also used by `codexopt improve`). It evolves a document against real Codex
33+
# rollouts scored by a tiered reward (verifier -> LLM-judge -> static).
34+
reflective:
35+
# Model specs: "codex" (uses `codex exec`, no API key), "openai/<model>",
36+
# "<model>", or null to disable that role. Defaults stay offline; use
37+
# `codexopt improve --live` to opt into Codex-backed rollouts and mutation.
38+
optimizer_model: null # reflective-mutation LLM; null -> heuristic proposer (weak signal)
39+
judge_model: null # LLM-judge for trajectories; null disables the judge tier
40+
reward_mode: "tiered" # tiered | verifier | judge | static
41+
minibatch_size: 3
42+
max_iterations: 6 # reflect -> mutate -> gate iterations (optimization budget)
43+
edit_budget: 12 # max line-edit operations per mutation (gradient clipping)
44+
valset_ratio: 0.34 # held-out fraction used by the validation gate
45+
max_rollouts: 60 # hard cap on Codex/verifier executions per run (cost guard)
46+
seed: 0
47+
codex_binary: "codex"

docs/benchmarking.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,9 @@ If `task_files` or `issue_files` are configured, benchmark output includes:
3939
- criterion-level sub-scores
4040
- natural-language feedback
4141

42+
JSON task files can include executable rollout tasks for `skillopt`; benchmark still treats their
43+
descriptions as evidence, while optimization can run their commands as validation gates.
44+
4245
## Example
4346

4447
```bash
@@ -54,4 +57,3 @@ issue_files: 1
5457
- agents: /path/to/AGENTS.md
5558
score=0.4700 issues=contradictions, duplicate_lines, missing_output_contract
5659
```
57-

0 commit comments

Comments
 (0)