2222
2323CodexOpt is a lightweight CLI for benchmarking and optimizing Codex instruction assets.
2424
25- It focuses on two repo-local files :
25+ It focuses on Codex instruction assets :
2626
2727- ` AGENTS.md `
2828- ` .codex/skills/**/SKILL.md `
29+ - ` .agents/skills/**/SKILL.md `
2930
3031## Quick Links
3132
3233- Documentation: [ superagenticai.github.io/CodexOpt] ( https://superagenticai.github.io/CodexOpt/ )
34+ - Codex user workflow: [ docs/codex-users.md] ( docs/codex-users.md )
3335- Demo repository: [ github.com/SuperagenticAI/codexopt-demo] ( https://github.com/SuperagenticAI/codexopt-demo )
3436- PyPI package: [ pypi.org/project/codexopt] ( https://pypi.org/project/codexopt/ )
3537- Docs source: [ docs/] ( /Users/shashi/oss/CodexOpt/docs )
@@ -58,8 +60,10 @@ CodexOpt turns these edits into measurable runs with artifacts you can inspect a
5860- Benchmark scoring with sub-scores and natural-language feedback.
5961- Optional evidence inputs from repo task files and issue exports.
6062- Optimization engine ` heuristic ` (default, local and deterministic).
61- - Optional optimization engine ` gepa ` (via ` gepa.optimize_anything ` ).
62- - Explicit reporting when a GEPA-requested run falls back to heuristic optimization.
63+ - Reflective engine for Codex-backed SkillOpt/GEPA-style optimization.
64+ - SkillOpt-inspired ` skillopt ` engine for SKILL.md files with train/validation evidence splits,
65+ bounded edits, and validation-gated acceptance.
66+ - Explicit reporting when a model-backed run falls back to heuristic optimization.
6367- Safe apply flow with automatic backups.
6468- Markdown reporting from latest runs.
6569- Minimal OSS CI (lint, test, build).
@@ -119,12 +123,16 @@ uv run codexopt apply --kind agents
119123uv run codexopt report --output codexopt-report.md
120124```
121125
126+ For Codex-specific rollout workflows, including ` codex exec --json ` validation tasks, see
127+ [ Using CodexOpt with Codex] ( docs/codex-users.md ) .
128+
122129## How Teams Use CodexOpt
123130
124131Developers use CodexOpt in the repository that contains their Codex instruction assets:
125132
126133- ` AGENTS.md `
127134- ` .codex/skills/**/SKILL.md `
135+ - ` .agents/skills/**/SKILL.md `
128136
129137Optional evidence can also be added to improve benchmarking and optimization quality:
130138
@@ -155,7 +163,10 @@ With that config in place, `benchmark` and `optimize` use:
155163- repo task alignment
156164- recurring issue/review themes
157165
158- Today, task and issue files influence scoring and feedback. CodexOpt does not yet execute full agent task simulations.
166+ Today, task and issue files influence scoring and feedback. With `--engine skillopt`, CodexOpt
167+ uses task evidence as train/validation splits so skill candidates must improve held-out evidence
168+ before they are accepted. JSON task files can also define executable rollout commands; when present,
169+ those rollout pass rates become the held-out validation gate.
159170
160171Use `codexopt.example.yaml` as a starting point for committed team config.
161172
@@ -198,7 +209,7 @@ Optimize AGENTS files.
198209` ` ` bash
199210codexopt optimize agents \
200211 [--file PATTERN] \
201- [--engine heuristic|gepa ] \
212+ [--engine heuristic|reflective ] \
202213 [--reflection-model MODEL] \
203214 [--max-metric-calls N]
204215` ` `
@@ -210,11 +221,22 @@ Optimize SKILL files.
210221` ` ` bash
211222codexopt optimize skills \
212223 [--glob PATTERN] \
213- [--engine heuristic|gepa ] \
224+ [--engine heuristic|skillopt|reflective ] \
214225 [--reflection-model MODEL] \
215226 [--max-metric-calls N]
216227` ` `
217228
229+ # ## `improve`
230+
231+ One command for Codex users : discover targets, mine starter tasks, run the
232+ reflective optimizer, and preview the diff.
233+
234+ ` ` ` bash
235+ codexopt improve # offline preview
236+ codexopt improve --live # Codex-backed reflective preview
237+ codexopt improve --live --apply # write validated changes with backups
238+ ` ` `
239+
218240# ## `apply`
219241
220242Apply best candidates from the latest optimization run (or a provided run id).
@@ -245,6 +267,8 @@ targets:
245267 skills_globs:
246268 - ".codex/skills/**/SKILL.md"
247269 - "**/.codex/skills/**/SKILL.md"
270+ - ".agents/skills/**/SKILL.md"
271+ - "**/.agents/skills/**/SKILL.md"
248272 exclude_globs:
249273 - ".git/**"
250274 - ".codexopt/**"
@@ -261,6 +285,9 @@ optimization:
261285 min_apply_delta: 0.01
262286 max_metric_calls: 60
263287 reflection_model: null
288+ skillopt_train_ratio: 0.67
289+ skillopt_edit_budget: 24
290+ skillopt_validation_delta: 0.01
264291` ` `
265292
266293Config notes :
@@ -271,10 +298,13 @@ Config notes:
271298- `output.root_dir` : run artifacts and backups location.
272299- `evidence.task_files` : optional markdown/json task lists used for repo-alignment scoring.
273300- `evidence.issue_files` : optional markdown/json issue or review exports used for theme-aware feedback.
274- - `optimization.engine` : default optimization engine.
301+ - `optimization.engine` : default optimization engine (`heuristic`, `reflective`, or `skillopt` for skills) .
275302- `optimization.min_apply_delta` : minimum score gain required to apply.
276- - `optimization.max_metric_calls` : GEPA metric budget.
277- - `optimization.reflection_model` : required when using GEPA engine.
303+ - `optimization.max_metric_calls` : legacy GEPA metric budget.
304+ - `optimization.reflection_model` : legacy GEPA reflection model.
305+ - `optimization.skillopt_train_ratio` : task evidence fraction used for skill candidate proposal.
306+ - `optimization.skillopt_edit_budget` : maximum line edit operations allowed for SkillOpt candidates.
307+ - `optimization.skillopt_validation_delta` : minimum held-out validation gain required for SkillOpt acceptance.
278308
279309# # How Scoring Works
280310
@@ -317,42 +347,89 @@ Candidate transforms include:
317347
318348The best candidate is selected by score delta. If delta is below `min_apply_delta`, original content is kept.
319349
320- # ## GEPA engine (optional)
350+ # ## Reflective engine
321351
322- CodexOpt can call `gepa.optimize_anything` when `--engine gepa` is selected.
352+ The maintained SkillOpt/GEPA-inspired path is `--engine reflective`, or the
353+ Codex-user shortcut `codexopt improve`. It evaluates a candidate document on
354+ tasks, captures textual feedback, asks an optimizer model to rewrite the
355+ document, and accepts the rewrite only when it improves held-out validation
356+ tasks.
323357
324- The GEPA path is model-agnostic. In practice, teams can use any reflection model
325- supported by their GEPA / LiteLLM setup, including OpenAI, Gemini, local models,
326- or other compatible providers. That means you can ask GEPA to generate feedback
327- and candidate improvements using whichever model gives you the best quality /
328- cost tradeoff for your workflow.
358+ Defaults stay offline and use static/verifier signals. To run the full live
359+ Codex loop, use :
329360
330- Requirements :
361+ ` ` ` bash
362+ codexopt improve --live
363+ ` ` `
331364
332- - ` gepa` installed in the environment.
333- - A valid reflection model via `--reflection-model` or config.
365+ ` --live` uses `codex exec` as both optimizer and judge. You can also set
366+ ` reflective.optimizer_model` and `reflective.judge_model` to `codex`,
367+ ` openai/<model>` , or another OpenAI-compatible model.
334368
335- Common examples :
369+ # ## Legacy GEPA engine
370+
371+ ` --engine gepa` is deprecated. It targeted an older `gepa.optimize_anything`
372+ API and now falls back with a clear warning. Use `--engine reflective` instead.
373+
374+ For SkillOpt-style skill optimization :
336375
337376` ` ` yaml
338377optimization:
339- engine: "gepa"
340- reflection_model: "openai/gpt-5-mini"
378+ engine: "skillopt"
379+ reflection_model: "openai/gpt-5-mini" # optional; without it, heuristic proposers are used
380+ skillopt_train_ratio: 0.67
381+ skillopt_edit_budget: 24
382+ skillopt_validation_delta: 0.01
341383` ` `
342384
343- ` ` ` yaml
344- optimization:
345- engine: "gepa"
346- reflection_model: "gemini/gemini-2.5-pro"
385+ Executable rollout task files can be listed in `evidence.task_files` :
386+
387+ ` ` ` json
388+ [
389+ {
390+ "name": "skill-verifier",
391+ "description": "Run a repo-local verifier against the candidate skill.",
392+ "command": ["python", "scripts/verify_skill.py"],
393+ "timeout_seconds": 30
394+ }
395+ ]
347396` ` `
348397
349- For OpenAI-backed GEPA runs, set :
398+ Codex-backed rollout tasks can use `backend : " codex" ` and ` codex_prompt`:
399+
400+ ` ` ` json
401+ [
402+ {
403+ "name": "codex-skill-task",
404+ "backend": "codex",
405+ "description": "Run Codex against the candidate skill.",
406+ "codex_prompt": "Use the local skill to update CHANGELOG.md for a patch release.",
407+ "timeout_seconds": 120,
408+ "expected_final_response_contains": "CHANGELOG.md",
409+ "expected_file_change": "CHANGELOG.md",
410+ "expected_file_contains": {
411+ "path": "CHANGELOG.md",
412+ "contains": "Patch"
413+ }
414+ }
415+ ]
416+ ` ` `
417+
418+ CodexOpt evaluates those commands in a temporary copy of the repo with the candidate `SKILL.md`
419+ written in place, then records pass/fail details in `optimize.json`. For Codex-backed rollouts,
420+ CodexOpt also parses `codex exec --json` events into trajectory metadata : final response,
421+ commands, file changes, token usage, and errors.
422+
423+ For OpenAI-compatible reflective models, set the provider credentials and use
424+ ` reflective.optimizer_model` / `reflective.judge_model` values such as
425+ `openai/gpt-5-mini` :
350426
351427` ` ` bash
352428export OPENAI_API_KEY="your-openai-key"
353429` ` `
354430
355- For Gemini-backed GEPA runs, set :
431+ For Gemini-compatible endpoints, set the credentials expected by your OpenAI-compatible
432+ client or run through `codexopt improve --live` to use `codex exec` directly.
356433
357434` ` ` bash
358435export GEMINI_API_KEY="your-gemini-key"
@@ -361,7 +438,8 @@ export GOOGLE_API_KEY="$GEMINI_API_KEY"
361438
362439Fallback behavior :
363440
364- - If GEPA is unavailable or errors, CodexOpt falls back to heuristic optimization.
441+ - If a configured optimizer or judge model is unavailable, CodexOpt records a note and
442+ falls back to the weaker heuristic/static path.
365443- Fallbacks are recorded in optimization artifacts, CLI summaries, and reports.
366444
367445# # Artifacts and State
@@ -514,17 +592,17 @@ uv run codexopt apply --kind agents --run-id <run_id>
514592
515593Cause :
516594
517- - ` gepa` is not installed, or
518- - ` reflection_model` is missing.
595+ - The legacy GEPA engine targeted an older `gepa.optimize_anything` API.
519596
520597Behavior :
521598
522- - CodexOpt falls back to heuristic optimization when GEPA errors .
599+ - CodexOpt falls back to heuristic optimization and records the deprecation reason .
523600
524601Fix :
525602
526603` ` ` bash
527- uv run codexopt optimize agents --engine gepa --reflection-model <model_name>
604+ uv run codexopt optimize agents --engine reflective
605+ uv run codexopt improve --live
528606` ` `
529607
530608# ## `apply --dry-run` says files would be applied, but nothing changed
0 commit comments