SuperagenticAI
diff --git a/‎README.md‎
Lines changed: 111 additions & 33 deletions b/‎README.md‎
Lines changed: 111 additions & 33 deletions
diff --git a/‎codexopt.example.yaml‎
Lines changed: 22 additions & 0 deletions b/‎codexopt.example.yaml‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎docs/benchmarking.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/benchmarking.md‎
Lines changed: 3 additions & 1 deletion
@@ -22,14 +22,16 @@
 
 CodexOpt is a lightweight CLI for benchmarking and optimizing Codex instruction assets.
 
-It focuses on two repo-local files:
+It focuses on Codex instruction assets:
 
 - `AGENTS.md`
 - `.codex/skills/**/SKILL.md`
+- `.agents/skills/**/SKILL.md`
 
 ## Quick Links
 
 - Documentation: [superagenticai.github.io/CodexOpt](https://superagenticai.github.io/CodexOpt/)
+- Codex user workflow: [docs/codex-users.md](docs/codex-users.md)
 - Demo repository: [github.com/SuperagenticAI/codexopt-demo](https://github.com/SuperagenticAI/codexopt-demo)
 - PyPI package: [pypi.org/project/codexopt](https://pypi.org/project/codexopt/)
 - Docs source: [docs/](/Users/shashi/oss/CodexOpt/docs)
@@ -58,8 +60,10 @@ CodexOpt turns these edits into measurable runs with artifacts you can inspect a
 - Benchmark scoring with sub-scores and natural-language feedback.
 - Optional evidence inputs from repo task files and issue exports.
 - Optimization engine `heuristic` (default, local and deterministic).
-- Optional optimization engine `gepa` (via `gepa.optimize_anything`).
-- Explicit reporting when a GEPA-requested run falls back to heuristic optimization.
+- Reflective engine for Codex-backed SkillOpt/GEPA-style optimization.
+- SkillOpt-inspired `skillopt` engine for SKILL.md files with train/validation evidence splits,
+  bounded edits, and validation-gated acceptance.
+- Explicit reporting when a model-backed run falls back to heuristic optimization.
 - Safe apply flow with automatic backups.
 - Markdown reporting from latest runs.
 - Minimal OSS CI (lint, test, build).
@@ -119,12 +123,16 @@ uv run codexopt apply --kind agents
 uv run codexopt report --output codexopt-report.md
 ```
 
+For Codex-specific rollout workflows, including `codex exec --json` validation tasks, see
+[Using CodexOpt with Codex](docs/codex-users.md).
+
 ## How Teams Use CodexOpt
 
 Developers use CodexOpt in the repository that contains their Codex instruction assets:
 
 - `AGENTS.md`
 - `.codex/skills/**/SKILL.md`
+- `.agents/skills/**/SKILL.md`
 
 Optional evidence can also be added to improve benchmarking and optimization quality:
 
@@ -155,7 +163,10 @@ With that config in place, `benchmark` and `optimize` use:
 - repo task alignment
 - recurring issue/review themes
 
-Today, task and issue files influence scoring and feedback. CodexOpt does not yet execute full agent task simulations.
+Today, task and issue files influence scoring and feedback. With `--engine skillopt`, CodexOpt
+uses task evidence as train/validation splits so skill candidates must improve held-out evidence
+before they are accepted. JSON task files can also define executable rollout commands; when present,
+those rollout pass rates become the held-out validation gate.
 
 Use `codexopt.example.yaml` as a starting point for committed team config.
 
@@ -198,7 +209,7 @@ Optimize AGENTS files.
 ```bash
 codexopt optimize agents \
   [--file PATTERN] \
-  [--engine heuristic|gepa] \
+  [--engine heuristic|reflective] \
   [--reflection-model MODEL] \
   [--max-metric-calls N]
 ```
@@ -210,11 +221,22 @@ Optimize SKILL files.
 ```bash
 codexopt optimize skills \
   [--glob PATTERN] \
-  [--engine heuristic|gepa] \
+  [--engine heuristic|skillopt|reflective] \
   [--reflection-model MODEL] \
   [--max-metric-calls N]
 ```
 
+### `improve`
+
+One command for Codex users: discover targets, mine starter tasks, run the
+reflective optimizer, and preview the diff.
+
+```bash
+codexopt improve                    # offline preview
+codexopt improve --live             # Codex-backed reflective preview
+codexopt improve --live --apply     # write validated changes with backups
+```
+
 ### `apply`
 
 Apply best candidates from the latest optimization run (or a provided run id).
@@ -245,6 +267,8 @@ targets:
   skills_globs:
     - ".codex/skills/**/SKILL.md"
     - "**/.codex/skills/**/SKILL.md"
+    - ".agents/skills/**/SKILL.md"
+    - "**/.agents/skills/**/SKILL.md"
   exclude_globs:
     - ".git/**"
     - ".codexopt/**"
@@ -261,6 +285,9 @@ optimization:
   min_apply_delta: 0.01
   max_metric_calls: 60
   reflection_model: null
+  skillopt_train_ratio: 0.67
+  skillopt_edit_budget: 24
+  skillopt_validation_delta: 0.01
 ```
 
 Config notes:
@@ -271,10 +298,13 @@ Config notes:
 - `output.root_dir`: run artifacts and backups location.
 - `evidence.task_files`: optional markdown/json task lists used for repo-alignment scoring.
 - `evidence.issue_files`: optional markdown/json issue or review exports used for theme-aware feedback.
-- `optimization.engine`: default optimization engine.
+- `optimization.engine`: default optimization engine (`heuristic`, `reflective`, or `skillopt` for skills).
 - `optimization.min_apply_delta`: minimum score gain required to apply.
-- `optimization.max_metric_calls`: GEPA metric budget.
-- `optimization.reflection_model`: required when using GEPA engine.
+- `optimization.max_metric_calls`: legacy GEPA metric budget.
+- `optimization.reflection_model`: legacy GEPA reflection model.
+- `optimization.skillopt_train_ratio`: task evidence fraction used for skill candidate proposal.
+- `optimization.skillopt_edit_budget`: maximum line edit operations allowed for SkillOpt candidates.
+- `optimization.skillopt_validation_delta`: minimum held-out validation gain required for SkillOpt acceptance.
 
 ## How Scoring Works
 
@@ -317,42 +347,89 @@ Candidate transforms include:
 
 The best candidate is selected by score delta. If delta is below `min_apply_delta`, original content is kept.
 
-### GEPA engine (optional)
+### Reflective engine
 
-CodexOpt can call `gepa.optimize_anything` when `--engine gepa` is selected.
+The maintained SkillOpt/GEPA-inspired path is `--engine reflective`, or the
+Codex-user shortcut `codexopt improve`. It evaluates a candidate document on
+tasks, captures textual feedback, asks an optimizer model to rewrite the
+document, and accepts the rewrite only when it improves held-out validation
+tasks.
 
-The GEPA path is model-agnostic. In practice, teams can use any reflection model
-supported by their GEPA / LiteLLM setup, including OpenAI, Gemini, local models,
-or other compatible providers. That means you can ask GEPA to generate feedback
-and candidate improvements using whichever model gives you the best quality /
-cost tradeoff for your workflow.
+Defaults stay offline and use static/verifier signals. To run the full live
+Codex loop, use:
 
-Requirements:
+```bash
+codexopt improve --live
+```
 
-- `gepa` installed in the environment.
-- A valid reflection model via `--reflection-model` or config.
+`--live` uses `codex exec` as both optimizer and judge. You can also set
+`reflective.optimizer_model` and `reflective.judge_model` to `codex`,
+`openai/<model>`, or another OpenAI-compatible model.
 
-Common examples:
+### Legacy GEPA engine
+
+`--engine gepa` is deprecated. It targeted an older `gepa.optimize_anything`
+API and now falls back with a clear warning. Use `--engine reflective` instead.
+
+For SkillOpt-style skill optimization:
 
 ```yaml
 optimization:
-  engine: "gepa"
-  reflection_model: "openai/gpt-5-mini"
+  engine: "skillopt"
+  reflection_model: "openai/gpt-5-mini"  # optional; without it, heuristic proposers are used
+  skillopt_train_ratio: 0.67
+  skillopt_edit_budget: 24
+  skillopt_validation_delta: 0.01
 ```
 
-```yaml
-optimization:
-  engine: "gepa"
-  reflection_model: "gemini/gemini-2.5-pro"
+Executable rollout task files can be listed in `evidence.task_files`:
+
+```json
+[
+  {
+    "name": "skill-verifier",
+    "description": "Run a repo-local verifier against the candidate skill.",
+    "command": ["python", "scripts/verify_skill.py"],
+    "timeout_seconds": 30
+  }
+]
 ```
 
-For OpenAI-backed GEPA runs, set:
+Codex-backed rollout tasks can use `backend: "codex"` and `codex_prompt`:
+
+```json
+[
+  {
+    "name": "codex-skill-task",
+    "backend": "codex",
+    "description": "Run Codex against the candidate skill.",
+    "codex_prompt": "Use the local skill to update CHANGELOG.md for a patch release.",
+    "timeout_seconds": 120,
+    "expected_final_response_contains": "CHANGELOG.md",
+    "expected_file_change": "CHANGELOG.md",
+    "expected_file_contains": {
+      "path": "CHANGELOG.md",
+      "contains": "Patch"
+    }
+  }
+]
+```
+
+CodexOpt evaluates those commands in a temporary copy of the repo with the candidate `SKILL.md`
+written in place, then records pass/fail details in `optimize.json`. For Codex-backed rollouts,
+CodexOpt also parses `codex exec --json` events into trajectory metadata: final response,
+commands, file changes, token usage, and errors.
+
+For OpenAI-compatible reflective models, set the provider credentials and use
+`reflective.optimizer_model` / `reflective.judge_model` values such as
+`openai/gpt-5-mini`:
 
 ```bash
 export OPENAI_API_KEY="your-openai-key"
 ```
 
-For Gemini-backed GEPA runs, set:
+For Gemini-compatible endpoints, set the credentials expected by your OpenAI-compatible
+client or run through `codexopt improve --live` to use `codex exec` directly.
 
 ```bash
 export GEMINI_API_KEY="your-gemini-key"
@@ -361,7 +438,8 @@ export GOOGLE_API_KEY="$GEMINI_API_KEY"
 
 Fallback behavior:
 
-- If GEPA is unavailable or errors, CodexOpt falls back to heuristic optimization.
+- If a configured optimizer or judge model is unavailable, CodexOpt records a note and
+  falls back to the weaker heuristic/static path.
 - Fallbacks are recorded in optimization artifacts, CLI summaries, and reports.
 
 ## Artifacts and State
@@ -514,17 +592,17 @@ uv run codexopt apply --kind agents --run-id <run_id>
 
 Cause:
 
-- `gepa` is not installed, or
-- `reflection_model` is missing.
+- The legacy GEPA engine targeted an older `gepa.optimize_anything` API.
 
 Behavior:
 
-- CodexOpt falls back to heuristic optimization when GEPA errors.
+- CodexOpt falls back to heuristic optimization and records the deprecation reason.
 
 Fix:
 
 ```bash
-uv run codexopt optimize agents --engine gepa --reflection-model <model_name>
+uv run codexopt optimize agents --engine reflective
+uv run codexopt improve --live
 ```
 
 ### `apply --dry-run` says files would be applied, but nothing changed
 
@@ -7,6 +7,8 @@ targets:
   skills_globs:
     - ".codex/skills/**/SKILL.md"
     - "**/.codex/skills/**/SKILL.md"
+    - ".agents/skills/**/SKILL.md"
+    - "**/.agents/skills/**/SKILL.md"
   exclude_globs:
     - ".git/**"
     - ".codexopt/**"
@@ -23,3 +25,23 @@ optimization:
   min_apply_delta: 0.01
   max_metric_calls: 60
   reflection_model: null
+  skillopt_train_ratio: 0.67
+  skillopt_edit_budget: 24
+  skillopt_validation_delta: 0.01
+# Settings for the GEPA-concept-aligned reflective engine (`--engine reflective`,
+# also used by `codexopt improve`). It evolves a document against real Codex
+# rollouts scored by a tiered reward (verifier -> LLM-judge -> static).
+reflective:
+  # Model specs: "codex" (uses `codex exec`, no API key), "openai/<model>",
+  # "<model>", or null to disable that role. Defaults stay offline; use
+  # `codexopt improve --live` to opt into Codex-backed rollouts and mutation.
+  optimizer_model: null        # reflective-mutation LLM; null -> heuristic proposer (weak signal)
+  judge_model: null            # LLM-judge for trajectories; null disables the judge tier
+  reward_mode: "tiered"        # tiered | verifier | judge | static
+  minibatch_size: 3
+  max_iterations: 6            # reflect -> mutate -> gate iterations (optimization budget)
+  edit_budget: 12              # max line-edit operations per mutation (gradient clipping)
+  valset_ratio: 0.34           # held-out fraction used by the validation gate
+  max_rollouts: 60             # hard cap on Codex/verifier executions per run (cost guard)
+  seed: 0
+  codex_binary: "codex"
@@ -39,6 +39,9 @@ If `task_files` or `issue_files` are configured, benchmark output includes:
 - criterion-level sub-scores
 - natural-language feedback
 
+JSON task files can include executable rollout tasks for `skillopt`; benchmark still treats their
+descriptions as evidence, while optimization can run their commands as validation gates.
+
 ## Example
 
 ```bash
@@ -54,4 +57,3 @@ issue_files: 1
 - agents: /path/to/AGENTS.md
   score=0.4700 issues=contradictions, duplicate_lines, missing_output_contract
 ```
-