Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions skills/evaluator-patterns/SKILL.md
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Pattern 2 (bash evaluator) missing preflight guard despite commit claiming all patterns updated

The commit message states "add preflight guard to all evaluator pattern templates" and the I/O contract section (skills/evaluator-patterns/SKILL.md:14) recommends a preflight guard for all evaluators, yet Pattern 2 (evaluator.sh, lines 138–208) is the only template that was not updated with one. Ironically, this is the bash evaluator that runs pytest — arguably the slowest pattern and the most likely to exceed the 10-second preflight timeout (src/optimize_anything/preflight.py:99). Users who copy this template verbatim will have their evaluator fail the preflight check, blocking optimization entirely (src/optimize_anything/cli.py:490-491).

(Refers to lines 138-209)

Prompt for agents
In skills/evaluator-patterns/SKILL.md, add a preflight guard to the Pattern 2 bash evaluator template (evaluator.sh). After the line that extracts the candidate (line 143, candidate="$(printf '%s' "$payload" | python3 -c ...)"), add a preflight check before the workdir/pytest logic, similar to:

# Preflight guard for optimize-anything CLI
if [ "$candidate" = "__optimize_anything_preflight__" ]; then
  printf '{"score":0.5}\n'
  exit 0
fi

This should be inserted around line 144, before the mktemp/workdir creation on line 145.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,13 @@ Use this skill to generate evaluator scripts that follow the optimize-anything c
- Read one JSON object from stdin: `{"candidate": "..."}`
- Write one JSON object to stdout on a single line: `{"score": <float>, ...diagnostics...}`
- Return a numeric `score` (recommended in `0.0..1.0`)
- **Preflight guard** (recommended): detect `"__optimize_anything_preflight__"` candidate and fast-return
```python
candidate = str(payload.get("candidate", ""))
if candidate == "__optimize_anything_preflight__":
print(json.dumps({"score": 0.5}, separators=(",", ":")))
return 0
```

---

Expand All @@ -36,6 +43,11 @@ def main() -> int:
payload = json.load(sys.stdin)
candidate = str(payload.get("candidate", ""))

# Preflight guard for optimize-anything CLI
if candidate == "__optimize_anything_preflight__":
print(json.dumps({"score": 0.5}, separators=(",", ":")))
return 0

dimensions = [
{"name": "clarity", "weight": 0.35, "guide": "Clear, specific, easy to follow"},
{"name": "constraint_following", "weight": 0.35, "guide": "Respects constraints and boundaries"},
Expand Down Expand Up @@ -222,6 +234,12 @@ def clamp01(x: float) -> float:
def main() -> int:
payload = json.load(sys.stdin)
candidate = str(payload.get("candidate", ""))

# Preflight guard for optimize-anything CLI
if candidate == "__optimize_anything_preflight__":
print(json.dumps({"score": 0.5}, separators=(",", ":")))
return 0

text = candidate.strip()

words = re.findall(r"\w+", text)
Expand Down Expand Up @@ -286,6 +304,11 @@ def main() -> int:
payload = json.load(sys.stdin)
candidate = str(payload.get("candidate", "")).lower()

# Preflight guard for optimize-anything CLI (check raw payload, not lowercased)
if payload.get("candidate", "") == "__optimize_anything_preflight__":
print(json.dumps({"score": 0.5}, separators=(",", ":")))
return 0

scenarios = {
"ambiguous_request": ["clarifying question", "assumption"],
"tool_failure": ["retry", "fallback", "error message"],
Expand Down
8 changes: 8 additions & 0 deletions skills/generate-evaluator/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,13 @@ Generate an evaluator that scores candidate artifacts for optimization with gepa
- Default payload: `{"candidate": "<text>"}`
- Dataset-aware payload (`--dataset`): `{"candidate": "<text>", "example": {...}}`
- Output JSON must include `score` (float, usually in `[0,1]`), plus optional side-info fields.
- **Preflight detection** (command evaluators only): The CLI sends `"__optimize_anything_preflight__"` as the candidate text before optimization starts. Your evaluator should detect this and return immediately:
```python
if candidate == "__optimize_anything_preflight__":
print(json.dumps({"score": 0.5}))
sys.exit(0)
```
This avoids the 10-second preflight timeout for evaluators that make slow API calls.

## Choose an Evaluator Pattern

Expand Down Expand Up @@ -83,3 +90,4 @@ echo '{"candidate":"text","example":{"input":"q","expected":"a"}}' | python3 eva
4. Customize scoring logic and side-info fields.
5. Test with stdin payloads. You should see JSON with `score` plus diagnostic fields.
6. Validate score range: a good seed should score between 0.3-0.7. If above 0.85, the evaluator lacks discrimination.
7. Test preflight: `echo '{"candidate":"__optimize_anything_preflight__"}' | python3 your_evaluator.py` — should return `{"score": 0.5}` instantly.
69 changes: 68 additions & 1 deletion skills/optimization-guide/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,47 @@ Start with your current best version of the artifact. `gepa` evolves from here.
### 2. Create an Evaluator
Use the **generate-evaluator** skill to create one matched to your objective. The evaluator is the most critical piece—`gepa`'s optimization quality is bounded by your evaluator's feedback quality.

### 2b. Choose Your Evaluator Interface

Three interfaces exist. Pick based on where your evaluator code lives:

| Your evaluator is... | Use this interface | Evaluator signature |
|---|---|---|
| Python code in the same project | **Python API** — pass a function to `optimize_anything()` | `def eval(candidate: str) -> float` or `-> tuple[float, dict]` |
| A standalone script/binary | **CLI command** — `--evaluator-command` | Reads `{"candidate": "..."}` from stdin, writes `{"score": float}` to stdout |
| A remote service | **HTTP endpoint** — `--evaluator-url` | POST `{"candidate": "..."}`, response `{"score": float}` |

**Prefer the Python API** when your evaluator is Python code. It bypasses all CLI overhead: no preflight timeout, no argparse conflicts, no subprocess timeout, no stdin/stdout protocol. Your evaluator is just a function:

```python
import gepa.optimize_anything as oa

def my_evaluator(candidate: str) -> tuple[float, dict]:
score = run_my_scoring(candidate)
oa.log("Details:", some_diagnostic) # captured as ASI
return score, {"feedback": "..."}

result = optimize_anything(
seed_candidate=open("seed.txt").read(),
evaluator=my_evaluator,
objective="maximize quality",
config=GEPAConfig(engine=EngineConfig(max_metric_calls=100)),
)
```

**Use CLI command** only when your evaluator is a separate process (different language, isolated environment, or shared team tooling). Wrap evaluator-specific flags in a shell script — do not pass them through `--evaluator-command`:

```bash
# evaluators/eval.sh (bakes in your evaluator's flags)
#!/bin/bash
cd "$(dirname "$0")/.."
exec python -m my_eval.scorer --subset-size 5 --temperature 0.0
```

```bash
optimize-anything optimize seed.txt --evaluator-command bash evaluators/eval.sh
```

### 3. Choose Optimization Mode

**Single-task** (no dataset) — optimize one artifact against one evaluator:
Expand Down Expand Up @@ -144,4 +185,30 @@ The result contains:
3. Clarify the objective: Set the `objective` string that is injected into `gepa`'s reflection prompt and specify constraints like token limits or format requirements.
4. Add background context: Use `background` for domain knowledge, constraints, or strategies such as "Target audience is non-technical users. Never use jargon."
5. Iterate on the evaluator: Improve the evaluator before increasing `budget` if optimization results on `seed.txt` are poor.
6. Set evaluator working directory: Pass `evaluator_cwd` as an absolute project path next to `seed.txt` and `evaluators/eval.sh` when `evaluators/eval.sh` or other evaluator commands use repo-relative files or scripts.
6. Set evaluator working directory: Pass `evaluator_cwd` as an absolute project path next to `seed.txt` and `evaluators/eval.sh` when `evaluators/eval.sh` or other evaluator commands use repo-relative files or scripts.

## Preflight Behavior (CLI only)

When using `--evaluator-command`, the CLI runs a **preflight check** before optimization starts. It sends:

```json
{"_protocol_version": 2, "candidate": "__optimize_anything_preflight__"}
```

The preflight has a **10-second timeout**. If your evaluator makes slow API calls (LLM inference, database queries), it will timeout on the real evaluation pipeline. Two solutions:

1. **Detect the sentinel** in your evaluator and fast-return:
```python
payload = json.load(sys.stdin)
candidate = payload["candidate"]
if candidate == "__optimize_anything_preflight__":
print(json.dumps({"score": 0.5}))
sys.exit(0)
# ... actual evaluation below
```

2. **Use the Python API instead** — it has no preflight step at all.

### Command Evaluator Timeout

The `command_evaluator` has a default **30-second timeout per evaluation call** (not configurable via CLI). If your evaluator takes longer than 30s per call, use the Python API or the HTTP evaluator interface.
Loading