openevolve/examples/llm_prompt_optimization/templates/evaluation.txt at main · algorithmicsuperintelligence/openevolve · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Evaluate the following prompt designed for large language models on a scale of 0.0 to 1.0 for these metrics:

1. **Clarity** (0.0-1.0): How clear and unambiguous are the instructions? Are there any confusing or contradictory elements?

2. **Specificity** (0.0-1.0): Does the prompt provide appropriate detail and constraints without being overly restrictive? Does it guide the model effectively?

3. **Robustness** (0.0-1.0): Will this prompt handle edge cases and varied inputs well? Is it resilient to different phrasings or unexpected scenarios?

4. **Format_specification** (0.0-1.0): Is the expected output format clearly defined? Will the model know exactly how to structure its response?

Prompt to evaluate:
```
{current_program}
```

Consider that this prompt is designed for a task involving mathematical problem-solving, classification, or similar structured tasks where accuracy and consistency are important.

Evaluation guidelines:
- A score of 1.0 means excellent/optimal for that dimension
- A score of 0.5 means adequate but with room for improvement
- A score of 0.0 means severely lacking in that dimension
- Consider how well the prompt would work across different models and contexts

Return your evaluation as a JSON object with the following format:
{{
    "clarity": [score],
    "specificity": [score],
    "robustness": [score],
    "format_specification": [score],
    "reasoning": "[brief explanation of scores, highlighting strengths and areas for improvement]"
}}