Skip to content

Commit bb5be25

Browse files
committed
Added critique rules, temperature
1 parent c443eb8 commit bb5be25

13 files changed

Lines changed: 848 additions & 72 deletions

File tree

docs/api/coordinator.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ Coordinator protocol and implementations for learning decision logic.
3232
- guide_refinement
3333
- on_learning_complete
3434
- audit_rules
35+
- critique_rules
3536

3637
## AuditResult
3738

docs/guide/advanced.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,7 @@ For fine-tuning, use only the `messages` field. The `metadata` and `call_type` a
159159
| `synthetic_generation` | Learner | Synthetic example generation |
160160
| `guide_refinement` | Coordinator | Per-iteration refinement guidance |
161161
| `audit_rules` | Coordinator | Rule pruning/merging audit |
162+
| `critique_rules` | Coordinator | Critic agent feedback on ruleset |
162163
| `trigger_decision` | Coordinator | Should-learn decision |
163164

164165
### Generating training data at scale

docs/guide/coordinators.md

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ The default coordinator uses threshold-based heuristics:
1717
from rulechef import SimpleCoordinator
1818

1919
coordinator = SimpleCoordinator(
20-
trigger_threshold=10, # Min new examples before first learn
21-
correction_threshold=5, # Min corrections before refinement
20+
trigger_threshold=50, # Min new examples before first learn
21+
correction_threshold=10, # Min corrections before refinement
2222
)
2323

2424
chef = RuleChef(task, client, coordinator=coordinator)
@@ -84,6 +84,7 @@ coordinator = AgenticCoordinator(
8484
client,
8585
model="gpt-4o-mini",
8686
prune_after_learn=True,
87+
audit_interval=3, # Run mid-refinement audit every N iterations (default: 3, 0 to disable)
8788
)
8889
chef = RuleChef(task, client, coordinator=coordinator)
8990
chef.learn_rules()
@@ -98,6 +99,27 @@ A safety net re-evaluates after pruning and **reverts all changes** if F1 drops
9899

99100
In the CLI: `learn --agentic --prune`.
100101

102+
### Critic Agent
103+
104+
Enable `enable_critic` to run an LLM critic before refinement. The critic acts like a human domain expert — it reviews the entire ruleset with per-rule metrics, false positive/negative examples, and per-class performance, then provides actionable feedback:
105+
106+
```python
107+
coordinator = AgenticCoordinator(
108+
client,
109+
model="gpt-4o-mini",
110+
prune_after_learn=True,
111+
enable_critic=True,
112+
critic_interval=4, # Run critic every N iterations (default: 4, 0 to disable)
113+
)
114+
```
115+
116+
The critic runs periodically during refinement (controlled by `critic_interval`) and writes feedback using the same mechanism as `add_feedback()`:
117+
118+
- **Rule-level feedback**: Specific advice per rule (e.g., "Narrow `\d+` by adding context for dollar amounts")
119+
- **Task-level feedback**: Strategic guidance about class disambiguation and priority ordering
120+
121+
This feedback is automatically picked up by the patch prompt in subsequent refinement iterations. Critic feedback is tagged with `source="critic"` and refreshed each learning cycle.
122+
101123
## Custom Coordinators
102124

103125
Implement the `CoordinatorProtocol`:
@@ -133,6 +155,10 @@ class MyCoordinator(CoordinatorProtocol):
133155
def audit_rules(self, rules, rule_metrics):
134156
"""Return AuditResult with merge/remove actions. Default: no-op."""
135157
return AuditResult()
158+
159+
def critique_rules(self, rules, rule_metrics, eval_result, dataset):
160+
"""Return feedback dict or None. Default: no critique."""
161+
return None
136162
```
137163

138164
### CoordinationDecision Fields

docs/guide/learning.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,9 +130,14 @@ Incremental patching:
130130

131131
- Generates targeted rules for known failures
132132
- Merges new rules into the existing ruleset
133+
- **Deletes** underperforming rules when better replacements are provided
133134
- Prunes weak rules that don't contribute
134135
- Preserves stable rules that are working
135136

137+
During patching, the LLM can list rules in a `"deleted_rules"` field to remove them from the ruleset. This is used when a rule is too broad (high false positives) and the LLM provides narrower replacement rules in the same response.
138+
139+
A patch is accepted if micro F1 stays within 0.5%, or if precision improves (higher precision at the cost of some recall is considered a net quality win). Otherwise the patch is rejected and the previous rules are kept.
140+
136141
## Persistence
137142

138143
Rules and datasets are automatically saved to disk:

rulechef/coordinator.py

Lines changed: 235 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,24 @@ def audit_rules(self, rules: list["Rule"], rule_metrics: list[Any]) -> AuditResu
120120
"""
121121
return AuditResult()
122122

123+
def critique_rules(
124+
self,
125+
rules: list["Rule"],
126+
rule_metrics: list[Any],
127+
eval_result: Any,
128+
dataset: Any,
129+
) -> dict | None:
130+
"""LLM critic reviews the full ruleset and provides actionable feedback.
131+
132+
Called before refinement iterations start. Returns a dict with
133+
rule_feedback (per-rule advice) and task_guidance (strategic guidance).
134+
The feedback is written to the dataset as structured feedback and
135+
automatically picked up by patch prompts.
136+
137+
Default: no critique.
138+
"""
139+
return None
140+
123141

124142
class SimpleCoordinator(CoordinatorProtocol):
125143
"""
@@ -251,6 +269,9 @@ def __init__(
251269
min_correction_batch: int = 1,
252270
verbose: bool = True,
253271
prune_after_learn: bool = False,
272+
enable_critic: bool = False,
273+
audit_interval: int = 3,
274+
critic_interval: int = 4,
254275
training_logger=None,
255276
):
256277
"""
@@ -261,6 +282,10 @@ def __init__(
261282
min_correction_batch: Minimum corrections before asking LLM
262283
verbose: Print coordination decisions
263284
prune_after_learn: If True, audit and prune/merge rules after learning
285+
enable_critic: If True, run LLM critic before refinement to provide
286+
strategic feedback on the ruleset.
287+
audit_interval: Run mid-refinement audit every N iterations (0 to disable).
288+
critic_interval: Run critic every N iterations (0 to disable).
264289
training_logger: Optional TrainingDataLogger for capturing LLM calls.
265290
"""
266291
self.llm = llm_client
@@ -269,7 +294,17 @@ def __init__(
269294
self.min_correction_batch = min_correction_batch
270295
self.verbose = verbose
271296
self.prune_after_learn = prune_after_learn
297+
self.enable_critic = enable_critic
298+
self.audit_interval = audit_interval
299+
self.critic_interval = critic_interval
272300
self.training_logger = training_logger
301+
self.temperature: float | None = None
302+
303+
def _temp_kwargs(self) -> dict:
304+
"""Return temperature kwarg dict if set, empty dict otherwise."""
305+
if self.temperature is not None:
306+
return {"temperature": self.temperature}
307+
return {}
273308

274309
def should_trigger_learning(
275310
self, buffer: "ExampleBuffer", current_rules: list["Rule"] | None
@@ -379,6 +414,7 @@ def guide_refinement(self, eval_result: Any, iteration: int, max_iterations: int
379414
model=self.model,
380415
messages=[{"role": "user", "content": prompt}],
381416
response_format={"type": "json_object"},
417+
**self._temp_kwargs(),
382418
)
383419
response_text = response.choices[0].message.content
384420
result = json.loads(response_text)
@@ -456,18 +492,25 @@ def audit_rules(self, rules: list["Rule"], rule_metrics: list[Any]) -> AuditResu
456492
RULES:
457493
{json.dumps(rule_entries, indent=2)}
458494
459-
Your job is to CONSOLIDATE the ruleset. Prefer MERGING over removing.
495+
Your job is to CONSOLIDATE and CLEAN the ruleset.
460496
461-
ACTIONS:
462-
1. MERGE: Two+ regex rules with similar patterns targeting the same output/label.
463-
Combine their patterns into one rule (e.g. merge `(?:bad|awful)` and `(?:terrible|worst)` into `(?:bad|awful|terrible|worst)`).
464-
Only merge rules of the same format and same output_template/output_key.
465-
2. REMOVE: Only for rules that are pure noise — precision=0 AND matches>0 (every match is wrong).
497+
ACTIONS (in priority order):
498+
1. MERGE: Two+ rules with similar/overlapping patterns targeting the same output/label.
499+
Combine into one rule. Only merge rules of the same format and same output_template/output_key.
500+
2. REMOVE rules that hurt more than they help:
501+
- precision=0 AND matches>0 (pure noise — every match is wrong)
502+
- false_positives > 2x true_positives (rule causes more harm than good)
503+
- Memorized exact strings from training data that won't generalize (e.g. matching one specific phrase)
504+
3. TIGHTEN: If a rule has high FP, return it as a merge-with-self — same rule_id but a narrower pattern.
466505
467506
IMPORTANT — do NOT remove:
468-
- Rules with low F1/recallthey may catch rare but important cases
507+
- The only rule for a class/labeleven if it looks weak, tighten it instead
469508
- Rules with 0 matches — the training set may be small, they could help on unseen data
470-
- The only rule for a class/label — even if it looks weak
509+
510+
LOOK FOR:
511+
- Near-duplicate rules (same type, similar regex) — merge them
512+
- Overly broad patterns (like bare \\d+ or [A-Z][a-z]+) with high FP — tighten or remove
513+
- Rules that are subsets of other rules (one pattern already covered by another)
471514
472515
Return JSON:
473516
{{
@@ -486,6 +529,7 @@ def audit_rules(self, rules: list["Rule"], rule_metrics: list[Any]) -> AuditResu
486529
model=self.model,
487530
messages=[{"role": "user", "content": prompt}],
488531
response_format={"type": "json_object"},
532+
**self._temp_kwargs(),
489533
)
490534
result = json.loads(response.choices[0].message.content)
491535

@@ -543,6 +587,188 @@ def audit_rules(self, rules: list["Rule"], rule_metrics: list[Any]) -> AuditResu
543587
print(f"⚠ Audit error: {e}")
544588
return AuditResult()
545589

590+
def critique_rules(
591+
self,
592+
rules: list["Rule"],
593+
rule_metrics: list[Any],
594+
eval_result: Any,
595+
dataset: Any,
596+
) -> dict | None:
597+
"""LLM critic reviews the full ruleset like a human domain expert.
598+
599+
Sees: task definition, all rules with full patterns, per-rule metrics
600+
with FP/FN examples, per-class performance, and system-level FP examples.
601+
Returns actionable per-rule feedback and strategic task guidance.
602+
"""
603+
import json
604+
605+
if not self.enable_critic or not rules or not rule_metrics:
606+
return None
607+
608+
task = dataset.task
609+
metrics_by_id = {m.rule_id: m for m in rule_metrics}
610+
611+
# Task context
612+
task_section = f"TASK: {task.name}\n{task.description}\n"
613+
task_section += f"Type: {task.type.value}\n"
614+
task_section += f"Input schema: {json.dumps(task.input_schema)}\n"
615+
if hasattr(task.output_schema, "model_fields"):
616+
task_section += "Output schema: Pydantic model\n"
617+
else:
618+
task_section += f"Output schema: {json.dumps(task.output_schema)}\n"
619+
620+
# Overall performance
621+
perf_section = (
622+
f"\nOVERALL PERFORMANCE:\n"
623+
f" micro F1={eval_result.micro_f1:.1%}, "
624+
f"P={eval_result.micro_precision:.1%}, "
625+
f"R={eval_result.micro_recall:.1%}\n"
626+
f" exact_match={eval_result.exact_match:.1%} "
627+
f"({eval_result.total_docs} documents)\n"
628+
)
629+
630+
# Per-class performance (sorted worst-first)
631+
class_lines = ["\nPER-CLASS PERFORMANCE (sorted worst-first):"]
632+
sorted_classes = sorted(eval_result.per_class, key=lambda c: c.f1)
633+
for cm in sorted_classes:
634+
class_lines.append(
635+
f" {cm.label}: P={cm.precision:.0%} R={cm.recall:.0%} F1={cm.f1:.0%} "
636+
f"(TP={cm.tp} FP={cm.fp} FN={cm.fn})"
637+
)
638+
class_section = "\n".join(class_lines) + "\n"
639+
640+
# Rules with per-rule metrics and FP examples
641+
rules_lines = [f"\nRULES ({len(rules)} total):"]
642+
for rule in rules:
643+
m = metrics_by_id.get(rule.id)
644+
rules_lines.append(f'\n Rule: "{rule.name}" (id={rule.id})')
645+
rules_lines.append(f" Format: {rule.format.value}, Priority: {rule.priority}")
646+
rules_lines.append(f" Pattern: {rule.content}")
647+
if rule.output_template:
648+
rules_lines.append(f" Output template: {json.dumps(rule.output_template)}")
649+
if rule.output_key:
650+
rules_lines.append(f" Output key: {rule.output_key}")
651+
if m:
652+
rules_lines.append(
653+
f" Metrics: P={m.precision:.0%} R={m.recall:.0%} F1={m.f1:.0%} "
654+
f"(TP={m.true_positives} FP={m.false_positives}, {m.matches} total matches)"
655+
)
656+
# Show FP examples from sample_matches for this rule
657+
fp_samples = [s for s in m.sample_matches if s.get("fp", 0) > 0]
658+
if fp_samples:
659+
rules_lines.append(" FP examples from this rule:")
660+
for sample in fp_samples[:3]:
661+
input_text = sample.get("input", {})
662+
# Get first string value as context
663+
if isinstance(input_text, dict):
664+
text_vals = [v for v in input_text.values() if isinstance(v, str)]
665+
ctx = text_vals[0][:150] if text_vals else str(input_text)[:150]
666+
else:
667+
ctx = str(input_text)[:150]
668+
rules_lines.append(f' Input: "{ctx}"')
669+
rules_lines.append(
670+
f" Rule produced: {json.dumps(sample.get('rule_output', [])[:3])}"
671+
)
672+
rules_lines.append(
673+
f" Expected: {json.dumps(sample.get('expected', [])[:3])}"
674+
)
675+
else:
676+
rules_lines.append(" Metrics: (no metrics available)")
677+
rules_section = "\n".join(rules_lines) + "\n"
678+
679+
# System-level FP examples
680+
fp_section = ""
681+
if eval_result.fp_examples:
682+
fp_lines = [
683+
f"\nFALSE POSITIVES (system-level, {len(eval_result.fp_examples)} examples):"
684+
]
685+
for fp in eval_result.fp_examples[:20]:
686+
line = f' Predicted "{fp["predicted_text"]}" as {fp["predicted_type"]}'
687+
if fp.get("correct_type"):
688+
line += f" — should be {fp['correct_type']}"
689+
else:
690+
line += " — not an entity"
691+
if fp.get("context"):
692+
line += f' (context: "{fp["context"][:80]}")'
693+
fp_lines.append(line)
694+
fp_section = "\n".join(fp_lines) + "\n"
695+
696+
# FN documents (missed entities)
697+
fn_section = ""
698+
if eval_result.failures:
699+
fn_lines = ["\nMISSED ENTITIES (sample documents with errors):"]
700+
for f in eval_result.failures[:10]:
701+
input_data = f.get("input", {})
702+
if isinstance(input_data, dict):
703+
text_vals = [v for v in input_data.values() if isinstance(v, str)]
704+
ctx = text_vals[0][:150] if text_vals else str(input_data)[:150]
705+
else:
706+
ctx = str(input_data)[:150]
707+
fn_lines.append(f' Input: "{ctx}"')
708+
fn_lines.append(f" Expected: {json.dumps(f.get('expected', {}))}")
709+
fn_lines.append(f" Got: {json.dumps(f.get('got', {}))}")
710+
fn_lines.append("")
711+
fn_section = "\n".join(fn_lines)
712+
713+
prompt = f"""You are an expert Rule Critic acting as a human domain expert. You are reviewing a rule-based {task.type.value} system and providing actionable feedback.
714+
715+
{task_section}
716+
{perf_section}
717+
{class_section}
718+
{rules_section}
719+
{fp_section}
720+
{fn_section}
721+
ANALYZE HOLISTICALLY:
722+
1. Which rules cause the most harm and WHY? Show your reasoning.
723+
2. Are there inter-class conflicts? (same text matched by rules for different types)
724+
3. Are priority assignments correct? (higher priority runs first, wins conflicts)
725+
4. What patterns are MISSING for classes with low recall?
726+
5. What would a human regex expert change about these patterns?
727+
728+
PROVIDE FEEDBACK:
729+
- rule_feedback: For EACH problematic rule, provide SPECIFIC, ACTIONABLE advice.
730+
Bad: "This rule is too broad" (vague)
731+
Good: "Narrow \\d+ by adding word-boundary context: use (\\d+)\\s*(?:million|billion) for large numbers, and let MONEY/PERCENT rules handle $-prefixed and %-suffixed numbers by giving them higher priority"
732+
- task_guidance: Strategic advice about the ENTIRE ruleset — class disambiguation strategy, priority ordering, what kinds of rules are missing.
733+
734+
Return JSON:
735+
{{
736+
"analysis": "1-2 sentence summary of the main issues",
737+
"rule_feedback": {{
738+
"rule_id": "Specific actionable advice for this rule..."
739+
}},
740+
"task_guidance": "Strategic guidance about the full ruleset..."
741+
}}
742+
"""
743+
744+
try:
745+
response = self.llm.chat.completions.create(
746+
model=self.model,
747+
messages=[{"role": "user", "content": prompt}],
748+
response_format={"type": "json_object"},
749+
**self._temp_kwargs(),
750+
)
751+
result = json.loads(response.choices[0].message.content)
752+
753+
if self.training_logger:
754+
self.training_logger.log(
755+
"critique_rules",
756+
[{"role": "user", "content": prompt}],
757+
response.choices[0].message.content,
758+
{
759+
"num_rules": len(rules),
760+
"analysis": result.get("analysis", ""),
761+
"num_rule_feedback": len(result.get("rule_feedback", {})),
762+
},
763+
)
764+
765+
return result
766+
767+
except Exception as e:
768+
if self.verbose:
769+
print(f"⚠ Critic error: {e}")
770+
return None
771+
546772
def _ask_llm(
547773
self, buffer: "ExampleBuffer", current_rules: list["Rule"] | None
548774
) -> CoordinationDecision:
@@ -595,6 +821,7 @@ def _ask_llm(
595821
model=self.model,
596822
messages=[{"role": "user", "content": prompt}],
597823
response_format={"type": "json_object"},
824+
**self._temp_kwargs(),
598825
)
599826

600827
response_text = response.choices[0].message.content

0 commit comments

Comments
 (0)