RuleChef

Learn rule-based models from examples using LLM-powered synthesis.
Replace expensive LLM calls with fast, deterministic, inspectable rules.

What is RuleChef?

RuleChef learns regex, Python code, and spaCy patterns from labeled examples using LLM-powered synthesis. You provide examples, RuleChef generates rules, and those rules run locally without any LLM at inference time.

Why rules instead of LLMs?

Cost: Rules cost nothing to run. No API calls, no tokens.
Latency: Sub-millisecond per query vs hundreds of ms for LLM calls.
Determinism: Same input always produces the same output.
Inspectability: You can read, edit, and debug every rule.
No drift: Rules don't change unless you change them.

Results

On the Text Anonymization Benchmark (real court decisions, PII extraction), rules learned by RuleChef beat both prompting the same LLM and a dedicated neural extractor on format-structured entity types — at a fraction of the cost:

	Format-type F1	ms/doc
RuleChef (rules only)	78.7	0.6
LLM prompting (same model)	74.8	~1500
GLiNER2 (schema)	74.4	190

The LLM stays ahead on purely semantic types — rules are the high-precision tier, not a full replacement. In observation mode, rules learned from watching just 50 LLM calls replace 48% of subsequent calls at 96% precision. Details and per-type numbers: benchmarks/results/.

Installation

pip install rulechef

Extras:

pip install rulechef[grex]     # Regex pattern suggestions from examples
pip install rulechef[spacy]    # spaCy token/dependency matcher patterns
pip install rulechef[agentic]  # LLM-powered coordinator for adaptive learning
pip install rulechef[all]      # Everything

Quick Start

Extraction

Extract answer spans from text:

from openai import OpenAI
from rulechef import RuleChef, Task, TaskType

client = OpenAI()
task = Task(
    name="Q&A Extraction",
    description="Extract answer spans from context",
    input_schema={"question": "str", "context": "str"},
    output_schema={"spans": "List[Span]"},
    type=TaskType.EXTRACTION,
)

chef = RuleChef(task, client)

chef.add_example(
    {"question": "When?", "context": "Built in 1991"},
    {"spans": [{"text": "1991", "start": 9, "end": 13}]}
)
chef.add_example(
    {"question": "When?", "context": "Released in 2005"},
    {"spans": [{"text": "2005", "start": 12, "end": 16}]}
)

chef.learn_rules()

result = chef.extract({"question": "When?", "context": "Founded in 1997"})
print(result)  # {"spans": [{"text": "1997", ...}]}

Named Entity Recognition (NER)

from pydantic import BaseModel
from typing import List, Literal

class Entity(BaseModel):
    text: str
    start: int
    end: int
    type: Literal["DRUG", "DOSAGE", "CONDITION"]

class NEROutput(BaseModel):
    entities: List[Entity]

task = Task(
    name="Medical NER",
    description="Extract drugs, dosages, and conditions",
    input_schema={"text": "str"},
    output_schema=NEROutput,
    type=TaskType.NER,
)

chef = RuleChef(task, client)
chef.add_example(
    {"text": "Take Aspirin 500mg for headache"},
    {"entities": [
        {"text": "Aspirin", "start": 5, "end": 12, "type": "DRUG"},
        {"text": "500mg", "start": 13, "end": 18, "type": "DOSAGE"},
        {"text": "headache", "start": 23, "end": 31, "type": "CONDITION"},
    ]}
)
chef.learn_rules()

Classification

task = Task(
    name="Intent Classification",
    description="Classify banking customer queries",
    input_schema={"text": "str"},
    output_schema={"label": "str"},
    type=TaskType.CLASSIFICATION,
    text_field="text",
)

chef = RuleChef(task, client)
chef.add_example({"text": "what is the exchange rate?"}, {"label": "exchange_rate"})
chef.add_example({"text": "I want to know the rates"}, {"label": "exchange_rate"})
chef.add_example({"text": "my card hasn't arrived"}, {"label": "card_arrival"})

chef.learn_rules()
result = chef.extract({"text": "current exchange rate please"})
print(result)  # {"label": "exchange_rate"}

Transformation

task = Task(
    name="Invoice Parser",
    description="Extract company and amount from invoices",
    input_schema={"text": "str"},
    output_schema={"company": "str", "amount": "str"},
    type=TaskType.TRANSFORMATION,
)

chef = RuleChef(task, client)
chef.add_example(
    {"text": "Invoice from Acme Corp for $1,500.00"},
    {"company": "Acme Corp", "amount": "$1,500.00"}
)
chef.learn_rules()

Core Concepts

Task Types

Type	Output	Use Case
`EXTRACTION`	`{"spans": [...]}`	Find text spans (untyped)
`NER`	`{"entities": [...]}`	Find typed entities with labels
`CLASSIFICATION`	`{"label": "..."}`	Classify text into categories
`TRANSFORMATION`	Custom dict	Extract structured fields

Rule Formats

Format	Best For	Example
`RuleFormat.REGEX`	Keyword patterns, structured text	`\b\d{4}\b`
`RuleFormat.CODE`	Complex logic, multi-field extraction	`def extract(input_data): ...`
`RuleFormat.SPACY`	Linguistic patterns, POS/dependency	`[{"POS": "PROPN", "OP": "+"}]`

from rulechef import RuleFormat

# Only generate regex rules (fastest, most portable)
chef = RuleChef(task, client, allowed_formats=[RuleFormat.REGEX])

# Only code rules (most flexible)
chef = RuleChef(task, client, allowed_formats=[RuleFormat.CODE])

Buffer-First Architecture

Examples go to a buffer first, then get committed to the dataset during learn_rules(). This enables batch learning and coordinator-driven decisions:

chef.add_example(input1, output1)   # Goes to buffer
chef.add_example(input2, output2)   # Goes to buffer
chef.add_correction(input3, wrong_output, correct_output)  # High-priority signal

chef.learn_rules()  # Buffer -> Dataset -> Synthesis -> Refinement

Corrections & Feedback

Corrections are the highest-value training signal -- they show exactly where the current rules fail:

result = chef.extract({"text": "some input"})
# Result was wrong! Correct it:
chef.add_correction(
    {"text": "some input"},
    model_output=result,
    expected_output={"label": "correct_label"},
    feedback="The rule matched too broadly"
)

# Task-level guidance
chef.add_feedback("Drug names always follow 'take' or 'prescribe'")

# Rule-level feedback
chef.add_feedback("This rule is too broad", level="rule", target_id="rule_id")

chef.learn_rules()  # Re-learns with corrections prioritized

Evaluation

RuleChef includes built-in evaluation with entity-level precision, recall, and F1:

# Dataset-level evaluation
eval_result = chef.evaluate()
# Prints: Exact match, micro/macro P/R/F1, per-class breakdown

# Per-rule evaluation (find dead or harmful rules)
metrics = chef.get_rule_metrics()
# Shows: per-rule TP/FP/FN, sample matches, identifies dead rules

# Delete a bad rule
chef.delete_rule("rule_id")

Inspect Your Rules

The rules are the model — so you can read them. Generate a browsable report of any ruleset against labeled data: per-rule precision, and every true/false positive highlighted in context.

rulechef-report --rules my_rules.json --data gold.jsonl --out report.html

Load a saved ruleset back any time — from a dataset file, a benchmark result, or a bare list of rule dicts:

chef.load_rules("benchmarks/results/results_extract_tab.ckpt_rulechef.json")
chef.extract({"text": "filed under no. 36244/06 in 2006"})   # runs rules, no LLM

See benchmarks/INSPECTING_RULES.md.

Advanced Features

Synthesis Strategy

For multi-class tasks, RuleChef can synthesize rules one class at a time for better coverage:

# Auto-detect (default): per-class if >1 class, bulk otherwise
chef = RuleChef(task, client, synthesis_strategy="auto")

# Force per-class synthesis
chef = RuleChef(task, client, synthesis_strategy="per_class")

# Force single-prompt bulk synthesis
chef = RuleChef(task, client, synthesis_strategy="bulk")

Agentic Coordinator

The AgenticCoordinator uses LLM calls to guide the refinement loop, focusing on weak classes:

from rulechef import RuleChef, AgenticCoordinator

coordinator = AgenticCoordinator(client, model="gpt-4o-mini")
chef = RuleChef(task, client, coordinator=coordinator)

chef.learn_rules(max_refinement_iterations=10)
# Coordinator analyzes per-class metrics each iteration,
# tells the synthesis prompt which classes to focus on,
# and stops early when performance plateaus.

Rule Pruning

With prune_after_learn=True, the agentic coordinator audits rules after learning -- merging redundant rules and removing pure noise. A safety net reverts if F1 drops:

coordinator = AgenticCoordinator(client, prune_after_learn=True)
chef = RuleChef(task, client, coordinator=coordinator)

chef.learn_rules()
# After synthesis+refinement:
# 1. LLM analyzes rules + per-rule metrics
# 2. Merges similar patterns (e.g. two regexes → one)
# 3. Removes precision=0 rules (pure false positives)
# 4. Re-evaluates — reverts if F1 drops

In the CLI: learn --agentic --prune.

Incremental Patching

After the initial learn, you can patch existing rules without full re-synthesis:

chef.learn_rules()           # Initial synthesis
chef.add_correction(...)     # Add corrections
chef.learn_rules(incremental_only=True)  # Patch, don't re-synthesize

Observation Mode

Collect training data from any LLM -- no task definition needed:

# Works with any LLM provider (Anthropic, Groq, local models, etc.)
chef = RuleChef(client=client, model="gpt-4o-mini")  # No task needed
chef.add_observation({"text": "what's the exchange rate?"}, {"label": "exchange_rate"})
chef.learn_rules()  # Auto-discovers the task schema

For raw LLM interactions where you don't know the schema:

chef.add_raw_observation(
    messages=[{"role": "user", "content": "classify: what's the rate?"}],
    response="exchange_rate",
)
chef.learn_rules()  # Discovers task + maps observations + learns rules

For OpenAI-compatible clients, auto-capture with monkey-patching:

wrapped = chef.start_observing(openai_client, auto_learn=False)
response = wrapped.chat.completions.create(...)  # Observed automatically
chef.learn_rules()
chef.stop_observing()

Pydantic Output Schemas

Use Pydantic models for type-safe, validated outputs with automatic label extraction:

from pydantic import BaseModel
from typing import List, Literal

class Entity(BaseModel):
    text: str
    start: int
    end: int
    type: Literal["PERSON", "ORG", "LOCATION"]

class Output(BaseModel):
    entities: List[Entity]

task = Task(..., output_schema=Output, type=TaskType.NER)
# RuleChef automatically discovers labels: ["PERSON", "ORG", "LOCATION"]

grex: Regex Pattern Suggestions

When use_grex=True (default), grex analyzes your training examples and adds regex pattern hints to the synthesis prompt. The LLM sees concrete patterns alongside the raw examples, producing better rules — especially for structured data like dates, IDs, and amounts:

DATA EVIDENCE FROM TRAINING:
- DATE (5 unique): "2024-01-15", "2024-02-28", "2023-12-01", ...
  Exact pattern: (2023\-12\-01|2024\-01\-15|2024\-02\-28|...)
  Structural pattern: \d{4}\-\d{2}\-\d{2}

Install with pip install rulechef[grex]. Disable with use_grex=False.

Benchmarks

All numbers, harnesses, and learned rulesets are committed under benchmarks/ — every reported result traces to a script and a results JSON. Highlights:

TAB anonymization (real court decisions): format-type F1 78.7 vs 74.8 (LLM prompting) vs 74.4 (GLiNER2); on the official test split the rules recover 83% of direct identifiers at 5 ms per document on CPU.
Ablation: one-shot rule prompting gets 7.4 F1 on semantic types; the refinement loop with holdout acceptance reaches 44.4 — the pipeline, not the prompt, does the work.
Feedback repair: three sentences of plain-English rule feedback lift a broken rule family from 5.7 to 35.6 F1 in one incremental round.
Banking77 (5-class, 5-shot): 97.6% precision at 61% recall (75.1 micro-F1, 0.17 ms/query) — rules answer or abstain, never guess.

CLI

Interactive CLI for quick experimentation across all task types:

export OPENAI_API_KEY=your_key
rulechef

The CLI walks you through a setup wizard (task name, type, labels, model, base URL) and drops you into a command loop:

Commands:
  add        Add a training example
  correct    Add a correction
  extract    Run extraction on input
  learn      Learn rules (--iterations N, --incremental, --agentic, --prune)
  evaluate   Evaluate rules against dataset
  rules      List learned rules (rules <id> for detail)
  delete     Delete a rule by ID
  feedback   Add feedback (task/rule level)
  generate   Generate synthetic examples with LLM
  stats      Show dataset statistics
  help       Show commands
  quit       Exit

Works with any OpenAI-compatible API (Groq, Together, Ollama, etc.) via the base URL prompt.

Web App

RuleChef includes a web UI (FastAPI + React) for interactive rule learning. Upload data, learn rules, see highlighted entities, and correct mistakes — all in the browser.

pip install -e ".[app]"
uvicorn api.main:app --reload --port 8000

cd frontend && npm install && npm run dev

See the Web App guide for full setup and usage.

License

Apache 2.0 -- see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github		.github
api		api
assets		assets
benchmarks		benchmarks
docs		docs
examples		examples
frontend		frontend
notebooks		notebooks
rulechef		rulechef
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

RuleChef

What is RuleChef?

Results

Installation

Quick Start

Extraction

Named Entity Recognition (NER)

Classification

Transformation

Core Concepts

Task Types

Rule Formats

Buffer-First Architecture

Corrections & Feedback

Evaluation

Inspect Your Rules

Advanced Features

Synthesis Strategy

Agentic Coordinator

Rule Pruning

Incremental Patching

Observation Mode

Pydantic Output Schemas

grex: Regex Pattern Suggestions

Benchmarks

CLI

Web App

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages