Skip to content

Latest commit

 

History

History
163 lines (108 loc) · 4.15 KB

File metadata and controls

163 lines (108 loc) · 4.15 KB

Skill.md Optimization with DSPy

Programmatically optimize Skill.md prompts using DSPy's prompt optimization framework.

Key finding: DSPy optimization shows improvement on local models (+12.5% on Qwen) but not frontier models (0% on GPT-4o).

Read the full writeup


Quick Start

Option 1: Test with Qwen (Local, No API Key)

# Install Ollama: https://ollama.ai
ollama pull qwen3

# Install dependencies
pip install -r requirements.txt

# Run REAL DSPy BootstrapFewShot optimization
python scripts/optimize_qwen.py

Result: See REAL automated DSPy optimization with BootstrapFewShot in ~3-4 minutes.

  • Baseline: 42.5%
  • BootstrapFewShot Optimized: 55.1%
  • +12.5% improvement from automated example selection

DSPy automatically selects best examples and generates optimized prompts - no manual prompt engineering.

Option 2: Test with GPT-4o (Requires Azure/OpenAI)

# Set environment variables
export AZURE_API_KEY="your-key"
export AZURE_API_BASE="your-endpoint"
export AZURE_DEPLOYMENT="your-deployment"

# Run optimization
python scripts/run_azure_optimization.py

What This Does

Input: code-review.md - A Skill that finds security vulnerabilities

Process:

  1. Convert Skill.md → DSPy Signature
  2. Optimize with 10 vulnerable code examples
  3. Extract improvements → Skill-optimized.md

Output: Model-specific optimized variants


Results

Model Baseline Optimized Improvement
GPT-4o (Azure) 40.6% 38.8% 0%
Qwen3 (Ollama) 42.5% 55.1% +12.5%

Key insight: DSPy's BootstrapFewShot shows significant improvement on local models. GPT-4o sees no improvement (already optimal), Qwen shows +12.5% with automated optimization (+29% relative improvement).


Example Output

Running python scripts/optimize_qwen.py:

======================================================================
REAL DSPy OPTIMIZATION WITH QWEN
======================================================================

This uses BootstrapFewShot to AUTOMATICALLY:
  • Select which examples work best
  • Generate optimized prompts
  • Find the best reasoning strategy

======================================================================
STEP 1: BASELINE (No optimization)
======================================================================

Testing 1/3... 61.6%
Testing 2/3... 56.0%
Testing 3/3... 10.0%

  Baseline: 42.5%

======================================================================
STEP 2: DSPy BootstrapFewShot OPTIMIZATION
======================================================================

🔄 Running optimization...
Bootstrapped 2 full traces after 2 examples for up to 1 rounds.
  Optimization complete!

======================================================================
STEP 3: TESTING OPTIMIZED VERSION
======================================================================

Testing 1/3... 77.0%
Testing 2/3... 65.7%
Testing 3/3... 22.5%

  Optimized: 55.1%

======================================================================
RESULTS
======================================================================

Baseline:  42.5%
Optimized: 55.1%

  IMPROVEMENT: +12.5% (+29.4%)

DSPy's BootstrapFewShot successfully improved the skill!

Real automated DSPy optimization - BootstrapFewShot automatically selects best examples.


Documentation


Limitations

  • Tested on one skill type (code review)
  • Two models only (GPT-4o, Qwen3)
  • Simple metrics (10 training examples)

Open questions:

  • Do optimizations transfer across models?
  • Can we auto-generate training data?
  • What's the right standardized format?

Contributing

Contributions welcome! Particularly interested in:

  1. Testing with other skill types (data analysis, API design, etc.)
  2. Testing with other models (Llama, Mistral, etc.)
  3. Better evaluation metrics
  4. Transfer learning experiments
  5. Auto-generation of training data

License

MIT License - See LICENSE