diff --git a/docs/examples/instruct_validate_repair/qiskit_code_validation/README.md b/docs/examples/instruct_validate_repair/qiskit_code_validation/README.md index ed993a7fd..fa952030d 100644 --- a/docs/examples/instruct_validate_repair/qiskit_code_validation/README.md +++ b/docs/examples/instruct_validate_repair/qiskit_code_validation/README.md @@ -5,10 +5,9 @@ This example demonstrates using Mellea's Instruct-Validate-Repair (IVR) pattern ## What This Example Does Takes a prompt containing deprecated Qiskit code and: -1. Detects QKT violations in the input code -2. Passes those violations to the LLM as context -3. Generates corrected code that passes QKT validation -4. Automatically repairs the code if validation fails (up to 10 attempts) +1. Generates corrected code using the LLM +2. Validates the output against QKT rules +3. Automatically repairs the code if validation fails (up to 10 attempts) ## Quick Start @@ -29,10 +28,9 @@ Dependencies (`mellea`, `flake8-qiskit-migration`) are automatically installed. ### The IVR Pipeline -1. **Pre-condition validation**: Validates the input prompt and any code it contains -2. **Instruction**: LLM generates code following structured requirements -3. **Post-condition validation**: Validates generated code against QKT rules (see [Qiskit Migration Guide](https://docs.quantum.ibm.com/api/migration-guides)) -4. **Repair loop**: Automatically repairs code that fails validation (up to 10 attempts) +1. **Instruction**: LLM generates code following structured requirements +2. **Post-condition validation**: Validates generated code against QKT rules (see [Qiskit Migration Guide](https://docs.quantum.ibm.com/api/migration-guides)) +3. **Repair loop**: Automatically repairs code that fails validation (up to 10 attempts) ### Sampling Strategies @@ -47,20 +45,20 @@ To switch strategies, edit the `use_multiturn_strategy` variable in `test_qiskit #### Strategy Performance Comparison -Benchmarks on `mistral-small-3.2-24b-qiskit` model, no system prompt: +Benchmarks on `mistral-small-3.2-24b-qiskit` model: | Dataset | Strategy | First Pass (QKT) | Post-Repair (QKT) | |---------|----------|------------|-------------| -| **QHE** | RepairTemplate | 98.0% | **100%** | -| | MultiTurn | **100%** | **100%** | -| **QKT** | RepairTemplate | 98.0% | **100%** | -| | MultiTurn | 93.3% | **100%** | +| **QHE** | RepairTemplate | 97.4% | **100%** | +| | MultiTurn | 95.4% | **100%** | +| **QKT** | RepairTemplate | 88.9% | **100%** | +| | MultiTurn | **97.8%** | **100%** | **Datasets:** - **QHE** (QiskitHumanEval): 151 general Qiskit code generation tasks - **QKT**: 45 Qiskit version migration tasks requiring fixes to deprecated APIs -**Note:** Pass rates measure whether generated code passes QKT validation rules, not whether the code correctly solves the prompt. On QHE, the model achieves ~32.5% correctness when running the QHE check() test suite against the generated code. Full benchmark data and analysis are available in @ajbozarth's [toolbox repo](https://github.com/ajbozarth/toolbox/tree/main/mellea/qiskit_code_validation/benchmarking). +**Note:** Pass rates measure whether generated code passes QKT validation rules, not whether the code correctly solves the prompt. On QHE, the model achieves ~27.8% correctness when running the QHE check() test suite against the generated code. Full benchmark data and analysis are available in @ajbozarth's [toolbox repo](https://github.com/ajbozarth/toolbox/tree/main/mellea/qiskit_code_validation/benchmarking). ### Code Structure diff --git a/docs/examples/instruct_validate_repair/qiskit_code_validation/qiskit_code_validation.py b/docs/examples/instruct_validate_repair/qiskit_code_validation/qiskit_code_validation.py index 73e6b3155..029fe4dfc 100644 --- a/docs/examples/instruct_validate_repair/qiskit_code_validation/qiskit_code_validation.py +++ b/docs/examples/instruct_validate_repair/qiskit_code_validation/qiskit_code_validation.py @@ -93,18 +93,6 @@ def generate_validated_qiskit_code( Returns: Tuple of (generated_code, success, attempts_used) """ - # Pre-validate input code if present — include violations as context rather than failing - is_valid, error_msg = validate_input_code(prompt) - if not is_valid: - print( - f"Input code has QKT violations, including as context for LLM: {error_msg}" - ) - prompt = ( - f"{prompt}\n\n" - f"Note: the code above has the following Qiskit migration issues that must be fixed:\n" - f"{error_msg}" - ) - # Only pass optional kwargs if they have values — avoids passing None to m.instruct() extra: dict = {} if grounding_context: