docs: update qiskit_code_validation example defaults (#743)

ajbozarth · web-flow · commit fd51c2bd9138 · 2026-03-27T19:12:05.000Z
* docs: update qiskit_code_validation example defaults

Switch default model to hf.co/Qiskit/mistral-small-3.2-24b-qiskit-GGUF
and inline QISKIT_SYSTEM_PROMPT as a documented optional tuning aid for
non-specialized models. Update README to match.

Signed-off-by: Alex Bozarth &lt;ajbozart@us.ibm.com&gt;

* docs: update qiskit_code_validation README benchmark results

Replace outdated benchmark table with completed run data and add
check() correctness finding (~32.5% on QHE).

Signed-off-by: Alex Bozarth &lt;ajbozart@us.ibm.com&gt;

* docs: add grounding_context usage example to qiskit_code_validation README

Signed-off-by: Alex Bozarth &lt;ajbozart@us.ibm.com&gt;

---------

Signed-off-by: Alex Bozarth &lt;ajbozart@us.ibm.com&gt;
diff --git a/docs/examples/instruct_validate_repair/qiskit_code_validation/README.md b/docs/examples/instruct_validate_repair/qiskit_code_validation/README.md
@@ -8,7 +8,7 @@ Takes a prompt containing deprecated Qiskit code and:
 1. Detects QKT violations in the input code
 2. Passes those violations to the LLM as context
 3. Generates corrected code that passes QKT validation
-4. Automatically repairs the code if validation fails (up to 5 attempts)
+4. Automatically repairs the code if validation fails (up to 10 attempts)
 
 ## Quick Start
 
@@ -22,7 +22,7 @@ Dependencies (`mellea`, `flake8-qiskit-migration`) are automatically installed.
 ## Requirements
 
 - **Ollama backend** running locally (`ollama serve`)
-- **Compatible model**: e.g., `hf.co/Qiskit/mistral-small-3.2-24b-qiskit-GGUF:latest` or `granite4:small-h`
+- **Compatible model**: `hf.co/Qiskit/mistral-small-3.2-24b-qiskit-GGUF:latest` (recommended — domain-specialized; see [Changing the Model](#changing-the-model))
 - **flake8-qiskit-migration**: Automatically installed when using `uv run`
 
 ## How It Works
@@ -32,7 +32,7 @@ Dependencies (`mellea`, `flake8-qiskit-migration`) are automatically installed.
 1. **Pre-condition validation**: Validates the input prompt and any code it contains
 2. **Instruction**: LLM generates code following structured requirements
 3. **Post-condition validation**: Validates generated code against QKT rules (see [Qiskit Migration Guide](https://docs.quantum.ibm.com/api/migration-guides))
-4. **Repair loop**: Automatically repairs code that fails validation (up to 5 attempts)
+4. **Repair loop**: Automatically repairs code that fails validation (up to 10 attempts)
 
 ### Sampling Strategies
 
@@ -47,20 +47,20 @@ To switch strategies, edit the `use_multiturn_strategy` variable in `test_qiskit
 
 #### Strategy Performance Comparison
 
-Benchmarks on `mistral-small-3.2-24b-qiskit` model (pass rates measure QKT validation only, not correctness):
+Benchmarks on `mistral-small-3.2-24b-qiskit` model, no system prompt:
 
-| Dataset | Strategy | First Pass | Post-Repair |
+| Dataset | Strategy | First Pass (QKT) | Post-Repair (QKT) |
 |---------|----------|------------|-------------|
-| **QHE** | RepairTemplate | 78.2% | **99.3%** |
-|         | MultiTurn | 77.5% | 96.7% |
-| **QKT** | RepairTemplate | 54.1% | **83.8%** |
-|         | MultiTurn | 37.8% | 70.3% |
+| **QHE** | RepairTemplate | 98.0% | **100%** |
+|         | MultiTurn | **100%** | **100%** |
+| **QKT** | RepairTemplate | 98.0% | **100%** |
+|         | MultiTurn | 93.3% | **100%** |
 
 **Datasets:**
-- **QHE** (QiskitHumanEval): Code generation tasks testing general Qiskit programming
-- **QKT**: Qiskit version migration tasks requiring fixes to deprecated APIs
+- **QHE** (QiskitHumanEval): 151 general Qiskit code generation tasks
+- **QKT**: 45 Qiskit version migration tasks requiring fixes to deprecated APIs
 
-**Note:** These benchmarks measure whether generated code passes QKT validation rules, not whether the code correctly solves the prompt. Both aspects are important for production use.
+**Note:** Pass rates measure whether generated code passes QKT validation rules, not whether the code correctly solves the prompt. On QHE, the model achieves ~32.5% correctness when running the QHE check() test suite against the generated code. Full benchmark data and analysis are available in @ajbozarth's [toolbox repo](https://github.com/ajbozarth/toolbox/tree/main/mellea/qiskit_code_validation/benchmarking).
 
 ### Code Structure
 
@@ -183,22 +183,17 @@ qc.measure_all()
 Validation failed with 1 error(s):
 QKT101: QuantumCircuit.cnot() has been removed in Qiskit 1.0; use `.cx()` instead
 
-====== Result (83.5s) ======
+====== Result (23.1s, 2 attempt(s)) ======
 ```python
-from qiskit_aer import AerSimulator
-from qiskit import QuantumCircuit
+from qiskit_aer import AerSimulator, QuantumCircuit
 
 backend = AerSimulator()
 
 qc = QuantumCircuit(5, 5)
 qc.h(0)
-qc.cx(0, range(1, 5))  # Fixed: use .cx() instead of .cnot()
+qc.cx(0, range(1, 5))
 qc.measure_all()
-
-job = backend.run(qc)
-result = job.result()
 ```
-I fixed the code by replacing `QuantumCircuit.cnot()` with `QuantumCircuit.cx()` as required by Qiskit 1.0. I also replaced the deprecated `BasicAer.get_backend('qasm_simulator')` with `AerSimulator()`. This code should now pass Qiskit migration validation (QKT rules).
 ======================
 
 ✓ Code passes Qiskit migration validation
@@ -211,13 +206,35 @@ I fixed the code by replacing `QuantumCircuit.cnot()` with `QuantumCircuit.cx()`
 To try a different model, edit the `model_id` variable in the `test_qiskit_code_validation()` function:
 
 ```python
-# Uncomment one to try different models
-# model_id = "granite4:micro-h"
-# model_id = "granite4:small-h"
 model_id = "hf.co/Qiskit/mistral-small-3.2-24b-qiskit-GGUF:latest"
 ```
 
-**Note**: Smaller models (like `granite4:micro-h`) may not have enough Qiskit knowledge to pass validation consistently. The Qiskit-specific model or `granite4:small-h` work best.
+The default model is a Qiskit-specialized fine-tune of Mistral Small. It requires a large initial download (~15GB) but produces reliable results without a system prompt.
+
+General-purpose models (e.g. `granite4:micro-h`) can be used as a lighter alternative but have significantly lower correctness on Qiskit tasks. When using a non-specialized model, set `system_prompt = QISKIT_SYSTEM_PROMPT` to improve results.
+
+## Using Grounding Context
+
+The `grounding_context` parameter accepts a `dict[str, str]` of additional context passed to the LLM alongside the prompt. Keys act as section labels and values are the content. This is useful for injecting relevant documentation snippets, RAG results, or API references at inference time.
+
+**Example — injecting migration guide excerpts:**
+
+```python
+grounding_context = {
+    "primitives_migration": (
+        "SamplerV2 replaces the legacy execute() function. "
+        "Use: sampler = SamplerV2(backend); job = sampler.run([circuit]); result = job.result()"
+    ),
+    "transpilation": (
+        "Use generate_preset_pass_manager() instead of transpile(). "
+        "Example: pm = generate_preset_pass_manager(optimization_level=1, backend=backend); isa_circuit = pm.run(circuit)"
+    ),
+}
+
+code, success, attempts = generate_validated_qiskit_code(
+    m, prompt, strategy, grounding_context=grounding_context
+)
+```
 
 ## Troubleshooting
 
@@ -237,9 +254,9 @@ ollama pull hf.co/Qiskit/mistral-small-3.2-24b-qiskit-GGUF:latest
 ```
 
 ### Validation Always Fails
-If using smaller models (e.g., `granite4:micro-h`), they may not have enough Qiskit knowledge. Try:
-- Using a larger model (`granite4:small-h` or the Qiskit-specific model)
-- Reducing prompt complexity
+If using a general-purpose model, it may not have enough Qiskit knowledge to pass validation consistently. Try:
+- Switching to the Qiskit-specialized model (`hf.co/Qiskit/mistral-small-3.2-24b-qiskit-GGUF:latest`)
+- Setting `system_prompt = QISKIT_SYSTEM_PROMPT` to guide the model toward modern Qiskit APIs
 - Using simpler prompts
 
 ### Import Error: flake8-qiskit-migration
@@ -248,8 +265,3 @@ ModuleNotFoundError: No module named 'flake8_qiskit_migration'
 ```
 **Solution**: Use `uv run` which auto-installs dependencies
 
-## Future Work
-
-The following enhancements are planned for future iterations:
-
-1. **Enable Smaller Models** - Add system prompt or grounding context with Qiskit API documentation to help smaller models perform accurate migrations. This would allow removing the `pytest.mark.skip` marker and make the example run in standard test suites.
diff --git a/docs/examples/instruct_validate_repair/qiskit_code_validation/qiskit_code_validation.py b/docs/examples/instruct_validate_repair/qiskit_code_validation/qiskit_code_validation.py
@@ -15,7 +15,7 @@
 1. **Pre-condition validation**: Validate prompt content and any input code
 2. **Instruction**: LLM generates code following structured requirements
 3. **Post-condition validation**: Validate generated code against QKT rules
-4. **Repair loop**: Automatically repair code that fails validation (up to 5 attempts)
+4. **Repair loop**: Automatically repair code that fails validation (up to 10 attempts)
 
 Requirements:
     - flake8-qiskit-migration: Installed automatically when run via `uv run`
@@ -30,18 +30,51 @@
 
 from validation_helpers import validate_input_code, validate_qiskit_migration
 
-import mellea
+from mellea import MelleaSession, start_session
 from mellea.backends import ModelOption
 from mellea.stdlib.context import ChatContext, SimpleContext
 from mellea.stdlib.requirements import req, simple_validate
 from mellea.stdlib.sampling import MultiTurnStrategy, RepairTemplateStrategy
 
+# Optional system prompt for models not specialized for Qiskit.
+# Set system_prompt = QISKIT_SYSTEM_PROMPT in test_qiskit_code_validation() to enable.
+QISKIT_SYSTEM_PROMPT = """\
+You are the Qiskit code assistant, a Qiskit coding expert developed by IBM Quantum. \
+Your mission is to help users write good Qiskit code and advise them on best practices \
+for quantum computing using Qiskit and IBM Quantum and its hardware and services. \
+You stick to the user request, without adding non-requested information or yapping.
+
+When doing code generation, you always generate Python and Qiskit code. If the input \
+you received only contains code, your task is to complete the code without adding extra \
+explanations or text.
+
+The current version of `qiskit` is `2.1`. Ensure your code is valid Python and Qiskit. \
+The official documentation is available at https://quantum.cloud.ibm.com/docs/en. \
+Avoid `https://qiskit.org` links as they are not active.
+
+Code standards — never use deprecated methods:
+- Transpilation: use `generate_preset_pass_manager()` instead of `transpile()`
+- Execution: use `SamplerV2` or `EstimatorV2` primitives instead of `execute()`
+- Provider: `qiskit-ibmq-provider` / `IBMQ` was deprecated in 2023; use `qiskit-ibm-runtime` instead
+- Simulator: import as `from qiskit_aer import AerSimulator`, not `from qiskit.providers.aer import AerSimulator`
+- Random circuits: import as `from qiskit.circuit.random import random_circuit`
+
+When no backend is specified, default to `ibm_fez`, `ibm_marrakesh`, `ibm_pittsburg`, or `ibm_kingston`. \
+Avoid simulators unless explicitly requested.
+
+The four steps of a Qiskit pattern: (1) Map problem to quantum circuits and operators. \
+(2) Optimize for target hardware. (3) Execute on target hardware. (4) Post-process results.
+"""
+
 
 def generate_validated_qiskit_code(
-    m: mellea.MelleaSession,
+    m: MelleaSession,
     prompt: str,
     strategy: MultiTurnStrategy | RepairTemplateStrategy,
-) -> str:
+    *,
+    system_prompt: str | None = None,
+    grounding_context: dict[str, str] | None = None,
+) -> tuple[str, bool, int]:
     """Generate Qiskit code that passes Qiskit migration validation.
 
     This function implements the Instruct-Validate-Repair pattern:
@@ -54,34 +87,34 @@ def generate_validated_qiskit_code(
         m: Mellea session
         prompt: User prompt for code generation
         strategy: Sampling strategy for handling validation failures
+        system_prompt: Optional system prompt passed via ModelOption.SYSTEM_PROMPT
+        grounding_context: Optional grounding context dict passed to m.instruct()
 
     Returns:
-        Generated code that passes validation
-
-    Raises:
-        ValueError: If prompt validation fails
+        Tuple of (generated_code, success, attempts_used)
     """
     # Pre-validate input code if present — include violations as context rather than failing
     is_valid, error_msg = validate_input_code(prompt)
-    input_code_errors = None
     if not is_valid:
         print(
             f"Input code has QKT violations, including as context for LLM: {error_msg}"
         )
-        input_code_errors = error_msg
-
-    # Build the instruction prompt, optionally augmented with input code violations
-    instruct_prompt = prompt
-    if input_code_errors is not None:
-        instruct_prompt = (
+        prompt = (
             f"{prompt}\n\n"
             f"Note: the code above has the following Qiskit migration issues that must be fixed:\n"
-            f"{input_code_errors}"
+            f"{error_msg}"
         )
 
+    # Only pass optional kwargs if they have values — avoids passing None to m.instruct()
+    extra: dict = {}
+    if grounding_context:
+        extra["grounding_context"] = grounding_context
+    if system_prompt:
+        extra["model_options"] = {ModelOption.SYSTEM_PROMPT: system_prompt}
+
     # Generate code with output validation only
     code_candidate = m.instruct(
-        instruct_prompt,
+        prompt,
         requirements=[
             req(
                 "Code must pass Qiskit migration validation (QKT rules)",
@@ -90,10 +123,17 @@ def generate_validated_qiskit_code(
         ],
         strategy=strategy,
         return_sampling_results=True,
+        **extra,
+    )
+
+    attempts = (
+        len(code_candidate.sample_generations)
+        if code_candidate.sample_generations
+        else 1
     )
 
     if code_candidate.success:
-        return str(code_candidate.result)
+        return str(code_candidate.result), True, attempts
     else:
         print("Code generation did not fully succeed, returning best attempt")
         # Log detailed validation failure reasons
@@ -105,9 +145,13 @@ def generate_validated_qiskit_code(
                     )
         # Return best attempt even if validation failed
         if code_candidate.sample_generations:
-            return str(code_candidate.sample_generations[0].value or "")
+            return (
+                str(code_candidate.sample_generations[-1].value or ""),
+                False,
+                attempts,
+            )
         print("No code generations available")
-        return ""
+        return "", False, attempts
 
 
 def test_qiskit_code_validation() -> None:
@@ -117,16 +161,14 @@ def test_qiskit_code_validation() -> None:
     that uses old APIs (BasicAer, execute) and having the LLM fix it to use
     modern Qiskit APIs that pass QKT validation rules.
     """
-    # Strategy selection - True for MultiTurnStrategy, False for RepairTemplateStrategy
-    # MultiTurnStrategy: Adds validation failure reasons as a new user message in the conversation
-    # RepairTemplateStrategy: Adds validation failure reasons to the instruction and retries
-    use_multiturn_strategy = False
-
-    # Model selection - uncomment one to try different models
-    # model_id = "granite4:micro-h"
-    # model_id = "granite4:small-h"
+    # Model — requires Ollama with the model pulled locally
+    # See README.md for model options and tradeoffs
     model_id = "hf.co/Qiskit/mistral-small-3.2-24b-qiskit-GGUF:latest"
 
+    # System prompt — None uses the model's built-in Qiskit knowledge (default)
+    # Set to QISKIT_SYSTEM_PROMPT when using a model not specialized for Qiskit
+    system_prompt = None
+
     # Prompt - replace with your own or see README.md for examples
     prompt = """from qiskit import BasicAer, QuantumCircuit, execute
 
@@ -144,37 +186,41 @@ def test_qiskit_code_validation() -> None:
     print(prompt)
     print("======================\n")
 
+    # Strategy selection - True for MultiTurnStrategy, False for RepairTemplateStrategy
+    # MultiTurnStrategy: Adds validation failure reasons as a new user message in the conversation
+    # RepairTemplateStrategy: Adds validation failure reasons to the instruction and retries
+    use_multiturn_strategy = False
+
     # Initialize the required context
     ctx = ChatContext() if use_multiturn_strategy else SimpleContext()
+    if use_multiturn_strategy:
+        strategy: MultiTurnStrategy | RepairTemplateStrategy = MultiTurnStrategy(
+            loop_budget=10
+        )
+    else:
+        strategy = RepairTemplateStrategy(loop_budget=10)
 
-    with mellea.start_session(
+    with start_session(
         model_id=model_id,
         backend_name="ollama",
         ctx=ctx,
         model_options={ModelOption.TEMPERATURE: 0.8, ModelOption.MAX_NEW_TOKENS: 2048},
     ) as m:
         start_time = time.time()
 
-        if use_multiturn_strategy:
-            strategy: MultiTurnStrategy | RepairTemplateStrategy = MultiTurnStrategy(
-                loop_budget=5
-            )
-        else:
-            strategy = RepairTemplateStrategy(loop_budget=5)
-
-        code = generate_validated_qiskit_code(m, prompt, strategy)
+        code, success, attempts = generate_validated_qiskit_code(
+            m, prompt, strategy, system_prompt=system_prompt
+        )
         elapsed = time.time() - start_time
 
-    print(f"\n====== Result ({elapsed:.1f}s) ======")
+    print(f"\n====== Result ({elapsed:.1f}s, {attempts} attempt(s)) ======")
     print(code)
     print("======================\n")
 
-    # Validate the generated code
-    is_valid, error_msg = validate_qiskit_migration(code)
-
-    if is_valid:
+    if success:
         print("✓ Code passes Qiskit migration validation")
     else:
+        _, error_msg = validate_qiskit_migration(code)
         print("✗ Validation errors:")
         print(error_msg)