QwenLM · DragonnZhang · May 13, 2026 · May 13, 2026 · May 14, 2026 · wenshao
diff --git a/.gitignore b/.gitignore
@@ -64,6 +64,7 @@ packages/web-templates/src/generated/
 packages/vscode-ide-companion/*.vsix
 
 logs/
+.repro-runs/
 # GHA credentials
 gha-creds-*.json
 
@@ -93,4 +94,4 @@ tmp/
 
 # code graph skills
 .venv
-.codegraph
+.codegraph
diff --git a/.qwen/skills/agent-reproduce-align/SKILL.md b/.qwen/skills/agent-reproduce-align/SKILL.md
@@ -0,0 +1,98 @@
+---
+name: agent-reproduce-align
+description: Use after a Codex or Claude Code feature has been implemented in Qwen Code to run the selected reference agent and Qwen Code under the same scenario, capture HTTP and terminal traces, compare request bodies, tool/function schemas, outputs, and iterate until the reproduced behavior is close enough.
+---
+
+# Agent Reproduce Align
+
+## Purpose
+
+Use this skill when Qwen Code already has a candidate implementation and needs evidence-based parity with a selected reference agent: `codex` or `claude-code`. The goal is not byte-for-byte equality; it is matching the observable contract that matters for the feature.
+
+Default target repo: the current working directory. Use a user-specified path only when the user explicitly provides one.
+
+## Reference Agent Selection
+
+Use the same reference agent selected during `$agent-reproduce-feature`. If the earlier choice is unavailable, ask once and record the answer in the scenario or run notes.
+
+## Workflow
+
+1. Re-state the parity target:
+   - feature name and trigger
+   - selected reference agent
+   - one baseline prompt or interaction script
+   - acceptable differences
+   - must-match fields
+2. Run the reference agent and Qwen Code in separate capture directories with the same scenario.
+3. Capture the selected reference agent's local state before and after the
+   reference run when state may affect parity.
+4. Normalize traces with `scripts/normalize_trace.py`.
+5. Compare normalized traces with `scripts/compare_traces.py`.
+6. Inspect differences in this order:
+   - reference-agent state changes that explain behavior
+   - missing tool/function names
+   - schema shape and required fields
+   - model settings and response mode
+   - prompt role/order differences that affect behavior
+   - terminal-visible output and exit status
+7. Patch Qwen Code, rerun the smallest failing scenario, and repeat.
+8. Preserve only redacted minimal fixtures in the repo.
+
+Read `references/alignment-workflow.md` before the first comparison pass.
+
+## Common Commands
+
+Normalize:
+
+```sh
+skills/agent-reproduce-align/scripts/normalize_trace.py \
+  .repro-runs/reference/http.jsonl \
+  > .repro-runs/reference/normalized.json
+```
+
+Compare:
+
+```sh
+skills/agent-reproduce-align/scripts/compare_traces.py \
+  .repro-runs/reference/normalized.json \
+  .repro-runs/qwen/normalized.json
+```
+
+Run a paired shell scenario:
+
+```sh
+REPRO_REFERENCE_AGENT=codex \
+skills/agent-reproduce-align/scripts/run_pair_capture.sh \
+  .repro-runs/slash-help \
+  "codex exec '/help'" \
+  "npm test -- --runInBand"
+```
+
+For Claude Code, set `REPRO_REFERENCE_AGENT=claude-code` and replace the first
+command with the discovered Claude Code command. When `REPRO_REFERENCE_AGENT`
+is set, the paired runner writes `reference/state-before`,
+`reference/state-after`, and `reference/state-diff`. Use the paired runner only
+when shell quoting is simple. For interactive slash commands, run the two
+captures manually with tmux so each side can receive the same keystrokes. Use
+`REPRO_REFERENCE_STATE_ROOT=/tmp/some-root` only for tests or custom state
+directories.
+
+## Comparison Rules
+
+- Compare contracts before wording. Exact prompt text is usually implementation detail.
+- Treat absent schemas, wrong required fields, or wrong argument names as high-signal failures.
+- Treat output ordering as significant only when the user-visible workflow depends on it.
+- Do not chase provider-specific endpoints, model names, IDs, timestamps, token counts, or ephemeral headers unless the feature depends on them.
+- Do not chase every local state write. Treat state diffs as explanatory
+  evidence unless the feature contract requires a particular config, memory, or
+  permission side effect.
+- Stop when Qwen Code passes the user-visible scenario and the remaining trace differences are documented as intentional.
+
+## Done Criteria
+
+- Reference-agent and Qwen Code traces for the same scenario exist locally.
+- Reference-agent state diff exists or state capture is documented as
+  irrelevant for the scenario.
+- The normalized comparison has no unexplained must-match differences.
+- Qwen Code tests or smoke commands cover the fixed behavior.
+- Any remaining mismatch is written down in the task notes or Qwen Code docs when it affects users.
diff --git a/.qwen/skills/agent-reproduce-align/references/alignment-workflow.md b/.qwen/skills/agent-reproduce-align/references/alignment-workflow.md
@@ -0,0 +1,84 @@
+# Alignment Workflow Reference
+
+The alignment phase starts after Qwen Code has a candidate implementation. Use it to create a tight loop: run the selected reference agent and Qwen Code, compare traces, patch the target, and rerun only the failing scenario.
+
+## Trace Inputs
+
+Expected raw capture layout:
+
+```text
+.repro-runs/<scenario>/
+  reference/
+    http.jsonl
+    command.stdout
+    command.stderr
+    command.exit
+    state-before/state-manifest.json
+    state-after/state-manifest.json
+    state-diff/state-diff.md
+  qwen/
+    http.jsonl
+    command.stdout
+    command.stderr
+    command.exit
+```
+
+Use capture scripts from `$agent-reproduce-feature` for raw capture, or use
+`run_pair_capture.sh` for simple non-interactive shell scenarios. Set
+`REPRO_REFERENCE_AGENT=codex` or `REPRO_REFERENCE_AGENT=claude-code` with the
+paired runner to capture reference-agent state automatically.
+
+## Normalization
+
+`normalize_trace.py` reads mitm JSONL output and emits stable JSON:
+
+- request method and URL path
+- JSON request body summary
+- message role order and brief content hashes
+- tool/function names
+- schema required fields
+- response status code
+
+It intentionally drops:
+
+- timestamps
+- authorization and cookie headers
+- provider request IDs
+- full message text unless needed for a hash
+
+## Diff Triage
+
+High priority:
+
+- missing request entirely
+- wrong endpoint family
+- missing tool/function schema
+- incompatible required fields or enum values
+- slash command not routed to the same behavior class
+- state changes that prove the feature writes config, memory, permissions, or
+  another user-visible local store
+
+Medium priority:
+
+- prompt role ordering differences
+- terminal output phrasing differences
+- streaming versus non-streaming if users can observe it
+- unexplained state changes that plausibly affect future runs
+
+Low priority:
+
+- timestamps, IDs, token counts
+- harmless wording differences
+- extra target-side metadata ignored by the provider
+
+## Iteration Loop
+
+1. Pick the highest-priority unexplained mismatch.
+2. Patch only the likely owner module in Qwen Code.
+3. Run the focused test/smoke path.
+4. Capture only the affected scenario again.
+5. Refresh the reference state diff if the suspected mismatch involves local
+   state.
+6. Normalize and compare again.
+
+Stop when the target behavior is compatible and remaining differences are either irrelevant or explicitly documented.
diff --git a/.qwen/skills/agent-reproduce-align/scripts/compare_traces.py b/.qwen/skills/agent-reproduce-align/scripts/compare_traces.py
@@ -0,0 +1,85 @@
+#!/usr/bin/env python3
+"""Compare normalized reproduction traces and print actionable differences."""
+
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+from typing import Any
+
+
+def load(path: Path) -> dict[str, Any]:
+    return json.loads(path.read_text(encoding="utf-8"))
+
+
+def tool_index(request: dict[str, Any]) -> dict[str, dict[str, Any]]:
+    return {
+        tool.get("name") or f"<unnamed-{idx}>": tool
+        for idx, tool in enumerate(request.get("tools") or [])
+    }
+
+
+def compare_request(idx: int, left: dict[str, Any], right: dict[str, Any]) -> list[str]:
+    diffs: list[str] = []
+    prefix = f"request[{idx}]"
+    for key in ("method", "url_path", "model", "stream", "response_status"):
+        if left.get(key) != right.get(key):
+            diffs.append(f"{prefix}.{key}: {left.get(key)!r} != {right.get(key)!r}")
+
+    left_roles = [item.get("role") for item in left.get("messages") or []]
+    right_roles = [item.get("role") for item in right.get("messages") or []]
+    if left_roles != right_roles:
+        diffs.append(f"{prefix}.message_roles: {left_roles!r} != {right_roles!r}")
+
+    left_tools = tool_index(left)
+    right_tools = tool_index(right)
+    missing = sorted(set(left_tools) - set(right_tools))
+    extra = sorted(set(right_tools) - set(left_tools))
+    if missing:
+        diffs.append(f"{prefix}.tools_missing_in_right: {missing}")
+    if extra:
+        diffs.append(f"{prefix}.tools_extra_in_right: {extra}")
+
-
+# In compare_request(), add to the comparison loop:
+body_keys_diff = _diff(
+    left.get("body_keys") or [],
+    right.get("body_keys") or [],
+)
+if body_keys_diff:
+    diffs["body_keys"] = body_keys_diff
-
+# In the for key in (...) loop around line 47, add:
+"type",
+"description_hash",
-
+# In compare_request(), add to the comparison loop:
+body_keys_diff = _diff(
+    left.get("body_keys") or [],
+    right.get("body_keys") or [],
+)
+if body_keys_diff:
+    diffs["body_keys"] = body_keys_diff
-
+# In the for key in (...) loop around line 47, add:
+"type",
+"description_hash",
+    for name in sorted(set(left_tools) & set(right_tools)):
+        for key in ("required", "properties"):
+            if left_tools[name].get(key) != right_tools[name].get(key):
+                diffs.append(
+                    f"{prefix}.tool[{name}].{key}: "
+                    f"{left_tools[name].get(key)!r} != {right_tools[name].get(key)!r}"
+                )
+    return diffs
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("left", type=Path, help="Reference normalized trace")
+    parser.add_argument("right", type=Path, help="Target normalized trace, usually Qwen Code")
+    args = parser.parse_args()
+
+    left = load(args.left)
+    right = load(args.right)
+    diffs: list[str] = []
-    diffs: list[str] = []
+# After the zip loop, report unpaired requests:
+min_len = min(len(left_reqs), len(right_reqs))
+for idx in range(min_len, len(left_reqs)):
+    diffs.append(f"request[{idx}].extra_in_left: {left_reqs[idx].get('url_path')!r}")
+for idx in range(min_len, len(right_reqs)):
+    diffs.append(f"request[{idx}].extra_in_right: {right_reqs[idx].get('url_path')!r}")
-    diffs: list[str] = []
+# After the zip loop, report unpaired requests:
+min_len = min(len(left_reqs), len(right_reqs))
+for idx in range(min_len, len(left_reqs)):
+    diffs.append(f"request[{idx}].extra_in_left: {left_reqs[idx].get('url_path')!r}")
+for idx in range(min_len, len(right_reqs)):
+    diffs.append(f"request[{idx}].extra_in_right: {right_reqs[idx].get('url_path')!r}")
+
+    if left.get("request_count") != right.get("request_count"):
+        diffs.append(
+            f"request_count: {left.get('request_count')!r} != {right.get('request_count')!r}"
+        )
+
+    for idx, (left_req, right_req) in enumerate(
+        zip(left.get("requests") or [], right.get("requests") or [])
+    ):
+        diffs.extend(compare_request(idx, left_req, right_req))
+
+    if not diffs:
+        print("No normalized trace differences found.")
+        return 0
+
+    print("Normalized trace differences:")
+    for diff in diffs:
+        print(f"- {diff}")
-        print(f"- {diff}")
+def main() -> int:
+    try:
+        args = parser.parse_args()
+        left = load(args.left)
+        right = load(args.right)
+        result = compare_traces(left, right)
+        if result["differences"]:
+            json.dump(result, sys.stdout, indent=2, sort_keys=True)
+            sys.stdout.write("\n")
+            return 1
+        print("No normalized trace differences found.")
+        return 0
+    except Exception as exc:
+        print(f"Error: {exc}", file=sys.stderr)
+        return 2
-        print(f"- {diff}")
+def main() -> int:
+    try:
+        args = parser.parse_args()
+        left = load(args.left)
+        right = load(args.right)
+        result = compare_traces(left, right)
+        if result["differences"]:
+            json.dump(result, sys.stdout, indent=2, sort_keys=True)
+            sys.stdout.write("\n")
+            return 1
+        print("No normalized trace differences found.")
+        return 0
+    except Exception as exc:
+        print(f"Error: {exc}", file=sys.stderr)
+        return 2
+    return 1
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())