-
Notifications
You must be signed in to change notification settings - Fork 2.4k
feat(skills): add agent reproduction workflows #4118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| --- | ||
| name: agent-reproduce-align | ||
| description: Use after a Codex or Claude Code feature has been implemented in Qwen Code to run the selected reference agent and Qwen Code under the same scenario, capture HTTP and terminal traces, compare request bodies, tool/function schemas, outputs, and iterate until the reproduced behavior is close enough. | ||
| --- | ||
|
|
||
| # Agent Reproduce Align | ||
|
|
||
| ## Purpose | ||
|
|
||
| Use this skill when Qwen Code already has a candidate implementation and needs evidence-based parity with a selected reference agent: `codex` or `claude-code`. The goal is not byte-for-byte equality; it is matching the observable contract that matters for the feature. | ||
|
|
||
| Default target repo: the current working directory. Use a user-specified path only when the user explicitly provides one. | ||
|
|
||
| ## Reference Agent Selection | ||
|
|
||
| Use the same reference agent selected during `$agent-reproduce-feature`. If the earlier choice is unavailable, ask once and record the answer in the scenario or run notes. | ||
|
|
||
| ## Workflow | ||
|
|
||
| 1. Re-state the parity target: | ||
| - feature name and trigger | ||
| - selected reference agent | ||
| - one baseline prompt or interaction script | ||
| - acceptable differences | ||
| - must-match fields | ||
| 2. Run the reference agent and Qwen Code in separate capture directories with the same scenario. | ||
| 3. Capture the selected reference agent's local state before and after the | ||
| reference run when state may affect parity. | ||
| 4. Normalize traces with `scripts/normalize_trace.py`. | ||
| 5. Compare normalized traces with `scripts/compare_traces.py`. | ||
| 6. Inspect differences in this order: | ||
| - reference-agent state changes that explain behavior | ||
| - missing tool/function names | ||
| - schema shape and required fields | ||
| - model settings and response mode | ||
| - prompt role/order differences that affect behavior | ||
| - terminal-visible output and exit status | ||
| 7. Patch Qwen Code, rerun the smallest failing scenario, and repeat. | ||
| 8. Preserve only redacted minimal fixtures in the repo. | ||
|
|
||
| Read `references/alignment-workflow.md` before the first comparison pass. | ||
|
|
||
| ## Common Commands | ||
|
|
||
| Normalize: | ||
|
|
||
| ```sh | ||
| skills/agent-reproduce-align/scripts/normalize_trace.py \ | ||
| .repro-runs/reference/http.jsonl \ | ||
| > .repro-runs/reference/normalized.json | ||
| ``` | ||
|
|
||
| Compare: | ||
|
|
||
| ```sh | ||
| skills/agent-reproduce-align/scripts/compare_traces.py \ | ||
| .repro-runs/reference/normalized.json \ | ||
| .repro-runs/qwen/normalized.json | ||
| ``` | ||
|
|
||
| Run a paired shell scenario: | ||
|
|
||
| ```sh | ||
| REPRO_REFERENCE_AGENT=codex \ | ||
| skills/agent-reproduce-align/scripts/run_pair_capture.sh \ | ||
| .repro-runs/slash-help \ | ||
| "codex exec '/help'" \ | ||
| "npm test -- --runInBand" | ||
| ``` | ||
|
|
||
| For Claude Code, set `REPRO_REFERENCE_AGENT=claude-code` and replace the first | ||
| command with the discovered Claude Code command. When `REPRO_REFERENCE_AGENT` | ||
| is set, the paired runner writes `reference/state-before`, | ||
| `reference/state-after`, and `reference/state-diff`. Use the paired runner only | ||
| when shell quoting is simple. For interactive slash commands, run the two | ||
| captures manually with tmux so each side can receive the same keystrokes. Use | ||
| `REPRO_REFERENCE_STATE_ROOT=/tmp/some-root` only for tests or custom state | ||
| directories. | ||
|
|
||
| ## Comparison Rules | ||
|
|
||
| - Compare contracts before wording. Exact prompt text is usually implementation detail. | ||
| - Treat absent schemas, wrong required fields, or wrong argument names as high-signal failures. | ||
| - Treat output ordering as significant only when the user-visible workflow depends on it. | ||
| - Do not chase provider-specific endpoints, model names, IDs, timestamps, token counts, or ephemeral headers unless the feature depends on them. | ||
| - Do not chase every local state write. Treat state diffs as explanatory | ||
| evidence unless the feature contract requires a particular config, memory, or | ||
| permission side effect. | ||
| - Stop when Qwen Code passes the user-visible scenario and the remaining trace differences are documented as intentional. | ||
|
|
||
| ## Done Criteria | ||
|
|
||
| - Reference-agent and Qwen Code traces for the same scenario exist locally. | ||
| - Reference-agent state diff exists or state capture is documented as | ||
| irrelevant for the scenario. | ||
| - The normalized comparison has no unexplained must-match differences. | ||
| - Qwen Code tests or smoke commands cover the fixed behavior. | ||
| - Any remaining mismatch is written down in the task notes or Qwen Code docs when it affects users. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| # Alignment Workflow Reference | ||
|
|
||
| The alignment phase starts after Qwen Code has a candidate implementation. Use it to create a tight loop: run the selected reference agent and Qwen Code, compare traces, patch the target, and rerun only the failing scenario. | ||
|
|
||
| ## Trace Inputs | ||
|
|
||
| Expected raw capture layout: | ||
|
|
||
| ```text | ||
| .repro-runs/<scenario>/ | ||
| reference/ | ||
| http.jsonl | ||
| command.stdout | ||
| command.stderr | ||
| command.exit | ||
| state-before/state-manifest.json | ||
| state-after/state-manifest.json | ||
| state-diff/state-diff.md | ||
| qwen/ | ||
| http.jsonl | ||
| command.stdout | ||
| command.stderr | ||
| command.exit | ||
| ``` | ||
|
|
||
| Use capture scripts from `$agent-reproduce-feature` for raw capture, or use | ||
| `run_pair_capture.sh` for simple non-interactive shell scenarios. Set | ||
| `REPRO_REFERENCE_AGENT=codex` or `REPRO_REFERENCE_AGENT=claude-code` with the | ||
| paired runner to capture reference-agent state automatically. | ||
|
|
||
| ## Normalization | ||
|
|
||
| `normalize_trace.py` reads mitm JSONL output and emits stable JSON: | ||
|
|
||
| - request method and URL path | ||
| - JSON request body summary | ||
| - message role order and brief content hashes | ||
| - tool/function names | ||
| - schema required fields | ||
| - response status code | ||
|
|
||
| It intentionally drops: | ||
|
|
||
| - timestamps | ||
| - authorization and cookie headers | ||
| - provider request IDs | ||
| - full message text unless needed for a hash | ||
|
|
||
| ## Diff Triage | ||
|
|
||
| High priority: | ||
|
|
||
| - missing request entirely | ||
| - wrong endpoint family | ||
| - missing tool/function schema | ||
| - incompatible required fields or enum values | ||
| - slash command not routed to the same behavior class | ||
| - state changes that prove the feature writes config, memory, permissions, or | ||
| another user-visible local store | ||
|
|
||
| Medium priority: | ||
|
|
||
| - prompt role ordering differences | ||
| - terminal output phrasing differences | ||
| - streaming versus non-streaming if users can observe it | ||
| - unexplained state changes that plausibly affect future runs | ||
|
|
||
| Low priority: | ||
|
|
||
| - timestamps, IDs, token counts | ||
| - harmless wording differences | ||
| - extra target-side metadata ignored by the provider | ||
|
|
||
| ## Iteration Loop | ||
|
|
||
| 1. Pick the highest-priority unexplained mismatch. | ||
| 2. Patch only the likely owner module in Qwen Code. | ||
| 3. Run the focused test/smoke path. | ||
| 4. Capture only the affected scenario again. | ||
| 5. Refresh the reference state diff if the suspected mismatch involves local | ||
| state. | ||
| 6. Normalize and compare again. | ||
|
|
||
| Stop when the target behavior is compatible and remaining differences are either irrelevant or explicitly documented. |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,85 @@ | ||||||||||||||||||||||||||||||||||
| #!/usr/bin/env python3 | ||||||||||||||||||||||||||||||||||
| """Compare normalized reproduction traces and print actionable differences.""" | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| from __future__ import annotations | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| import argparse | ||||||||||||||||||||||||||||||||||
| import json | ||||||||||||||||||||||||||||||||||
| from pathlib import Path | ||||||||||||||||||||||||||||||||||
| from typing import Any | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| def load(path: Path) -> dict[str, Any]: | ||||||||||||||||||||||||||||||||||
| return json.loads(path.read_text(encoding="utf-8")) | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| def tool_index(request: dict[str, Any]) -> dict[str, dict[str, Any]]: | ||||||||||||||||||||||||||||||||||
| return { | ||||||||||||||||||||||||||||||||||
| tool.get("name") or f"<unnamed-{idx}>": tool | ||||||||||||||||||||||||||||||||||
| for idx, tool in enumerate(request.get("tools") or []) | ||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| def compare_request(idx: int, left: dict[str, Any], right: dict[str, Any]) -> list[str]: | ||||||||||||||||||||||||||||||||||
| diffs: list[str] = [] | ||||||||||||||||||||||||||||||||||
| prefix = f"request[{idx}]" | ||||||||||||||||||||||||||||||||||
| for key in ("method", "url_path", "model", "stream", "response_status"): | ||||||||||||||||||||||||||||||||||
| if left.get(key) != right.get(key): | ||||||||||||||||||||||||||||||||||
| diffs.append(f"{prefix}.{key}: {left.get(key)!r} != {right.get(key)!r}") | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| left_roles = [item.get("role") for item in left.get("messages") or []] | ||||||||||||||||||||||||||||||||||
| right_roles = [item.get("role") for item in right.get("messages") or []] | ||||||||||||||||||||||||||||||||||
| if left_roles != right_roles: | ||||||||||||||||||||||||||||||||||
| diffs.append(f"{prefix}.message_roles: {left_roles!r} != {right_roles!r}") | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| left_tools = tool_index(left) | ||||||||||||||||||||||||||||||||||
| right_tools = tool_index(right) | ||||||||||||||||||||||||||||||||||
| missing = sorted(set(left_tools) - set(right_tools)) | ||||||||||||||||||||||||||||||||||
| extra = sorted(set(right_tools) - set(left_tools)) | ||||||||||||||||||||||||||||||||||
| if missing: | ||||||||||||||||||||||||||||||||||
| diffs.append(f"{prefix}.tools_missing_in_right: {missing}") | ||||||||||||||||||||||||||||||||||
| if extra: | ||||||||||||||||||||||||||||||||||
| diffs.append(f"{prefix}.tools_extra_in_right: {extra}") | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Suggestion] Tool comparison omits Two tools with the same name but different
Suggested change
— DeepSeek/deepseek-v4-pro via Qwen Code /review |
||||||||||||||||||||||||||||||||||
| for name in sorted(set(left_tools) & set(right_tools)): | ||||||||||||||||||||||||||||||||||
| for key in ("required", "properties"): | ||||||||||||||||||||||||||||||||||
| if left_tools[name].get(key) != right_tools[name].get(key): | ||||||||||||||||||||||||||||||||||
| diffs.append( | ||||||||||||||||||||||||||||||||||
| f"{prefix}.tool[{name}].{key}: " | ||||||||||||||||||||||||||||||||||
| f"{left_tools[name].get(key)!r} != {right_tools[name].get(key)!r}" | ||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||
| return diffs | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| def main() -> int: | ||||||||||||||||||||||||||||||||||
| parser = argparse.ArgumentParser() | ||||||||||||||||||||||||||||||||||
| parser.add_argument("left", type=Path, help="Reference normalized trace") | ||||||||||||||||||||||||||||||||||
| parser.add_argument("right", type=Path, help="Target normalized trace, usually Qwen Code") | ||||||||||||||||||||||||||||||||||
| args = parser.parse_args() | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| left = load(args.left) | ||||||||||||||||||||||||||||||||||
| right = load(args.right) | ||||||||||||||||||||||||||||||||||
| diffs: list[str] = [] | ||||||||||||||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Suggestion] When
Suggested change
— DeepSeek/deepseek-v4-pro via Qwen Code /review |
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| if left.get("request_count") != right.get("request_count"): | ||||||||||||||||||||||||||||||||||
| diffs.append( | ||||||||||||||||||||||||||||||||||
| f"request_count: {left.get('request_count')!r} != {right.get('request_count')!r}" | ||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| for idx, (left_req, right_req) in enumerate( | ||||||||||||||||||||||||||||||||||
| zip(left.get("requests") or [], right.get("requests") or []) | ||||||||||||||||||||||||||||||||||
| ): | ||||||||||||||||||||||||||||||||||
| diffs.extend(compare_request(idx, left_req, right_req)) | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| if not diffs: | ||||||||||||||||||||||||||||||||||
| print("No normalized trace differences found.") | ||||||||||||||||||||||||||||||||||
| return 0 | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| print("Normalized trace differences:") | ||||||||||||||||||||||||||||||||||
| for diff in diffs: | ||||||||||||||||||||||||||||||||||
| print(f"- {diff}") | ||||||||||||||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Suggestion] Exit code If
Suggested change
— DeepSeek/deepseek-v4-pro via Qwen Code /review |
||||||||||||||||||||||||||||||||||
| return 1 | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| if __name__ == "__main__": | ||||||||||||||||||||||||||||||||||
| raise SystemExit(main()) | ||||||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion]
body_keysfield is produced bynormalize_trace.pybut never compared.normalize_trace.pyoutputsbody_keys: sorted(body.keys())to capture request parameter shape (model,stream,temperature,max_tokens, etc.), butcompare_request()never reads this field. Differences in request parameters other thanmodelandstreamare silently ignored.— DeepSeek/deepseek-v4-pro via Qwen Code /review