Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ packages/web-templates/src/generated/
packages/vscode-ide-companion/*.vsix

logs/
.repro-runs/
# GHA credentials
gha-creds-*.json

Expand Down Expand Up @@ -93,4 +94,4 @@ tmp/

# code graph skills
.venv
.codegraph
.codegraph
98 changes: 98 additions & 0 deletions .qwen/skills/agent-reproduce-align/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
name: agent-reproduce-align
description: Use after a Codex or Claude Code feature has been implemented in Qwen Code to run the selected reference agent and Qwen Code under the same scenario, capture HTTP and terminal traces, compare request bodies, tool/function schemas, outputs, and iterate until the reproduced behavior is close enough.
---

# Agent Reproduce Align

## Purpose

Use this skill when Qwen Code already has a candidate implementation and needs evidence-based parity with a selected reference agent: `codex` or `claude-code`. The goal is not byte-for-byte equality; it is matching the observable contract that matters for the feature.

Default target repo: the current working directory. Use a user-specified path only when the user explicitly provides one.

## Reference Agent Selection

Use the same reference agent selected during `$agent-reproduce-feature`. If the earlier choice is unavailable, ask once and record the answer in the scenario or run notes.

## Workflow

1. Re-state the parity target:
- feature name and trigger
- selected reference agent
- one baseline prompt or interaction script
- acceptable differences
- must-match fields
2. Run the reference agent and Qwen Code in separate capture directories with the same scenario.
3. Capture the selected reference agent's local state before and after the
reference run when state may affect parity.
4. Normalize traces with `scripts/normalize_trace.py`.
5. Compare normalized traces with `scripts/compare_traces.py`.
6. Inspect differences in this order:
- reference-agent state changes that explain behavior
- missing tool/function names
- schema shape and required fields
- model settings and response mode
- prompt role/order differences that affect behavior
- terminal-visible output and exit status
7. Patch Qwen Code, rerun the smallest failing scenario, and repeat.
8. Preserve only redacted minimal fixtures in the repo.

Read `references/alignment-workflow.md` before the first comparison pass.

## Common Commands

Normalize:

```sh
skills/agent-reproduce-align/scripts/normalize_trace.py \
.repro-runs/reference/http.jsonl \
> .repro-runs/reference/normalized.json
```

Compare:

```sh
skills/agent-reproduce-align/scripts/compare_traces.py \
.repro-runs/reference/normalized.json \
.repro-runs/qwen/normalized.json
```

Run a paired shell scenario:

```sh
REPRO_REFERENCE_AGENT=codex \
skills/agent-reproduce-align/scripts/run_pair_capture.sh \
.repro-runs/slash-help \
"codex exec '/help'" \
"npm test -- --runInBand"
```

For Claude Code, set `REPRO_REFERENCE_AGENT=claude-code` and replace the first
command with the discovered Claude Code command. When `REPRO_REFERENCE_AGENT`
is set, the paired runner writes `reference/state-before`,
`reference/state-after`, and `reference/state-diff`. Use the paired runner only
when shell quoting is simple. For interactive slash commands, run the two
captures manually with tmux so each side can receive the same keystrokes. Use
`REPRO_REFERENCE_STATE_ROOT=/tmp/some-root` only for tests or custom state
directories.

## Comparison Rules

- Compare contracts before wording. Exact prompt text is usually implementation detail.
- Treat absent schemas, wrong required fields, or wrong argument names as high-signal failures.
- Treat output ordering as significant only when the user-visible workflow depends on it.
- Do not chase provider-specific endpoints, model names, IDs, timestamps, token counts, or ephemeral headers unless the feature depends on them.
- Do not chase every local state write. Treat state diffs as explanatory
evidence unless the feature contract requires a particular config, memory, or
permission side effect.
- Stop when Qwen Code passes the user-visible scenario and the remaining trace differences are documented as intentional.

## Done Criteria

- Reference-agent and Qwen Code traces for the same scenario exist locally.
- Reference-agent state diff exists or state capture is documented as
irrelevant for the scenario.
- The normalized comparison has no unexplained must-match differences.
- Qwen Code tests or smoke commands cover the fixed behavior.
- Any remaining mismatch is written down in the task notes or Qwen Code docs when it affects users.
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Alignment Workflow Reference

The alignment phase starts after Qwen Code has a candidate implementation. Use it to create a tight loop: run the selected reference agent and Qwen Code, compare traces, patch the target, and rerun only the failing scenario.

## Trace Inputs

Expected raw capture layout:

```text
.repro-runs/<scenario>/
reference/
http.jsonl
command.stdout
command.stderr
command.exit
state-before/state-manifest.json
state-after/state-manifest.json
state-diff/state-diff.md
qwen/
http.jsonl
command.stdout
command.stderr
command.exit
```

Use capture scripts from `$agent-reproduce-feature` for raw capture, or use
`run_pair_capture.sh` for simple non-interactive shell scenarios. Set
`REPRO_REFERENCE_AGENT=codex` or `REPRO_REFERENCE_AGENT=claude-code` with the
paired runner to capture reference-agent state automatically.

## Normalization

`normalize_trace.py` reads mitm JSONL output and emits stable JSON:

- request method and URL path
- JSON request body summary
- message role order and brief content hashes
- tool/function names
- schema required fields
- response status code

It intentionally drops:

- timestamps
- authorization and cookie headers
- provider request IDs
- full message text unless needed for a hash

## Diff Triage

High priority:

- missing request entirely
- wrong endpoint family
- missing tool/function schema
- incompatible required fields or enum values
- slash command not routed to the same behavior class
- state changes that prove the feature writes config, memory, permissions, or
another user-visible local store

Medium priority:

- prompt role ordering differences
- terminal output phrasing differences
- streaming versus non-streaming if users can observe it
- unexplained state changes that plausibly affect future runs

Low priority:

- timestamps, IDs, token counts
- harmless wording differences
- extra target-side metadata ignored by the provider

## Iteration Loop

1. Pick the highest-priority unexplained mismatch.
2. Patch only the likely owner module in Qwen Code.
3. Run the focused test/smoke path.
4. Capture only the affected scenario again.
5. Refresh the reference state diff if the suspected mismatch involves local
state.
6. Normalize and compare again.

Stop when the target behavior is compatible and remaining differences are either irrelevant or explicitly documented.
85 changes: 85 additions & 0 deletions .qwen/skills/agent-reproduce-align/scripts/compare_traces.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/usr/bin/env python3
"""Compare normalized reproduction traces and print actionable differences."""

from __future__ import annotations

import argparse
import json
from pathlib import Path
from typing import Any


def load(path: Path) -> dict[str, Any]:
return json.loads(path.read_text(encoding="utf-8"))


def tool_index(request: dict[str, Any]) -> dict[str, dict[str, Any]]:
return {
tool.get("name") or f"<unnamed-{idx}>": tool
for idx, tool in enumerate(request.get("tools") or [])
}


def compare_request(idx: int, left: dict[str, Any], right: dict[str, Any]) -> list[str]:
diffs: list[str] = []
prefix = f"request[{idx}]"
for key in ("method", "url_path", "model", "stream", "response_status"):
if left.get(key) != right.get(key):
diffs.append(f"{prefix}.{key}: {left.get(key)!r} != {right.get(key)!r}")

left_roles = [item.get("role") for item in left.get("messages") or []]
right_roles = [item.get("role") for item in right.get("messages") or []]
if left_roles != right_roles:
diffs.append(f"{prefix}.message_roles: {left_roles!r} != {right_roles!r}")

left_tools = tool_index(left)
right_tools = tool_index(right)
missing = sorted(set(left_tools) - set(right_tools))
extra = sorted(set(right_tools) - set(left_tools))
if missing:
diffs.append(f"{prefix}.tools_missing_in_right: {missing}")
if extra:
diffs.append(f"{prefix}.tools_extra_in_right: {extra}")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] body_keys field is produced by normalize_trace.py but never compared.

normalize_trace.py outputs body_keys: sorted(body.keys()) to capture request parameter shape (model, stream, temperature, max_tokens, etc.), but compare_request() never reads this field. Differences in request parameters other than model and stream are silently ignored.

Suggested change
# In compare_request(), add to the comparison loop:
body_keys_diff = _diff(
left.get("body_keys") or [],
right.get("body_keys") or [],
)
if body_keys_diff:
diffs["body_keys"] = body_keys_diff

— DeepSeek/deepseek-v4-pro via Qwen Code /review

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Tool comparison omits type and description_hash fields from normalize_trace.py output.

Two tools with the same name but different type values (e.g., "function" vs a provider-specific type) or structurally different descriptions will be accepted as matching. The SKILL.md lists tool schema differences as high-signal, but type and description hash are uncompared.

Suggested change
# In the for key in (...) loop around line 47, add:
"type",
"description_hash",

— DeepSeek/deepseek-v4-pro via Qwen Code /review

for name in sorted(set(left_tools) & set(right_tools)):
for key in ("required", "properties"):
if left_tools[name].get(key) != right_tools[name].get(key):
diffs.append(
f"{prefix}.tool[{name}].{key}: "
f"{left_tools[name].get(key)!r} != {right_tools[name].get(key)!r}"
)
return diffs


def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("left", type=Path, help="Reference normalized trace")
parser.add_argument("right", type=Path, help="Target normalized trace, usually Qwen Code")
args = parser.parse_args()

left = load(args.left)
right = load(args.right)
diffs: list[str] = []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] zip() silently truncates extra requests when request counts differ.

When left and right have different request counts, zip() stops at the shorter list. While request_count mismatch is reported, the content of extra requests (method, URL, tools) is never shown. If Qwen Code makes 2 extra API calls beyond the reference, they are invisible in the comparison output.

Suggested change
diffs: list[str] = []
# After the zip loop, report unpaired requests:
min_len = min(len(left_reqs), len(right_reqs))
for idx in range(min_len, len(left_reqs)):
diffs.append(f"request[{idx}].extra_in_left: {left_reqs[idx].get('url_path')!r}")
for idx in range(min_len, len(right_reqs)):
diffs.append(f"request[{idx}].extra_in_right: {right_reqs[idx].get('url_path')!r}")

— DeepSeek/deepseek-v4-pro via Qwen Code /review


if left.get("request_count") != right.get("request_count"):
diffs.append(
f"request_count: {left.get('request_count')!r} != {right.get('request_count')!r}"
)

for idx, (left_req, right_req) in enumerate(
zip(left.get("requests") or [], right.get("requests") or [])
):
diffs.extend(compare_request(idx, left_req, right_req))

if not diffs:
print("No normalized trace differences found.")
return 0

print("Normalized trace differences:")
for diff in diffs:
print(f"- {diff}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Exit code 1 is ambiguous: both "diffs found" and script crash produce the same exit code.

If normalize_trace.py produced malformed JSON or a file is missing, the script crashes with exit code 1 — identical to the "diffs found" case. The paired runner (run_pair_capture.sh) can't distinguish a real mismatch from a pipeline error.

Suggested change
print(f"- {diff}")
def main() -> int:
try:
args = parser.parse_args()
left = load(args.left)
right = load(args.right)
result = compare_traces(left, right)
if result["differences"]:
json.dump(result, sys.stdout, indent=2, sort_keys=True)
sys.stdout.write("\n")
return 1
print("No normalized trace differences found.")
return 0
except Exception as exc:
print(f"Error: {exc}", file=sys.stderr)
return 2

— DeepSeek/deepseek-v4-pro via Qwen Code /review

return 1


if __name__ == "__main__":
raise SystemExit(main())
Loading
Loading