feat(skills): add agent reproduction workflows#4118
Conversation
3ebded8 to
184a416
Compare
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
wenshao
left a comment
There was a problem hiding this comment.
Critical No tests exist for any of the 3 Python scripts or 2 Bash scripts introduced by this PR. The agent-reproduce-feature/SKILL.md Done Criteria require "at least one verification path" and "focused tests or a reproducible smoke command." The core pipeline (HTTP capture → trace normalization → trace comparison) is entirely untested — any regression in normalize_trace.py or compare_traces.py will silently produce misleading parity results.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| trap cleanup EXIT | ||
|
|
||
| sleep 1 | ||
|
|
There was a problem hiding this comment.
[Critical] sleep 1 without proxy health check can cause silent empty captures.
After launching mitmdump in the background, the script unconditionally sleeps 1 second then runs the child command. If mitmdump failed to bind its port (port conflict, crash, missing dependency), the child command runs without any proxy — producing an empty http.jsonl with zero indication of failure.
| # Replace `sleep 1` with a port-polling loop: | |
| for i in $(seq 1 10); do | |
| if curl -s --max-time 1 "http://127.0.0.1:${port}" >/dev/null 2>&1; then | |
| break | |
| fi | |
| if ! kill -0 "${mitm_pid}" 2>/dev/null; then | |
| echo "FATAL: mitmdump died. Check ${mitm_log}" >&2 | |
| cat "${mitm_log}" >&2 | |
| exit 1 | |
| fi | |
| sleep 0.5 | |
| done |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| "description_hash": content_hash(fn.get("description", "")) | ||
| if isinstance(fn, dict) and isinstance(fn.get("description"), str) | ||
| else None, | ||
| "required": sorted(params.get("required", [])) |
There was a problem hiding this comment.
[Critical] Anthropic input_schema not handled — Claude Code comparison path is broken.
summarize_tool() only extracts schema via fn.get("parameters"), which is the OpenAI convention. Anthropic/Claude tools use input_schema instead. When comparing against claude-code (a documented reference agent choice), all tools get "required": [], "properties": [] — the entire schema comparison is silently disabled.
| "required": sorted(params.get("required", [])) | |
| fn = tool.get("function") if isinstance(tool.get("function"), dict) else tool | |
| # Anthropic uses input_schema; OpenAI uses parameters | |
| params = ( | |
| fn.get("parameters") or fn.get("input_schema") | |
| if isinstance(fn, dict) else None | |
| ) |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| echo "reference_status=${reference_status}" | ||
| echo "qwen_status=${qwen_status}" | ||
| echo "compare_status=${compare_status}" | ||
| echo "diff=${out_dir}/trace.diff" |
There was a problem hiding this comment.
[Critical] False success when both captures produce empty traces.
If both run_with_mitm.sh calls produce empty http.jsonl (e.g., proxy failed to start), normalize_trace.py outputs request_count: 0 with empty request lists. compare_traces.py then reports "No normalized trace differences found" with exit 0. The exit guard at line 40 only checks non-zero status codes — it never fires because all three are 0. The script happily reports success when nothing was actually captured.
| echo "diff=${out_dir}/trace.diff" | |
| # Add before normalization: | |
| for side in reference qwen; do | |
| if [[ ! -s "${out_dir}/${side}/http.jsonl" ]]; then | |
| echo "FATAL: ${out_dir}/${side}/http.jsonl is empty — capture likely failed" >&2 | |
| exit 1 | |
| fi | |
| done |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| messages = value["messages"] | ||
| elif isinstance(value.get("input"), list): | ||
| messages = value["input"] | ||
| if messages is None: |
There was a problem hiding this comment.
[Critical] System prompt fields not captured — prompt-level differences invisible to comparison.
summarize_messages() only extracts messages and input arrays from the request body. System prompts are passed via different fields by each provider: Anthropic uses system (top-level), OpenAI /v1/responses uses instructions, Google uses system_instruction. None of these are captured in the normalized output, meaning two agents can use completely different system prompts and the comparator will report "No differences."
| if messages is None: | |
| # After extracting messages, also capture system prompts: | |
| system_fields = {} | |
| if isinstance(body, dict): | |
| for field in ("system", "instructions", "system_instruction"): | |
| if field in body: | |
| val = body[field] | |
| system_fields[field] = ( | |
| val if isinstance(val, str) | |
| else content_hash(json.dumps(val, sort_keys=True)) | |
| ) | |
| # Include system_fields in the request dict |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| return | ||
| record = { | ||
| "ts": time.time(), | ||
| "request": { |
There was a problem hiding this comment.
[Suggestion] API keys in request body are not redacted before writing to disk.
The script redacts sensitive headers (authorization, x-api-key, etc.) but writes request bodies verbatim to http.jsonl. Many LLM providers and proxies accept API keys in JSON request bodies (e.g., {"api_key": "sk-..."}). Consider adding a recursive body redaction step that scrubs known sensitive key names (api_key, apiKey, access_token, secret, auth) before writing.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
|
|
||
| REPRO_CAPTURE_OUT="${http_out}" \ | ||
| mitmdump \ | ||
| --listen-host 127.0.0.1 \ |
There was a problem hiding this comment.
[Suggestion] --set ssl_insecure=true disables upstream TLS verification with no documentation.
This is necessary for MITM proxying but should be accompanied by a comment explaining the security trade-off and that it's intended for local dev only.
| --listen-host 127.0.0.1 \ | |
| # Disable upstream TLS verification so mitmproxy can intercept HTTPS. | |
| # Safe for local dev only; do NOT use on shared or untrusted networks. | |
| --set ssl_insecure=true \ |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| session="repro-$(date +%Y%m%d-%H%M%S)" | ||
| printf '%q ' "$@" > "${out_dir}/command.txt" | ||
| echo >> "${out_dir}/command.txt" | ||
|
|
There was a problem hiding this comment.
[Suggestion] Tmux session is never cleaned up — can leak authenticated agent processes.
The script creates a detached tmux session running an authenticated CLI (codex/claude) but has no EXIT trap to kill it. If the script exits abnormally (or even normally), the session persists with the user's API credentials in its environment. Anyone with access to the same tmux socket can attach and interact with the authenticated agent.
| cleanup_tmux() { | |
| tmux kill-session -t "${session}" >/dev/null 2>&1 || true | |
| } | |
| trap cleanup_tmux EXIT |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
|
|
||
| left = load(args.left) | ||
| right = load(args.right) | ||
| diffs: list[str] = [] |
There was a problem hiding this comment.
[Suggestion] zip() silently truncates extra requests when request counts differ.
When left and right have different request counts, zip() stops at the shorter list. While request_count mismatch is reported, the content of extra requests (method, URL, tools) is never shown. If Qwen Code makes 2 extra API calls beyond the reference, they are invisible in the comparison output.
| diffs: list[str] = [] | |
| # After the zip loop, report unpaired requests: | |
| min_len = min(len(left_reqs), len(right_reqs)) | |
| for idx in range(min_len, len(left_reqs)): | |
| diffs.append(f"request[{idx}].extra_in_left: {left_reqs[idx].get('url_path')!r}") | |
| for idx in range(min_len, len(right_reqs)): | |
| diffs.append(f"request[{idx}].extra_in_right: {right_reqs[idx].get('url_path')!r}") |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| echo "command=$*" | ||
| } > "${out_dir}/env.txt" | ||
|
|
||
| set +e |
There was a problem hiding this comment.
[Suggestion] Missing lowercase proxy env vars and NO_PROXY not cleared.
Only uppercase HTTP_PROXY / HTTPS_PROXY / ALL_PROXY are set. curl, wget, and Python requests (used by many LLM SDKs) respect lowercase http_proxy / https_proxy — requests through those libraries will bypass mitmproxy entirely, producing silent omissions in the capture.
Additionally, a common parent-shell NO_PROXY=localhost,127.0.0.1 will actively disable proxying to the local mitmproxy port.
| set +e | |
| HTTP_PROXY="http://127.0.0.1:${port}" \ | |
| http_proxy="http://127.0.0.1:${port}" \ | |
| HTTPS_PROXY="http://127.0.0.1:${port}" \ | |
| https_proxy="http://127.0.0.1:${port}" \ | |
| ALL_PROXY="http://127.0.0.1:${port}" \ | |
| all_proxy="http://127.0.0.1:${port}" \ | |
| NO_PROXY="" \ | |
| no_proxy="" \ |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
|
|
||
| sleep 1 | ||
|
|
||
| { |
There was a problem hiding this comment.
[Suggestion] command=$* writes full CLI args including potential secrets to env.txt.
If the proxied command includes --api-key sk-xxx or similar, the key is written in plaintext to env.txt. The documentation describes env.txt as "non-secret capture metadata" which is misleading — it can contain secrets depending on the command.
| { | |
| echo "command=${1}" > "${out_dir}/env.txt" | |
| # Or redact known sensitive flags: | |
| # safe_args=() | |
| # while [[ $# -gt 0 ]]; do | |
| # case "$1" in | |
| # --api-key|--token|--secret) safe_args+=("$1 <redacted>"); shift 2 ;; | |
| # *) safe_args+=("$1"); shift ;; | |
| # esac | |
| # done |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| OUT = os.environ.get("REPRO_CAPTURE_OUT", "http.jsonl") | ||
| MAX_BODY = int(os.environ.get("REPRO_CAPTURE_MAX_BODY", "500000")) | ||
| CAPTURE_ALL = os.environ.get("REPRO_CAPTURE_ALL", "0") == "1" | ||
| SENSITIVE_HEADERS = { |
There was a problem hiding this comment.
[Suggestion] SENSITIVE_HEADERS is missing common proxy / session auth headers.
Headers like proxy-authorization, x-auth-token, x-session-token, x-refresh-token are used by self-hosted gateways and proxies. Their values will be written in plaintext to http.jsonl.
| SENSITIVE_HEADERS = { | |
| SENSITIVE_HEADERS = { | |
| "authorization", | |
| "cookie", | |
| "set-cookie", | |
| "x-api-key", | |
| "api-key", | |
| "proxy-authorization", | |
| "x-auth-token", | |
| "x-access-token", | |
| "x-session-token", | |
| "x-refresh-token", | |
| "openai-organization", | |
| "openai-project", | |
| } |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| if CAPTURE_ALL: | ||
| return True | ||
| url = flow.request.pretty_url.lower() | ||
| ctype = flow.request.headers.get("content-type", "").lower() |
There was a problem hiding this comment.
[Suggestion] SSE content-type check is on the request side, but SSE is a response content-type.
_interesting() checks flow.request.headers for text/event-stream. Standard streaming endpoints (/chat/completions, /v1/messages) are captured via URL path hints, but any SSE endpoint on a non-standard path is silently skipped.
| ctype = flow.request.headers.get("content-type", "").lower() | |
| def response(flow: http.HTTPFlow) -> None: | |
| if not _interesting(flow): | |
| # Also check response content-type for SSE | |
| if flow.response: | |
| resp_ct = flow.response.headers.get("content-type", "").lower() | |
| if "text/event-stream" not in resp_ct: | |
| return | |
| else: | |
| return |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| return redacted | ||
|
|
||
|
|
||
| def _decode(content: bytes | None) -> dict[str, Any]: |
There was a problem hiding this comment.
[Suggestion] content[:MAX_BODY] byte slicing can split multi-byte UTF-8 codepoints, causing valid text bodies to be misclassified as base64.
When the truncation boundary falls mid-codepoint (probability ~0.1% for ASCII-dominant text, higher for CJK/emoji), decode("utf-8") raises UnicodeDecodeError and the body is stored as base64. Downstream normalize_trace.py skips base64 bodies entirely, losing the request from comparison.
| def _decode(content: bytes | None) -> dict[str, Any]: | |
| try: | |
| text = content.decode("utf-8") | |
| except UnicodeDecodeError: | |
| return { | |
| "kind": "base64", | |
| "base64": base64.b64encode(content).decode("ascii"), | |
| "truncated": truncated, | |
| } | |
| truncated = len(content) > MAX_BODY | |
| text = text[:MAX_BODY] |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| req = raw.get("request") or {} | ||
| resp = raw.get("response") or {} | ||
| parsed = urlparse(req.get("url", "")) | ||
| body = json_body(req) |
There was a problem hiding this comment.
[Suggestion] urlparse drops query strings — API version parameters invisible to comparison.
parsed.path captures only the URL path. A request to /v1/chat/completions?api-version=2024-02-01 (Azure OpenAI pattern) will appear identical to /v1/chat/completions?api-version=2024-10-21 in the normalized output, masking incompatible API version differences.
| body = json_body(req) | |
| url = urlparse(req.get("url", "")) | |
| url_path = url.path | |
| url_query = url.query | |
| normalized["url_path"] = url_path | |
| if url_query: | |
| normalized["url_query"] = url_query |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| diffs.append(f"{prefix}.tools_missing_in_right: {missing}") | ||
| if extra: | ||
| diffs.append(f"{prefix}.tools_extra_in_right: {extra}") | ||
|
|
There was a problem hiding this comment.
[Suggestion] body_keys field is produced by normalize_trace.py but never compared.
normalize_trace.py outputs body_keys: sorted(body.keys()) to capture request parameter shape (model, stream, temperature, max_tokens, etc.), but compare_request() never reads this field. Differences in request parameters other than model and stream are silently ignored.
| # In compare_request(), add to the comparison loop: | |
| body_keys_diff = _diff( | |
| left.get("body_keys") or [], | |
| right.get("body_keys") or [], | |
| ) | |
| if body_keys_diff: | |
| diffs["body_keys"] = body_keys_diff |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| diffs.append(f"{prefix}.tools_missing_in_right: {missing}") | ||
| if extra: | ||
| diffs.append(f"{prefix}.tools_extra_in_right: {extra}") | ||
|
|
There was a problem hiding this comment.
[Suggestion] Tool comparison omits type and description_hash fields from normalize_trace.py output.
Two tools with the same name but different type values (e.g., "function" vs a provider-specific type) or structurally different descriptions will be accepted as matching. The SKILL.md lists tool schema differences as high-signal, but type and description hash are uncompared.
| # In the for key in (...) loop around line 47, add: | |
| "type", | |
| "description_hash", |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
|
|
||
| print("Normalized trace differences:") | ||
| for diff in diffs: | ||
| print(f"- {diff}") |
There was a problem hiding this comment.
[Suggestion] Exit code 1 is ambiguous: both "diffs found" and script crash produce the same exit code.
If normalize_trace.py produced malformed JSON or a file is missing, the script crashes with exit code 1 — identical to the "diffs found" case. The paired runner (run_pair_capture.sh) can't distinguish a real mismatch from a pipeline error.
| print(f"- {diff}") | |
| def main() -> int: | |
| try: | |
| args = parser.parse_args() | |
| left = load(args.left) | |
| right = load(args.right) | |
| result = compare_traces(left, right) | |
| if result["differences"]: | |
| json.dump(result, sys.stdout, indent=2, sort_keys=True) | |
| sys.stdout.write("\n") | |
| return 1 | |
| print("No normalized trace differences found.") | |
| return 0 | |
| except Exception as exc: | |
| print(f"Error: {exc}", file=sys.stderr) | |
| return 2 |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| ) -> dict[str, dict[str, Any]]: | ||
| entries: dict[str, dict[str, Any]] = {} | ||
| for dirpath, dirnames, filenames in os.walk(root, followlinks=False): | ||
| walkable_dirnames = [] |
There was a problem hiding this comment.
[Suggestion] Dirname loop in collect_entries() missing OSError handler — TOCTOU race can crash the entire snapshot.
The filename loop below (line ~207) has except OSError, but the dirname loop (lines 199-205) does not. If a subdirectory is deleted or changes permissions between os.walk's internal scandir and the subsequent path.is_symlink() call, the snapshot crashes with no partial manifest.
| walkable_dirnames = [] | |
| for dirname in sorted(dirnames): | |
| path = Path(dirpath) / dirname | |
| rel_path = path.relative_to(root).as_posix() | |
| try: | |
| if path.is_symlink(): | |
| entries[rel_path] = {"kind": "symlink"} | |
| else: | |
| walkable_dirnames.append(dirname) | |
| except OSError as exc: | |
| entries[rel_path] = {"kind": "error", "error": str(exc)} |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
Summary
Validation
Scope / Risk
Testing Matrix
Testing matrix notes:
Linked Issues / Bugs