Add tool ordering replay artifact#153
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a tool-ordering replay artifact along with a generation script and a comprehensive test suite. The script analyzes execution sequences from manifest fixtures to detect drift and ordering violations between original and reconstructed payloads. Feedback focuses on a critical logic error in sequence extraction that destroys temporal order, a fragility in edge generation from ordered lists, and the need for more robust testing using synthetic drift scenarios to verify the validation logic.
| def _extract_sequence(payloads: list[dict[str, Any]]) -> tuple[str, ...]: | ||
| sequence_values: set[str] = set() | ||
| for payload in payloads: | ||
| edges: list[tuple[str, str]] = [] | ||
| required: list[tuple[str, str]] = [] | ||
| _walk(payload, edges, required) | ||
| for left, right in edges: | ||
| sequence_values.add(left) | ||
| sequence_values.add(right) | ||
| return tuple(sorted(sequence_values)) |
There was a problem hiding this comment.
The _extract_sequence function contains a critical logic error. It extracts nodes from edges and then sorts them alphabetically (sorted(sequence_values)). This completely destroys the temporal order of tool calls found in the replay. As a result, find_order_violations (line 164) will only check if constraints match alphabetical order, failing to detect actual execution order drift.
Additionally, this function is inefficient as it re-traverses the entire payload set. I suggest refactoring the extraction to collect the observed sequence in a single pass while preserving the order of appearance.
| def _extract_sequence(payloads: list[dict[str, Any]]) -> tuple[str, ...]: | |
| sequence_values: set[str] = set() | |
| for payload in payloads: | |
| edges: list[tuple[str, str]] = [] | |
| required: list[tuple[str, str]] = [] | |
| _walk(payload, edges, required) | |
| for left, right in edges: | |
| sequence_values.add(left) | |
| sequence_values.add(right) | |
| return tuple(sorted(sequence_values)) | |
| def _extract_sequence(payloads: list[dict[str, Any]]) -> tuple[str, ...]: | |
| sequence: list[str] = [] | |
| seen: set[str] = set() | |
| def walk(obj): | |
| if isinstance(obj, dict): | |
| for key, value in obj.items(): | |
| if key in ORDERING_KEYS and isinstance(value, list): | |
| for item in value: | |
| node = item if isinstance(item, str) else _id_from_obj(item) | |
| if isinstance(node, str) and node and node not in seen: | |
| sequence.append(node) | |
| seen.add(node) | |
| walk(value) | |
| elif isinstance(obj, list): | |
| for item in obj: | |
| walk(item) | |
| for p in payloads: | |
| walk(p) | |
| return tuple(sequence) |
| def _edges_from_ordered_list(value: list[object]) -> list[tuple[str, str]]: | ||
| ordered: list[str] = [] | ||
| for item in value: | ||
| if isinstance(item, str) and item: | ||
| ordered.append(item) | ||
| continue | ||
| obj_id = _id_from_obj(item) | ||
| if obj_id is not None: | ||
| ordered.append(obj_id) | ||
| continue | ||
| return [] | ||
| return [(ordered[i], ordered[i + 1]) for i in range(len(ordered) - 1)] |
There was a problem hiding this comment.
The current implementation of _edges_from_ordered_list is fragile. If a single item in the list is not a string and lacks an identifier key, the function returns an empty list, causing the entire sequence of tool calls to be discarded. It should instead skip invalid items to preserve the remaining ordering information.
def _edges_from_ordered_list(value: list[object]) -> list[tuple[str, str]]:
ordered: list[str] = []
for item in value:
node = item if isinstance(item, str) else _id_from_obj(item)
if isinstance(node, str) and node:
ordered.append(node)
return [(ordered[i], ordered[i + 1]) for i in range(len(ordered) - 1)]| def test_tool_ordering_evidence_behavior() -> None: | ||
| artifact = _load_json(ARTIFACT_PATH) | ||
| fixtures = [fixture for family in artifact["families"] for fixture in family["fixtures"]] | ||
|
|
||
| with_data = [ | ||
| fixture | ||
| for fixture in fixtures | ||
| if fixture["tool_ordering"]["original_edge_count"] > 0 | ||
| or fixture["tool_ordering"]["replay_edge_count"] > 0 | ||
| ] | ||
|
|
||
| if with_data: | ||
| assert artifact["global_summary"]["fixtures_with_tool_ordering_data"] > 0 | ||
| else: | ||
| assert artifact["global_summary"]["fixtures_with_tool_ordering_data"] == 0 | ||
|
|
||
| drift_count = sum(1 for fixture in fixtures if fixture["tool_ordering"]["drift_detected"]) | ||
| assert artifact["global_summary"]["fixtures_with_tool_ordering_drift"] == drift_count | ||
|
|
||
| if drift_count > 0: | ||
| assert any(fixture["tool_ordering"]["drift_detected"] for fixture in fixtures) | ||
|
|
There was a problem hiding this comment.
The current tests for tool ordering evidence behavior only validate the 'zero-data' state of the current artifact. There are no tests that verify the generator's ability to actually detect drift or violations when provided with mismatched original and replay payloads. I recommend adding a test case with synthetic payloads that contain known ordering violations to ensure the validation logic is correct and remains so after refactoring.
Motivation
Description
scripts/generate_tool_ordering_replay_artifact.pyextracts only explicit structured tool-order and required-before relations, uses graph helpersnormalize_edges,nodes_from_edges,compare_edges, andfind_order_violations, and emits a stable JSON artifact atartifacts/tool_ordering_replay_results.jsonfollowing the project schema and sorting rules.artifacts/tool_ordering_replay_results.jsonrepresenting deterministic evidence per family/fixture and a global summary.tests/test_tool_ordering_replay_artifact.pythat verify artifact existence, generator reproducibility, top-level schema, determinism/sanitization, manifest alignment, tool-order evidence behavior (including zero-data handling), and label discipline.scripts/generate_tool_ordering_replay_artifact.py,artifacts/tool_ordering_replay_results.json,tests/test_tool_ordering_replay_artifact.py.Testing
python scripts/generate_tool_ordering_replay_artifact.pyand verified it writesartifacts/tool_ordering_replay_results.jsonreproducibly.pytest tests/test_tool_ordering_replay_artifact.py -q(new test set passed), pluspytesttargets used during validation:tests/test_capability_boundary_replay_artifact.py,tests/test_graph_diff_artifact.py,tests/test_replay_graph_core.py,tests/test_fixture_manifest.py,tests/test_failure_taxonomy.py, and the full suite; all tests passed (new tests and full suite green, total test run shown: 270 passed).npm run checkcompleted successfully in CI-local validation.Codex Task