You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/posts/txrca-bench/index.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -328,4 +328,6 @@ Several extensions are natural follow-ups:
328
328
329
329
We are releasing the TxRCA-Bench benchmark data publicly — the 70 annotated exploit transactions with ground-truth root cause labels, the per-case workspaces (raw traces, event logs, contracts, ABIs, Solidity sources), all 490 raw agent outputs with both judges' scores, and the JSON output schema. Anyone should be able to re-score outputs, test their own agent against the same evidence, or extend the benchmark with new cases.
330
330
331
-
The agent runtime and scoring harness code is not being released at this time. However, because the underlying on-chain data is immutable and the benchmark is defined purely in terms of `(transaction_hash, chain_id)` plus a ground-truth label, the benchmark is trivially reproducible against any new agent: given the inputs, any agent can be run in any runtime, and its output scored against the same rubric.
331
+
The agent runtime and scoring harness code is not being released at this time. However, because the underlying on-chain data is immutable and the benchmark is defined purely in terms of `(transaction_hash, chain_id)` plus a ground-truth label, the benchmark is trivially reproducible against any new agent: given the inputs, any agent can be run in any runtime, and its output scored against the same rubric.
0 commit comments