Commit 448785a
* feat(benchmark): task #30 B1 — chunk-window matrix + per-document metrics
Extend tests/benchmarks/graph_extraction harness for the task #30 graph
chunk window benchmark (spec § 6.3, dispatched in msg=cecae5ed):
* Fix `render_extraction_prompt` API drift — the harness was still on
the legacy `input_text=...` signature; PR #1918/#1920/#1921 (task #30
Phase A) moved it to `window_chunks=[{chunk_id, text}]`. Without this
fix B2 (Planetegg msg=9489efdb) cannot run.
* Add `--chunk-window-size N` for single-shape runs and `--matrix
N1,N2,...` for batch sweeps (the two are mutually exclusive). Sample
text is split into `--pseudo-chunks-per-doc` (default 4) pseudo-chunks
and grouped into non-overlapping windows of size N, mirroring the
production `_GraphChunkWindow` shape (PR #1918).
* Aggregate **per-document** the 7 metrics required by spec § 6.3 +
Planetegg msg=ea7efa7b: `llm_call_count`, `input_tokens_total`,
`output_tokens_total`, `wall_time_s`, `timeout_or_failure_count`,
entity+relation totals + duplicate counts, and the new
`source_chunk_ids_valid` / `source_chunk_ids_total` provenance check
(task #30 §3.1.3 hard requirement #2).
* `--dry-run` produces a placeholder schema for B2 to verify ingestion
before paying provider cost (per Planetegg msg=cbe84223).
* `test_runner_units.py` pins the harness structural pieces (chunking,
windowing, validity counting, per-document aggregation) so the
matrix output schema stays contract-stable even though the benchmark
itself runs out-of-CI.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(benchmark): task #30 B1 — window-scoped source_chunk_ids + JSON-mode default ON
Two BLOCKER fixes per @ziang msg=56912dae + @huangzhangshu msg=cda4dc75:
* **BLOCKER 1** — `source_chunk_ids` validity is now strictly
window-scoped: `run_window` computes `source_chunk_ids_valid` /
`source_chunk_ids_total` against that single window's
`allowed_chunk_ids`, and `aggregate_sample` only sums per-window
counters. Previously the per-document union check let a record
produced in window-0 reference a chunk_id from window-1 and pass —
violating the A3 parser invariant (source_chunk_ids ⊆ current
window's chunk_ids, not document union). New unit test
`test_aggregate_sample_source_chunk_ids_is_window_scoped_not_union`
pins the cross-window pollution case.
* **BLOCKER 2** — `--response-format-json` is now ON by default with
`--no-response-format-json` as the explicit opt-out. A3 PR #1920
(`01b45196`) made `response_format=json_object` a graph extractor
production invariant, but the legacy benchmark default was off, so
B2 baseline would have measured `json_ok_rate` / parse failure /
cost on the pre-A3 path. README updated to match.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 0058507 commit 448785a
3 files changed
Lines changed: 788 additions & 114 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
16 | | - | |
17 | | - | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
18 | 19 | | |
19 | | - | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
22 | | - | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
63 | 64 | | |
64 | 65 | | |
65 | 66 | | |
66 | | - | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
67 | 98 | | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
74 | 124 | | |
75 | | - | |
76 | | - | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
0 commit comments