Skip to content

Commit 448785a

Browse files
earayuclaude
andauthored
feat(benchmark): task #30 B1 — chunk-window matrix + per-document metrics (#1923)
* feat(benchmark): task #30 B1 — chunk-window matrix + per-document metrics Extend tests/benchmarks/graph_extraction harness for the task #30 graph chunk window benchmark (spec § 6.3, dispatched in msg=cecae5ed): * Fix `render_extraction_prompt` API drift — the harness was still on the legacy `input_text=...` signature; PR #1918/#1920/#1921 (task #30 Phase A) moved it to `window_chunks=[{chunk_id, text}]`. Without this fix B2 (Planetegg msg=9489efdb) cannot run. * Add `--chunk-window-size N` for single-shape runs and `--matrix N1,N2,...` for batch sweeps (the two are mutually exclusive). Sample text is split into `--pseudo-chunks-per-doc` (default 4) pseudo-chunks and grouped into non-overlapping windows of size N, mirroring the production `_GraphChunkWindow` shape (PR #1918). * Aggregate **per-document** the 7 metrics required by spec § 6.3 + Planetegg msg=ea7efa7b: `llm_call_count`, `input_tokens_total`, `output_tokens_total`, `wall_time_s`, `timeout_or_failure_count`, entity+relation totals + duplicate counts, and the new `source_chunk_ids_valid` / `source_chunk_ids_total` provenance check (task #30 §3.1.3 hard requirement #2). * `--dry-run` produces a placeholder schema for B2 to verify ingestion before paying provider cost (per Planetegg msg=cbe84223). * `test_runner_units.py` pins the harness structural pieces (chunking, windowing, validity counting, per-document aggregation) so the matrix output schema stays contract-stable even though the benchmark itself runs out-of-CI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(benchmark): task #30 B1 — window-scoped source_chunk_ids + JSON-mode default ON Two BLOCKER fixes per @ziang msg=56912dae + @huangzhangshu msg=cda4dc75: * **BLOCKER 1** — `source_chunk_ids` validity is now strictly window-scoped: `run_window` computes `source_chunk_ids_valid` / `source_chunk_ids_total` against that single window's `allowed_chunk_ids`, and `aggregate_sample` only sums per-window counters. Previously the per-document union check let a record produced in window-0 reference a chunk_id from window-1 and pass — violating the A3 parser invariant (source_chunk_ids ⊆ current window's chunk_ids, not document union). New unit test `test_aggregate_sample_source_chunk_ids_is_window_scoped_not_union` pins the cross-window pollution case. * **BLOCKER 2** — `--response-format-json` is now ON by default with `--no-response-format-json` as the explicit opt-out. A3 PR #1920 (`01b45196`) made `response_format=json_object` a graph extractor production invariant, but the legacy benchmark default was off, so B2 baseline would have measured `json_ok_rate` / parse failure / cost on the pre-A3 path. README updated to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 0058507 commit 448785a

3 files changed

Lines changed: 788 additions & 114 deletions

File tree

tests/benchmarks/graph_extraction/README.md

Lines changed: 66 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,15 @@ make benchmark-graph-extraction
1212
```
1313

1414
The default run uses the current ApeRAG graph extraction prompt via
15-
`aperag.indexing.llm.render_extraction_prompt` and does not send
16-
`response_format`. That matches current graph indexing behavior and gives a
17-
prompt-only baseline.
15+
`aperag.indexing.llm.render_extraction_prompt` **and sends
16+
`response_format={"type":"json_object"}` on every call** — task #30 A3
17+
PR #1920 (`01b45196`) made JSON-mode a graph extractor production
18+
invariant, so the benchmark mirrors production by default.
1819

19-
To simulate the proposed JSON-mode fix:
20+
To explicitly compare against the legacy non-JSON-mode behavior:
2021

2122
```bash
22-
make benchmark-graph-extraction RESPONSE_FORMAT_JSON=1
23+
uv run python tests/benchmarks/graph_extraction/runner.py --no-response-format-json
2324
```
2425

2526
Results are written to:
@@ -63,14 +64,65 @@ make benchmark-graph-extraction MODELS='qwen/qwen-plus,moonshotai/kimi-k2.6'
6364
## Scoring
6465

6566
Each sample has a short hand-written expected entity list and relation endpoint
66-
list. The runner computes:
67+
list. The runner aggregates **per-document** (per task #30 spec § 6.3 +
68+
Planetegg msg=ea7efa7b acceptance criteria) across all windows for that
69+
document/model/window-size combination:
70+
71+
1. `llm_call_count` — total LLM calls per document
72+
2. `input_tokens_total` + `output_tokens_total`
73+
3. `wall_time_s` — sum of per-window latencies
74+
4. `timeout_or_failure_count` — any window that errored or returned non-JSON
75+
5. `entities_count` + `relations_count` — extraction totals (raw)
76+
6. `duplicate_entity_count` + `duplicate_relation_count` — normalized-name dups
77+
7. `source_chunk_ids_valid` / `source_chunk_ids_total` — fraction of records
78+
whose `source_chunk_ids` are a non-empty subset of the window's chunk_ids
79+
(per task #30 §3.1.3 hard requirement #2)
80+
81+
Plus the legacy directional scores (`entity_hit_rate`, `relation_hit_rate`,
82+
`json_ok_rate`, `estimated_cost_usd`).
83+
84+
The scores are directional. They are intended to compare model/prompt/window
85+
changes against the same sample set, not to be a complete graph-quality judge.
86+
87+
## Chunk window matrix (task #30 B1)
88+
89+
The harness simulates the production `_GraphChunkWindow` (PR #1918 commit
90+
`3255fa56`) by splitting each sample text into `--pseudo-chunks-per-doc`
91+
(default `4`) pseudo-chunks and grouping them into non-overlapping windows
92+
of size `--chunk-window-size`. The rendered prompt uses the new
93+
`render_extraction_prompt(window_chunks=...)` API (PR #1920 commit
94+
`01b45196`) so each window carries `[[chunk_id=<id> index=<n>]]` boundary
95+
markers.
96+
97+
Single-shape run:
6798

68-
- JSON parse success
69-
- entity hit rate
70-
- relation endpoint hit rate
71-
- latency
72-
- output tokens per second
73-
- estimated cost from OpenRouter usage/pricing
99+
```bash
100+
uv run python tests/benchmarks/graph_extraction/runner.py \
101+
--models qwen/qwen3-30b-a3b-instruct-2507 \
102+
--chunk-window-size 3
103+
```
104+
105+
Matrix sweep (one run, many results blocks):
106+
107+
```bash
108+
uv run python tests/benchmarks/graph_extraction/runner.py \
109+
--models qwen/qwen3-30b-a3b-instruct-2507,google/gemini-2.5-flash \
110+
--matrix 1,2,3,5
111+
```
112+
113+
Dry-run schema check (no provider call, used by B2 for ingestion verify
114+
per Planetegg msg=cbe84223):
115+
116+
```bash
117+
uv run python tests/benchmarks/graph_extraction/runner.py \
118+
--dry-run --matrix 1,2,3,5 --output /tmp/b1_dry_run.json
119+
```
120+
121+
For real multi-chunk documents, raise `--pseudo-chunks-per-doc` (B2 may
122+
add larger samples than the current 3 when comparing window>1 effects on
123+
real-world-sized payloads).
74124

75-
The scores are directional. They are intended to compare model/prompt changes
76-
against the same sample set, not to be a complete graph-quality judge.
125+
The harness structural pieces (chunking, windowing, validity counting,
126+
per-document aggregation) are unit-tested in
127+
`test_runner_units.py` so the matrix output schema stays pinned even
128+
though the benchmark itself runs out-of-CI.

0 commit comments

Comments
 (0)