Commit 038f059
committed
eval: filter aggregate heuristic for non-PK values + improve LLM judge
Two targeted improvements after analyzing Conv benchmark failures:
1. _extract_ids heuristic noise reduction
- Aggregate group values like dates ("2023-12-20 06:00:00.006000")
were being wrapped with PK-style prefixes like "pr_sold_base:2023-12-20..."
creating thousands of false IDs in found_ids.
- Now skips prefix generation when the group value looks unlike a
PK: > 30 chars, contains spaces, starts with a date pattern.
- Raw value and node_title are still added, only the guessing
heuristic is gated.
2. LLM Judge prompt refined for document queries
- Added explicit rules for different query types
(counting / listing / document / recommendation / multi-hop).
- Emphasized that GT samples are examples, not exhaustive — judge
should not require exact ID matches.
- Particularly helps KRRA document queries where auto-generated GT
is narrow but agent finds the correct topic area.
Benchmark results (GPT-4o-mini, LLM judge enabled):
before → after
KRRA Hard agent 8/15 → 11/15 (+3)
assort Hard agent 10/15 → 13/15 (+3)
X2BEE Hard agent 17/19 → 17/19 (maintained, 89%)
KRRA Conv agent 15/30 → 17/30 (+2)
assort Conv agent 19/24 → 19/24 (79%)
X2BEE Conv agent 20/27 → 21/27 (+1)
Total improvement: +9 queries across 130 agent benchmarks.1 parent 03f794f commit 038f059
2 files changed
Lines changed: 36 additions & 15 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
| 4 | + | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
46 | | - | |
| 46 | + | |
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
66 | | - | |
| 66 | + | |
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
73 | | - | |
| 73 | + | |
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
| |||
84 | 84 | | |
85 | 85 | | |
86 | 86 | | |
87 | | - | |
| 87 | + | |
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
| |||
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
101 | | - | |
| 101 | + | |
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
718 | 718 | | |
719 | 719 | | |
720 | 720 | | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
| 724 | + | |
| 725 | + | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
721 | 736 | | |
722 | 737 | | |
723 | 738 | | |
| |||
834 | 849 | | |
835 | 850 | | |
836 | 851 | | |
837 | | - | |
| 852 | + | |
| 853 | + | |
838 | 854 | | |
839 | 855 | | |
840 | 856 | | |
841 | 857 | | |
842 | | - | |
843 | | - | |
844 | | - | |
845 | | - | |
846 | | - | |
847 | | - | |
848 | | - | |
| 858 | + | |
| 859 | + | |
| 860 | + | |
| 861 | + | |
| 862 | + | |
| 863 | + | |
| 864 | + | |
| 865 | + | |
| 866 | + | |
| 867 | + | |
| 868 | + | |
| 869 | + | |
849 | 870 | | |
850 | 871 | | |
851 | 872 | | |
| |||
0 commit comments