Skip to content

Commit c957b89

Browse files
authored
Update benchmark with gemini 3 and gpt 5.2 models
1 parent 608893a commit c957b89

7 files changed

Lines changed: 112 additions & 57 deletions

File tree

README.md

Lines changed: 27 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -118,30 +118,33 @@ _Note:_ Benchmarks are currently done in the zero-shot setting.
118118

119119
| Rank | Model | SequenceMatcher Similarity | TFIDF Similarity | Time (s) | Cost ($) |
120120
| --- | --- | --- | --- | --- | --- |
121-
| 1 | AUTO (with auto-selected model) | 0.899 (±0.131) | 0.960 (±0.066) | 21.17 | 0.00066 |
122-
| 2 | AUTO | 0.895 (±0.112) | 0.973 (±0.046) | 9.29 | 0.00063 |
123-
| 3 | gemini-2.5-flash | 0.886 (±0.164) | 0.986 (±0.027) | 52.55 | 0.01226 |
124-
| 4 | mistral-ocr-latest | 0.882 (±0.106) | 0.932 (±0.091) | 5.75 | 0.00121 |
125-
| 5 | gemini-2.5-pro | 0.876 (±0.195) | 0.976 (±0.049) | 22.65 | 0.02408 |
126-
| 6 | gemini-2.0-flash | 0.875 (±0.148) | 0.977 (±0.037) | 11.96 | 0.00079 |
127-
| 7 | claude-3-5-sonnet-20241022 | 0.858 (±0.184) | 0.930 (±0.098) | 17.32 | 0.01804 |
128-
| 8 | gemini-1.5-flash | 0.842 (±0.214) | 0.969 (±0.037) | 15.58 | 0.00043 |
129-
| 9 | gpt-5-mini | 0.819 (±0.201) | 0.917 (±0.104) | 52.84 | 0.00811 |
130-
| 10 | gpt-5 | 0.807 (±0.215) | 0.919 (±0.088) | 98.12 | 0.05505 |
131-
| 11 | claude-sonnet-4-20250514 | 0.801 (±0.188) | 0.905 (±0.136) | 22.02 | 0.02056 |
132-
| 12 | claude-opus-4-20250514 | 0.789 (±0.220) | 0.886 (±0.148) | 29.55 | 0.09513 |
133-
| 13 | accounts/fireworks/models/llama4-maverick-instruct-basic | 0.772 (±0.203) | 0.930 (±0.117) | 16.02 | 0.00147 |
134-
| 14 | gemini-1.5-pro | 0.767 (±0.309) | 0.865 (±0.230) | 24.77 | 0.01139 |
135-
| 15 | gpt-4.1-mini | 0.754 (±0.249) | 0.803 (±0.193) | 23.28 | 0.00347 |
136-
| 16 | accounts/fireworks/models/llama4-scout-instruct-basic | 0.754 (±0.243) | 0.942 (±0.063) | 13.36 | 0.00087 |
137-
| 17 | gpt-4o | 0.752 (±0.269) | 0.896 (±0.123) | 28.87 | 0.01469 |
138-
| 18 | gpt-4o-mini | 0.728 (±0.241) | 0.850 (±0.128) | 18.96 | 0.00609 |
139-
| 19 | claude-3-7-sonnet-20250219 | 0.646 (±0.397) | 0.758 (±0.297) | 57.96 | 0.01730 |
140-
| 20 | gpt-4.1 | 0.637 (±0.301) | 0.787 (±0.185) | 35.37 | 0.01498 |
141-
| 21 | google/gemma-3-27b-it | 0.604 (±0.342) | 0.788 (±0.297) | 23.16 | 0.00020 |
142-
| 22 | ds4sd/SmolDocling-256M-preview | 0.603 (±0.292) | 0.705 (±0.262) | 507.74 | 0.00000 |
143-
| 23 | microsoft/phi-4-multimodal-instruct | 0.589 (±0.273) | 0.820 (±0.197) | 14.00 | 0.00045 |
144-
| 24 | qwen/qwen-2.5-vl-7b-instruct | 0.498 (±0.378) | 0.630 (±0.445) | 14.73 | 0.00056 |
121+
| 1 | gemini-3-pro-preview | 0.917 (±0.127) | 0.943 (±0.159) | 46.92 | 0.06288 |
122+
| 2 | AUTO (with auto-selected model) | 0.899 (±0.131) | 0.960 (±0.066) | 21.17 | 0.00066 |
123+
| 3 | AUTO | 0.895 (±0.112) | 0.973 (±0.046) | 9.29 | 0.00063 |
124+
| 4 | gpt-5.2 | 0.890 (±0.193) | 0.975 (±0.036) | 33.32 | 0.03959 |
125+
| 5 | gemini-2.5-flash | 0.886 (±0.164) | 0.986 (±0.027) | 52.55 | 0.01226 |
126+
| 6 | mistral-ocr-latest | 0.882 (±0.106) | 0.932 (±0.091) | 5.75 | 0.00121 |
127+
| 7 | gemini-2.5-pro | 0.876 (±0.195) | 0.976 (±0.049) | 22.65 | 0.02408 |
128+
| 8 | gemini-2.0-flash | 0.875 (±0.148) | 0.977 (±0.037) | 11.96 | 0.00079 |
129+
| 9 | claude-3-5-sonnet-20241022 | 0.858 (±0.184) | 0.930 (±0.098) | 17.32 | 0.01804 |
130+
| 10 | gemini-1.5-flash | 0.842 (±0.214) | 0.969 (±0.037) | 15.58 | 0.00043 |
131+
| 11 | gpt-5-mini | 0.819 (±0.201) | 0.917 (±0.104) | 52.84 | 0.00811 |
132+
| 12 | gpt-5 | 0.807 (±0.215) | 0.919 (±0.088) | 98.12 | 0.05505 |
133+
| 13 | claude-sonnet-4-20250514 | 0.801 (±0.188) | 0.905 (±0.136) | 22.02 | 0.02056 |
134+
| 14 | claude-opus-4-20250514 | 0.789 (±0.220) | 0.886 (±0.148) | 29.55 | 0.09513 |
135+
| 15 | accounts/fireworks/models/llama4-maverick-instruct-basic | 0.772 (±0.203) | 0.930 (±0.117) | 16.02 | 0.00147 |
136+
| 16 | gemini-1.5-pro | 0.767 (±0.309) | 0.865 (±0.230) | 24.77 | 0.01139 |
137+
| 17 | gemini-3-flash-preview | 0.766 (±0.293) | 0.858 (±0.210) | 39.38 | 0.00969 |
138+
| 18 | gpt-4.1-mini | 0.754 (±0.249) | 0.803 (±0.193) | 23.28 | 0.00347 |
139+
| 19 | accounts/fireworks/models/llama4-scout-instruct-basic | 0.754 (±0.243) | 0.942 (±0.063) | 13.36 | 0.00087 |
140+
| 20 | gpt-4o | 0.752 (±0.269) | 0.896 (±0.123) | 28.87 | 0.01469 |
141+
| 21 | gpt-4o-mini | 0.728 (±0.241) | 0.850 (±0.128) | 18.96 | 0.00609 |
142+
| 22 | claude-3-7-sonnet-20250219 | 0.646 (±0.397) | 0.758 (±0.297) | 57.96 | 0.01730 |
143+
| 23 | gpt-4.1 | 0.637 (±0.301) | 0.787 (±0.185) | 35.37 | 0.01498 |
144+
| 24 | google/gemma-3-27b-it | 0.604 (±0.342) | 0.788 (±0.297) | 23.16 | 0.00020 |
145+
| 25 | ds4sd/SmolDocling-256M-preview | 0.603 (±0.292) | 0.705 (±0.262) | 507.74 | 0.00000 |
146+
| 26 | microsoft/phi-4-multimodal-instruct | 0.589 (±0.273) | 0.820 (±0.197) | 14.00 | 0.00045 |
147+
| 27 | qwen/qwen-2.5-vl-7b-instruct | 0.498 (±0.378) | 0.630 (±0.445) | 14.73 | 0.00056 |
145148

146149
## Citation
147150
If you use Lexoid in production or publications, please cite accordingly and acknowledge usage. We appreciate the support 🙏

docs/benchmark.rst

Lines changed: 41 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -89,144 +89,162 @@ Here are the detailed parsing performance results for various models, sorted by
8989
- Time (s)
9090
- Cost ($)
9191
* - 1
92+
- gemini-3-pro-preview
93+
- 0.917 (±0.127)
94+
- 0.943 (±0.159)
95+
- 46.92
96+
- 0.06288
97+
* - 2
9298
- AUTO (with auto-selected model)
9399
- 0.899 (±0.131)
94100
- 0.960 (±0.066)
95101
- 21.17
96102
- 0.00066
97-
* - 2
103+
* - 3
98104
- AUTO
99105
- 0.895 (±0.112)
100106
- 0.973 (±0.046)
101107
- 9.29
102108
- 0.00063
103-
* - 3
109+
* - 4
110+
- gpt-5.2
111+
- 0.890 (±0.193)
112+
- 0.975 (±0.036)
113+
- 33.32
114+
- 0.03959
115+
* - 5
104116
- gemini-2.5-flash
105117
- 0.886 (±0.164)
106118
- 0.986 (±0.027)
107119
- 52.55
108120
- 0.01226
109-
* - 4
121+
* - 6
110122
- mistral-ocr-latest
111123
- 0.882 (±0.106)
112124
- 0.932 (±0.091)
113125
- 5.75
114126
- 0.00121
115-
* - 5
127+
* - 7
116128
- gemini-2.5-pro
117129
- 0.876 (±0.195)
118130
- 0.976 (±0.049)
119131
- 22.65
120132
- 0.02408
121-
* - 6
133+
* - 8
122134
- gemini-2.0-flash
123135
- 0.875 (±0.148)
124136
- 0.977 (±0.037)
125137
- 11.96
126138
- 0.00079
127-
* - 7
139+
* - 9
128140
- claude-3-5-sonnet-20241022
129141
- 0.858 (±0.184)
130142
- 0.930 (±0.098)
131143
- 17.32
132144
- 0.01804
133-
* - 8
145+
* - 10
134146
- gemini-1.5-flash
135147
- 0.842 (±0.214)
136148
- 0.969 (±0.037)
137149
- 15.58
138150
- 0.00043
139-
* - 9
151+
* - 11
140152
- gpt-5-mini
141153
- 0.819 (±0.201)
142154
- 0.917 (±0.104)
143155
- 52.84
144156
- 0.00811
145-
* - 10
157+
* - 12
146158
- gpt-5
147159
- 0.807 (±0.215)
148160
- 0.919 (±0.088)
149161
- 98.12
150162
- 0.05505
151-
* - 11
163+
* - 13
152164
- claude-sonnet-4-20250514
153165
- 0.801 (±0.188)
154166
- 0.905 (±0.136)
155167
- 22.02
156168
- 0.02056
157-
* - 12
169+
* - 14
158170
- claude-opus-4-20250514
159171
- 0.789 (±0.220)
160172
- 0.886 (±0.148)
161173
- 29.55
162174
- 0.09513
163-
* - 13
175+
* - 15
164176
- accounts/fireworks/models/llama4-maverick-instruct-basic
165177
- 0.772 (±0.203)
166178
- 0.930 (±0.117)
167179
- 16.02
168180
- 0.00147
169-
* - 14
181+
* - 16
170182
- gemini-1.5-pro
171183
- 0.767 (±0.309)
172184
- 0.865 (±0.230)
173185
- 24.77
174186
- 0.01139
175-
* - 15
187+
* - 17
188+
- gemini-3-flash-preview
189+
- 0.766 (±0.293)
190+
- 0.858 (±0.210)
191+
- 39.38
192+
- 0.00969
193+
* - 18
176194
- gpt-4.1-mini
177195
- 0.754 (±0.249)
178196
- 0.803 (±0.193)
179197
- 23.28
180198
- 0.00347
181-
* - 16
199+
* - 19
182200
- accounts/fireworks/models/llama4-scout-instruct-basic
183201
- 0.754 (±0.243)
184202
- 0.942 (±0.063)
185203
- 13.36
186204
- 0.00087
187-
* - 17
205+
* - 20
188206
- gpt-4o
189207
- 0.752 (±0.269)
190208
- 0.896 (±0.123)
191209
- 28.87
192210
- 0.01469
193-
* - 18
211+
* - 21
194212
- gpt-4o-mini
195213
- 0.728 (±0.241)
196214
- 0.850 (±0.128)
197215
- 18.96
198216
- 0.00609
199-
* - 19
217+
* - 22
200218
- claude-3-7-sonnet-20250219
201219
- 0.646 (±0.397)
202220
- 0.758 (±0.297)
203221
- 57.96
204222
- 0.01730
205-
* - 20
223+
* - 23
206224
- gpt-4.1
207225
- 0.637 (±0.301)
208226
- 0.787 (±0.185)
209227
- 35.37
210228
- 0.01498
211-
* - 21
229+
* - 24
212230
- google/gemma-3-27b-it
213231
- 0.604 (±0.342)
214232
- 0.788 (±0.297)
215233
- 23.16
216234
- 0.00020
217-
* - 22
235+
* - 25
218236
- ds4sd/SmolDocling-256M-preview
219237
- 0.603 (±0.292)
220238
- 0.705 (±0.262)
221239
- 507.74
222240
- 0.00000
223-
* - 23
241+
* - 26
224242
- microsoft/phi-4-multimodal-instruct
225243
- 0.589 (±0.273)
226244
- 0.820 (±0.197)
227245
- 14.00
228246
- 0.00045
229-
* - 24
247+
* - 27
230248
- qwen/qwen-2.5-vl-7b-instruct
231249
- 0.498 (±0.378)
232250
- 0.630 (±0.445)

lexoid/core/parse_type/llm_parser.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -450,6 +450,8 @@ def parse_image_with_gemini(
450450
}
451451
if kwargs["model"] == "gemini-2.5-pro":
452452
generation_config["thinkingConfig"] = {"thinkingBudget": 128}
453+
elif kwargs["model"].startswith("gemini-3"):
454+
generation_config["thinkingConfig"] = {"thinkingLevel": "low"}
453455

454456
payload = {
455457
"contents": [
@@ -658,11 +660,9 @@ def create_response(
658660
completion_params = {
659661
"model": model,
660662
"messages": messages,
661-
"max_tokens": max_tokens,
662-
"temperature": temperature,
663663
}
664664

665-
if api == "openai" and (model in ["gpt-5", "gpt-5-mini"] or model.startswith("o")):
665+
if api == "openai":
666666
# Unsupported in some models
667667
del completion_params["max_tokens"]
668668
del completion_params["temperature"]

tests/api_cost_mapping.json

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,16 @@
11
{
2+
"gemini-3-flash-preview": {
3+
"input": 0.5,
4+
"output": 3
5+
},
6+
"gemini-3-pro-preview": {
7+
"input": 2,
8+
"output": 12
9+
},
10+
"gemini-3-pro-image-preview": {
11+
"input": 2,
12+
"output": 12
13+
},
214
"gemini-2.5-flash": {
315
"input": 0.3,
416
"output": 2.5
@@ -45,6 +57,10 @@
4557
"input": 0.4,
4658
"output": 1.6
4759
},
60+
"gpt-5.2": {
61+
"input": 1.75,
62+
"output": 14
63+
},
4864
"gpt-5": {
4965
"input": 1.25,
5066
"output": 10

tests/benchmark.py

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,17 @@
1010
from dotenv import load_dotenv
1111

1212
from lexoid.api import parse
13-
from benchmark_utils import calculate_similarities
13+
from benchmark_utils import calculate_similarities, METRIC_NAMES
1414

1515
load_dotenv()
1616

1717
config_options = {
1818
"parser_type": ["LLM_PARSE", "STATIC_PARSE", "AUTO"],
1919
"model": [
2020
# # Google models
21+
"gemini-3-flash-preview",
22+
# "gemini-3-pro-preview",
23+
# "gemini-3-pro-image-preview",
2124
# "gemini-2.5-flash",
2225
# "gemini-2.5-pro",
2326
# "gemini-2.0-flash",
@@ -32,6 +35,7 @@
3235
# "claude-3-7-sonnet-20250219",
3336
# "claude-3-5-sonnet-20241022",
3437
# # OpenAI models
38+
# "gpt-5.2",
3539
# "gpt-5",
3640
# "gpt-5-mini",
3741
# "gpt-4.1",
@@ -163,14 +167,14 @@ def run_benchmark_config(
163167
break # Stop further iterations if an error occurs
164168

165169
mean_similarity = (
166-
{metric: mean([s[metric] for s in similarities]) for metric in similarities[0]}
170+
{metric: mean([s[metric] for s in similarities]) for metric in METRIC_NAMES}
167171
if similarities
168172
else None
169173
)
170174
std_similarity = (
171-
{metric: stdev([s[metric] for s in similarities]) for metric in similarities[0]}
175+
{metric: stdev([s[metric] for s in similarities]) for metric in METRIC_NAMES}
172176
if len(similarities) > 1
173-
else {metric: 0.0 for metric in similarities[0]}
177+
else {metric: 0.0 for metric in METRIC_NAMES}
174178
)
175179

176180
return BenchmarkResult(
@@ -196,15 +200,15 @@ def aggregate_results(results: List[BenchmarkResult]) -> BenchmarkResult:
196200
all_costs = [c for r in valid_results for c in r.cost]
197201
avg_similarity = {
198202
metric: mean([s[metric] for s in all_similarities])
199-
for metric in all_similarities[0]
203+
for metric in METRIC_NAMES
200204
}
201205
std_similarity = (
202206
{
203207
metric: stdev([s[metric] for s in all_similarities])
204-
for metric in all_similarities[0]
208+
for metric in METRIC_NAMES
205209
}
206210
if len(all_similarities) > 1
207-
else {metric: 0.0 for metric in avg_similarity}
211+
else {metric: 0.0 for metric in METRIC_NAMES}
208212
)
209213
avg_execution_time = mean(all_execution_times)
210214
avg_cost = mean(all_costs)
@@ -449,6 +453,7 @@ def main():
449453

450454
# Can be either a single file or directory
451455
input_path = "examples/inputs"
456+
# input_path = "examples/inputs/grocery_bill.jpg"
452457
output_dir = "examples/outputs"
453458

454459
run_id = "_".join(

tests/benchmark_utils.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,3 +112,13 @@ def calculate_similarities(
112112
similarities.update(precision_recall_f1_score(text1, text2))
113113

114114
return similarities
115+
116+
117+
METRIC_NAMES = (
118+
"sequence_matcher",
119+
"cosine",
120+
"jaccard",
121+
"precision",
122+
"recall",
123+
"f1_score",
124+
)

tests/results.csv

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,6 @@ ds4sd/SmolDocling-256M-preview,0.603 (±0.292),0.705 (±0.262),0.645 (±0.245),0
2323
mistral-ocr-latest,0.882 (±0.106),0.932 (±0.091),0.904 (±0.104),0.923 (±0.097),0.977 (±0.034),0.946 (±0.061),5.754099847414555,0.001211538461538342
2424
gpt-5,0.807 (±0.215),0.919 (±0.088),0.855 (±0.131),0.977 (±0.024),0.871 (±0.126),0.917 (±0.078),98.12129604357942,0.05505420673076922
2525
gpt-5-mini,0.819 (±0.201),0.917 (±0.104),0.857 (±0.150),0.975 (±0.033),0.876 (±0.152),0.916 (±0.093),52.83561164752032,0.008113719551281968
26+
gemini-3-flash-preview,0.766 (±0.293),0.858 (±0.210),0.825 (±0.237),0.989 (±0.016),0.835 (±0.242),0.883 (±0.175),39.38287312643869,0.00968610714285712
27+
gemini-3-pro-preview,0.917 (±0.127),0.943 (±0.159),0.925 (±0.126),0.974 (±0.034),0.944 (±0.120),0.956 (±0.081),46.92351686954498,0.06287985714285714
28+
gpt-5.2,0.890 (±0.193),0.975 (±0.036),0.950 (±0.049),0.966 (±0.035),0.981 (±0.020),0.974 (±0.027),33.3172641311373,0.03959375

0 commit comments

Comments
 (0)