|
18 | 18 | <a href="https://huggingface.co/datasets/opencompass/VerifierBench" target="_blank" style="margin: 2px;"> |
19 | 19 | <img alt="Hugging Face Dataset" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-ff9800?color=ff9800&logoColor=white" style="display: inline-block; vertical-align: middle;"/> |
20 | 20 | </a> |
21 | | - <a href="https://creativecommons.org/licenses/by-sa/4.0/" style="margin: 2px;"> |
22 | | - <img alt="License" src="https://img.shields.io/badge/License-CC%20BY--SA%204.0-f5de53?color=f5de53&logoColor=white" style="display: inline-block; vertical-align: middle;"/> |
| 21 | + <a href="https://www.apache.org/licenses/LICENSE-2.0" style="margin: 2px;"> |
| 22 | + <img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-blue.svg?color=blue&logo=apache&logoColor=white" style="display: inline-block; vertical-align: middle;"/> |
23 | 23 | </a> |
24 | 24 | </div> |
25 | 25 |
|
|
30 | 30 |
|
31 | 31 |
|
32 | 32 | <p align="center"> |
33 | | - <img src="https://cdn-uploads.huggingface.co/production/uploads/614ffea450eec00bf3c23652/gezsWZn0CxCc423gW5UMO.png" alt="Test Set Results" width="600" height="400"> |
| 33 | + <img src="assets/model_performance.png" alt="Test Set Results" width="600" height="400"> |
34 | 34 | </p> |
35 | 35 |
|
36 | 36 | ## Get the model and dataset from 🤗 |
@@ -389,44 +389,8 @@ CompassVerifier performance (F1 score w/o COT) on our new released [VerifierBen |
389 | 389 | <td style="text-align: right;">62.6</td> |
390 | 390 | <td style="text-align: right;">67.1</td> |
391 | 391 | </tr> |
392 | | - |
393 | | - <tr> |
394 | | - <td colspan="6" style="text-align: center;"><strong><em>CompassVerifier (Qwen3)</em></strong></td> |
395 | | - </tr> |
396 | | - <tr> |
397 | | - <td style="text-align: left;">CompassVerifier-1.7B</td> |
398 | | - <td style="text-align: right;">87.1</td> |
399 | | - <td style="text-align: right;">89.4</td> |
400 | | - <td style="text-align: right;">63.0</td> |
401 | | - <td style="text-align: right;">80.2</td> |
402 | | - <td style="text-align: right;">80.0</td> |
403 | | - </tr> |
404 | | - <tr> |
405 | | - <td style="text-align: left;">CompassVerifier-8B</td> |
406 | | - <td style="text-align: right;">86.7</td> |
407 | | - <td style="text-align: right;">90.7</td> |
408 | | - <td style="text-align: right;">75.7</td> |
409 | | - <td style="text-align: right;">79.3</td> |
410 | | - <td style="text-align: right;">83.1</td> |
411 | | - </tr> |
412 | | - <tr> |
413 | | - <td style="text-align: left;">CompassVerifier-14B</td> |
414 | | - <td style="text-align: right;">90.3</td> |
415 | | - <td style="text-align: right;">91.4</td> |
416 | | - <td style="text-align: right;">79.1</td> |
417 | | - <td style="text-align: right;">82.9</td> |
418 | | - <td style="text-align: right;">85.9</td> |
419 | | - </tr> |
420 | | - <tr> |
421 | | - <td style="text-align: left;">CompassVerifier-32B</td> |
422 | | - <td style="text-align: right;">89.6</td> |
423 | | - <td style="text-align: right;">92.3</td> |
424 | | - <td style="text-align: right;">79.8</td> |
425 | | - <td style="text-align: right;">83.0</td> |
426 | | - <td style="text-align: right;">86.2</td> |
427 | | - </tr> |
428 | 392 | <tr> |
429 | | - <td colspan="6" style="text-align: center;"><strong><em>CompassVerifier (Qwen2.5)</em></strong></td> |
| 393 | + <td colspan="6" style="text-align: center;"><strong><em>CompassVerifier</em></strong></td> |
430 | 394 | </tr> |
431 | 395 | <tr> |
432 | 396 | <td style="text-align: left;">CompassVerifier-3B</td> |
@@ -534,24 +498,8 @@ We also test the performance of CompassVerifier on [VerifyBench](https://arxiv.o |
534 | 498 | <td style="text-align: right;">-</td> |
535 | 499 | </tr> |
536 | 500 | <tr> |
537 | | - <td colspan="5" style="text-align: center;"><strong><em>CompassVerifier (Qwen3)</em></strong></td> |
538 | | - </tr> |
539 | | - <tr> |
540 | | - <td style="text-align: left;">CompassVerifier-1.7B</td> |
541 | | - <td style="text-align: right;">80.1</td> |
542 | | - <td style="text-align: right;">69.3</td> |
543 | | - <td style="text-align: right;">72.9</td> |
544 | | - <td style="text-align: right;">61.0</td> |
545 | | - </tr> |
546 | | - <tr> |
547 | | - <td style="text-align: left;">CompassVerifier-8B</td> |
548 | | - <td style="text-align: right;">84.5</td> |
549 | | - <td style="text-align: right;">72.7</td> |
550 | | - <td style="text-align: right;">79.2</td> |
551 | | - <td style="text-align: right;">55.4</td> |
552 | | - </tr> |
553 | 501 | <tr> |
554 | | - <td colspan="5" style="text-align: center;"><strong><em>CompassVerifier (Qwen2.5)</em></strong></td> |
| 502 | + <td colspan="5" style="text-align: center;"><strong><em>CompassVerifier</em></strong></td> |
555 | 503 | </tr> |
556 | 504 | <tr> |
557 | 505 | <td style="text-align: left;">CompassVerifier-3B</td> |
|
0 commit comments