|
22 | 22 | A Systematic Review of |
23 | 23 | Evaluation, Validation, and Enhancement |
24 | 24 | </title> |
25 | | - <link rel="icon" type="image/x-icon" href="static/images/logo_llm.ico"> |
| 25 | + <link rel="icon" type="image/x-icon" href="static/images/llm_psychometrics.ico"> |
26 | 26 | <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet"> |
27 | 27 |
|
28 | 28 | <link rel="stylesheet" href="static/css/bulma.min.css"> |
@@ -490,7 +490,7 @@ <h2 class="title is-3 has-text-centered">Comparison: Psychometrics vs AI Benchma |
490 | 490 | <tbody> |
491 | 491 | <tr> |
492 | 492 | <td>Core goal</td> |
493 | | - <td>To prove that a test measures what it is intended to measure (validity evidence) and to understand the construct being measured.</td> |
| 493 | + <td>To measure psychological constructs, to prove that a test measures as intended (validity evidence), and to understand the construct being measured.</td> |
494 | 494 | <td>To test and compare the task performance of different LLMs. Focuses on ranking models and selecting the best one suited for a specific task.</td> |
495 | 495 | </tr> |
496 | 496 | <tr> |
@@ -520,13 +520,13 @@ <h2 class="title is-3 has-text-centered">Comparison: Psychometrics vs AI Benchma |
520 | 520 | </tr> |
521 | 521 | <tr> |
522 | 522 | <td>Sample size</td> |
523 | | - <td>Typically requires a larger sample size of individuals for robust statistical modeling.</td> |
| 523 | + <td>Typically requires a larger sample size of test takers for robust statistical modeling.</td> |
524 | 524 | <td>Can be applied to evaluate the performance of a single LLM on the benchmark.</td> |
525 | 525 | </tr> |
526 | 526 | <tr> |
527 | 527 | <td>Statistical modeling</td> |
528 | 528 | <td>Employs advanced and various statistical models like Item Response Theory and Factor Analysis to analyze data, estimate latent abilities, and assess model fit.</td> |
529 | | - <td>Often relies on simple aggregation methods, such as calculating average accuracy across benchmarks.</td> |
| 529 | + <td>Often relies on simple aggregation methods, such as calculating average accuracy across benchmark tasks.</td> |
530 | 530 | </tr> |
531 | 531 | <tr> |
532 | 532 | <td>Result analysis</td> |
|
0 commit comments