|
1 | | - |
2 | | - <!DOCTYPE html> |
| 1 | +<!DOCTYPE html> |
3 | 2 | <html> |
4 | 3 | <head> |
5 | 4 | <meta charset="utf-8"> |
|
229 | 228 | .reversed-layout .content { |
230 | 229 | margin-top: 2rem; |
231 | 230 | } |
| 231 | + |
| 232 | + /* Comparison Table Styles */ |
| 233 | + .comparison-table { |
| 234 | + width: 100%; |
| 235 | + border-collapse: separate; |
| 236 | + border-spacing: 0; |
| 237 | + border-radius: 8px; |
| 238 | + overflow: hidden; |
| 239 | + box-shadow: 0 4px 12px rgba(0,0,0,0.1); |
| 240 | + margin-bottom: 2rem; |
| 241 | + } |
| 242 | + |
| 243 | + .comparison-table th { |
| 244 | + background-color: #485fc7; |
| 245 | + color: white; |
| 246 | + padding: 1rem; |
| 247 | + text-align: left; |
| 248 | + font-weight: 600; |
| 249 | + font-size: 1.05rem; |
| 250 | + } |
| 251 | + |
| 252 | + .comparison-table tr:nth-child(even) { |
| 253 | + background-color: #f5f7ff; |
| 254 | + } |
| 255 | + |
| 256 | + .comparison-table tr:nth-child(odd) { |
| 257 | + background-color: white; |
| 258 | + } |
| 259 | + |
| 260 | + .comparison-table td { |
| 261 | + padding: 1rem; |
| 262 | + border-bottom: 1px solid #eaeaea; |
| 263 | + font-size: 0.95rem; |
| 264 | + vertical-align: top; |
| 265 | + } |
| 266 | + |
| 267 | + .comparison-table tr:last-child td { |
| 268 | + border-bottom: none; |
| 269 | + } |
| 270 | + |
| 271 | + .comparison-table td:first-child { |
| 272 | + font-weight: 600; |
| 273 | + color: #485fc7; |
| 274 | + width: 18%; |
| 275 | + } |
| 276 | + |
| 277 | + .comparison-table td:nth-child(2), |
| 278 | + .comparison-table td:nth-child(3) { |
| 279 | + width: 41%; |
| 280 | + } |
| 281 | + |
| 282 | + .table-container { |
| 283 | + overflow-x: auto; |
| 284 | + margin-top: 2rem; |
| 285 | + } |
| 286 | + |
| 287 | + .table-caption { |
| 288 | + text-align: center; |
| 289 | + font-weight: bold; |
| 290 | + margin-bottom: 1rem; |
| 291 | + font-size: 1.1rem; |
| 292 | + color: #485fc7; |
| 293 | + } |
| 294 | + |
| 295 | + @media screen and (max-width: 768px) { |
| 296 | + .comparison-table td, .comparison-table th { |
| 297 | + padding: 0.75rem; |
| 298 | + font-size: 0.9rem; |
| 299 | + } |
| 300 | + } |
232 | 301 | </style> |
233 | 302 | </head> |
234 | 303 | <body> |
@@ -341,51 +410,70 @@ <h2 class="title is-3 has-text-centered">Abstract</h2> |
341 | 410 | </div> |
342 | 411 | </section> |
343 | 412 |
|
344 | | - <!-- Key Features Section --> |
| 413 | + <!-- Comparison Table Section (Replacing Key Features) --> |
345 | 414 | <section class="section"> |
346 | 415 | <div class="container is-max-desktop"> |
347 | | - <h2 class="title is-3 has-text-centered">Key Contributions</h2> |
| 416 | + <h2 class="title is-3 has-text-centered">Comparison: Psychometrics vs AI Benchmarks</h2> |
348 | 417 |
|
349 | | - <div class="columns is-multiline key-features"> |
350 | | - <div class="column is-half"> |
351 | | - <div class="feature-box"> |
352 | | - <div class="feature-icon"> |
353 | | - <i class="fas fa-brain"></i> |
354 | | - </div> |
355 | | - <h3 class="title is-5">Psychological Construct Measurement</h3> |
356 | | - <p>Systematic approaches for measuring personality constructs and cognitive abilities in LLMs.</p> |
357 | | - </div> |
358 | | - </div> |
359 | | - |
360 | | - <div class="column is-half"> |
361 | | - <div class="feature-box"> |
362 | | - <div class="feature-icon"> |
363 | | - <i class="fas fa-flask"></i> |
364 | | - </div> |
365 | | - <h3 class="title is-5">Evaluation Methodologies</h3> |
366 | | - <p>Comprehensive frameworks for test formats, data sources, prompting strategies, and scoring mechanisms.</p> |
367 | | - </div> |
368 | | - </div> |
369 | | - |
370 | | - <div class="column is-half"> |
371 | | - <div class="feature-box"> |
372 | | - <div class="feature-icon"> |
373 | | - <i class="fas fa-check-circle"></i> |
374 | | - </div> |
375 | | - <h3 class="title is-5">Psychometric Validation</h3> |
376 | | - <p>Principles for ensuring reliability, validity, and fairness in LLM assessments.</p> |
377 | | - </div> |
378 | | - </div> |
379 | | - |
380 | | - <div class="column is-half"> |
381 | | - <div class="feature-box"> |
382 | | - <div class="feature-icon"> |
383 | | - <i class="fas fa-rocket"></i> |
384 | | - </div> |
385 | | - <h3 class="title is-5">LLM Enhancement Techniques</h3> |
386 | | - <p>Applications of psychometric insights to improve model capabilities and alignment.</p> |
387 | | - </div> |
388 | | - </div> |
| 418 | + <div class="table-caption">Table 1: Systematic comparison between psychometric evaluation and conventional AI benchmarking approaches</div> |
| 419 | + |
| 420 | + <div class="table-container"> |
| 421 | + <table class="comparison-table"> |
| 422 | + <thead> |
| 423 | + <tr> |
| 424 | + <th>Feature</th> |
| 425 | + <th>Psychometrics</th> |
| 426 | + <th>AI Benchmark</th> |
| 427 | + </tr> |
| 428 | + </thead> |
| 429 | + <tbody> |
| 430 | + <tr> |
| 431 | + <td>Core goal</td> |
| 432 | + <td>To prove that a test measures what it is intended to measure (validity evidence) and to understand the construct being measured.</td> |
| 433 | + <td>To test and compare the task performance of different LLMs. Focuses on ranking models and selecting the best one suited for a specific task.</td> |
| 434 | + </tr> |
| 435 | + <tr> |
| 436 | + <td>Philosophy of measurement</td> |
| 437 | + <td>Construct-oriented. Tends towards a causal approach to measurement, where the measured trait is believed to cause the measurement outcomes.</td> |
| 438 | + <td>Task-oriented. Leans towards representativism, assuming items exhaust or represent all aspects of the underlying ability.</td> |
| 439 | + </tr> |
| 440 | + <tr> |
| 441 | + <td>Target construct</td> |
| 442 | + <td>Personality and ability.</td> |
| 443 | + <td>Mostly task-specific abilities.</td> |
| 444 | + </tr> |
| 445 | + <tr> |
| 446 | + <td>Construct definition</td> |
| 447 | + <td>Emphasizes clear and detailed definitions of the construct being measured. Agreement on the construct definition is a byproduct of test development.</td> |
| 448 | + <td>Often defines constructs implicitly through ad hoc task selection. Construct definitions can be vague.</td> |
| 449 | + </tr> |
| 450 | + <tr> |
| 451 | + <td>Development process</td> |
| 452 | + <td>Systematic and rigorous, often following methods like Evidence-Centered Design (ECD). Can be labor-intensive.</td> |
| 453 | + <td>Compiles a set of relevant questions or tasks, then performs expert annotation or crowdsourcing to label ground truth answers. Less labor-intensive per item.</td> |
| 454 | + </tr> |
| 455 | + <tr> |
| 456 | + <td>Number of items</td> |
| 457 | + <td>Can vary, but not necessarily large. Focus is on item quality and relevance to the construct.</td> |
| 458 | + <td>Typically consists of an extensive number of questions to cover various aspects of abilities. Reliability increases with test length.</td> |
| 459 | + </tr> |
| 460 | + <tr> |
| 461 | + <td>Sample size</td> |
| 462 | + <td>Typically requires a larger sample size of individuals for robust statistical modeling.</td> |
| 463 | + <td>Can be applied to evaluate the performance of a single LLM on the benchmark.</td> |
| 464 | + </tr> |
| 465 | + <tr> |
| 466 | + <td>Statistical modeling</td> |
| 467 | + <td>Employs advanced and various statistical models like Item Response Theory and Factor Analysis to analyze data, estimate latent abilities, and assess model fit.</td> |
| 468 | + <td>Often relies on simple aggregation methods, such as calculating average accuracy across benchmarks.</td> |
| 469 | + </tr> |
| 470 | + <tr> |
| 471 | + <td>Result analysis</td> |
| 472 | + <td>Ensures the reliability, validity, predictive power, and explanatory power of the test through result analysis and statistical modeling.</td> |
| 473 | + <td>Reliability is likely to be high due to the large number of items. However, validity, predictive power, or explanatory power beyond the target task is not a primary concern.</td> |
| 474 | + </tr> |
| 475 | + </tbody> |
| 476 | + </table> |
389 | 477 | </div> |
390 | 478 | </div> |
391 | 479 | </section> |
|
0 commit comments