Skip to content

Commit a494c35

Browse files
authored
Update index.html
1 parent d0d9e21 commit a494c35

1 file changed

Lines changed: 132 additions & 44 deletions

File tree

index.html

Lines changed: 132 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
2-
<!DOCTYPE html>
1+
<!DOCTYPE html>
32
<html>
43
<head>
54
<meta charset="utf-8">
@@ -229,6 +228,76 @@
229228
.reversed-layout .content {
230229
margin-top: 2rem;
231230
}
231+
232+
/* Comparison Table Styles */
233+
.comparison-table {
234+
width: 100%;
235+
border-collapse: separate;
236+
border-spacing: 0;
237+
border-radius: 8px;
238+
overflow: hidden;
239+
box-shadow: 0 4px 12px rgba(0,0,0,0.1);
240+
margin-bottom: 2rem;
241+
}
242+
243+
.comparison-table th {
244+
background-color: #485fc7;
245+
color: white;
246+
padding: 1rem;
247+
text-align: left;
248+
font-weight: 600;
249+
font-size: 1.05rem;
250+
}
251+
252+
.comparison-table tr:nth-child(even) {
253+
background-color: #f5f7ff;
254+
}
255+
256+
.comparison-table tr:nth-child(odd) {
257+
background-color: white;
258+
}
259+
260+
.comparison-table td {
261+
padding: 1rem;
262+
border-bottom: 1px solid #eaeaea;
263+
font-size: 0.95rem;
264+
vertical-align: top;
265+
}
266+
267+
.comparison-table tr:last-child td {
268+
border-bottom: none;
269+
}
270+
271+
.comparison-table td:first-child {
272+
font-weight: 600;
273+
color: #485fc7;
274+
width: 18%;
275+
}
276+
277+
.comparison-table td:nth-child(2),
278+
.comparison-table td:nth-child(3) {
279+
width: 41%;
280+
}
281+
282+
.table-container {
283+
overflow-x: auto;
284+
margin-top: 2rem;
285+
}
286+
287+
.table-caption {
288+
text-align: center;
289+
font-weight: bold;
290+
margin-bottom: 1rem;
291+
font-size: 1.1rem;
292+
color: #485fc7;
293+
}
294+
295+
@media screen and (max-width: 768px) {
296+
.comparison-table td, .comparison-table th {
297+
padding: 0.75rem;
298+
font-size: 0.9rem;
299+
}
300+
}
232301
</style>
233302
</head>
234303
<body>
@@ -341,51 +410,70 @@ <h2 class="title is-3 has-text-centered">Abstract</h2>
341410
</div>
342411
</section>
343412

344-
<!-- Key Features Section -->
413+
<!-- Comparison Table Section (Replacing Key Features) -->
345414
<section class="section">
346415
<div class="container is-max-desktop">
347-
<h2 class="title is-3 has-text-centered">Key Contributions</h2>
416+
<h2 class="title is-3 has-text-centered">Comparison: Psychometrics vs AI Benchmarks</h2>
348417

349-
<div class="columns is-multiline key-features">
350-
<div class="column is-half">
351-
<div class="feature-box">
352-
<div class="feature-icon">
353-
<i class="fas fa-brain"></i>
354-
</div>
355-
<h3 class="title is-5">Psychological Construct Measurement</h3>
356-
<p>Systematic approaches for measuring personality constructs and cognitive abilities in LLMs.</p>
357-
</div>
358-
</div>
359-
360-
<div class="column is-half">
361-
<div class="feature-box">
362-
<div class="feature-icon">
363-
<i class="fas fa-flask"></i>
364-
</div>
365-
<h3 class="title is-5">Evaluation Methodologies</h3>
366-
<p>Comprehensive frameworks for test formats, data sources, prompting strategies, and scoring mechanisms.</p>
367-
</div>
368-
</div>
369-
370-
<div class="column is-half">
371-
<div class="feature-box">
372-
<div class="feature-icon">
373-
<i class="fas fa-check-circle"></i>
374-
</div>
375-
<h3 class="title is-5">Psychometric Validation</h3>
376-
<p>Principles for ensuring reliability, validity, and fairness in LLM assessments.</p>
377-
</div>
378-
</div>
379-
380-
<div class="column is-half">
381-
<div class="feature-box">
382-
<div class="feature-icon">
383-
<i class="fas fa-rocket"></i>
384-
</div>
385-
<h3 class="title is-5">LLM Enhancement Techniques</h3>
386-
<p>Applications of psychometric insights to improve model capabilities and alignment.</p>
387-
</div>
388-
</div>
418+
<div class="table-caption">Table 1: Systematic comparison between psychometric evaluation and conventional AI benchmarking approaches</div>
419+
420+
<div class="table-container">
421+
<table class="comparison-table">
422+
<thead>
423+
<tr>
424+
<th>Feature</th>
425+
<th>Psychometrics</th>
426+
<th>AI Benchmark</th>
427+
</tr>
428+
</thead>
429+
<tbody>
430+
<tr>
431+
<td>Core goal</td>
432+
<td>To prove that a test measures what it is intended to measure (validity evidence) and to understand the construct being measured.</td>
433+
<td>To test and compare the task performance of different LLMs. Focuses on ranking models and selecting the best one suited for a specific task.</td>
434+
</tr>
435+
<tr>
436+
<td>Philosophy of measurement</td>
437+
<td>Construct-oriented. Tends towards a causal approach to measurement, where the measured trait is believed to cause the measurement outcomes.</td>
438+
<td>Task-oriented. Leans towards representativism, assuming items exhaust or represent all aspects of the underlying ability.</td>
439+
</tr>
440+
<tr>
441+
<td>Target construct</td>
442+
<td>Personality and ability.</td>
443+
<td>Mostly task-specific abilities.</td>
444+
</tr>
445+
<tr>
446+
<td>Construct definition</td>
447+
<td>Emphasizes clear and detailed definitions of the construct being measured. Agreement on the construct definition is a byproduct of test development.</td>
448+
<td>Often defines constructs implicitly through ad hoc task selection. Construct definitions can be vague.</td>
449+
</tr>
450+
<tr>
451+
<td>Development process</td>
452+
<td>Systematic and rigorous, often following methods like Evidence-Centered Design (ECD). Can be labor-intensive.</td>
453+
<td>Compiles a set of relevant questions or tasks, then performs expert annotation or crowdsourcing to label ground truth answers. Less labor-intensive per item.</td>
454+
</tr>
455+
<tr>
456+
<td>Number of items</td>
457+
<td>Can vary, but not necessarily large. Focus is on item quality and relevance to the construct.</td>
458+
<td>Typically consists of an extensive number of questions to cover various aspects of abilities. Reliability increases with test length.</td>
459+
</tr>
460+
<tr>
461+
<td>Sample size</td>
462+
<td>Typically requires a larger sample size of individuals for robust statistical modeling.</td>
463+
<td>Can be applied to evaluate the performance of a single LLM on the benchmark.</td>
464+
</tr>
465+
<tr>
466+
<td>Statistical modeling</td>
467+
<td>Employs advanced and various statistical models like Item Response Theory and Factor Analysis to analyze data, estimate latent abilities, and assess model fit.</td>
468+
<td>Often relies on simple aggregation methods, such as calculating average accuracy across benchmarks.</td>
469+
</tr>
470+
<tr>
471+
<td>Result analysis</td>
472+
<td>Ensures the reliability, validity, predictive power, and explanatory power of the test through result analysis and statistical modeling.</td>
473+
<td>Reliability is likely to be high due to the large number of items. However, validity, predictive power, or explanatory power beyond the target task is not a primary concern.</td>
474+
</tr>
475+
</tbody>
476+
</table>
389477
</div>
390478
</div>
391479
</section>

0 commit comments

Comments
 (0)