You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The evolving capabilities of large language models (LLMs) have outpaced traditional evaluation methodologies and introduced novel evaluation challenges,
466
+
The survey will be released soon. Stay tuned!
467
+
<!-- The evolving capabilities of large language models (LLMs) have outpaced traditional evaluation methodologies and introduced novel evaluation challenges,
464
468
such as assessing human-like psychological constructs, addressing the limitations
465
469
of static and task-specific benchmarks, and meeting the requirement for human-centered
466
470
evaluation. These challenges intersect with psychometrics, the science
<h4class="gradient-title">Psychological Domains in LLM Research</h4>
572
+
<h4class="gradient-title">Psychological Constructs in LLM Research</h4>
569
573
<p>
570
-
We group the traits probed in LLMs into two super‑domains. Personality Constructs capture relatively enduring dispositions and preferences. Under this heading we place (1) Personality traits measured by Big Five, HEXACO, MBTI or Dark‑Triad batteries; (2) Values inventories such as Schwartz, WVS, VSM and GLOBE; (3) Morality scales—MFT, DIT, ETHICS—gauging moral foundations; and (4) Attitudes & Opinions from political panels like ANES, ATP, GLES and PCT. By contrast, Cognitive Constructs probe fluid abilities. These include (1) Heuristics & Biases tasks such as the Cognitive Reflection Test; (2) Social interaction abilities—Theory of Mind, Emotional and Social Intelligence; (3) Psychology of Language modules covering comprehension, generation and acquisition; and (4) Learning & Cognitive Capabilities assessed through open‑ended similarity or reasoning items. Together these two domains provide a comprehensive blueprint for modelling an LLM's "personality" and "cognition" with standard psychometric tools.
574
+
LLM psychometrics evaluates LLMs in their personality and cognitive constructs. Personality constructs include (1) personality traits measured based on theories such as Big Five, HEXACO, MBTI, or Dark Triad; (2) values based on theories such as Schwartz, WVS, VSM, and GLOBE; (3) morality based on MFT, DIT, and ETHICS; and (4) attitudes and opinions from political panels like ANES, ATP, GLES, and PCT. In contrast, cognitive constructs include (1) heuristics and biases measured by tasks such as the Cognitive Reflection Test; (2) social interaction abilities—Theory of Mind, Emotional and Social Intelligence; (3) psychology of language modules covering comprehension, generation, and acquisition; and (4) learning and cognitive capabilities.
<h4class="gradient-title">Validation Framework for LLM Assessments</h4>
622
+
<h4class="gradient-title">Validation Framework for LLM Psychometrics</h4>
619
623
<p>
620
-
Applying psychometrics to LLMs demands the same evidential standards used for human tests. Reliability is assessed via test–retest stability, parallel‑form equivalence, and inter‑rater agreement when subjective coding is involved. Validity evidence is gathered on multiple fronts: content (guarding against training‑data contamination and item representativeness), construct (ensuring responses reflect the intended latent trait rather than response sets or social desirability bias), and criterion or ecological correspondence with external benchmarks. We also emphasize emerging standards—non‑disclosure of test materials, fairness across languages and cultures, and suitability of item difficulty to model capability—to prevent inflation of scores and to promote transparent comparisons across models and human baselines.
624
+
Applying psychometrics to LLMs requires validation. Reliability is assessed through test-retest reliability, parallel forms reliability, and inter-rater reliability when subjective coding is involved. Validity evidence is gathered on multiple fronts: content (guarding against training-data contamination and item representativeness), construct (ensuring responses reflect the intended latent trait rather than response sets or social desirability bias), and criterion or ecological correspondence with external benchmarks. We also gather emerging standards, such as non-disclosure of test materials, fairness across languages and cultures, and the suitability of tests for model capabilities.
<li><spanclass="highlight">Trait Manipulation</span>: Techniques for controlling LLM personality expressions through prompting, inference-time interventions, and fine-tuning</li>
648
+
<li><spanclass="highlight">Trait Manipulation</span>: Techniques for controlling LLM traits through prompting, inference-time interventions, and fine-tuning</li>
645
649
<li><spanclass="highlight">Safety and Alignment</span>: Leveraging psychometric measurements to guide alignment with human values and improve safety</li>
646
650
<li><spanclass="highlight">Cognitive Enhancement</span>: Methods for developing more human-like reasoning, empathy, and communication capabilities</li>
<li><spanclass="highlight">Psychometric Validation</span>: Establish rigorous reliability and validity checks to confirm that measured traits truly reflect LLM capabilities.</li>
674
+
<li><spanclass="highlight">Psychometric Validation</span>: Establish rigorous reliability and validity checks.</li>
671
675
<li><spanclass="highlight">From Human Constructs to LLM Constructs</span>: Reframe classical psychological constructs so they align with the representational space and behavior patterns of language models.</li>
672
-
<li><spanclass="highlight">Perceivedvs.Aligned Traits</span>: Separate traits that humans infer from outputs from those explicitly optimized during alignment.</li>
673
-
<li><spanclass="highlight">Anthropomorphization Challenges</span>: Avoid methodological pitfalls that arise when interpreting stochastic text as evidence of stable intentions.</li>
674
-
<li><spanclass="highlight">Expanding Dimensions in Model Deployment</span>: Extend evaluations to multilingual, multi‑turn, multimodal, and multi‑agent contexts where new validity issues emerge.</li>
675
-
<li><spanclass="highlight">Item Response Theory</span>: Adopt multidimensional IRT and adaptive testing to boost efficiency and precision of LLM assessments.</li>
676
-
<li><spanclass="highlight">From Evaluation to Enhancement</span>: Use psychometric insights to fine‑tune, augment data, and steer models toward predictable, aligned behavior.</li>
676
+
<li><spanclass="highlight">Perceivedvs.Aligned Traits</span>: Distinguish between traits that humans perceive from LLM outputs and those aligned with human self-views.</li>
677
+
<li><spanclass="highlight">Anthropomorphization Challenges</span>: Properly anthropomorphizing LLMs in psychometric tests remains a controversial topic.</li>
678
+
<li><spanclass="highlight">Expanding Dimensions in Model Deployment</span>: Extend evaluations to multilingual, multi-turn, multimodal, agent, and multi-agent contexts where new validity issues emerge.</li>
679
+
<li><spanclass="highlight">Item Response Theory</span>: Adopt sophisticated IRT models and adaptive testing to improve LLM evaluation.</li>
680
+
<li><spanclass="highlight">From Evaluation to Enhancement</span>: Use psychometric insights to enhance and align LLMs.</li>
0 commit comments