- Our methodological framework mirrors a classic assessment pipeline but is tailored to LLM tooling. Test format can range from tightly controlled structured items (single choice or Likert) to open‑ended conversations and full agentic simulations in which the model plays a role over many turns. Data sources may come from established inventories, researcher‑curated adaptations, or synthetic prompts automatically generated to extend test coverage. Prompting strategies include perturbing the original question, injecting step‑by‑step or role‑playing instructions, and enforcing chain‑of‑thought disclosure, all designed to surface latent capacities while controlling for prompt sensitivity. Finally, output & scoring modules translate the model's raw text into numerical metrics: logits for probability‑based scoring, direct mapping to scale points, or rubric‑based evaluation of free text, optionally adjudicated by additional models or humans.
0 commit comments