Probing Concepts is a research benchmark and analysis toolkit for measuring
conceptual understanding in large language models across seven domains:
biology, botany, chemistry, geology, medicine, musicology, and physics. The
repository includes the curated concept inventory, prompt templates, analysis
code, and cached model responses/ and scores/ files used in the study, so
the full set of reported tables, figures, and summary statistics can be
reproduced locally without making any live LLM API calls.
Tested with Python 3.11+.
# 1. create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# 2. install the package (editable, with all dependencies)
pip install -e .
# 3. unzip the cached LLM outputs in place
unzip responses.zip
unzip scores.zipThis creates the responses/ and scores/ directories that the analysis
pipeline reads from.
Once installed, regenerate every reported artifact from the bundled
responses/ and scores/ data with a single command:
probing-analyze # cache-only (default): no LLM API calls
probing-analyze --no-cache # opt in to live LLM API callsIn cache-only mode (--cache, the default), any attempt to call the LLM
API raises immediately. The full analysis pipeline reads only the bundled
JSON files and requires no API credentials.
This runs the full analysis pipeline end-to-end and writes:
reports/summarized_scores.xlsx— model rankings and aggregated scoresreports/failure_modes.xlsx— per-concept failure analysisreports/*.xlsx— concept / response / score statisticsplots/*.pdf— radar, bar, and histogram plotspaper/neurips2026/figures/*.pdf— publication figurespaper_plots/*.pdfand*.csv— full-size paper plots and table datapaper_stats.json— summary statistics used in the paper
concepts/ curated concept JSON files (7 domains)
concepts.zip Croissant-compatible view of the benchmark data
concepts_records.jsonl flattened concept records (Croissant input)
croissant.json Croissant dataset metadata descriptor
prompt_templates/ testing, scoring, and system prompts
proofs/ Lean 4 formalisation of conceptual semantics
responses.zip zipped cached LLM responses (unzip to create responses/)
scores.zip zipped cached LLM-as-judge scores (unzip to create scores/)
src/ source package (probing_concepts)
scripts/ paper figure / table generators
models.json list of evaluated models
tests.json test definitions
The cached JSON files are sufficient to reproduce every result. To re-collect responses or rescore them against the LLM judge, set the API credentials and run the individual stages:
export LITELLM_PROXY_URL=...
export LITELLM_PC_API_KEY=...
probing-test # collect model responses -> responses/
probing-score # score responses with judge -> scores/
probing-analyze # run the full analysis pipelineLITELLM_PROXY_URL should point to a base URL that exposes an
OpenAI-compatible Chat Completions API. In practice, the code sends
authenticated POST requests to:
$LITELLM_PROXY_URL/chat/completions
with a bearer token from LITELLM_PC_API_KEY, plus standard chat-completions
fields such as model and messages. LiteLLM is the primary target and the
best-supported option here; the code also attaches LiteLLM-style metadata tags
for tracing. Another proxy or gateway should work if it accepts the same
OpenAI-style request shape, but a provider with a different API contract would
require a small adapter in src/probing_concepts/utils/call_llm.py.
Other entry points exposed by the package:
| Command | Purpose |
|---|---|
probing-analyze |
Run the full analysis pipeline on cached data |
probing-test |
Probe LLMs on the benchmark and write responses/ |
probing-score |
Score cached responses with the LLM judge |
probing-summary |
Build reports/summarized_scores.xlsx |
probing-stats |
Concept / response / score statistics |
probing-plot |
Radar / bar / histogram plots |
probing-failures |
Per-concept failure-mode table |
probing-embed |
Build sentence-transformer embeddings for concept lookup |
probing-validate |
Validate concept JSON files against schema and rules |
The proofs/ directory contains a Lean 4 formalisation of the conceptual
semantics framework used in this study (conceptual_semantics.lean). It
encodes the core definitions (concepts, semantic fields, selection-criteria
equivalence, conceptual equivalence) and proves compositionality properties
referenced in the paper (Appendix A).
To verify the proofs, copy the contents of proofs/conceptual_semantics.lean
into the Lean 4 web editor and confirm that everything checks out.
This project is licensed under the Apache License 2.0.