Probing Concepts

Probing Concepts is a research benchmark and analysis toolkit for measuring conceptual understanding in large language models across seven domains: biology, botany, chemistry, geology, medicine, musicology, and physics. The repository includes the curated concept inventory, prompt templates, analysis code, and cached model responses/ and scores/ files used in the study, so the full set of reported tables, figures, and summary statistics can be reproduced locally without making any live LLM API calls.

Tested with Python 3.11+.

Installation

# 1. create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 2. install the package (editable, with all dependencies)
pip install -e .

# 3. unzip the cached LLM outputs in place
unzip responses.zip
unzip scores.zip

This creates the responses/ and scores/ directories that the analysis pipeline reads from.

Reproducing all results

Once installed, regenerate every reported artifact from the bundled responses/ and scores/ data with a single command:

probing-analyze            # cache-only (default): no LLM API calls
probing-analyze --no-cache # opt in to live LLM API calls

In cache-only mode (--cache, the default), any attempt to call the LLM API raises immediately. The full analysis pipeline reads only the bundled JSON files and requires no API credentials.

This runs the full analysis pipeline end-to-end and writes:

reports/summarized_scores.xlsx — model rankings and aggregated scores
reports/failure_modes.xlsx — per-concept failure analysis
reports/*.xlsx — concept / response / score statistics
plots/*.pdf — radar, bar, and histogram plots
paper/neurips2026/figures/*.pdf — publication figures
paper_plots/*.pdf and *.csv — full-size paper plots and table data
paper_stats.json — summary statistics used in the paper

Layout

concepts/                curated concept JSON files (7 domains)
concepts.zip             Croissant-compatible view of the benchmark data
concepts_records.jsonl    flattened concept records (Croissant input)
croissant.json           Croissant dataset metadata descriptor
prompt_templates/        testing, scoring, and system prompts
proofs/                  Lean 4 formalisation of conceptual semantics
responses.zip            zipped cached LLM responses (unzip to create responses/)
scores.zip               zipped cached LLM-as-judge scores (unzip to create scores/)
src/                     source package (probing_concepts)
scripts/                 paper figure / table generators
models.json              list of evaluated models
tests.json               test definitions

Re-running the LLM pipeline (optional)

The cached JSON files are sufficient to reproduce every result. To re-collect responses or rescore them against the LLM judge, set the API credentials and run the individual stages:

export LITELLM_PROXY_URL=...
export LITELLM_PC_API_KEY=...

probing-test       # collect model responses    -> responses/
probing-score      # score responses with judge -> scores/
probing-analyze    # run the full analysis pipeline

LITELLM_PROXY_URL should point to a base URL that exposes an OpenAI-compatible Chat Completions API. In practice, the code sends authenticated POST requests to:

$LITELLM_PROXY_URL/chat/completions

with a bearer token from LITELLM_PC_API_KEY, plus standard chat-completions fields such as model and messages. LiteLLM is the primary target and the best-supported option here; the code also attaches LiteLLM-style metadata tags for tracing. Another proxy or gateway should work if it accepts the same OpenAI-style request shape, but a provider with a different API contract would require a small adapter in src/probing_concepts/utils/call_llm.py.

Other entry points exposed by the package:

Command	Purpose
`probing-analyze`	Run the full analysis pipeline on cached data
`probing-test`	Probe LLMs on the benchmark and write `responses/`
`probing-score`	Score cached responses with the LLM judge
`probing-summary`	Build `reports/summarized_scores.xlsx`
`probing-stats`	Concept / response / score statistics
`probing-plot`	Radar / bar / histogram plots
`probing-failures`	Per-concept failure-mode table
`probing-embed`	Build sentence-transformer embeddings for concept lookup
`probing-validate`	Validate concept JSON files against schema and rules

Formal proofs

The proofs/ directory contains a Lean 4 formalisation of the conceptual semantics framework used in this study (conceptual_semantics.lean). It encodes the core definitions (concepts, semantic fields, selection-criteria equivalence, conceptual equivalence) and proves compositionality properties referenced in the paper (Appendix A).

To verify the proofs, copy the contents of proofs/conceptual_semantics.lean into the Lean 4 web editor and confirm that everything checks out.

License

This project is licensed under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probing Concepts

Installation

Reproducing all results

Layout

Re-running the LLM pipeline (optional)

Formal proofs

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Probing Concepts

Installation

Reproducing all results

Layout

Re-running the LLM pipeline (optional)

Formal proofs

License