Skip to content

pfizer-opensource/probing-concepts-release

Repository files navigation

Probing Concepts

Probing Concepts is a research benchmark and analysis toolkit for measuring conceptual understanding in large language models across seven domains: biology, botany, chemistry, geology, medicine, musicology, and physics. The repository includes the curated concept inventory, prompt templates, analysis code, and cached model responses/ and scores/ files used in the study, so the full set of reported tables, figures, and summary statistics can be reproduced locally without making any live LLM API calls.

Tested with Python 3.11+.

Installation

# 1. create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 2. install the package (editable, with all dependencies)
pip install -e .

# 3. unzip the cached LLM outputs in place
unzip responses.zip
unzip scores.zip

This creates the responses/ and scores/ directories that the analysis pipeline reads from.

Reproducing all results

Once installed, regenerate every reported artifact from the bundled responses/ and scores/ data with a single command:

probing-analyze            # cache-only (default): no LLM API calls
probing-analyze --no-cache # opt in to live LLM API calls

In cache-only mode (--cache, the default), any attempt to call the LLM API raises immediately. The full analysis pipeline reads only the bundled JSON files and requires no API credentials.

This runs the full analysis pipeline end-to-end and writes:

  • reports/summarized_scores.xlsx — model rankings and aggregated scores
  • reports/failure_modes.xlsx — per-concept failure analysis
  • reports/*.xlsx — concept / response / score statistics
  • plots/*.pdf — radar, bar, and histogram plots
  • paper/neurips2026/figures/*.pdf — publication figures
  • paper_plots/*.pdf and *.csv — full-size paper plots and table data
  • paper_stats.json — summary statistics used in the paper

Layout

concepts/                curated concept JSON files (7 domains)
concepts.zip             Croissant-compatible view of the benchmark data
concepts_records.jsonl    flattened concept records (Croissant input)
croissant.json           Croissant dataset metadata descriptor
prompt_templates/        testing, scoring, and system prompts
proofs/                  Lean 4 formalisation of conceptual semantics
responses.zip            zipped cached LLM responses (unzip to create responses/)
scores.zip               zipped cached LLM-as-judge scores (unzip to create scores/)
src/                     source package (probing_concepts)
scripts/                 paper figure / table generators
models.json              list of evaluated models
tests.json               test definitions

Re-running the LLM pipeline (optional)

The cached JSON files are sufficient to reproduce every result. To re-collect responses or rescore them against the LLM judge, set the API credentials and run the individual stages:

export LITELLM_PROXY_URL=...
export LITELLM_PC_API_KEY=...

probing-test       # collect model responses    -> responses/
probing-score      # score responses with judge -> scores/
probing-analyze    # run the full analysis pipeline

LITELLM_PROXY_URL should point to a base URL that exposes an OpenAI-compatible Chat Completions API. In practice, the code sends authenticated POST requests to:

$LITELLM_PROXY_URL/chat/completions

with a bearer token from LITELLM_PC_API_KEY, plus standard chat-completions fields such as model and messages. LiteLLM is the primary target and the best-supported option here; the code also attaches LiteLLM-style metadata tags for tracing. Another proxy or gateway should work if it accepts the same OpenAI-style request shape, but a provider with a different API contract would require a small adapter in src/probing_concepts/utils/call_llm.py.

Other entry points exposed by the package:

Command Purpose
probing-analyze Run the full analysis pipeline on cached data
probing-test Probe LLMs on the benchmark and write responses/
probing-score Score cached responses with the LLM judge
probing-summary Build reports/summarized_scores.xlsx
probing-stats Concept / response / score statistics
probing-plot Radar / bar / histogram plots
probing-failures Per-concept failure-mode table
probing-embed Build sentence-transformer embeddings for concept lookup
probing-validate Validate concept JSON files against schema and rules

Formal proofs

The proofs/ directory contains a Lean 4 formalisation of the conceptual semantics framework used in this study (conceptual_semantics.lean). It encodes the core definitions (concepts, semantic fields, selection-criteria equivalence, conceptual equivalence) and proves compositionality properties referenced in the paper (Appendix A).

To verify the proofs, copy the contents of proofs/conceptual_semantics.lean into the Lean 4 web editor and confirm that everything checks out.

License

This project is licensed under the Apache License 2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors