Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/ci-full.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ jobs:
- name: Run linting
run: uv run make lint

- name: Lint + test the transferability study
run: uv run make -C transferability check

- name: Run full test suite with coverage
run: uv run make coverage

Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ jobs:
- name: Run linting
run: uv run make lint

- name: Lint + test the transferability study
run: uv run make -C transferability check

- name: Run fast tests with coverage
run: uv run make coverage-fast

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ ______________________________________________________________________
| `tests/` | Pytest suites mirroring the package layout. |
| `notebooks/` | Demos, EDA, training-run analysis, benchmark/eval analysis, and HuggingFace repackagers; see [`notebooks/README.md`](./notebooks/README.md) for the categorized index. |
| `scripts/` | Internal launchers and sweep drivers used during our experiments (SLURM, W&B sweeps, vLLM eval/grading); see [`scripts/README.md`](./scripts/README.md). |
| `transferability/` | Self-contained downstream study — does PyINE-v1's shortcut behavior generalize across standard NLP benchmarks? Two sweeps (lm-eval accuracy + CueFlip cue-injection) over 6 benchmarks; provider-agnostic, multi-model, reproducible. See [`transferability/README.md`](./transferability/README.md). |
| `data/` | Default location for stored datasets. Override via `PYINE_DATA_ROOT` (see [`.env.template`](./.env.template)) to keep large artifacts outside the repo. |
| `logs/` | Default output location for app runs (Hydra run folders, training/eval logs, coverage reports). Override via `PYINE_LOGS_ROOT`. |
| `Makefile` | Helper targets for linting, tests, coverage, and notebook tooling. Run `make info` for the full menu. |
Expand Down
140 changes: 140 additions & 0 deletions transferability/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# .env template for the transferability study.
#
# Copy this file to .env in this directory, then fill in the variables for
# your inference provider. The study reads from `transferability/.env` (NOT
# pyine's root .env) so configuring this study doesn't disturb any pyine .env
# you may already have customized for training experiments. Override the
# default path via TRANSF_DOTENV=/path/to/.env if you need to.
#
# All inference is done over the OpenAI Chat/Completions wire protocol, so
# any provider implementing that protocol works -- Runpod serverless,
# vLLM/SGLang running locally, Together AI, Anyscale, OpenAI itself,
# OpenRouter, Fireworks, etc.

# ============================================================================
# REQUIRED
# ============================================================================

# Hugging Face token (for gated dataset/model downloads).
# HF_TOKEN=

# ============================================================================
# INFERENCE PROVIDER -- pick ONE of the recipes below (A or B)
# ============================================================================
# The two models (shortcut + base) can use different providers if you want.

# ----------------------------------------------------------------------------
# Recipe A: Runpod serverless (the default path used by deploy/deploy_endpoints.py)
# ----------------------------------------------------------------------------
# Run `python deploy/deploy_endpoints.py` to create the endpoints; the script
# prints the IDs you should paste here.
#
# RUNPOD_API_KEY=your-runpod-key
# RUNPOD_ENDPOINT_SHORTCUT=endpoint-id-1
# RUNPOD_ENDPOINT_BASE=endpoint-id-2

# ----------------------------------------------------------------------------
# Recipe B: Any other OpenAI-compatible provider
# ----------------------------------------------------------------------------
# Two env-var sets are read by two different code paths:
#
# - scripts/_common.sh reads INFERENCE_URL_<TAG> + INFERENCE_API_KEY for the
# lm-eval `--model local-completions` backend. URL must include
# `/v1/completions` at the end.
#
# - cueflip/runner.py reads CUEFLIP_INFERENCE_URL_<TAG> + CUEFLIP_INFERENCE_API_KEY
# for the OpenAI Python client. URL must end at `/v1` (the client appends
# `/completions` itself).
#
# Set both pairs if you plan to run both sweeps with a non-Runpod provider.

# --- Local vLLM / SGLang example (both models on the same server, different ports) ---
# INFERENCE_URL_SHORTCUT=http://localhost:8001/v1/completions
# INFERENCE_URL_BASE=http://localhost:8002/v1/completions
# INFERENCE_API_KEY=EMPTY
# CUEFLIP_INFERENCE_URL_SHORTCUT=http://localhost:8001/v1
# CUEFLIP_INFERENCE_URL_BASE=http://localhost:8002/v1
# CUEFLIP_INFERENCE_API_KEY=EMPTY

# --- Together AI example (both models hosted there) ---
# INFERENCE_URL_SHORTCUT=https://api.together.xyz/v1/completions
# INFERENCE_URL_BASE=https://api.together.xyz/v1/completions
# INFERENCE_API_KEY=your-together-key
# CUEFLIP_INFERENCE_URL_SHORTCUT=https://api.together.xyz/v1
# CUEFLIP_INFERENCE_URL_BASE=https://api.together.xyz/v1
# CUEFLIP_INFERENCE_API_KEY=your-together-key

# --- OpenAI (only sensible if you're auditing a *different* organism that's
# actually hosted on OpenAI; the default Qwen3 base/PyINE-v1 organism aren't) ---
# INFERENCE_URL_SHORTCUT=https://api.openai.com/v1/completions
# INFERENCE_URL_BASE=https://api.openai.com/v1/completions
# INFERENCE_API_KEY=your-openai-key

# ============================================================================
# OPTIONAL: model selection (single model, custom pair, or arbitrary set)
# ============================================================================
# The study runs against the tags listed in MODELS (comma-separated). Defaults
# to `shortcut,base` -- the PyINE-v1 organism vs its Qwen3 base.
#
# For each TAG in MODELS, set:
# <TAG>_MODEL_ID -- the model name passed to /completions
# INFERENCE_URL_<TAG> -- explicit URL (preferred for non-Runpod providers)
# OR
# RUNPOD_ENDPOINT_<TAG> -- Runpod endpoint ID (default Runpod path)
#
# CueFlip uses the parallel CUEFLIP_INFERENCE_URL_<TAG> variable (URL ending
# in /v1, not /v1/completions; the OpenAI client appends /completions itself).
#
# Defaults (the PyINE-v1 audit):
# MODELS=shortcut,base
# SHORTCUT_MODEL_ID=plstcharles-saifh/pyine-v1-qwen3-4b-shortcut
# BASE_MODEL_ID=Qwen/Qwen3-4B-Instruct-2507
#
# --- Single-model run (no comparison; just measure one model) ---
# MODELS=my_org
# MY_ORG_MODEL_ID=my-org/my-model-v1
# INFERENCE_URL_MY_ORG=http://localhost:8000/v1/completions
# CUEFLIP_INFERENCE_URL_MY_ORG=http://localhost:8000/v1
#
# --- Custom pair (e.g., audit a different organism vs a different base) ---
# MODELS=audit_org,audit_base
# AUDIT_ORG_MODEL_ID=my-org/my-shortcut-model
# AUDIT_BASE_MODEL_ID=my-org/my-base-model
# INFERENCE_URL_AUDIT_ORG=https://api.together.xyz/v1/completions
# INFERENCE_URL_AUDIT_BASE=https://api.together.xyz/v1/completions
# CUEFLIP_INFERENCE_URL_AUDIT_ORG=https://api.together.xyz/v1
# CUEFLIP_INFERENCE_URL_AUDIT_BASE=https://api.together.xyz/v1
#
# --- Three+ models in one sweep (e.g., variant comparison) ---
# MODELS=organism_a,organism_b,base
# ORGANISM_A_MODEL_ID=org-a/model
# ORGANISM_B_MODEL_ID=org-b/model
# BASE_MODEL_ID=...
# (set per-tag INFERENCE_URL_<TAG> and CUEFLIP_INFERENCE_URL_<TAG> as above)

# ============================================================================
# OPTIONAL: cueflip judge endpoint (LLM-as-judge recovery pass for ambiguous responses)
# ============================================================================
# Defaults to a local vLLM server at localhost:8000 running Qwen3-4B-Instruct.
# Used by both cueflip/judge.py (multiple-choice answer recovery) and
# cueflip/build_operation_flip_cache.py (one-time GSM8K op-flip wrong-numeric
# generation).
#
# CUEFLIP_JUDGE_URL=http://localhost:8000/v1
# CUEFLIP_JUDGE_MODEL=Qwen/Qwen3-4B-Instruct-2507

# ============================================================================
# OPTIONAL: GSM8K operation-flip wrong-numeric cache
# ============================================================================
# Path to operation_flip_cache.json produced by build_operation_flip_cache.py.
# Defaults to cueflip/operation_flip_cache.json relative to repo root. Override
# if you keep multiple cache versions or are running from a non-default working dir.
#
# CUEFLIP_OP_FLIP_CACHE=/abs/path/to/operation_flip_cache.json

# ============================================================================
# OPTIONAL: Weights & Biases logging
# ============================================================================
# WANDB_API_KEY=your-wandb-key
# WANDB_PROJECT=pyine-transferability
# WANDB_GROUP=2026-05-19-sweep
56 changes: 56 additions & 0 deletions transferability/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Bulky raw outputs - archived externally; regenerable via scripts/run_all.sh
outputs/raw/

# Failure-mode narratives - regenerable via scripts/analysis_e.py
outputs/failure_modes/

# Research-note artifacts -- regenerable end-to-end via scripts/run_all.sh +
# scripts/analysis_*.py (Sweep #1) and cueflip/runner.py + cueflip/analyze.py
# (Sweep #2). Not committed because they're large, regenerable, and not
# needed by anyone who just wants to read or audit the code.
# A pre-rendered HTML of the note (with figures embedded) is hosted separately;
# see README's "Inference providers" section for the link.
outputs/derived/
cueflip/cueflip_*.png
cueflip/cueflip_*.csv
cueflip/cueflip_*.md

# W&B run archive (lm-eval writes to transferability/wandb/ when invoked from
# the study root; scripts/run_*.sh used to invoke from scripts/, hence the
# legacy scripts/wandb/ pattern).
scripts/wandb/
wandb/

# Per-cell wall-clock log written by scripts/_common.sh:log_timing.
timing.tsv

# Stderr logs
logs/
*.log

# Paper-writing artifacts -- not shipped. The PR is code + reproduction
# infrastructure only; the research note lives outside this codebase.
docs/

# Rendered HTML, anywhere -- never ship rendered output.
**/*.html

# CueFlip raw outputs - regenerable via cueflip/runner.py
# (Derived artifacts cueflip/cueflip_*.{csv,md,png} are already covered
# by the patterns above; not repeated here.)
cueflip/results/
cueflip/results_dry_run/
cueflip/operation_flip_cache.json
cueflip/operation_flip_cache_dry_run.json
cueflip/runner_base.log

# Python caches
**/__pycache__/
.pytest_cache/

# env vars
.env

# PR meta -- used to seed the PR body at submission time (e.g. `gh pr create
# --body-file PR_DESCRIPTION.md`); not part of the shipped study code.
PR_DESCRIPTION.md
Loading