Skip to content

Latest commit

 

History

History
202 lines (152 loc) · 7.93 KB

File metadata and controls

202 lines (152 loc) · 7.93 KB

Reproduction Guide

Reproduce all results from the OEA Framework paper in under 10 minutes (bigram experiments; real LLM experiments require GPU and ~30 min).

Prerequisites

  • Python 3.11+
  • pip install numpy matplotlib scipy pytest
  • For real LLM experiments: pip install torch transformers rouge-score

Step 1 — Clone and install

git clone https://github.com/BitConcepts/oea-framework-paper
cd oea-framework-paper
pip install -r requirements-lock.txt

Step 2 — Run bigram proxy experiments (~2 min, no GPU)

# Pilot recursive stability + epistemic friction
python experiments/run_experiments.py

# Full credibility suite (12 variants, 648 runs each)
python experiments/credibility_suite.py

# Recursive memory drift benchmark (REQ-OEA-017)
python experiments/recursive_memory_drift.py

# Baseline competition (REQ-OEA-016)
python experiments/baseline_competition.py

Step 3 — Generate figures (~10 sec, no GPU)

python experiments/generate_figures.py
# Outputs: arxiv/figures/{fig_pipeline.pdf, fig_calibration.pdf, fig_metric_dissociation.pdf}

Step 4 — Install neural LLM dependencies

Install torch for your hardware. See requirements-lock.txt for the full list with test-status notes.

# NVIDIA CUDA 12.1 [verified]:
pip install torch==2.3.1+cu121 transformers==4.41.0 rouge-score==0.1.2 --index-url https://download.pytorch.org/whl/cu121

# NVIDIA CUDA 12.4+ [verified]:
pip install torch transformers==4.41.0 rouge-score==0.1.2 --index-url https://download.pytorch.org/whl/cu124

# AMD ROCm 6.x [community-tested]:
pip install torch transformers==4.41.0 rouge-score==0.1.2 --index-url https://download.pytorch.org/whl/rocm6.3

# Intel Arc / Xe XPU [community-tested]:
pip install torch transformers==4.41.0 rouge-score==0.1.2 --index-url https://download.pytorch.org/whl/xpu

# Apple Silicon MPS [community-tested]:
pip install torch transformers==4.41.0 rouge-score==0.1.2

# CPU only (all platforms):
pip install torch transformers==4.41.0 rouge-score==0.1.2 --index-url https://download.pytorch.org/whl/cpu

numpy: numpy 2.x is compatible with current torch versions. No pinning required.

Step 5 — Run real LLM experiments

# GPU — auto-detected (full config, ~20-30 min per model):
python experiments/real_lm_experiment.py --model distilgpt2
python experiments/real_lm_experiment.py --model gpt2
python experiments/real_lm_experiment.py --model EleutherAI/gpt-neo-125M
python experiments/real_lm_experiment.py --model Qwen/Qwen2.5-1.5B

# CPU (reduced config, ~15-25 min per model):
python experiments/real_lm_experiment.py --model distilgpt2 --n-seeds 3 --n-iterations 5 --gen-tokens 40
python experiments/real_lm_experiment.py --model gpt2 --n-seeds 3 --n-iterations 5 --gen-tokens 40
python experiments/real_lm_experiment.py --model EleutherAI/gpt-neo-125M --n-seeds 3 --n-iterations 5 --gen-tokens 40
python experiments/real_lm_experiment.py --model Qwen/Qwen2.5-1.5B --n-seeds 3 --n-iterations 5 --gen-tokens 40

# Force a specific backend (if auto-detection picks the wrong one):
python experiments/real_lm_experiment.py --model distilgpt2 --device rocm
python experiments/real_lm_experiment.py --model distilgpt2 --device xpu
python experiments/real_lm_experiment.py --model distilgpt2 --device mps

CPU results (reduced config) are valid for mechanism verification but have wider confidence intervals than the full GPU config. Annotate in manuscript with (CPU, n_seeds=3, n_iter=5) to distinguish from full runs.

Step 6 — Verify artifact integrity

python experiments/verify_manifest.py
# Compares SHA-256 hashes against experiments/manifest.json

Note on hash mismatches: The committed hashes in manifest.json were recorded with numpy 2.4.5. Re-running experiments on a different numpy version (e.g. 2.4.6) may produce cosmetically different JSON formatting (float precision) that changes the SHA-256 hash without changing the numerical results. If verify_manifest reports failures for bigram summary JSON files but the CSV raw runs pass, the results are directionally reproducible. Real LLM results (CSV + summary) should match exactly if you use the same model weights and seed policy.

Step 7 — Run tests

pytest tests/
# Expected: 12 tests passing

Docker (fully reproducible environment)

# CPU (all platforms):
docker build -t oea-framework .
docker run --rm -v $(pwd)/results:/app/results oea-framework

# NVIDIA GPU [verified]:
docker build -f Dockerfile.cuda -t oea-framework-cuda .
docker run --rm --gpus all -v $(pwd)/results:/app/results oea-framework-cuda \
  python experiments/real_lm_experiment.py --model distilgpt2

# AMD ROCm [community-tested, Linux only]:
docker build -f Dockerfile.rocm -t oea-framework-rocm .
docker run --rm --device /dev/kfd --device /dev/dri \
  --group-add render --group-add video \
  -v $(pwd)/results:/app/results oea-framework-rocm \
  python experiments/real_lm_experiment.py --model distilgpt2 --device rocm

# Intel XPU [community-tested, Linux only]:
docker build -f Dockerfile.xpu -t oea-framework-xpu .
docker run --rm --device /dev/dri \
  -v $(pwd)/results:/app/results oea-framework-xpu \
  python experiments/real_lm_experiment.py --model distilgpt2 --device xpu

# Apple MPS: Docker is not compatible with Apple Metal — use native install.

Expected outputs

Experiment Runtime Output
Bigram pilot (run_experiments.py) ~5s results/summary_metrics.json
Credibility suite ~90s results/credibility/
Memory drift ~5s results/memory_drift/
Baseline competition ~5s results/baseline_competition/
Figure generation ~5s arxiv/figures/
Real LLM (distilgpt2, GPU) ~20 min results/real_lm/distilgpt2/
Real LLM (gpt2, GPU) ~25 min results/real_lm/gpt2/
Real LLM (gpt-neo-125M, GPU) ~25 min results/real_lm/EleutherAI/gpt-neo-125M/
Real LLM (Qwen2.5-1.5B, GPU) ~27 min results/real_lm/Qwen/Qwen2.5-1.5B/
Real LLM (any model, CPU reduced) ~15-25 min same paths, n_seeds=3 n_iter=5

Seed policy

All experiments use fixed random seeds. Per-experiment seed parameters: Bigram experiments: random.Random(seed_idx) and numpy.random.default_rng(seed). Real LLM: torch.manual_seed(gen_seed), where gen_seed = seed_idx * 1000 + iteration * 10.

Hardware notes

All bigram experiments run on CPU (no GPU required). Full GPU real LLM experiments conducted on: NVIDIA RTX 4070 SUPER, CUDA 12.1, Windows 11. CPU validation (reduced config: --n-seeds 3 --n-iterations 5 --gen-tokens 40) is supported and produces valid directional results. Use CPU results only for mechanism verification; report full GPU results in the manuscript for statistical power.

Hardware test matrix

Hardware Status Notes
CPU (x86-64, AMD or Intel) ✅ Verified All platforms
NVIDIA CUDA 12.1 ✅ Verified RTX 4070 SUPER, Windows 11
NVIDIA CUDA 12.4+ ✅ Verified Newer drivers / GPUs
AMD ROCm 6.x ⚠️ Community-tested Use --device rocm
Intel Arc / Xe XPU ⚠️ Community-tested Use --device xpu
Apple Silicon MPS ⚠️ Community-tested Auto-detected on macOS 13+

CI: GPU paths are not CI-tested. GitHub-hosted runners have no GPU hardware. Only CPU-based unit tests run automatically on every push.

Untested hardware — help wanted

If you run the real LLM experiments on AMD ROCm, Intel XPU, or Apple MPS, please report your result (success or failure) using the Hardware Compatibility issue template. Include your GPU model, driver/ROCm/CUDA version, OS, and PyTorch version.

Compute budget

Experiment GPU-hours CPU-hours (reduced)
distilgpt2 (82M) ~0.3 ~0.4
gpt2 (124M) ~0.4 ~0.5
gpt-neo-125M (non-GPT2) ~0.4 ~0.5
Qwen2.5-1.5B (modern 2024) ~0.45 ~0.6
All bigram experiments 0.0 (CPU) 0.0 (CPU)