Reproduce all results from the OEA Framework paper in under 10 minutes (bigram experiments; real LLM experiments require GPU and ~30 min).
- Python 3.11+
pip install numpy matplotlib scipy pytest- For real LLM experiments:
pip install torch transformers rouge-score
git clone https://github.com/BitConcepts/oea-framework-paper
cd oea-framework-paper
pip install -r requirements-lock.txt# Pilot recursive stability + epistemic friction
python experiments/run_experiments.py
# Full credibility suite (12 variants, 648 runs each)
python experiments/credibility_suite.py
# Recursive memory drift benchmark (REQ-OEA-017)
python experiments/recursive_memory_drift.py
# Baseline competition (REQ-OEA-016)
python experiments/baseline_competition.pypython experiments/generate_figures.py
# Outputs: arxiv/figures/{fig_pipeline.pdf, fig_calibration.pdf, fig_metric_dissociation.pdf}Install torch for your hardware. See requirements-lock.txt for the full list with test-status notes.
# NVIDIA CUDA 12.1 [verified]:
pip install torch==2.3.1+cu121 transformers==4.41.0 rouge-score==0.1.2 --index-url https://download.pytorch.org/whl/cu121
# NVIDIA CUDA 12.4+ [verified]:
pip install torch transformers==4.41.0 rouge-score==0.1.2 --index-url https://download.pytorch.org/whl/cu124
# AMD ROCm 6.x [community-tested]:
pip install torch transformers==4.41.0 rouge-score==0.1.2 --index-url https://download.pytorch.org/whl/rocm6.3
# Intel Arc / Xe XPU [community-tested]:
pip install torch transformers==4.41.0 rouge-score==0.1.2 --index-url https://download.pytorch.org/whl/xpu
# Apple Silicon MPS [community-tested]:
pip install torch transformers==4.41.0 rouge-score==0.1.2
# CPU only (all platforms):
pip install torch transformers==4.41.0 rouge-score==0.1.2 --index-url https://download.pytorch.org/whl/cpunumpy: numpy 2.x is compatible with current torch versions. No pinning required.
# GPU — auto-detected (full config, ~20-30 min per model):
python experiments/real_lm_experiment.py --model distilgpt2
python experiments/real_lm_experiment.py --model gpt2
python experiments/real_lm_experiment.py --model EleutherAI/gpt-neo-125M
python experiments/real_lm_experiment.py --model Qwen/Qwen2.5-1.5B
# CPU (reduced config, ~15-25 min per model):
python experiments/real_lm_experiment.py --model distilgpt2 --n-seeds 3 --n-iterations 5 --gen-tokens 40
python experiments/real_lm_experiment.py --model gpt2 --n-seeds 3 --n-iterations 5 --gen-tokens 40
python experiments/real_lm_experiment.py --model EleutherAI/gpt-neo-125M --n-seeds 3 --n-iterations 5 --gen-tokens 40
python experiments/real_lm_experiment.py --model Qwen/Qwen2.5-1.5B --n-seeds 3 --n-iterations 5 --gen-tokens 40
# Force a specific backend (if auto-detection picks the wrong one):
python experiments/real_lm_experiment.py --model distilgpt2 --device rocm
python experiments/real_lm_experiment.py --model distilgpt2 --device xpu
python experiments/real_lm_experiment.py --model distilgpt2 --device mpsCPU results (reduced config) are valid for mechanism verification but have wider confidence intervals than the full GPU config. Annotate in manuscript with
(CPU, n_seeds=3, n_iter=5)to distinguish from full runs.
python experiments/verify_manifest.py
# Compares SHA-256 hashes against experiments/manifest.jsonNote on hash mismatches: The committed hashes in
manifest.jsonwere recorded with numpy 2.4.5. Re-running experiments on a different numpy version (e.g. 2.4.6) may produce cosmetically different JSON formatting (float precision) that changes the SHA-256 hash without changing the numerical results. Ifverify_manifestreports failures for bigram summary JSON files but the CSV raw runs pass, the results are directionally reproducible. Real LLM results (CSV + summary) should match exactly if you use the same model weights and seed policy.
pytest tests/
# Expected: 12 tests passing# CPU (all platforms):
docker build -t oea-framework .
docker run --rm -v $(pwd)/results:/app/results oea-framework
# NVIDIA GPU [verified]:
docker build -f Dockerfile.cuda -t oea-framework-cuda .
docker run --rm --gpus all -v $(pwd)/results:/app/results oea-framework-cuda \
python experiments/real_lm_experiment.py --model distilgpt2
# AMD ROCm [community-tested, Linux only]:
docker build -f Dockerfile.rocm -t oea-framework-rocm .
docker run --rm --device /dev/kfd --device /dev/dri \
--group-add render --group-add video \
-v $(pwd)/results:/app/results oea-framework-rocm \
python experiments/real_lm_experiment.py --model distilgpt2 --device rocm
# Intel XPU [community-tested, Linux only]:
docker build -f Dockerfile.xpu -t oea-framework-xpu .
docker run --rm --device /dev/dri \
-v $(pwd)/results:/app/results oea-framework-xpu \
python experiments/real_lm_experiment.py --model distilgpt2 --device xpu
# Apple MPS: Docker is not compatible with Apple Metal — use native install.| Experiment | Runtime | Output |
|---|---|---|
| Bigram pilot (run_experiments.py) | ~5s | results/summary_metrics.json |
| Credibility suite | ~90s | results/credibility/ |
| Memory drift | ~5s | results/memory_drift/ |
| Baseline competition | ~5s | results/baseline_competition/ |
| Figure generation | ~5s | arxiv/figures/ |
| Real LLM (distilgpt2, GPU) | ~20 min | results/real_lm/distilgpt2/ |
| Real LLM (gpt2, GPU) | ~25 min | results/real_lm/gpt2/ |
| Real LLM (gpt-neo-125M, GPU) | ~25 min | results/real_lm/EleutherAI/gpt-neo-125M/ |
| Real LLM (Qwen2.5-1.5B, GPU) | ~27 min | results/real_lm/Qwen/Qwen2.5-1.5B/ |
| Real LLM (any model, CPU reduced) | ~15-25 min | same paths, n_seeds=3 n_iter=5 |
All experiments use fixed random seeds. Per-experiment seed parameters:
Bigram experiments: random.Random(seed_idx) and numpy.random.default_rng(seed).
Real LLM: torch.manual_seed(gen_seed), where gen_seed = seed_idx * 1000 + iteration * 10.
All bigram experiments run on CPU (no GPU required).
Full GPU real LLM experiments conducted on: NVIDIA RTX 4070 SUPER, CUDA 12.1, Windows 11.
CPU validation (reduced config: --n-seeds 3 --n-iterations 5 --gen-tokens 40) is supported
and produces valid directional results. Use CPU results only for mechanism verification;
report full GPU results in the manuscript for statistical power.
| Hardware | Status | Notes |
|---|---|---|
| CPU (x86-64, AMD or Intel) | ✅ Verified | All platforms |
| NVIDIA CUDA 12.1 | ✅ Verified | RTX 4070 SUPER, Windows 11 |
| NVIDIA CUDA 12.4+ | ✅ Verified | Newer drivers / GPUs |
| AMD ROCm 6.x | Use --device rocm |
|
| Intel Arc / Xe XPU | Use --device xpu |
|
| Apple Silicon MPS | Auto-detected on macOS 13+ |
CI: GPU paths are not CI-tested. GitHub-hosted runners have no GPU hardware. Only CPU-based unit tests run automatically on every push.
If you run the real LLM experiments on AMD ROCm, Intel XPU, or Apple MPS, please report your result (success or failure) using the Hardware Compatibility issue template. Include your GPU model, driver/ROCm/CUDA version, OS, and PyTorch version.
| Experiment | GPU-hours | CPU-hours (reduced) |
|---|---|---|
| distilgpt2 (82M) | ~0.3 | ~0.4 |
| gpt2 (124M) | ~0.4 | ~0.5 |
| gpt-neo-125M (non-GPT2) | ~0.4 | ~0.5 |
| Qwen2.5-1.5B (modern 2024) | ~0.45 | ~0.6 |
| All bigram experiments | 0.0 (CPU) | 0.0 (CPU) |