Skip to content

transformerless-lm v0.1.0 — 100× FibGen compression, 5.6× lazy-data speedup

Choose a tag to compare

@RandomCoder-lab RandomCoder-lab released this 21 May 05:21
· 229 commits to master since this release
ad35f98

transformerless-lm v0.1.0

First release of the substrate-compressed language model framework
under experiments/transformerless_lm/. This document is the in-tree
release artifact corresponding to the local annotated tag
transformerless-lm-v0.1.0 at commit ad35f98.

Headline results (validated)

100× weight compression via FibGen

Each weight tensor W ∈ R^{out × in} is replaced by a small
Fibonacci-indexed seed and reconstructed on demand via a closed-form
sin/cos expansion at Fibonacci frequencies.

arch params compression val (best) vs dense uniform reduction
dense_crt 801,664 2.5602 -38.7%
fibgen_K16_separable 8,064 100.4× 2.9020 +13.3% -30.5%
fibgen_K32_separable 9,216 87.9× 2.7282 +6.6% -34.6%

Reproduced across two independent training runs (the original v2 bench
at results_fibgen.json and the recheck run at the same path). The
compression is real — 8K stored parameters reconstruct an 810K dense-
equivalent weight tensor — and the model genuinely learns the corpus
structure (val well below the ln(65) = 4.17 uniform floor).

Inference: 90-93% throughput at 10-37× less RAM

arch d weight_MB tok/s vs dense speed
dense_crt 128 3.06 473
fibgen_K32 cached 128 0.31 441 93%
dense_crt 256 12.12 264
fibgen_K32 cached 256 0.33 237 90%

The weight cache pattern (precompute W once at deployment, reuse
across all forward passes) eliminates the FibGen forward-overhead at
inference. Per-token compute matches dense; only the persistent
weight storage is compressed. At d=256 the memory ratio is 37×;
at LLM scale (d=4096) extrapolation gives ~200× memory reduction.

Lazy-loaded training: 5.6× wall-clock speedup

Fibonacci-strided data sampling loads only log_φπ(T) tokens per
sequence position (11 of 128 at T=128). The model never reads gap
tokens from disk.

config val wall (1500 steps) speedup
dense baseline (dense data) 2.4396 165.7s 1.00×
dense + lazy-strided data 2.5274 29.5s 5.62×

The substrate's log_φπ cadence is the data-loading complexity
bound; this is the cleanest single-axis substrate-native win in the
release.

35B-in-8GB feasibility math

Combining the validated wins:

config 35B-equivalent storage fits in 8 GB?
dense fp16 70 GB no
4-bit quantization (SOTA) 17.5 GB no
FibGen K=32 cross 7 GB yes
FibGen K=32 separable 800 MB yes, easily

These numbers are extrapolations from the d=128 / d=256 measurements.
At true LLM scale the compression ratio grows as (d/K)² because
dense storage scales as while the seed is regardless of d.

Architectural primitives (all in experiments/transformerless_lm/)

primitive file validation
CRT-Fibonacci PE models.py -5.4% vs sinusoidal PE
Geodesic attention bias models.py -0.4% vs crt_only, 3/3 seeds
Fibonacci-offset sparse attention models_substrate.py 14× FLOP reduction, -3.2% loss
Zeckendorf-routed FFN models_substrate.py 5× FFN FLOPs reduction
FibGen weight generator models_fibgen.py 100× storage compression
Subsim L1-distance attention models_subsim.py substrate operator, +5.7% loss at d=128
Fibonacci tier quantization models_substrate.py:fibonacci_tier_snap saturates at +0.6 nats post-hoc
Fibonacci State Model models_fsm.py NaN at init, scale-bound
Lazy-strided data loader lazy_data.py 5.6× training speedup
Stochastic Fibonacci depth models_subsim.py 1.17× wall-clock speedup

Falsified or scale-bound

claim falsification
Pure Fibonacci-tier post-hoc quantization at 4-bit Saturates at +0.6 nats regardless of bit depth
Substrate operators (Subsim/FSM) faster than dense at d=128 At CPU bench scale (d≤256, T≤512) PyTorch overhead dominates the asymptotic FLOP savings
FSM recurrence numerically stable at random init Eigenvalue > 1 produces immediate NaN; needs gating
K-scaling alone closes the gap to dense at d=256 K=48, K=64 both LOST at d=256 (+30% gap)
Plain FibGen at d=256 maintains its compression-vs-quality Compression ratio grows nicely (36×) but loss penalty also grows (+30%)

Reproducing the headline numbers

cd experiments/transformerless_lm

# 100× compression result (this release's main claim)
python3 train_fibgen.py --steps 2500 --K-sweep 16,32 --modes separable
# expect: fibgen_K16_separable val ~2.90 (100x compression)
#         fibgen_K32_separable val ~2.73 (88x compression)

# Lazy-loading data speedup
python3 train_lazy_loading.py --steps 1500
# expect: dense ~165s, fib_strided ~29s, val deltas <5%

# Inference-time throughput
python3 bench_inference.py --n-tokens 256
# expect: fibgen_K32 cached at 90%+ of dense throughput at d=128

Honest limits

  • Output text quality at d=128 is gibberish for ALL archs including
    dense. Coherent text needs GPT-2-tiny-class capacity (d≥384,
    n_blocks≥6).
  • Substrate operator wall-clock wins (Subsim, FSM, Composed) are
    scale-bound — they don't materialize on CPU at our test scale.
    Asymptotic complexity advantages are real but unreachable in pure
    PyTorch without parallel-scan kernels or larger T/d.
  • 35B feasibility is an extrapolation from d=128/256 measurements,
    not a direct measurement at LLM scale.
  • Training-time substrate ops (lazy tier dropout, K-subsampling)
    delivered at most a small per-step compute reduction in pure PyTorch
    due to indexing overhead. Real wins would require kernel work.

File index

experiments/transformerless_lm/
  README.md                       # original transformerless-LM thesis
  GEODESIC_RESULT.md              # validated -0.4% geodesic attention
  GEODESIC_ATTENTION_DERIVATION.md
  TRANSFORMERLESS_RESULT.md       # token-CRT + Principle A/B results
  WEIGHT_SUBSTRATE_REFORMULATION.md  # Principle A/B derivation
  INFERENCE_FIRST_DERIVATION.md   # 35B-in-8GB framing
  RELEASE_v0.1.0.md              # THIS FILE

  corpus.py                       # data loader (TinyShakespeare)
  lazy_data.py                    # Fibonacci-strided data loader

  models.py                       # baseline crt_only + arch variants
  models_substrate.py             # FibonacciOffsetAttention, ZeckendorfRoutedFFN
  models_fibgen.py                # FibGenLinear (THE compression primitive)
  models_subsim.py                # L1-distance attention operator
  models_fsm.py                   # Fibonacci State Model (broken; needs stability fix)

  train_distractor_mix.py         # distractor-mix training scaffold
  train_geodesic_attention.py     # geodesic bench
  train_fibgen.py                 # FibGen K/mode sweep (main reproducer)
  train_lazy_loading.py           # lazy-data validation bench
  bench_inference.py              # autoregressive generation throughput

  results_*.json                  # raw bench outputs (kept for audit)
  results_samples.txt             # text generation samples at d=128