Tiny, transparent experiments in Active Inference using pymdp and gymnasium.
Start from the simplest possible environments and agents, then add complexity step-by-step—so you can reason about each design choice and its cognitive implications.
Goal: a parsimonious, scientifically minded playground to study cognition, building up difficulty “evolutionarily” (fully observable → partially observable; single factor → multi-factor; single modality → multi-modal; deterministic → noisy; etc.).
Variational Free Energy starts with the following expression:
Let us extract the meaning of the Variational Free Energy.
The following expression is the Complexity term, which is the penalty for moving far from prior beliefs:
And the following one is the Accuracy term, which tells us how well the model predicts current observations:
Expected Free Energy starts with the following expression:
Let us extract the meaning from the Expected Free Energy expression.
The following term is the Extrinsic (utility term), which measures how much the predicted outcomes (O) deviate from preferred outcomes encoded in (p(o)) (which comes from the (C) matrix). If (p(o)) is high for some outcomes, policies leading to those outcomes have lower risk. It is goal-directed — the instrumental part of planning.
The following term is the Epistemic (state information gain), which measures how much the agent expects to learn about hidden states (s) from future observations under policy (\pi). It is high when predicted observations would strongly reduce uncertainty about the hidden causes of sensory input. It is curiosity-driven — the exploratory part of planning.
- Minimal
N×MGridWorld (deterministic walls, reward cell, punish cell) - Text / graphical renderer (single persistent window, dynamic updates)
- Active Inference agent factory
(A, B, C, D)tailored to GridWorld - Batch experiments & plots (random vs AIF; sequential & parallel)
- Live demo: watch random episodes then AIF episodes in one window
- Roadmap: POMDP variants, multi-factor control, multi-modal outcomes, learning
Active_Inference_for_Fun/Environments/
├─ gridworld_env.py # Gymnasium env: N×M grid, reward & punish tiles
├─ ai_agent_factory.py # build_gridworld_agent(): constructs A,B,C,D & Agent
├─ run_gridworld_stats.py # random baseline: many episodes, plots results
├─ run_gridworld_aif_vs_random.py # AIF vs Random, sequential or parallel (processes)
├─ run_gridworld_live_demo.py # live dynamic render: random then AIF episodes
└─ README.md
Python ≥ 3.9 recommended.
# (optional) create a fresh environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# core deps
pip install --upgrade pip
pip install numpy matplotlib gymnasium
# pymdp (pick one)
pip install pymdp # if available in your index
# or from source:
# pip install git+https://github.com/infer-actively/pymdp.git If pymdp API differs across versions, the code includes small compatibility shims (e.g., scalar vs list observations, sample_action() return types).
python run_gridworld_human.py
Screencast.from.2025-10-19.15-46-34.mp4
python run_gridworld_stats.py --episodes 2000
Episodes: 2000
Success rate: 0.429
Punish rate: 0.455
Timeout rate: 0.117
Avg return: -0.025
Avg steps: 91.95python run_gridworld_aif_vs_random.py --episodes 1000 --workers 40
=== Summary (Random) ===
success_rate: 0.431
punish_rate: 0.455
timeout_rate: 0.114
avg_return: -0.024
avg_steps: 91.964
counts: {'reward': 431, 'timeout': 114, 'punish': 455}
=== Summary (AIF) ===
success_rate: 1.0
punish_rate: 0.0
timeout_rate: 0.0
avg_return: 1.0
avg_steps: 33.08
counts: {'reward': 1000}python run_gridworld_aif_vs_random.py --episodes 2000 --workers 60 --cols 10 --rows 10 --reward-pos "9, 9" --punish-pos "0, 9"
=== Summary (Random) ===
success_rate: 0.1875
punish_rate: 0.2635
timeout_rate: 0.549
avg_return: -0.076
avg_steps: 160.0365
counts: {'timeout': 1098, 'reward': 375, 'punish': 527}
=== Summary (AIF) ===
success_rate: 0.69
punish_rate: 0.0
timeout_rate: 0.31
avg_return: 0.69
avg_steps: 126.71
counts: {'timeout': 620, 'reward': 1380}python run_gridworld_aif_vs_random.py --episodes 2000 --workers 60 --cols 10 --rows 10 --reward-pos "9, 9" --punish-pos "0, 9" --policy-len 5
=== Summary (Random) ===
success_rate: 0.1875
punish_rate: 0.2635
timeout_rate: 0.549
avg_return: -0.076
avg_steps: 160.0365
counts: {'timeout': 1098, 'reward': 375, 'punish': 527}
=== Summary (AIF) ===
success_rate: 0.82
punish_rate: 0.0
timeout_rate: 0.18
avg_return: 0.82
avg_steps: 120.96
counts: {'reward': 1640, 'timeout': 360}
python run_gridworld_obs_noise.py --workers 10 --episodes 1000
=== Summary (Random) ===
success_rate: 0.408
punish_rate: 0.453
timeout_rate: 0.139
avg_return: -0.045
avg_steps: 94.594
counts: {'reward': 408, 'timeout': 139, 'punish': 453}
=== Summary (AIF, noisy A) ===
success_rate: 1.0
punish_rate: 0.0
timeout_rate: 0.0
avg_return: 1.0
avg_steps: 33.41
counts: {'reward': 1000}
belief_error_ratio (mean over episodes): 0.917python run_gridworld_obs_noise.py --workers 20 --episodes 1000 --cols 10 --rows 10 --reward-pos "9, 9" --punish-pos "0, 9"
=== Summary (Random) ===
success_rate: 0.189
punish_rate: 0.259
timeout_rate: 0.552
avg_return: -0.07
avg_steps: 161.987
counts: {'timeout': 552, 'reward': 189, 'punish': 259}
=== Summary (AIF, noisy A) ===
success_rate: 0.52
punish_rate: 0.06
timeout_rate: 0.42
avg_return: 0.46
avg_steps: 153.3
counts: {'reward': 520, 'timeout': 420, 'punish': 60}
belief_error_ratio (mean over episodes): 0.985python run_gridworld_live_demo.py --episodes-random 4 --episodes-aif 3 --fps 12 --seed 58457
[RANDOM] Episode 1: return=1.00, steps=184
[RANDOM] Episode 2: return=1.00, steps=31
[RANDOM] Episode 3: return=1.00, steps=24
[RANDOM] Episode 4: return=0.00, steps=200
[AIF] Episode 1: return=1.00, steps=11
[AIF] Episode 2: return=1.00, steps=10
[AIF] Episode 3: return=1.00, steps=15Screencast.from.2025-10-19.18-07-54.mp4
This section is to analyze how the main four terms defined in the previous section vary as we run an episode of this very simple grid world. The main idea is to demonstrate conceptually what are the main things that these four terms entail in regards to active inference.
As shown in the examples bellow, when we run an episode, the only value greater than 0 is the extrinsic Utility throughout the whole episode while the others are all 0.
python run_gridworld_live_metrics.py --fps 10
[Episode 1] return=1.00, steps=14
[Episode 2] return=1.00, steps=33
[Episode 3] return=1.00, steps=27
All episodes complete: [(1.0, 14), (1.0, 33), (1.0, 27)]Why is this happening?
Short answer: that pattern is exactly what we should see in our current setup (fully observable, deterministic dynamics, sharp prior), so nothing’s “wrong.”
Lets unfold this in more detail.
-
Fully observable A (≈ identity) & deterministic B
. After each step we set the prior for the next step as
$prior_s = B_uq_s$ .. The new observation is perfectly informative: the posterior
$q(s)$ collapses to the true state, which equals the predicted state..
$\Rightarrow$ Complexity$D_{KL}({q(s)}\parallel{prior_s}) = 0$ (posterior matches prior).. With
$A \approx identity, p(o_t∣s^*) = 1$ at the true state$\rightarrow −\mathbb{E}_q[ln p(o_t∣s)] = 0$ .. Epistemic (1-step info gain) vanishes: for each possible
$o, {q(s∣o)}\approx{q(s)}$ (no uncertainty to reduce), so expected$KL = 0$ . -
Extrinsic > 0
. We compute
$\mathbb{E}_{q(o)}[−ln p(o∣C)]$ .. Unless our preferences
$p(o∣C)$ put probability$\sim{1}$ on the actually observed outcome at every step, this expectation is positive. That’s why our utility bar moves, while the others don’t.
If we want non-zero Complexity, Accuracy, Epistemic, we have to introduce mismatch or uncertainty:
. --a-noise 0.2 (so
python run_gridworld_live_metrics.py --fps 10 --a-noise 0.2
[Episode 1] return=1.00, steps=24
[Episode 2] return=1.00, steps=16
[Episode 3] return=1.00, steps=10
All episodes complete: [(1.0, 24), (1.0, 16), (1.0, 10)]Pure B-noise (no env slips), see Complexity move:
python run_gridworld_live_metrics.py --b-noise 0.2 --fps 10
[Episode 1] return=0.00, steps=200
[Episode 2] return=0.00, steps=200
[Episode 3] return=0.00, steps=200
All episodes complete: [(0.0, 200), (0.0, 200), (0.0, 200)]Combine A-noise with B-noise to also see Accuracy/Epistemic also let's keep preferences gentle so exploration isn’t crushed:
python run_gridworld_live_metrics.py --a-noise 0.5 --b-noise 0.5 --c-reward 0.1 --c-punish -0.1 --fps 12 --policy-len 6 --sophisticated --max-steps 10
[Episode 1] return=0.00, steps=10
[Episode 2] return=0.00, steps=10
[Episode 3] return=0.00, steps=10
All episodes complete: [(0.0, 10), (0.0, 10), (0.0, 10)]Producing a Complexity spike inserting a random starting position (--start-pos random), plus non-zero Accuracy + Epistemic during re-localization with noisy actions (--slip-prob 0.1).
python run_gridworld_live_metrics.py --start-pos random --fps 1 --slip-prob 0.1
[Episode 1] return=1.00, steps=8
[Episode 2] return=1.00, steps=42
[Episode 3] return=1.00, steps=42
All episodes complete: [(1.0, 8), (1.0, 42), (1.0, 42)]You can also run some stats in this regard
python run_gridworld_metrics_stats.py --episodes 100 --workers 20 --a-noise 0.1 --b-noise 0.1 --c-reward 0.1 --c-punish -0.1 --max-steps 200 --policy-len 6 --sophisticated --cols 5 --rows 5 --reward-pos "4,4" --punish-pos "0,4"
=== Summary ===
success_rate: 0.0
punish_rate: 0.0
timeout_rate: 1.0
avg_return: 0.0
avg_steps: 200.0
counts: {'timeout': 100}
Per-episode means (overall): Complexity=0.098, Accuracy=0.103, Extrinsic=3.219, Epistemic=0.002This is a great place to showcase Active Inference with controlled, informative sensing. The cleanest specification in pymdp (and the one I recommend) is:
Hidden state factors
- F₁: a joint factor for (position × orientation) with size S = (NM4).
Reason: in pymdp, transition matrices factorize over hidden-state factors. Since your F₁ (position) transition under forward depends on orientation (F₂), that cross-factor dependency would be awkward/imprecise unless we combine position & orientation into a single factor.
-
M₁ (distance): integer distance (0..max_range) to the first non-empty square in the current look direction; that terminal can only be one of {EDGE, RED, GREEN}.
-
M₂ (terminal class): the class of that terminal square in the look direction ∈ {EDGE, RED, GREEN}.
-
M₃ (current class): the class of the square the agent currently occupies ∈ {EMPTY, EDGE, RED, GREEN}.
- 3 controls over the single factor: forward, turn_left, turn_right.
-
likes GREEN (reward) and dislikes RED (punish) via M3 (current cell class).
-
M1/M2 are neutral by default (but you can make them informative to encourage curiosity).
- optional A-noise (observation model noise) and
- B-noise (agent’s transition model mismatch).
- Plain, deterministic sensing/dynamics, sophisticated inference:
python run_nav3_live_demo.py --sophisticated --start-ori E
[Episode 1] return=1.00, steps=65- With observation and model mismatch inside the agent (watch it reorient/localize):
python run_nav3_live_demo.py --a-noise 0.001 --b-noise 0.001 --sophisticated --policy-len 4 --start-ori N
[Episode 1] return=1.00, steps=69- Start at a different place/orientation:
python run_nav3_live_demo.py --start-pos 2,1 --start-ori S
[Episode 1] return=1.00, steps=79python run_nav3_live_demo.py --start-ori E --fps 12 --episodes 5 --rows 6 --cols 6 --reward-pos "5,5" --punish-pos "0,5" --policy-len 3
[Episode 1] return=1.00, steps=92
[Episode 2] return=1.00, steps=73
[Episode 3] return=0.00, steps=200
[Episode 4] return=1.00, steps=65
[Episode 5] return=1.00, steps=37Screencast.from.2025-12-17.22-58-52.mp4
python run_nav3_live_metrics.py --sophisticated --start-ori N --policy-len 3 --episodes 5 --cols 10 --rows 10 --punish-pos 0,9 --reward-pos 9,9 --start-pos 0,0 --a-noise 0.8 --b-noise 0.8
[Episode 1] return=1.00, steps=88
[Episode 2] return=1.00, steps=121
[Episode 3] return=1.00, steps=82
[Episode 4] return=1.00, steps=94
[Episode 5] return=1.00, steps=75
All episodes complete: [(1.0, 88), (1.0, 121), (1.0, 82), (1.0, 94), (1.0, 75)]Rewarding viewing green, while penalizing viewing red.
Slightly penalizing viewing edge, and staying in edge or empty.
C1 = np.zeros((O1,), dtype=np.float64) # neutral distances
C2 = np.zeros((O2,), dtype=np.float64) # neutral terminal class
+ C2[M2_GREEN] = pref_green # prefer seeing green ahead
+ C2[M2_RED] = pref_red # prefer not seeing red ahead
+ C2[M2_EDGE] = -0.1 # prefer not seeing edge ahead
C3 = np.zeros((O3,), dtype=np.float64)
C3[CLASS_GREEN] = pref_green
C3[CLASS_RED] = pref_red
- # EDGE / EMPTY remain neutral
+ C3[CLASS_EDGE] = -0.1
+ C3[CLASS_EMPTY] = -0.1Screencast.from.2025-12-29.10-32-02.mp4
python run_nav3_metrics_stats.py --episodes 50 --workers 10 --max-steps 50 --cols 10 --rows 10 --reward-pos 9,9 --punish-pos 0,9 --a-noise 0.1 --b-noise 0.1
=== Summary ===
success_rate: 1.0
punish_rate: 0.0
timeout_rate: 0.0
avg_return: 1.0
avg_steps: 20.8
counts: {'reward': 50}
Per-episode means (overall): Complexity=0.098, Accuracy=0.243, Extrinsic=7.539, Epistemic=0.001









