Skip to content

dariodematties/Active_Inference_for_Fun

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Active Inference for Fun

Tiny, transparent experiments in Active Inference using pymdp and gymnasium.

Start from the simplest possible environments and agents, then add complexity step-by-step—so you can reason about each design choice and its cognitive implications.

      Goal: a parsimonious, scientifically minded playground to study cognition, building up difficulty “evolutionarily” (fully observable → partially observable; single factor → multi-factor; single modality → multi-modal; deterministic → noisy; etc.).

Active Inference Formulation Notes

Free Energy

Variational Free Energy

Variational Free Energy starts with the following expression:

$$ \begin{aligned} F_{\pi} & = \mathbb{E}_{q(s|\pi)} \left[\ln{\frac{q(s|\pi)}{p(o,s|\pi)}}\right] \\ & = \mathbb{E}_{q(s|\pi)} \left[\ln{q(s|\pi)} - \ln{p(o,s|\pi)}\right] \\ & = \mathbb{E}_{q(s|\pi)} \left[\ln{q(s|\pi)} - \ln{p(s|\pi)}\right] - \mathbb{E}_{q(s|\pi)} \left[\ln{p(o|s,\pi)}\right] \\ & = D_{KL}\left[q(s|\pi) || p(s|\pi)\right] - \mathbb{E}_{q(s|\pi)} \left[\ln{p(o|s,\pi)}\right] \end{aligned} $$

Let us extract the meaning of the Variational Free Energy.

The following expression is the Complexity term, which is the penalty for moving far from prior beliefs:

$$ D_{KL}\left[q(s|\pi) || p(s|\pi)\right] $$

And the following one is the Accuracy term, which tells us how well the model predicts current observations:

$$ \mathbb{E}_{q(s|\pi)} \left[-\ln{p(o|s,\pi)}\right] $$


Expected Free Energy

Expected Free Energy starts with the following expression:

$$ \begin{aligned} G_{\pi} &= \mathbb{E}_{q(o, s|\pi)} \left[\ln{\frac{q(s|\pi)}{p(o,s|\pi)}}\right] \\ &= \mathbb{E}_{q(o, s|\pi)} \left[\ln{q(s|\pi)} - \ln{p(o,s|\pi)}\right] \\ &= \mathbb{E}_{q(o, s|\pi)} \left[\ln{q(s|\pi)} - \ln{p(s|o, \pi)}\right] - \mathbb{E}_{q(o|\pi)} \left[\ln{p(o|\pi)}\right] \\ &\approx \mathbb{E}_{q(o, s|\pi)} \left[\ln{q(s|\pi)} - \ln{q(s|o, \pi)}\right] - \mathbb{E}_{q(o|\pi)} \left[\ln{p(o|C)}\right] \\ &= -\mathbb{E}_{q(o, s|\pi)} \left[\ln{q(s|o, \pi)} - \ln{q(s|\pi)}\right] - \mathbb{E}_{q(o|\pi)} \left[\ln{p(o|C)}\right] \\ &= -\mathbb{E}_{q(s|o,\pi)q(o|\pi)} \left[\ln{q(s|o, \pi)} - \ln{q(s|\pi)}\right] - \mathbb{E}_{q(o|\pi)} \left[\ln{p(o|C)}\right] \\ &= -\mathbb{E}_{q(o|\pi)} \left[ \mathbb{E}_{q(s|o,\pi)} \left[\ln{q(s|o, \pi)} - \ln{q(s|\pi)}\right] \right] - \mathbb{E}_{q(o|\pi)} \left[\ln{p(o|C)}\right] \\ &= -\mathbb{E}_{q(o|\pi)} \left[D_{KL}\left[q(s|o,\pi) || q(s|\pi)\right] \right] - \mathbb{E}_{q(o|\pi)} \left[\ln{p(o|C)}\right] \end{aligned} $$

Let us extract the meaning from the Expected Free Energy expression.

The following term is the Extrinsic (utility term), which measures how much the predicted outcomes (O) deviate from preferred outcomes encoded in (p(o)) (which comes from the (C) matrix). If (p(o)) is high for some outcomes, policies leading to those outcomes have lower risk. It is goal-directed — the instrumental part of planning.

$$ \mathbb{E}_{q(o|\pi)} \left[-\ln{p(o|C)}\right] $$

The following term is the Epistemic (state information gain), which measures how much the agent expects to learn about hidden states (s) from future observations under policy (\pi). It is high when predicted observations would strongly reduce uncertainty about the hidden causes of sensory input. It is curiosity-driven — the exploratory part of planning.

$$ \mathbb{E}_{q(o|\pi)} \left[D_{KL}\left[q(s|o,\pi) || q(s|\pi)\right] \right] $$

Contents

  • Minimal N×M GridWorld (deterministic walls, reward cell, punish cell)
  • Text / graphical renderer (single persistent window, dynamic updates)
  • Active Inference agent factory (A, B, C, D) tailored to GridWorld
  • Batch experiments & plots (random vs AIF; sequential & parallel)
  • Live demo: watch random episodes then AIF episodes in one window
  • Roadmap: POMDP variants, multi-factor control, multi-modal outcomes, learning

Repository Structure

Active_Inference_for_Fun/Environments/
├─ gridworld_env.py                 # Gymnasium env: N×M grid, reward & punish tiles
├─ ai_agent_factory.py              # build_gridworld_agent(): constructs A,B,C,D & Agent
├─ run_gridworld_stats.py           # random baseline: many episodes, plots results
├─ run_gridworld_aif_vs_random.py   # AIF vs Random, sequential or parallel (processes)
├─ run_gridworld_live_demo.py       # live dynamic render: random then AIF episodes
└─ README.md

Installation

Python ≥ 3.9 recommended.

# (optional) create a fresh environment
python -m venv .venv
source .venv/bin/activate     # Windows: .venv\Scripts\activate

# core deps
pip install --upgrade pip
pip install numpy matplotlib gymnasium

# pymdp (pick one)
pip install pymdp             # if available in your index
# or from source:
# pip install git+https://github.com/infer-actively/pymdp.git

      If pymdp API differs across versions, the code includes small compatibility shims (e.g., scalar vs list observations, sample_action() return types).

Quick Start

  1. To run a minimal environment smoke test run the following command

python run_gridworld_human.py

Screencast.from.2025-10-19.15-46-34.mp4

  1. Batch statistics (random baseline)

python run_gridworld_stats.py --episodes 2000

Episodes:        2000
Success rate:    0.429
Punish rate:     0.455
Timeout rate:    0.117
Avg return:      -0.025
Avg steps:       91.95

run_gridworld_stats

  1. AIF vs Random (with optional parallelism)

python run_gridworld_aif_vs_random.py --episodes 1000 --workers 40

=== Summary (Random) ===
 success_rate: 0.431
  punish_rate: 0.455
 timeout_rate: 0.114
   avg_return: -0.024
    avg_steps: 91.964
counts: {'reward': 431, 'timeout': 114, 'punish': 455}

=== Summary (AIF) ===
 success_rate: 1.0
  punish_rate: 0.0
 timeout_rate: 0.0
   avg_return: 1.0
    avg_steps: 33.08
counts: {'reward': 1000}

run_gridworld_aif_vs_random

python run_gridworld_aif_vs_random.py --episodes 2000 --workers 60 --cols 10 --rows 10 --reward-pos "9, 9" --punish-pos "0, 9"

=== Summary (Random) ===
 success_rate: 0.1875
  punish_rate: 0.2635
 timeout_rate: 0.549
   avg_return: -0.076
    avg_steps: 160.0365
counts: {'timeout': 1098, 'reward': 375, 'punish': 527}

=== Summary (AIF) ===
 success_rate: 0.69
  punish_rate: 0.0
 timeout_rate: 0.31
   avg_return: 0.69
    avg_steps: 126.71
counts: {'timeout': 620, 'reward': 1380}

run_gridworld_aif_vs_random1

python run_gridworld_aif_vs_random.py --episodes 2000 --workers 60 --cols 10 --rows 10 --reward-pos "9, 9" --punish-pos "0, 9"  --policy-len 5

=== Summary (Random) ===
 success_rate: 0.1875
  punish_rate: 0.2635
 timeout_rate: 0.549
   avg_return: -0.076
    avg_steps: 160.0365
counts: {'timeout': 1098, 'reward': 375, 'punish': 527}

=== Summary (AIF) ===
 success_rate: 0.82
  punish_rate: 0.0
 timeout_rate: 0.18
   avg_return: 0.82
    avg_steps: 120.96
counts: {'reward': 1640, 'timeout': 360}

run_gridworld_aif_vs_random2

  1. AIF vs Random with observations noise (with optional parallelism)

python run_gridworld_obs_noise.py --workers 10 --episodes 1000

=== Summary (Random) ===
 success_rate: 0.408
  punish_rate: 0.453
 timeout_rate: 0.139
   avg_return: -0.045
    avg_steps: 94.594
counts: {'reward': 408, 'timeout': 139, 'punish': 453}

=== Summary (AIF, noisy A) ===
 success_rate: 1.0
  punish_rate: 0.0
 timeout_rate: 0.0
   avg_return: 1.0
    avg_steps: 33.41
counts: {'reward': 1000}
 belief_error_ratio (mean over episodes): 0.917

run_gridworld_aif_vs_random3

python run_gridworld_obs_noise.py --workers 20 --episodes 1000 --cols 10 --rows 10 --reward-pos "9, 9" --punish-pos "0, 9"

=== Summary (Random) ===
 success_rate: 0.189
  punish_rate: 0.259
 timeout_rate: 0.552
   avg_return: -0.07
    avg_steps: 161.987
counts: {'timeout': 552, 'reward': 189, 'punish': 259}

=== Summary (AIF, noisy A) ===
 success_rate: 0.52
  punish_rate: 0.06
 timeout_rate: 0.42
   avg_return: 0.46
    avg_steps: 153.3
counts: {'reward': 520, 'timeout': 420, 'punish': 60}
 belief_error_ratio (mean over episodes): 0.985

run_gridworld_aif_vs_random4

  1. Live demo (dynamic window)

python run_gridworld_live_demo.py --episodes-random 4 --episodes-aif 3 --fps 12 --seed 58457
[RANDOM] Episode 1: return=1.00, steps=184
[RANDOM] Episode 2: return=1.00, steps=31
[RANDOM] Episode 3: return=1.00, steps=24
[RANDOM] Episode 4: return=0.00, steps=200
[AIF] Episode 1: return=1.00, steps=11
[AIF] Episode 2: return=1.00, steps=10
[AIF] Episode 3: return=1.00, steps=15
Screencast.from.2025-10-19.18-07-54.mp4
  1. Analyzing metrics (Complexity, Accuracy, Extrinsic and Epistemic)

This section is to analyze how the main four terms defined in the previous section vary as we run an episode of this very simple grid world. The main idea is to demonstrate conceptually what are the main things that these four terms entail in regards to active inference.

As shown in the examples bellow, when we run an episode, the only value greater than 0 is the extrinsic Utility throughout the whole episode while the others are all 0.

python run_gridworld_live_metrics.py --fps 10
[Episode 1] return=1.00, steps=14
[Episode 2] return=1.00, steps=33
[Episode 3] return=1.00, steps=27
All episodes complete: [(1.0, 14), (1.0, 33), (1.0, 27)]

run_gridworld_live_metrics

Why is this happening?

Short answer: that pattern is exactly what we should see in our current setup (fully observable, deterministic dynamics, sharp prior), so nothing’s “wrong.”

Lets unfold this in more detail.

  1. Fully observable A (≈ identity) & deterministic B

    . After each step we set the prior for the next step as $prior_s = B_uq_s$.

    . The new observation is perfectly informative: the posterior $q(s)$ collapses to the true state, which equals the predicted state.

    . $\Rightarrow$ Complexity $D_{KL}({q(s)}\parallel{prior_s}) = 0$ (posterior matches prior).

    . With $A \approx identity, p(o_t∣s^*) = 1$ at the true state $\rightarrow −\mathbb{E}_q[ln⁡ p(o_t⁣∣s)] = 0$.

    . Epistemic (1-step info gain) vanishes: for each possible $o, {q(s⁣∣o)}\approx{q(s)}$ (no uncertainty to reduce), so expected $KL = 0$.

  2. Extrinsic > 0

    . We compute $\mathbb{E}_{q(o)}[−ln ⁡p(o∣C)]$.

    . Unless our preferences $p(o∣C)$ put probability $\sim{1}$ on the actually observed outcome at every step, this expectation is positive. That’s why our utility bar moves, while the others don’t.

How to make the other bars move (and why)

If we want non-zero Complexity, Accuracy, Epistemic, we have to introduce mismatch or uncertainty:

. $\textbf{Accuracy} > 0$ (likelihood cost): We have to add observation noise: --a-noise 0.2 (so $\textbf{A}$ not identity).

python run_gridworld_live_metrics.py --fps 10 --a-noise 0.2
[Episode 1] return=1.00, steps=24
[Episode 2] return=1.00, steps=16
[Episode 3] return=1.00, steps=10
All episodes complete: [(1.0, 24), (1.0, 16), (1.0, 10)]

run_gridworld_live_metrics1

Pure B-noise (no env slips), see Complexity move:

python run_gridworld_live_metrics.py --b-noise 0.2 --fps 10
[Episode 1] return=0.00, steps=200
[Episode 2] return=0.00, steps=200
[Episode 3] return=0.00, steps=200
All episodes complete: [(0.0, 200), (0.0, 200), (0.0, 200)]

run_gridworld_live_metrics2

Combine A-noise with B-noise to also see Accuracy/Epistemic also let's keep preferences gentle so exploration isn’t crushed:

python run_gridworld_live_metrics.py --a-noise 0.5 --b-noise 0.5 --c-reward 0.1 --c-punish -0.1 --fps 12 --policy-len 6 --sophisticated --max-steps 10 
[Episode 1] return=0.00, steps=10
[Episode 2] return=0.00, steps=10
[Episode 3] return=0.00, steps=10
All episodes complete: [(0.0, 10), (0.0, 10), (0.0, 10)]

run_gridworld_live_metrics3

Producing a Complexity spike inserting a random starting position (--start-pos random), plus non-zero Accuracy + Epistemic during re-localization with noisy actions (--slip-prob 0.1).

python run_gridworld_live_metrics.py --start-pos random --fps 1 --slip-prob 0.1
[Episode 1] return=1.00, steps=8
[Episode 2] return=1.00, steps=42
[Episode 3] return=1.00, steps=42
All episodes complete: [(1.0, 8), (1.0, 42), (1.0, 42)]

run_gridworld_live_metrics4

You can also run some stats in this regard

python run_gridworld_metrics_stats.py --episodes 100 --workers 20 --a-noise 0.1 --b-noise 0.1 --c-reward 0.1 --c-punish -0.1 --max-steps 200 --policy-len 6 --sophisticated --cols 5 --rows 5 --reward-pos "4,4" --punish-pos "0,4"

=== Summary ===
 success_rate: 0.0
  punish_rate: 0.0
 timeout_rate: 1.0
   avg_return: 0.0
    avg_steps: 200.0
counts: {'timeout': 100}

Per-episode means (overall): Complexity=0.098, Accuracy=0.103, Extrinsic=3.219, Epistemic=0.002

Tri-modal, orientation-aware agent

This is a great place to showcase Active Inference with controlled, informative sensing. The cleanest specification in pymdp (and the one I recommend) is:

Hidden state factors

  • F₁: a joint factor for (position × orientation) with size S = (NM4).

Reason: in pymdp, transition matrices factorize over hidden-state factors. Since your F₁ (position) transition under forward depends on orientation (F₂), that cross-factor dependency would be awkward/imprecise unless we combine position & orientation into a single factor.

Outcome modalities

  • M₁ (distance): integer distance (0..max_range) to the first non-empty square in the current look direction; that terminal can only be one of {EDGE, RED, GREEN}.

  • M₂ (terminal class): the class of that terminal square in the look direction ∈ {EDGE, RED, GREEN}.

  • M₃ (current class): the class of the square the agent currently occupies ∈ {EMPTY, EDGE, RED, GREEN}.

Actions:

  • 3 controls over the single factor: forward, turn_left, turn_right.

Preferences C:

  • likes GREEN (reward) and dislikes RED (punish) via M3 (current cell class).

  • M1/M2 are neutral by default (but you can make them informative to encourage curiosity).

Noise:

  • optional A-noise (observation model noise) and
  • B-noise (agent’s transition model mismatch).

Run examples

  • Plain, deterministic sensing/dynamics, sophisticated inference:
python run_nav3_live_demo.py --sophisticated --start-ori E
[Episode 1] return=1.00, steps=65
  • With observation and model mismatch inside the agent (watch it reorient/localize):
python run_nav3_live_demo.py --a-noise 0.001 --b-noise 0.001 --sophisticated --policy-len 4 --start-ori N 
[Episode 1] return=1.00, steps=69
  • Start at a different place/orientation:
python run_nav3_live_demo.py --start-pos 2,1 --start-ori S
[Episode 1] return=1.00, steps=79
python run_nav3_live_demo.py --start-ori E --fps 12 --episodes 5 --rows 6 --cols 6 --reward-pos "5,5" --punish-pos "0,5" --policy-len 3
[Episode 1] return=1.00, steps=92
[Episode 2] return=1.00, steps=73
[Episode 3] return=0.00, steps=200
[Episode 4] return=1.00, steps=65
[Episode 5] return=1.00, steps=37
Screencast.from.2025-12-17.22-58-52.mp4

Metrics on Tri-modal, orientation-aware agent

python run_nav3_live_metrics.py --sophisticated --start-ori N --policy-len 3 --episodes 5 --cols 10 --rows 10 --punish-pos 0,9 --reward-pos 9,9 --start-pos 0,0 --a-noise 0.8 --b-noise 0.8
[Episode 1] return=1.00, steps=88
[Episode 2] return=1.00, steps=121
[Episode 3] return=1.00, steps=82
[Episode 4] return=1.00, steps=94
[Episode 5] return=1.00, steps=75
All episodes complete: [(1.0, 88), (1.0, 121), (1.0, 82), (1.0, 94), (1.0, 75)]

Rewarding viewing green, while penalizing viewing red.

Slightly penalizing viewing edge, and staying in edge or empty.

     C1 = np.zeros((O1,), dtype=np.float64)  # neutral distances
     C2 = np.zeros((O2,), dtype=np.float64)  # neutral terminal class
+    C2[M2_GREEN] = pref_green   # prefer seeing green ahead
+    C2[M2_RED] = pref_red   # prefer not seeing red ahead
+    C2[M2_EDGE] = -0.1   # prefer not seeing edge ahead
     C3 = np.zeros((O3,), dtype=np.float64)
     C3[CLASS_GREEN] = pref_green
     C3[CLASS_RED]   = pref_red
-    # EDGE / EMPTY remain neutral
+    C3[CLASS_EDGE]   = -0.1
+    C3[CLASS_EMPTY]   = -0.1
Screencast.from.2025-12-29.10-32-02.mp4

Run stats in this new model

python run_nav3_metrics_stats.py --episodes 50 --workers 10 --max-steps 50 --cols 10 --rows 10 --reward-pos 9,9 --punish-pos 0,9 --a-noise 0.1 --b-noise 0.1

=== Summary ===
 success_rate: 1.0
  punish_rate: 0.0
 timeout_rate: 0.0
   avg_return: 1.0
    avg_steps: 20.8
counts: {'reward': 50}

Per-episode means (overall): Complexity=0.098, Accuracy=0.243, Extrinsic=7.539, Epistemic=0.001

About

Tiny, transparent experiments in Active Inference using pymdp

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors