Add optimistic initialization to Q-learning lecture (#830)

jstac · claude · mmcky · web-flow · commit d825b92360ab · 2026-03-12T15:55:54.000+11:00
* Add optimistic initialization to Q-learning lecture

Initialize Q-table to 20 (above true value range of 13-18) instead of
zeros, which drives broader exploration via "optimism in the face of
uncertainty". This speeds convergence enough to reduce the training run
from 20M to 5M steps. Added a new subsection explaining the technique.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Add optimistic initialization to risk-sensitive Q-learning lecture

Initialize Q-table to 1e-5 (below true Q-value range of ~1e-5 to 1e-4)
instead of ones. Since the optimal policy minimizes q, optimistic means
initializing below the truth — the reverse of the risk-neutral case.
This speeds convergence enough to reduce training from 20M to 5M steps.
Added a subsection explaining the reversed optimistic init logic.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Add \EE macro to MathJax config

* Restore profit to observation list; fix risk-sensitive optimistic init value and narrative

---------

Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
Co-authored-by: Matt McKay &lt;mmcky@users.noreply.github.com&gt;
diff --git a/lectures/_config.yml b/lectures/_config.yml
@@ -109,6 +109,7 @@ sphinx:
         macros:
           "argmax" : ["\\operatorname*{argmax}", 0]
           "argmin" : ["\\operatorname*{argmin}", 0]
+          "EE" : "\\mathbb{E}"
     mathjax_path: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
     # Local Redirects
     rediraffe_redirects:
diff --git a/lectures/inventory_q.md b/lectures/inventory_q.md
@@ -37,18 +37,23 @@ We approach the problem in two ways.
 First, we solve it exactly using dynamic programming, assuming full knowledge of
 the model — the demand distribution, cost parameters, and transition dynamics.
 
-Second, we show how a manager can learn the optimal policy from experience alone, using *[Q-learning](https://en.wikipedia.org/wiki/Q-learning)*.
+Second, we show how a manager can learn the optimal policy from experience alone, using [Q-learning](https://en.wikipedia.org/wiki/Q-learning).
 
-The manager observes only the inventory level, the order placed, the resulting
-profit, and the next inventory level — without knowing any of the underlying
-parameters.
+In this setting, we assume that the manager observes only 
+
+* the inventory level, 
+* the order placed, 
+* the resulting profit, and 
+* the next inventory level.
+
+The manager knows the interest rate -- and hence the discount factor -- but not any of the other underlying parameters.
 
 A key idea is the *Q-factor* representation, which reformulates the Bellman
 equation so that the optimal policy can be recovered without knowledge of the
-transition function.
+transition dynamics.
 
-We show that, given enough experience, the manager's learned policy converges to
-the optimal one.
+We show that, given enough experience, the
+manager's learned policy converges to the optimal one.
 
 The lecture proceeds as follows:
 
@@ -67,16 +72,18 @@ import matplotlib.pyplot as plt
 from typing import NamedTuple
 ```
 
+
 ## The Model
 
-We study a firm where a manager tries to maximize shareholder value.
+We study a firm where a manager tries to maximize shareholder value by
+controlling inventories.
 
 To simplify the problem, we assume that the firm only sells one product.
 
 Letting $\pi_t$ be profits at time $t$ and $r > 0$ be the interest rate, the value of the firm is
 
 $$
-    V_0 = \sum_{t \geq 0} \beta^t \pi_t
+    V_0 = \EE \sum_{t \geq 0} \beta^t \pi_t
     \qquad
     \text{ where }
     \quad \beta := \frac{1}{1+r}.
@@ -97,9 +104,9 @@ $$
 $$
 
 The term $A_t$ is units of stock ordered this period, which arrive at the start
-of period $t+1$, after demand $D_{t+1}$ is realized and served.
+of period $t+1$, after demand $D_{t+1}$ is realized and served:
 
-**Timeline for period $t$:** observe $X_t$ → choose $A_t$ → demand $D_{t+1}$ arrives → profit realized → $X_{t+1}$ determined.
+* observe $X_t$ → choose $A_t$ → demand $D_{t+1}$ arrives → profit realized → $X_{t+1}$ determined.
 
 (We use a $t$ subscript in $A_t$ to indicate the information set: it is chosen
 before $D_{t+1}$ is observed.)
@@ -115,7 +122,7 @@ $$
 Here
 
 * the sales price is set to unity (for convenience)
-* revenue is the minimum of current stock and demand because orders in excess of inventory are lost rather than back-filled
+* revenue is the minimum of current stock and demand because orders in excess of inventory are lost (not back-filled)
 * $c$ is unit product cost and $\kappa$ is a fixed cost of ordering inventory
 
 We can map our inventory problem into a dynamic program with state space $\mathsf X := \{0, \ldots, K\}$ and action space $\mathsf A := \mathsf X$.
@@ -463,9 +470,10 @@ The manager does not need to know the demand distribution $\phi$, the unit cost
 All the manager needs to observe at each step is:
 
 1. the current inventory level $x$,
-2. the order quantity $a$ they chose,
-3. the resulting profit $R_{t+1}$ (which appears on the books), and
-4. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
+2. the order quantity $a$, which they choose,
+3. the resulting profit $R_{t+1}$ (which appears on the books),
+4. the discount factor $\beta$, which is determined by the interest rate, and
+5. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
 
 These are all directly observable quantities — no model knowledge is required.
 
@@ -480,47 +488,29 @@ a)$ for every state-action pair $(x, a)$.
 
 At each step, the manager is in some state $x$ and must choose a specific action
 $a$ to take.  Whichever $a$ is chosen, the manager observes profit $R_{t+1}$
-and next state $X_{t+1}$, and updates **that one entry** $q_t(x, a)$ of the
+and next state $X_{t+1}$, and updates *that one entry* $q_t(x, a)$ of the
 table using the rule above.
 
-**The max computes a value, not an action.**
-
 It is tempting to read the $\max_{a'}$ in the update rule as prescribing the
 manager's next action — that is, to interpret the update as saying "move to
-state $X_{t+1}$ and take action $\argmax_{a'} q_t(X_{t+1}, a')$."
+state $X_{t+1}$ and take an action in $\argmax_{a'} q_t(X_{t+1}, a')$."
 
-But the $\max$ plays a different role.  The quantity $\max_{a' \in
-\Gamma(X_{t+1})} q_t(X_{t+1}, a')$ is a **scalar** — it estimates the value of
-being in state $X_{t+1}$ under the best possible continuation.  This scalar
-enters the update as part of the target value for $q_t(x, a)$.
+But the $\max$ plays a different role.  
 
-Which action the manager *actually takes* at state $X_{t+1}$ is a separate
-decision entirely.
-
-To see why this distinction matters, consider what happens if we modify the
-update rule by replacing the $\max$ with evaluation under a fixed feasible
-policy $\sigma$:
-
-$$
-   q_{t+1}(x, a)
-   = (1 - \alpha_t) q_t(x, a) +
-       \alpha_t \left(R_{t+1} + \beta \, q_t(X_{t+1}, \sigma(X_{t+1}))\right).
-$$
+The quantity $\max_{a' \in \Gamma(X_{t+1})} q_t(X_{t+1}, a')$ is just an estimate of the value of
+being in state $X_{t+1}$ under the best possible continuation.  
 
-This modified update is a stochastic sample of the Bellman *evaluation* operator
-for $\sigma$.  The Q-table then converges to $q^\sigma$ — the Q-function
-associated with the lifetime value of $\sigma$, not the optimal one.
+This scalar enters the update as part of the target value for $q_t(x, a)$.
 
-By contrast, the original update with the $\max$ is a stochastic sample of the
-Bellman *optimality* operator, whose fixed point is $q^*$.  The $\max$ in the
-update target is therefore what drives convergence to $q^*$.
+Which action the manager *actually takes* at time $t+1$ is a separate decision.
 
-In short, the $\max$ is doing the work of finding the optimum; without it, you only evaluate a fixed policy.
+In short, the $\max$ is doing the work of finding the optimum; it does not dictate the action that the manager actually takes.
 
 ### The behavior policy
 
-The rule governing how the manager chooses actions is called the **behavior
-policy**.  Because the $\max$ in the update target always points toward $q^*$
+The rule governing how the manager chooses actions is called the **behavior policy**.  
+
+Because the $\max$ in the update target always points toward $q^*$
 regardless of how the manager selects actions, the behavior policy affects only
 which $(x, a)$ entries get visited — and hence updated — over time.
 
@@ -545,6 +535,7 @@ We use $\alpha_t = 1 / n_t(x, a)^{0.51}$, where $n_t(x, a)$ is the number of tim
 
 This decays slowly enough to allow learning from later (better-informed) updates, while still satisfying the [Robbins–Monro conditions](https://en.wikipedia.org/wiki/Stochastic_approximation#Robbins%E2%80%93Monro_algorithm) for convergence.
 
+
 ### Exploration: epsilon-greedy
 
 For our behavior policy, we use an $\varepsilon$-greedy strategy:
@@ -560,6 +551,16 @@ We decay $\varepsilon$ each step: $\varepsilon_{t+1} = \max(\varepsilon_{\min},\
 
 The stochastic demand shocks naturally drive the manager across different inventory levels, providing exploration over the state space without any artificial resets.
 
+### Optimistic initialization
+
+A simple but powerful technique for accelerating learning is **optimistic initialization**: instead of starting the Q-table at zero, we initialize every entry to a value above the true optimum.
+
+Because every untried action looks optimistically good, the agent is "disappointed" whenever it tries one — the update pulls that entry down toward reality. This drives the agent to try other actions (which still look optimistically high), producing broad exploration of the state-action space early in training.
+
+This idea is sometimes called **optimism in the face of uncertainty** and is widely used in both bandit and reinforcement learning settings.
+
+In our problem, the value function $v^*$ ranges from about 13 to 18. We initialize the Q-table at 20 — modestly above the true maximum — to ensure optimistic exploration without being so extreme as to distort learning.
+
 ### Implementation
 
 We first define a helper to extract the greedy policy from a Q-table.
@@ -587,9 +588,9 @@ At specified step counts (given by `snapshot_steps`), we record the current gree
 ```{code-cell} ipython3
 @numba.jit(nopython=True)
 def q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
-                      ε_init, ε_min, ε_decay, snapshot_steps, seed):
+                      ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed):
     np.random.seed(seed)
-    q = np.zeros((K + 1, K + 1))
+    q = np.full((K + 1, K + 1), q_init)
     n = np.zeros((K + 1, K + 1))       # visit counts for learning rate
     ε = ε_init
 
@@ -642,22 +643,21 @@ The wrapper function unpacks the model and provides default hyperparameters.
 ```{code-cell} ipython3
 def q_learning(model, n_steps=20_000_000, X_init=0,
                ε_init=1.0, ε_min=0.01, ε_decay=0.999999,
-               snapshot_steps=None, seed=1234):
+               q_init=20.0, snapshot_steps=None, seed=1234):
     x_values, d_values, ϕ_values, p, c, κ, β = model
     K = len(x_values) - 1
     if snapshot_steps is None:
         snapshot_steps = np.array([], dtype=np.int64)
     return q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
-                             ε_init, ε_min, ε_decay, snapshot_steps, seed)
+                             ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed)
 ```
 
-### Running Q-learning
-
-We run 20 million steps and take policy snapshots at steps 10,000, 1,000,000, and at the end.
+Next we run $n$ = 5 million steps and take policy snapshots at steps 10,000, 1,000,000, and $n$.
 
 ```{code-cell} ipython3
-snap_steps = np.array([10_000, 1_000_000, 19_999_999], dtype=np.int64)
-q, snapshots = q_learning(model, snapshot_steps=snap_steps)
+n = 5_000_000
+snap_steps = np.array([10_000, 1_000_000, n], dtype=np.int64)
+q, snapshots = q_learning(model, n_steps=n+1, snapshot_steps=snap_steps)
 ```
 
 ### Comparing with the exact solution
@@ -710,9 +710,11 @@ All panels use the **same demand sequence** (via a fixed random seed), so differ
 
 The top panel shows the optimal policy from VFI for reference.
 
-After only 10,000 steps the agent has barely explored and its policy is poor.
+After 10,000 steps the agent has barely explored and its policy is poor.
+
+By 1,000,000 steps the policy has improved but still differs noticeably from the optimum.
 
-By step 20 million, the learned policy produces inventory dynamics that closely resemble the S-s pattern of the optimal solution.
+By step 5 million, the learned policy produces inventory dynamics that closely resemble the S-s pattern of the optimal solution.
 
 ```{code-cell} ipython3
 ts_length = 200
diff --git a/lectures/rs_inventory_q.md b/lectures/rs_inventory_q.md
@@ -530,7 +530,7 @@ $X_{t+1}$ — no model knowledge is required.
 Our implementation follows the same structure as the risk-neutral Q-learning in
 {doc}`inventory_q`, with the modifications above:
 
-1. **Initialize** the Q-table $q$ to ones (since Q-values are positive) and
+1. **Initialize** the Q-table $q$ optimistically (see below) and
    visit counts $n$ to zeros.
 2. **At each step:**
    - Draw demand $D_{t+1}$ and compute observed profit $R_{t+1}$ and next state
@@ -548,6 +548,17 @@ Our implementation follows the same structure as the risk-neutral Q-learning in
    $\sigma(x) = \argmin_{a \in \Gamma(x)} q(x, a)$.
 4. **Compare** the learned policy against the VFI solution.
 
+### Optimistic initialization
+
+As in {doc}`inventory_q`, we use optimistic initialization to accelerate learning.
+
+The logic is the same — initialize the Q-table so that every untried action looks attractive, driving the agent to explore broadly — but the direction is reversed.
+
+Since the optimal policy *minimizes* $q$, "optimistic" means initializing the Q-table *below* the true values.  When the agent tries an action, the update pushes $q$ upward toward reality, making that entry look worse and prompting the agent to try other actions that still appear optimistically good.
+
+The true Q-values are on the order of $\exp(-\gamma \, v^*) \approx 10^{-8}$ to $10^{-6}$.
+We initialize the Q-table at $10^{-9}$, modestly below this range.
+
 ### Implementation
 
 We first define a helper to extract the greedy policy from the Q-table.
@@ -571,15 +582,15 @@ def greedy_policy_from_q_rs(q, K):
 ```
 
 The Q-learning loop mirrors the risk-neutral version, with the key changes:
-Q-table initialized to ones, the update target uses $\exp(-\gamma R_{t+1})
+the update target uses $\exp(-\gamma R_{t+1})
 \cdot (\min_{a'} q_t)^\beta$, and the behavior policy follows the $\argmin$.
 
 ```{code-cell} ipython3
 @numba.jit(nopython=True)
 def q_learning_rs_kernel(K, p, c, κ, β, γ, n_steps, X_init,
-                         ε_init, ε_min, ε_decay, snapshot_steps, seed):
+                         ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed):
     np.random.seed(seed)
-    q = np.ones((K + 1, K + 1))         # positive Q-values, initialized to 1
+    q = np.full((K + 1, K + 1), q_init)  # optimistic initialization
     n = np.zeros((K + 1, K + 1))        # visit counts for learning rate
     ε = ε_init
 
@@ -633,22 +644,23 @@ The wrapper function unpacks the model and provides default hyperparameters.
 ```{code-cell} ipython3
 def q_learning_rs(model, n_steps=20_000_000, X_init=0,
                   ε_init=1.0, ε_min=0.01, ε_decay=0.999999,
-                  snapshot_steps=None, seed=1234):
+                  q_init=1e-9, snapshot_steps=None, seed=1234):
     x_values, d_values, ϕ_values, p, c, κ, β, γ = model
     K = len(x_values) - 1
     if snapshot_steps is None:
         snapshot_steps = np.array([], dtype=np.int64)
     return q_learning_rs_kernel(K, p, c, κ, β, γ, n_steps, X_init,
-                                ε_init, ε_min, ε_decay, snapshot_steps, seed)
+                                ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed)
 ```
 
 ### Running Q-learning
 
-We run 20 million steps and take policy snapshots at steps 10,000, 1,000,000, and at the end.
+We run $n$ = 5 million steps and take policy snapshots at steps 10,000, 1,000,000, and $n$.
 
 ```{code-cell} ipython3
-snap_steps = np.array([10_000, 1_000_000, 19_999_999], dtype=np.int64)
-q_table, snapshots = q_learning_rs(model, snapshot_steps=snap_steps)
+n = 5_000_000
+snap_steps = np.array([10_000, 1_000_000, n], dtype=np.int64)
+q_table, snapshots = q_learning_rs(model, n_steps=n+1, snapshot_steps=snap_steps)
 ```
 
 ### Comparing with the exact solution
@@ -731,8 +743,9 @@ plt.show()
 
 After 10,000 steps, the agent has barely explored and its policy is erratic.
 
-By 1,000,000 steps the learned policy begins to resemble the optimal one, and
-by step 20 million the inventory dynamics are nearly indistinguishable from the
+By 1,000,000 steps the learned policy has improved but still differs noticeably from the optimum.
+
+By step 5 million the inventory dynamics are nearly indistinguishable from the
 VFI solution.
 
 Note that the converged policy maintains lower inventory levels than in the