Add risk-sensitive inventory management lecture (#827)

jstac · claude · mmcky · web-flow · commit ad276133c578 · 2026-03-10T12:20:40.000+11:00
* Add risk-sensitive inventory management lecture and improve inventory_q Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rs_inventory_q to table of contents Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove generated ipynb and py files from tracking The build system generates notebooks from the .md source. Having .ipynb and .py files in the repo causes duplicate document warnings and cross-reference failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix cross-references: use {doc} instead of {ref} for lecture links The {ref} directive requires a label placed before a heading to resolve a title. The inventory_q label is before a raw block, so {ref} fails. Using {doc} links to the document directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Use \argmin and \argmax macros from _config.yml * Rename risk-sensitivity function from φ to ψ to avoid notation conflict The symbol φ was overloaded: used for both the demand PMF and the risk-sensitivity transformation. Now ψ(t) = exp(-γt) is the risk transformation and φ(d) remains the demand PMF, consistent with inventory_q.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add forward ref from inventory_q and fix minor grammar - Add cross-reference from inventory_q.md to rs_inventory_q.md - Fix missing comma after introductory phrase Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move random seeds inside jitted functions for reproducibility Numba JIT functions use their own RNG state, so np.random.seed() called outside a jitted function has no effect on random draws inside it. Moved seeds into sim_inventories and q_learning kernels as parameters. Fixes issue noted by @HumphreyYang. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Matt McKay <mmcky@users.noreply.github.com>
diff --git a/lectures/_toc.yml b/lectures/_toc.yml
@@ -77,6 +77,7 @@ parts:
   numbered: true
   chapters:
   - file: inventory_q
+  - file: rs_inventory_q
   - file: mccall_q
 - caption: Introduction to Optimal Savings
   numbered: true
diff --git a/lectures/inventory_q.md b/lectures/inventory_q.md
@@ -37,7 +37,7 @@ We approach the problem in two ways.
 First, we solve it exactly using dynamic programming, assuming full knowledge of
 the model — the demand distribution, cost parameters, and transition dynamics.
 
-Second, we show how a manager can learn the optimal policy from experience alone, using *Q-learning*.
+Second, we show how a manager can learn the optimal policy from experience alone, using *[Q-learning](https://en.wikipedia.org/wiki/Q-learning)*.
 
 The manager observes only the inventory level, the order placed, the resulting
 profit, and the next inventory level — without knowing any of the underlying
@@ -50,6 +50,14 @@ transition function.
 We show that, given enough experience, the manager's learned policy converges to
 the optimal one.
 
+The lecture proceeds as follows:
+
+1. We set up the inventory model and solve it exactly via value function iteration.
+2. We introduce Q-factors and derive the Q-factor Bellman equation.
+3. We implement Q-learning and show the learned policy converges to the optimal one.
+
+A risk-sensitive extension of this model is studied in {doc}`rs_inventory_q`.
+
 We will use the following imports:
 
 ```{code-cell} ipython3
@@ -85,12 +93,14 @@ $$
     \qquad
     \text{where}
     \quad
-    h(x,a,d) := (x - d) \vee 0 + a.
+    h(x,a,d) := \max(x - d, 0) + a.
 $$
 
 The term $A_t$ is units of stock ordered this period, which arrive at the start
 of period $t+1$, after demand $D_{t+1}$ is realized and served.
 
+**Timeline for period $t$:** observe $X_t$ → choose $A_t$ → demand $D_{t+1}$ arrives → profit realized → $X_{t+1}$ determined.
+
 (We use a $t$ subscript in $A_t$ to indicate the information set: it is chosen
 before $D_{t+1}$ is observed.)
 
@@ -99,7 +109,7 @@ We assume that the firm can store at most $K$ items at one time.
 Profits are given by
 
 $$
-    \pi(X_t, A_t, D_{t+1}) := X_t \wedge D_{t+1} - c A_t - \kappa 1\{A_t > 0\}.
+    \pi(X_t, A_t, D_{t+1}) := \min(X_t, D_{t+1}) - c A_t - \kappa 1\{A_t > 0\}.
 $$
 
 Here
@@ -158,7 +168,7 @@ $$
       \sum_d \phi(d) \left[ \pi(x, a, d) + \beta \, v(h(x, a, d)) \right]
 $$
 
-When $r > 0$, the sequence $v_{k+1} = T v_k$ converges to the
+When $r > 0$ (equivalently, $\beta < 1$), the sequence $v_{k+1} = T v_k$ converges to the
 unique fixed point $v^*$, which is the value function of the optimal policy
 (see, e.g., {cite}`Sargent_Stachurski_2025`).
 
@@ -167,7 +177,7 @@ unique fixed point $v^*$, which is the value function of the optimal policy
 
 We store the model primitives in a `NamedTuple`.
 
-Demand follows a geometric distribution with parameter $p$, so $\phi(d) = (1 - p)^d \, p$ for $d = 0, 1, 2, \ldots$.
+Demand follows a [geometric distribution](https://en.wikipedia.org/wiki/Geometric_distribution) with parameter $p$, so $\phi(d) = (1 - p)^d \, p$ for $d = 0, 1, 2, \ldots$.
 
 ```{code-cell} ipython3
 class Model(NamedTuple):
@@ -338,8 +348,9 @@ At each step, we draw a demand shock from the geometric distribution and update
 
 ```{code-cell} ipython3
 @numba.jit(nopython=True)
-def sim_inventories(ts_length, σ, p, X_init=0):
+def sim_inventories(ts_length, σ, p, X_init=0, seed=0):
     """Simulate inventory dynamics under policy σ."""
+    np.random.seed(seed)
     X = np.zeros(ts_length, dtype=np.int32)
     X[0] = X_init
     for t in range(ts_length - 1):
@@ -427,7 +438,7 @@ transition function.
 
 ### The Q-learning update rule
 
-Q-learning approximates the fixed point of the Q-factor Bellman equation using **stochastic approximation**.
+Q-learning approximates the fixed point of the Q-factor Bellman equation using **[stochastic approximation](https://en.wikipedia.org/wiki/Stochastic_approximation)**.
 
 At each step, the agent is in state $x$, takes action $a$, observes reward
 $R_{t+1} = \pi(x, a, D_{t+1})$ and next state $X_{t+1} = h(x, a, D_{t+1})$, and
@@ -443,8 +454,23 @@ where $\alpha_t$ is the learning rate.
 
 The update blends the current estimate $q_t(x, a)$ with a fresh sample of the Bellman target.
 
+### What the manager needs to know
+
+Notice what is **not** required to implement the update.
+
+The manager does not need to know the demand distribution $\phi$, the unit cost $c$, the fixed cost $\kappa$, or the transition function $h$.
+
+All the manager needs to observe at each step is:
+
+1. the current inventory level $x$,
+2. the order quantity $a$ they chose,
+3. the resulting profit $R_{t+1}$ (which appears on the books), and
+4. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
+
+These are all directly observable quantities — no model knowledge is required.
 
-### The Q-table and the behavior policy
+
+### The Q-table and the role of the max
 
 It is important to understand how the update rule relates to the manager's
 actions.
@@ -489,7 +515,9 @@ By contrast, the original update with the $\max$ is a stochastic sample of the
 Bellman *optimality* operator, whose fixed point is $q^*$.  The $\max$ in the
 update target is therefore what drives convergence to $q^*$.
 
-**The behavior policy.**
+In short, the $\max$ is doing the work of finding the optimum; without it, you only evaluate a fixed policy.
+
+### The behavior policy
 
 The rule governing how the manager chooses actions is called the **behavior
 policy**.  Because the $\max$ in the update target always points toward $q^*$
@@ -511,26 +539,11 @@ In practice, we want the manager to mostly take good actions (to earn reasonable
 profits while learning), while still occasionally experimenting to discover
 better alternatives.
 
-### What the manager needs to know
-
-Notice what is **not** required to implement the update.
-
-The manager does not need to know the demand distribution $\phi$, the unit cost $c$, the fixed cost $\kappa$, or the transition function $h$.
-
-All the manager needs to observe at each step is:
-
-1. the current inventory level $x$,
-2. the order quantity $a$ they chose,
-3. the resulting profit $R_{t+1}$ (which appears on the books), and
-4. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
-
-These are all directly observable quantities — no model knowledge is required.
-
 ### Learning rate
 
 We use $\alpha_t = 1 / n_t(x, a)^{0.51}$, where $n_t(x, a)$ is the number of times the pair $(x, a)$ has been visited up to time $t$.
 
-This decays slowly enough to allow learning from later (better-informed) updates, while still satisfying the Robbins-Monro conditions for convergence.
+This decays slowly enough to allow learning from later (better-informed) updates, while still satisfying the [Robbins–Monro conditions](https://en.wikipedia.org/wiki/Stochastic_approximation#Robbins%E2%80%93Monro_algorithm) for convergence.
 
 ### Exploration: epsilon-greedy
 
@@ -574,7 +587,8 @@ At specified step counts (given by `snapshot_steps`), we record the current gree
 ```{code-cell} ipython3
 @numba.jit(nopython=True)
 def q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
-                      ε_init, ε_min, ε_decay, snapshot_steps):
+                      ε_init, ε_min, ε_decay, snapshot_steps, seed):
+    np.random.seed(seed)
     q = np.zeros((K + 1, K + 1))
     n = np.zeros((K + 1, K + 1))       # visit counts for learning rate
     ε = ε_init
@@ -593,7 +607,7 @@ def q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
             snapshots[snap_idx] = greedy_policy_from_q(q, K)
             snap_idx += 1
 
-        # === Observe outcome ===
+        # === Draw D_{t+1} and observe outcome ===
         d = np.random.geometric(p) - 1
         reward = min(x, d) - c * a - κ * (a > 0)
         x_next = max(x - d, 0) + a
@@ -628,21 +642,20 @@ The wrapper function unpacks the model and provides default hyperparameters.
 ```{code-cell} ipython3
 def q_learning(model, n_steps=20_000_000, X_init=0,
                ε_init=1.0, ε_min=0.01, ε_decay=0.999999,
-               snapshot_steps=None):
+               snapshot_steps=None, seed=1234):
     x_values, d_values, ϕ_values, p, c, κ, β = model
     K = len(x_values) - 1
     if snapshot_steps is None:
         snapshot_steps = np.array([], dtype=np.int64)
     return q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
-                             ε_init, ε_min, ε_decay, snapshot_steps)
+                             ε_init, ε_min, ε_decay, snapshot_steps, seed)
 ```
 
 ### Running Q-learning
 
 We run 20 million steps and take policy snapshots at steps 10,000, 1,000,000, and at the end.
 
 ```{code-cell} ipython3
-np.random.seed(1234)
 snap_steps = np.array([10_000, 1_000_000, 19_999_999], dtype=np.int64)
 q, snapshots = q_learning(model, snapshot_steps=snap_steps)
 ```
@@ -661,6 +674,7 @@ and compare them against $v^*$ and $\sigma^*$ from VFI.
 
 ```{code-cell} ipython3
 K = len(x_values) - 1
+# restrict to feasible actions a ∈ {0, ..., K-x}
 v_q = np.array([np.max(q[x, :K - x + 1]) for x in range(K + 1)])
 σ_q = np.array([np.argmax(q[x, :K - x + 1]) for x in range(K + 1)])
 ```
@@ -710,8 +724,7 @@ X_init = K // 2
 sim_seed = 5678
 
 # Optimal policy
-np.random.seed(sim_seed)
-X_opt = sim_inventories(ts_length, σ_star, p, X_init)
+X_opt = sim_inventories(ts_length, σ_star, p, X_init, seed=sim_seed)
 axes[0].plot(X_opt, alpha=0.7)
 axes[0].set_ylabel("inventory")
 axes[0].set_title("Optimal (VFI)")
@@ -720,8 +733,7 @@ axes[0].set_ylim(0, K + 2)
 # Q-learning snapshots
 for i in range(n_snaps):
     σ_snap = snapshots[i]
-    np.random.seed(sim_seed)
-    X = sim_inventories(ts_length, σ_snap, p, X_init)
+    X = sim_inventories(ts_length, σ_snap, p, X_init, seed=sim_seed)
     axes[i + 1].plot(X, alpha=0.7)
     axes[i + 1].set_ylabel("inventory")
     axes[i + 1].set_title(f"Step {snap_steps[i]:,}")
diff --git a/lectures/rs_inventory_q.md b/lectures/rs_inventory_q.md