Skip to content

Commit ad27613

Browse files
jstacclaudemmcky
authored
Add risk-sensitive inventory management lecture (#827)
* Add risk-sensitive inventory management lecture and improve inventory_q Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rs_inventory_q to table of contents Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove generated ipynb and py files from tracking The build system generates notebooks from the .md source. Having .ipynb and .py files in the repo causes duplicate document warnings and cross-reference failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix cross-references: use {doc} instead of {ref} for lecture links The {ref} directive requires a label placed before a heading to resolve a title. The inventory_q label is before a raw block, so {ref} fails. Using {doc} links to the document directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Use \argmin and \argmax macros from _config.yml * Rename risk-sensitivity function from φ to ψ to avoid notation conflict The symbol φ was overloaded: used for both the demand PMF and the risk-sensitivity transformation. Now ψ(t) = exp(-γt) is the risk transformation and φ(d) remains the demand PMF, consistent with inventory_q.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add forward ref from inventory_q and fix minor grammar - Add cross-reference from inventory_q.md to rs_inventory_q.md - Fix missing comma after introductory phrase Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move random seeds inside jitted functions for reproducibility Numba JIT functions use their own RNG state, so np.random.seed() called outside a jitted function has no effect on random draws inside it. Moved seeds into sim_inventories and q_learning kernels as parameters. Fixes issue noted by @HumphreyYang. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Matt McKay <mmcky@users.noreply.github.com>
1 parent b05460d commit ad27613

File tree

3 files changed

+806
-34
lines changed

3 files changed

+806
-34
lines changed

lectures/_toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ parts:
7777
numbered: true
7878
chapters:
7979
- file: inventory_q
80+
- file: rs_inventory_q
8081
- file: mccall_q
8182
- caption: Introduction to Optimal Savings
8283
numbered: true

lectures/inventory_q.md

Lines changed: 46 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ We approach the problem in two ways.
3737
First, we solve it exactly using dynamic programming, assuming full knowledge of
3838
the model — the demand distribution, cost parameters, and transition dynamics.
3939

40-
Second, we show how a manager can learn the optimal policy from experience alone, using *Q-learning*.
40+
Second, we show how a manager can learn the optimal policy from experience alone, using *[Q-learning](https://en.wikipedia.org/wiki/Q-learning)*.
4141

4242
The manager observes only the inventory level, the order placed, the resulting
4343
profit, and the next inventory level — without knowing any of the underlying
@@ -50,6 +50,14 @@ transition function.
5050
We show that, given enough experience, the manager's learned policy converges to
5151
the optimal one.
5252

53+
The lecture proceeds as follows:
54+
55+
1. We set up the inventory model and solve it exactly via value function iteration.
56+
2. We introduce Q-factors and derive the Q-factor Bellman equation.
57+
3. We implement Q-learning and show the learned policy converges to the optimal one.
58+
59+
A risk-sensitive extension of this model is studied in {doc}`rs_inventory_q`.
60+
5361
We will use the following imports:
5462

5563
```{code-cell} ipython3
@@ -85,12 +93,14 @@ $$
8593
\qquad
8694
\text{where}
8795
\quad
88-
h(x,a,d) := (x - d) \vee 0 + a.
96+
h(x,a,d) := \max(x - d, 0) + a.
8997
$$
9098

9199
The term $A_t$ is units of stock ordered this period, which arrive at the start
92100
of period $t+1$, after demand $D_{t+1}$ is realized and served.
93101

102+
**Timeline for period $t$:** observe $X_t$ → choose $A_t$ → demand $D_{t+1}$ arrives → profit realized → $X_{t+1}$ determined.
103+
94104
(We use a $t$ subscript in $A_t$ to indicate the information set: it is chosen
95105
before $D_{t+1}$ is observed.)
96106

@@ -99,7 +109,7 @@ We assume that the firm can store at most $K$ items at one time.
99109
Profits are given by
100110

101111
$$
102-
\pi(X_t, A_t, D_{t+1}) := X_t \wedge D_{t+1} - c A_t - \kappa 1\{A_t > 0\}.
112+
\pi(X_t, A_t, D_{t+1}) := \min(X_t, D_{t+1}) - c A_t - \kappa 1\{A_t > 0\}.
103113
$$
104114

105115
Here
@@ -158,7 +168,7 @@ $$
158168
\sum_d \phi(d) \left[ \pi(x, a, d) + \beta \, v(h(x, a, d)) \right]
159169
$$
160170

161-
When $r > 0$, the sequence $v_{k+1} = T v_k$ converges to the
171+
When $r > 0$ (equivalently, $\beta < 1$), the sequence $v_{k+1} = T v_k$ converges to the
162172
unique fixed point $v^*$, which is the value function of the optimal policy
163173
(see, e.g., {cite}`Sargent_Stachurski_2025`).
164174

@@ -167,7 +177,7 @@ unique fixed point $v^*$, which is the value function of the optimal policy
167177

168178
We store the model primitives in a `NamedTuple`.
169179

170-
Demand follows a geometric distribution with parameter $p$, so $\phi(d) = (1 - p)^d \, p$ for $d = 0, 1, 2, \ldots$.
180+
Demand follows a [geometric distribution](https://en.wikipedia.org/wiki/Geometric_distribution) with parameter $p$, so $\phi(d) = (1 - p)^d \, p$ for $d = 0, 1, 2, \ldots$.
171181

172182
```{code-cell} ipython3
173183
class Model(NamedTuple):
@@ -338,8 +348,9 @@ At each step, we draw a demand shock from the geometric distribution and update
338348

339349
```{code-cell} ipython3
340350
@numba.jit(nopython=True)
341-
def sim_inventories(ts_length, σ, p, X_init=0):
351+
def sim_inventories(ts_length, σ, p, X_init=0, seed=0):
342352
"""Simulate inventory dynamics under policy σ."""
353+
np.random.seed(seed)
343354
X = np.zeros(ts_length, dtype=np.int32)
344355
X[0] = X_init
345356
for t in range(ts_length - 1):
@@ -427,7 +438,7 @@ transition function.
427438

428439
### The Q-learning update rule
429440

430-
Q-learning approximates the fixed point of the Q-factor Bellman equation using **stochastic approximation**.
441+
Q-learning approximates the fixed point of the Q-factor Bellman equation using **[stochastic approximation](https://en.wikipedia.org/wiki/Stochastic_approximation)**.
431442

432443
At each step, the agent is in state $x$, takes action $a$, observes reward
433444
$R_{t+1} = \pi(x, a, D_{t+1})$ and next state $X_{t+1} = h(x, a, D_{t+1})$, and
@@ -443,8 +454,23 @@ where $\alpha_t$ is the learning rate.
443454

444455
The update blends the current estimate $q_t(x, a)$ with a fresh sample of the Bellman target.
445456

457+
### What the manager needs to know
458+
459+
Notice what is **not** required to implement the update.
460+
461+
The manager does not need to know the demand distribution $\phi$, the unit cost $c$, the fixed cost $\kappa$, or the transition function $h$.
462+
463+
All the manager needs to observe at each step is:
464+
465+
1. the current inventory level $x$,
466+
2. the order quantity $a$ they chose,
467+
3. the resulting profit $R_{t+1}$ (which appears on the books), and
468+
4. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
469+
470+
These are all directly observable quantities — no model knowledge is required.
446471

447-
### The Q-table and the behavior policy
472+
473+
### The Q-table and the role of the max
448474

449475
It is important to understand how the update rule relates to the manager's
450476
actions.
@@ -489,7 +515,9 @@ By contrast, the original update with the $\max$ is a stochastic sample of the
489515
Bellman *optimality* operator, whose fixed point is $q^*$. The $\max$ in the
490516
update target is therefore what drives convergence to $q^*$.
491517

492-
**The behavior policy.**
518+
In short, the $\max$ is doing the work of finding the optimum; without it, you only evaluate a fixed policy.
519+
520+
### The behavior policy
493521

494522
The rule governing how the manager chooses actions is called the **behavior
495523
policy**. Because the $\max$ in the update target always points toward $q^*$
@@ -511,26 +539,11 @@ In practice, we want the manager to mostly take good actions (to earn reasonable
511539
profits while learning), while still occasionally experimenting to discover
512540
better alternatives.
513541

514-
### What the manager needs to know
515-
516-
Notice what is **not** required to implement the update.
517-
518-
The manager does not need to know the demand distribution $\phi$, the unit cost $c$, the fixed cost $\kappa$, or the transition function $h$.
519-
520-
All the manager needs to observe at each step is:
521-
522-
1. the current inventory level $x$,
523-
2. the order quantity $a$ they chose,
524-
3. the resulting profit $R_{t+1}$ (which appears on the books), and
525-
4. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
526-
527-
These are all directly observable quantities — no model knowledge is required.
528-
529542
### Learning rate
530543

531544
We use $\alpha_t = 1 / n_t(x, a)^{0.51}$, where $n_t(x, a)$ is the number of times the pair $(x, a)$ has been visited up to time $t$.
532545

533-
This decays slowly enough to allow learning from later (better-informed) updates, while still satisfying the Robbins-Monro conditions for convergence.
546+
This decays slowly enough to allow learning from later (better-informed) updates, while still satisfying the [RobbinsMonro conditions](https://en.wikipedia.org/wiki/Stochastic_approximation#Robbins%E2%80%93Monro_algorithm) for convergence.
534547

535548
### Exploration: epsilon-greedy
536549

@@ -574,7 +587,8 @@ At specified step counts (given by `snapshot_steps`), we record the current gree
574587
```{code-cell} ipython3
575588
@numba.jit(nopython=True)
576589
def q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
577-
ε_init, ε_min, ε_decay, snapshot_steps):
590+
ε_init, ε_min, ε_decay, snapshot_steps, seed):
591+
np.random.seed(seed)
578592
q = np.zeros((K + 1, K + 1))
579593
n = np.zeros((K + 1, K + 1)) # visit counts for learning rate
580594
ε = ε_init
@@ -593,7 +607,7 @@ def q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
593607
snapshots[snap_idx] = greedy_policy_from_q(q, K)
594608
snap_idx += 1
595609
596-
# === Observe outcome ===
610+
# === Draw D_{t+1} and observe outcome ===
597611
d = np.random.geometric(p) - 1
598612
reward = min(x, d) - c * a - κ * (a > 0)
599613
x_next = max(x - d, 0) + a
@@ -628,21 +642,20 @@ The wrapper function unpacks the model and provides default hyperparameters.
628642
```{code-cell} ipython3
629643
def q_learning(model, n_steps=20_000_000, X_init=0,
630644
ε_init=1.0, ε_min=0.01, ε_decay=0.999999,
631-
snapshot_steps=None):
645+
snapshot_steps=None, seed=1234):
632646
x_values, d_values, ϕ_values, p, c, κ, β = model
633647
K = len(x_values) - 1
634648
if snapshot_steps is None:
635649
snapshot_steps = np.array([], dtype=np.int64)
636650
return q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
637-
ε_init, ε_min, ε_decay, snapshot_steps)
651+
ε_init, ε_min, ε_decay, snapshot_steps, seed)
638652
```
639653

640654
### Running Q-learning
641655

642656
We run 20 million steps and take policy snapshots at steps 10,000, 1,000,000, and at the end.
643657

644658
```{code-cell} ipython3
645-
np.random.seed(1234)
646659
snap_steps = np.array([10_000, 1_000_000, 19_999_999], dtype=np.int64)
647660
q, snapshots = q_learning(model, snapshot_steps=snap_steps)
648661
```
@@ -661,6 +674,7 @@ and compare them against $v^*$ and $\sigma^*$ from VFI.
661674

662675
```{code-cell} ipython3
663676
K = len(x_values) - 1
677+
# restrict to feasible actions a ∈ {0, ..., K-x}
664678
v_q = np.array([np.max(q[x, :K - x + 1]) for x in range(K + 1)])
665679
σ_q = np.array([np.argmax(q[x, :K - x + 1]) for x in range(K + 1)])
666680
```
@@ -710,8 +724,7 @@ X_init = K // 2
710724
sim_seed = 5678
711725
712726
# Optimal policy
713-
np.random.seed(sim_seed)
714-
X_opt = sim_inventories(ts_length, σ_star, p, X_init)
727+
X_opt = sim_inventories(ts_length, σ_star, p, X_init, seed=sim_seed)
715728
axes[0].plot(X_opt, alpha=0.7)
716729
axes[0].set_ylabel("inventory")
717730
axes[0].set_title("Optimal (VFI)")
@@ -720,8 +733,7 @@ axes[0].set_ylim(0, K + 2)
720733
# Q-learning snapshots
721734
for i in range(n_snaps):
722735
σ_snap = snapshots[i]
723-
np.random.seed(sim_seed)
724-
X = sim_inventories(ts_length, σ_snap, p, X_init)
736+
X = sim_inventories(ts_length, σ_snap, p, X_init, seed=sim_seed)
725737
axes[i + 1].plot(X, alpha=0.7)
726738
axes[i + 1].set_ylabel("inventory")
727739
axes[i + 1].set_title(f"Step {snap_steps[i]:,}")

0 commit comments

Comments
 (0)