Skip to content

Commit d825b92

Browse files
jstacclaudemmcky
authored
Add optimistic initialization to Q-learning lecture (#830)
* Add optimistic initialization to Q-learning lecture Initialize Q-table to 20 (above true value range of 13-18) instead of zeros, which drives broader exploration via "optimism in the face of uncertainty". This speeds convergence enough to reduce the training run from 20M to 5M steps. Added a new subsection explaining the technique. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add optimistic initialization to risk-sensitive Q-learning lecture Initialize Q-table to 1e-5 (below true Q-value range of ~1e-5 to 1e-4) instead of ones. Since the optimal policy minimizes q, optimistic means initializing below the truth — the reverse of the risk-neutral case. This speeds convergence enough to reduce training from 20M to 5M steps. Added a subsection explaining the reversed optimistic init logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add \EE macro to MathJax config * Restore profit to observation list; fix risk-sensitive optimistic init value and narrative --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Matt McKay <mmcky@users.noreply.github.com>
1 parent ad27613 commit d825b92

File tree

3 files changed

+82
-66
lines changed

3 files changed

+82
-66
lines changed

lectures/_config.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ sphinx:
109109
macros:
110110
"argmax" : ["\\operatorname*{argmax}", 0]
111111
"argmin" : ["\\operatorname*{argmin}", 0]
112+
"EE" : "\\mathbb{E}"
112113
mathjax_path: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
113114
# Local Redirects
114115
rediraffe_redirects:

lectures/inventory_q.md

Lines changed: 57 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -37,18 +37,23 @@ We approach the problem in two ways.
3737
First, we solve it exactly using dynamic programming, assuming full knowledge of
3838
the model — the demand distribution, cost parameters, and transition dynamics.
3939

40-
Second, we show how a manager can learn the optimal policy from experience alone, using *[Q-learning](https://en.wikipedia.org/wiki/Q-learning)*.
40+
Second, we show how a manager can learn the optimal policy from experience alone, using [Q-learning](https://en.wikipedia.org/wiki/Q-learning).
4141

42-
The manager observes only the inventory level, the order placed, the resulting
43-
profit, and the next inventory level — without knowing any of the underlying
44-
parameters.
42+
In this setting, we assume that the manager observes only
43+
44+
* the inventory level,
45+
* the order placed,
46+
* the resulting profit, and
47+
* the next inventory level.
48+
49+
The manager knows the interest rate -- and hence the discount factor -- but not any of the other underlying parameters.
4550

4651
A key idea is the *Q-factor* representation, which reformulates the Bellman
4752
equation so that the optimal policy can be recovered without knowledge of the
48-
transition function.
53+
transition dynamics.
4954

50-
We show that, given enough experience, the manager's learned policy converges to
51-
the optimal one.
55+
We show that, given enough experience, the
56+
manager's learned policy converges to the optimal one.
5257

5358
The lecture proceeds as follows:
5459

@@ -67,16 +72,18 @@ import matplotlib.pyplot as plt
6772
from typing import NamedTuple
6873
```
6974

75+
7076
## The Model
7177

72-
We study a firm where a manager tries to maximize shareholder value.
78+
We study a firm where a manager tries to maximize shareholder value by
79+
controlling inventories.
7380

7481
To simplify the problem, we assume that the firm only sells one product.
7582

7683
Letting $\pi_t$ be profits at time $t$ and $r > 0$ be the interest rate, the value of the firm is
7784

7885
$$
79-
V_0 = \sum_{t \geq 0} \beta^t \pi_t
86+
V_0 = \EE \sum_{t \geq 0} \beta^t \pi_t
8087
\qquad
8188
\text{ where }
8289
\quad \beta := \frac{1}{1+r}.
@@ -97,9 +104,9 @@ $$
97104
$$
98105

99106
The term $A_t$ is units of stock ordered this period, which arrive at the start
100-
of period $t+1$, after demand $D_{t+1}$ is realized and served.
107+
of period $t+1$, after demand $D_{t+1}$ is realized and served:
101108

102-
**Timeline for period $t$:** observe $X_t$ → choose $A_t$ → demand $D_{t+1}$ arrives → profit realized → $X_{t+1}$ determined.
109+
* observe $X_t$ → choose $A_t$ → demand $D_{t+1}$ arrives → profit realized → $X_{t+1}$ determined.
103110

104111
(We use a $t$ subscript in $A_t$ to indicate the information set: it is chosen
105112
before $D_{t+1}$ is observed.)
@@ -115,7 +122,7 @@ $$
115122
Here
116123

117124
* the sales price is set to unity (for convenience)
118-
* revenue is the minimum of current stock and demand because orders in excess of inventory are lost rather than back-filled
125+
* revenue is the minimum of current stock and demand because orders in excess of inventory are lost (not back-filled)
119126
* $c$ is unit product cost and $\kappa$ is a fixed cost of ordering inventory
120127

121128
We can map our inventory problem into a dynamic program with state space $\mathsf X := \{0, \ldots, K\}$ and action space $\mathsf A := \mathsf X$.
@@ -463,9 +470,10 @@ The manager does not need to know the demand distribution $\phi$, the unit cost
463470
All the manager needs to observe at each step is:
464471

465472
1. the current inventory level $x$,
466-
2. the order quantity $a$ they chose,
467-
3. the resulting profit $R_{t+1}$ (which appears on the books), and
468-
4. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
473+
2. the order quantity $a$, which they choose,
474+
3. the resulting profit $R_{t+1}$ (which appears on the books),
475+
4. the discount factor $\beta$, which is determined by the interest rate, and
476+
5. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
469477

470478
These are all directly observable quantities — no model knowledge is required.
471479

@@ -480,47 +488,29 @@ a)$ for every state-action pair $(x, a)$.
480488

481489
At each step, the manager is in some state $x$ and must choose a specific action
482490
$a$ to take. Whichever $a$ is chosen, the manager observes profit $R_{t+1}$
483-
and next state $X_{t+1}$, and updates **that one entry** $q_t(x, a)$ of the
491+
and next state $X_{t+1}$, and updates *that one entry* $q_t(x, a)$ of the
484492
table using the rule above.
485493

486-
**The max computes a value, not an action.**
487-
488494
It is tempting to read the $\max_{a'}$ in the update rule as prescribing the
489495
manager's next action — that is, to interpret the update as saying "move to
490-
state $X_{t+1}$ and take action $\argmax_{a'} q_t(X_{t+1}, a')$."
496+
state $X_{t+1}$ and take an action in $\argmax_{a'} q_t(X_{t+1}, a')$."
491497

492-
But the $\max$ plays a different role. The quantity $\max_{a' \in
493-
\Gamma(X_{t+1})} q_t(X_{t+1}, a')$ is a **scalar** — it estimates the value of
494-
being in state $X_{t+1}$ under the best possible continuation. This scalar
495-
enters the update as part of the target value for $q_t(x, a)$.
498+
But the $\max$ plays a different role.
496499

497-
Which action the manager *actually takes* at state $X_{t+1}$ is a separate
498-
decision entirely.
499-
500-
To see why this distinction matters, consider what happens if we modify the
501-
update rule by replacing the $\max$ with evaluation under a fixed feasible
502-
policy $\sigma$:
503-
504-
$$
505-
q_{t+1}(x, a)
506-
= (1 - \alpha_t) q_t(x, a) +
507-
\alpha_t \left(R_{t+1} + \beta \, q_t(X_{t+1}, \sigma(X_{t+1}))\right).
508-
$$
500+
The quantity $\max_{a' \in \Gamma(X_{t+1})} q_t(X_{t+1}, a')$ is just an estimate of the value of
501+
being in state $X_{t+1}$ under the best possible continuation.
509502

510-
This modified update is a stochastic sample of the Bellman *evaluation* operator
511-
for $\sigma$. The Q-table then converges to $q^\sigma$ — the Q-function
512-
associated with the lifetime value of $\sigma$, not the optimal one.
503+
This scalar enters the update as part of the target value for $q_t(x, a)$.
513504

514-
By contrast, the original update with the $\max$ is a stochastic sample of the
515-
Bellman *optimality* operator, whose fixed point is $q^*$. The $\max$ in the
516-
update target is therefore what drives convergence to $q^*$.
505+
Which action the manager *actually takes* at time $t+1$ is a separate decision.
517506

518-
In short, the $\max$ is doing the work of finding the optimum; without it, you only evaluate a fixed policy.
507+
In short, the $\max$ is doing the work of finding the optimum; it does not dictate the action that the manager actually takes.
519508

520509
### The behavior policy
521510

522-
The rule governing how the manager chooses actions is called the **behavior
523-
policy**. Because the $\max$ in the update target always points toward $q^*$
511+
The rule governing how the manager chooses actions is called the **behavior policy**.
512+
513+
Because the $\max$ in the update target always points toward $q^*$
524514
regardless of how the manager selects actions, the behavior policy affects only
525515
which $(x, a)$ entries get visited — and hence updated — over time.
526516

@@ -545,6 +535,7 @@ We use $\alpha_t = 1 / n_t(x, a)^{0.51}$, where $n_t(x, a)$ is the number of tim
545535

546536
This decays slowly enough to allow learning from later (better-informed) updates, while still satisfying the [Robbins–Monro conditions](https://en.wikipedia.org/wiki/Stochastic_approximation#Robbins%E2%80%93Monro_algorithm) for convergence.
547537

538+
548539
### Exploration: epsilon-greedy
549540

550541
For our behavior policy, we use an $\varepsilon$-greedy strategy:
@@ -560,6 +551,16 @@ We decay $\varepsilon$ each step: $\varepsilon_{t+1} = \max(\varepsilon_{\min},\
560551

561552
The stochastic demand shocks naturally drive the manager across different inventory levels, providing exploration over the state space without any artificial resets.
562553

554+
### Optimistic initialization
555+
556+
A simple but powerful technique for accelerating learning is **optimistic initialization**: instead of starting the Q-table at zero, we initialize every entry to a value above the true optimum.
557+
558+
Because every untried action looks optimistically good, the agent is "disappointed" whenever it tries one — the update pulls that entry down toward reality. This drives the agent to try other actions (which still look optimistically high), producing broad exploration of the state-action space early in training.
559+
560+
This idea is sometimes called **optimism in the face of uncertainty** and is widely used in both bandit and reinforcement learning settings.
561+
562+
In our problem, the value function $v^*$ ranges from about 13 to 18. We initialize the Q-table at 20 — modestly above the true maximum — to ensure optimistic exploration without being so extreme as to distort learning.
563+
563564
### Implementation
564565

565566
We first define a helper to extract the greedy policy from a Q-table.
@@ -587,9 +588,9 @@ At specified step counts (given by `snapshot_steps`), we record the current gree
587588
```{code-cell} ipython3
588589
@numba.jit(nopython=True)
589590
def q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
590-
ε_init, ε_min, ε_decay, snapshot_steps, seed):
591+
ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed):
591592
np.random.seed(seed)
592-
q = np.zeros((K + 1, K + 1))
593+
q = np.full((K + 1, K + 1), q_init)
593594
n = np.zeros((K + 1, K + 1)) # visit counts for learning rate
594595
ε = ε_init
595596
@@ -642,22 +643,21 @@ The wrapper function unpacks the model and provides default hyperparameters.
642643
```{code-cell} ipython3
643644
def q_learning(model, n_steps=20_000_000, X_init=0,
644645
ε_init=1.0, ε_min=0.01, ε_decay=0.999999,
645-
snapshot_steps=None, seed=1234):
646+
q_init=20.0, snapshot_steps=None, seed=1234):
646647
x_values, d_values, ϕ_values, p, c, κ, β = model
647648
K = len(x_values) - 1
648649
if snapshot_steps is None:
649650
snapshot_steps = np.array([], dtype=np.int64)
650651
return q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
651-
ε_init, ε_min, ε_decay, snapshot_steps, seed)
652+
ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed)
652653
```
653654

654-
### Running Q-learning
655-
656-
We run 20 million steps and take policy snapshots at steps 10,000, 1,000,000, and at the end.
655+
Next we run $n$ = 5 million steps and take policy snapshots at steps 10,000, 1,000,000, and $n$.
657656

658657
```{code-cell} ipython3
659-
snap_steps = np.array([10_000, 1_000_000, 19_999_999], dtype=np.int64)
660-
q, snapshots = q_learning(model, snapshot_steps=snap_steps)
658+
n = 5_000_000
659+
snap_steps = np.array([10_000, 1_000_000, n], dtype=np.int64)
660+
q, snapshots = q_learning(model, n_steps=n+1, snapshot_steps=snap_steps)
661661
```
662662

663663
### Comparing with the exact solution
@@ -710,9 +710,11 @@ All panels use the **same demand sequence** (via a fixed random seed), so differ
710710

711711
The top panel shows the optimal policy from VFI for reference.
712712

713-
After only 10,000 steps the agent has barely explored and its policy is poor.
713+
After 10,000 steps the agent has barely explored and its policy is poor.
714+
715+
By 1,000,000 steps the policy has improved but still differs noticeably from the optimum.
714716

715-
By step 20 million, the learned policy produces inventory dynamics that closely resemble the S-s pattern of the optimal solution.
717+
By step 5 million, the learned policy produces inventory dynamics that closely resemble the S-s pattern of the optimal solution.
716718

717719
```{code-cell} ipython3
718720
ts_length = 200

lectures/rs_inventory_q.md

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -530,7 +530,7 @@ $X_{t+1}$ — no model knowledge is required.
530530
Our implementation follows the same structure as the risk-neutral Q-learning in
531531
{doc}`inventory_q`, with the modifications above:
532532

533-
1. **Initialize** the Q-table $q$ to ones (since Q-values are positive) and
533+
1. **Initialize** the Q-table $q$ optimistically (see below) and
534534
visit counts $n$ to zeros.
535535
2. **At each step:**
536536
- Draw demand $D_{t+1}$ and compute observed profit $R_{t+1}$ and next state
@@ -548,6 +548,17 @@ Our implementation follows the same structure as the risk-neutral Q-learning in
548548
$\sigma(x) = \argmin_{a \in \Gamma(x)} q(x, a)$.
549549
4. **Compare** the learned policy against the VFI solution.
550550

551+
### Optimistic initialization
552+
553+
As in {doc}`inventory_q`, we use optimistic initialization to accelerate learning.
554+
555+
The logic is the same — initialize the Q-table so that every untried action looks attractive, driving the agent to explore broadly — but the direction is reversed.
556+
557+
Since the optimal policy *minimizes* $q$, "optimistic" means initializing the Q-table *below* the true values. When the agent tries an action, the update pushes $q$ upward toward reality, making that entry look worse and prompting the agent to try other actions that still appear optimistically good.
558+
559+
The true Q-values are on the order of $\exp(-\gamma \, v^*) \approx 10^{-8}$ to $10^{-6}$.
560+
We initialize the Q-table at $10^{-9}$, modestly below this range.
561+
551562
### Implementation
552563

553564
We first define a helper to extract the greedy policy from the Q-table.
@@ -571,15 +582,15 @@ def greedy_policy_from_q_rs(q, K):
571582
```
572583

573584
The Q-learning loop mirrors the risk-neutral version, with the key changes:
574-
Q-table initialized to ones, the update target uses $\exp(-\gamma R_{t+1})
585+
the update target uses $\exp(-\gamma R_{t+1})
575586
\cdot (\min_{a'} q_t)^\beta$, and the behavior policy follows the $\argmin$.
576587

577588
```{code-cell} ipython3
578589
@numba.jit(nopython=True)
579590
def q_learning_rs_kernel(K, p, c, κ, β, γ, n_steps, X_init,
580-
ε_init, ε_min, ε_decay, snapshot_steps, seed):
591+
ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed):
581592
np.random.seed(seed)
582-
q = np.ones((K + 1, K + 1)) # positive Q-values, initialized to 1
593+
q = np.full((K + 1, K + 1), q_init) # optimistic initialization
583594
n = np.zeros((K + 1, K + 1)) # visit counts for learning rate
584595
ε = ε_init
585596
@@ -633,22 +644,23 @@ The wrapper function unpacks the model and provides default hyperparameters.
633644
```{code-cell} ipython3
634645
def q_learning_rs(model, n_steps=20_000_000, X_init=0,
635646
ε_init=1.0, ε_min=0.01, ε_decay=0.999999,
636-
snapshot_steps=None, seed=1234):
647+
q_init=1e-9, snapshot_steps=None, seed=1234):
637648
x_values, d_values, ϕ_values, p, c, κ, β, γ = model
638649
K = len(x_values) - 1
639650
if snapshot_steps is None:
640651
snapshot_steps = np.array([], dtype=np.int64)
641652
return q_learning_rs_kernel(K, p, c, κ, β, γ, n_steps, X_init,
642-
ε_init, ε_min, ε_decay, snapshot_steps, seed)
653+
ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed)
643654
```
644655

645656
### Running Q-learning
646657

647-
We run 20 million steps and take policy snapshots at steps 10,000, 1,000,000, and at the end.
658+
We run $n$ = 5 million steps and take policy snapshots at steps 10,000, 1,000,000, and $n$.
648659

649660
```{code-cell} ipython3
650-
snap_steps = np.array([10_000, 1_000_000, 19_999_999], dtype=np.int64)
651-
q_table, snapshots = q_learning_rs(model, snapshot_steps=snap_steps)
661+
n = 5_000_000
662+
snap_steps = np.array([10_000, 1_000_000, n], dtype=np.int64)
663+
q_table, snapshots = q_learning_rs(model, n_steps=n+1, snapshot_steps=snap_steps)
652664
```
653665

654666
### Comparing with the exact solution
@@ -731,8 +743,9 @@ plt.show()
731743

732744
After 10,000 steps, the agent has barely explored and its policy is erratic.
733745

734-
By 1,000,000 steps the learned policy begins to resemble the optimal one, and
735-
by step 20 million the inventory dynamics are nearly indistinguishable from the
746+
By 1,000,000 steps the learned policy has improved but still differs noticeably from the optimum.
747+
748+
By step 5 million the inventory dynamics are nearly indistinguishable from the
736749
VFI solution.
737750

738751
Note that the converged policy maintains lower inventory levels than in the

0 commit comments

Comments
 (0)