You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add optimistic initialization to Q-learning lecture (#830)
* Add optimistic initialization to Q-learning lecture
Initialize Q-table to 20 (above true value range of 13-18) instead of
zeros, which drives broader exploration via "optimism in the face of
uncertainty". This speeds convergence enough to reduce the training run
from 20M to 5M steps. Added a new subsection explaining the technique.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add optimistic initialization to risk-sensitive Q-learning lecture
Initialize Q-table to 1e-5 (below true Q-value range of ~1e-5 to 1e-4)
instead of ones. Since the optimal policy minimizes q, optimistic means
initializing below the truth — the reverse of the risk-neutral case.
This speeds convergence enough to reduce training from 20M to 5M steps.
Added a subsection explaining the reversed optimistic init logic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add \EE macro to MathJax config
* Restore profit to observation list; fix risk-sensitive optimistic init value and narrative
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Matt McKay <mmcky@users.noreply.github.com>
The quantity $\max_{a' \in \Gamma(X_{t+1})} q_t(X_{t+1}, a')$ is just an estimate of the value of
501
+
being in state $X_{t+1}$ under the best possible continuation.
509
502
510
-
This modified update is a stochastic sample of the Bellman *evaluation* operator
511
-
for $\sigma$. The Q-table then converges to $q^\sigma$ — the Q-function
512
-
associated with the lifetime value of $\sigma$, not the optimal one.
503
+
This scalar enters the update as part of the target value for $q_t(x, a)$.
513
504
514
-
By contrast, the original update with the $\max$ is a stochastic sample of the
515
-
Bellman *optimality* operator, whose fixed point is $q^*$. The $\max$ in the
516
-
update target is therefore what drives convergence to $q^*$.
505
+
Which action the manager *actually takes* at time $t+1$ is a separate decision.
517
506
518
-
In short, the $\max$ is doing the work of finding the optimum; without it, you only evaluate a fixed policy.
507
+
In short, the $\max$ is doing the work of finding the optimum; it does not dictate the action that the manager actually takes.
519
508
520
509
### The behavior policy
521
510
522
-
The rule governing how the manager chooses actions is called the **behavior
523
-
policy**. Because the $\max$ in the update target always points toward $q^*$
511
+
The rule governing how the manager chooses actions is called the **behavior policy**.
512
+
513
+
Because the $\max$ in the update target always points toward $q^*$
524
514
regardless of how the manager selects actions, the behavior policy affects only
525
515
which $(x, a)$ entries get visited — and hence updated — over time.
526
516
@@ -545,6 +535,7 @@ We use $\alpha_t = 1 / n_t(x, a)^{0.51}$, where $n_t(x, a)$ is the number of tim
545
535
546
536
This decays slowly enough to allow learning from later (better-informed) updates, while still satisfying the [Robbins–Monro conditions](https://en.wikipedia.org/wiki/Stochastic_approximation#Robbins%E2%80%93Monro_algorithm) for convergence.
547
537
538
+
548
539
### Exploration: epsilon-greedy
549
540
550
541
For our behavior policy, we use an $\varepsilon$-greedy strategy:
@@ -560,6 +551,16 @@ We decay $\varepsilon$ each step: $\varepsilon_{t+1} = \max(\varepsilon_{\min},\
560
551
561
552
The stochastic demand shocks naturally drive the manager across different inventory levels, providing exploration over the state space without any artificial resets.
562
553
554
+
### Optimistic initialization
555
+
556
+
A simple but powerful technique for accelerating learning is **optimistic initialization**: instead of starting the Q-table at zero, we initialize every entry to a value above the true optimum.
557
+
558
+
Because every untried action looks optimistically good, the agent is "disappointed" whenever it tries one — the update pulls that entry down toward reality. This drives the agent to try other actions (which still look optimistically high), producing broad exploration of the state-action space early in training.
559
+
560
+
This idea is sometimes called **optimism in the face of uncertainty** and is widely used in both bandit and reinforcement learning settings.
561
+
562
+
In our problem, the value function $v^*$ ranges from about 13 to 18. We initialize the Q-table at 20 — modestly above the true maximum — to ensure optimistic exploration without being so extreme as to distort learning.
563
+
563
564
### Implementation
564
565
565
566
We first define a helper to extract the greedy policy from a Q-table.
@@ -587,9 +588,9 @@ At specified step counts (given by `snapshot_steps`), we record the current gree
Copy file name to clipboardExpand all lines: lectures/rs_inventory_q.md
+24-11Lines changed: 24 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -530,7 +530,7 @@ $X_{t+1}$ — no model knowledge is required.
530
530
Our implementation follows the same structure as the risk-neutral Q-learning in
531
531
{doc}`inventory_q`, with the modifications above:
532
532
533
-
1.**Initialize** the Q-table $q$ to ones (since Q-values are positive) and
533
+
1.**Initialize** the Q-table $q$ optimistically (see below) and
534
534
visit counts $n$ to zeros.
535
535
2.**At each step:**
536
536
- Draw demand $D_{t+1}$ and compute observed profit $R_{t+1}$ and next state
@@ -548,6 +548,17 @@ Our implementation follows the same structure as the risk-neutral Q-learning in
548
548
$\sigma(x) = \argmin_{a \in \Gamma(x)} q(x, a)$.
549
549
4.**Compare** the learned policy against the VFI solution.
550
550
551
+
### Optimistic initialization
552
+
553
+
As in {doc}`inventory_q`, we use optimistic initialization to accelerate learning.
554
+
555
+
The logic is the same — initialize the Q-table so that every untried action looks attractive, driving the agent to explore broadly — but the direction is reversed.
556
+
557
+
Since the optimal policy *minimizes* $q$, "optimistic" means initializing the Q-table *below* the true values. When the agent tries an action, the update pushes $q$ upward toward reality, making that entry look worse and prompting the agent to try other actions that still appear optimistically good.
558
+
559
+
The true Q-values are on the order of $\exp(-\gamma \, v^*) \approx 10^{-8}$ to $10^{-6}$.
560
+
We initialize the Q-table at $10^{-9}$, modestly below this range.
561
+
551
562
### Implementation
552
563
553
564
We first define a helper to extract the greedy policy from the Q-table.
0 commit comments