Skip to content

Commit 29762cd

Browse files
committed
Use \argmin and \argmax macros from _config.yml
1 parent 78a43a9 commit 29762cd

1 file changed

Lines changed: 7 additions & 7 deletions

File tree

lectures/rs_inventory_q.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -515,7 +515,7 @@ standard Q-learning.
515515
Notice several differences from the risk-neutral case:
516516

517517
- The Q-values are **positive** (expectations of exponentials) rather than signed.
518-
- The optimal policy is $\sigma(x) = \arg\min_a q(x, a)$ — we **minimize**
518+
- The optimal policy is $\sigma(x) = \argmin_a q(x, a)$ — we **minimize**
519519
rather than maximize, because $\phi^{-1}$ is decreasing.
520520
- The observed profit enters through $\exp(-\gamma R_{t+1})$ rather than
521521
additively.
@@ -536,23 +536,23 @@ Our implementation follows the same structure as the risk-neutral Q-learning in
536536
- Draw demand $D_{t+1}$ and compute observed profit $R_{t+1}$ and next state
537537
$X_{t+1}$.
538538
- Compute $\min_{a'} q_t(X_{t+1}, a')$ over feasible actions (this is a
539-
scalar for the update target, and the $\arg\min$ action is used by the
539+
scalar for the update target, and the $\argmin$ action is used by the
540540
$\varepsilon$-greedy behavior policy).
541541
- Update $q_t(x, a)$ using the rule above, with learning rate
542542
$\alpha_t = 1 / n_t(x, a)^{0.51}$.
543543
- Choose the next action via $\varepsilon$-greedy: with probability
544544
$\varepsilon$ pick a random feasible action, otherwise pick the
545-
$\arg\min$ action.
545+
$\argmin$ action.
546546
- Decay $\varepsilon$.
547547
3. **Extract the greedy policy** from the final Q-table via
548-
$\sigma(x) = \arg\min_{a \in \Gamma(x)} q(x, a)$.
548+
$\sigma(x) = \argmin_{a \in \Gamma(x)} q(x, a)$.
549549
4. **Compare** the learned policy against the VFI solution.
550550

551551
### Implementation
552552

553553
We first define a helper to extract the greedy policy from the Q-table.
554554

555-
Since the optimal policy minimizes $q$, we use $\arg\min$ rather than $\arg\max$.
555+
Since the optimal policy minimizes $q$, we use $\argmin$ rather than $\argmax$.
556556

557557
```{code-cell} ipython3
558558
@numba.jit(nopython=True)
@@ -572,7 +572,7 @@ def greedy_policy_from_q_rs(q, K):
572572

573573
The Q-learning loop mirrors the risk-neutral version, with the key changes:
574574
Q-table initialized to ones, the update target uses $\exp(-\gamma R_{t+1})
575-
\cdot (\min_{a'} q_t)^\beta$, and the behavior policy follows the $\arg\min$.
575+
\cdot (\min_{a'} q_t)^\beta$, and the behavior policy follows the $\argmin$.
576576

577577
```{code-cell} ipython3
578578
@numba.jit(nopython=True)
@@ -657,7 +657,7 @@ We extract the value function and policy from the final Q-table.
657657

658658
Since Q-values represent $\mathbb{E}[\exp(-\gamma(\cdots))]$, we recover the
659659
value function via $v_Q(x) = -\frac{1}{\gamma} \ln(\min_{a} q(x, a))$ and the
660-
policy via $\sigma_Q(x) = \arg\min_a q(x, a)$.
660+
policy via $\sigma_Q(x) = \argmin_a q(x, a)$.
661661

662662
```{code-cell} ipython3
663663
K = len(x_values) - 1

0 commit comments

Comments
 (0)