@@ -515,7 +515,7 @@ standard Q-learning.
515515Notice several differences from the risk-neutral case:
516516
517517- The Q-values are ** positive** (expectations of exponentials) rather than signed.
518- - The optimal policy is $\sigma(x) = \arg\min_a q(x, a)$ — we ** minimize**
518+ - The optimal policy is $\sigma(x) = \argmin_a q(x, a)$ — we ** minimize**
519519 rather than maximize, because $\phi^{-1}$ is decreasing.
520520- The observed profit enters through $\exp(-\gamma R_ {t+1})$ rather than
521521 additively.
@@ -536,23 +536,23 @@ Our implementation follows the same structure as the risk-neutral Q-learning in
536536 - Draw demand $D_ {t+1}$ and compute observed profit $R_ {t+1}$ and next state
537537 $X_ {t+1}$.
538538 - Compute $\min_ {a'} q_t(X_ {t+1}, a')$ over feasible actions (this is a
539- scalar for the update target, and the $\arg\min $ action is used by the
539+ scalar for the update target, and the $\argmin $ action is used by the
540540 $\varepsilon$-greedy behavior policy).
541541 - Update $q_t(x, a)$ using the rule above, with learning rate
542542 $\alpha_t = 1 / n_t(x, a)^{0.51}$.
543543 - Choose the next action via $\varepsilon$-greedy: with probability
544544 $\varepsilon$ pick a random feasible action, otherwise pick the
545- $\arg\min $ action.
545+ $\argmin $ action.
546546 - Decay $\varepsilon$.
5475473 . ** Extract the greedy policy** from the final Q-table via
548- $\sigma(x) = \arg\min _ {a \in \Gamma(x)} q(x, a)$.
548+ $\sigma(x) = \argmin _ {a \in \Gamma(x)} q(x, a)$.
5495494 . ** Compare** the learned policy against the VFI solution.
550550
551551### Implementation
552552
553553We first define a helper to extract the greedy policy from the Q-table.
554554
555- Since the optimal policy minimizes $q$, we use $\arg\min $ rather than $\arg\max $.
555+ Since the optimal policy minimizes $q$, we use $\argmin $ rather than $\argmax $.
556556
557557``` {code-cell} ipython3
558558@numba.jit(nopython=True)
@@ -572,7 +572,7 @@ def greedy_policy_from_q_rs(q, K):
572572
573573The Q-learning loop mirrors the risk-neutral version, with the key changes:
574574Q-table initialized to ones, the update target uses $\exp(-\gamma R_ {t+1})
575- \cdot (\min_ {a'} q_t)^\beta$, and the behavior policy follows the $\arg\min $.
575+ \cdot (\min_ {a'} q_t)^\beta$, and the behavior policy follows the $\argmin $.
576576
577577``` {code-cell} ipython3
578578@numba.jit(nopython=True)
@@ -657,7 +657,7 @@ We extract the value function and policy from the final Q-table.
657657
658658Since Q-values represent $\mathbb{E}[ \exp(-\gamma(\cdots))] $, we recover the
659659value function via $v_Q(x) = -\frac{1}{\gamma} \ln(\min_ {a} q(x, a))$ and the
660- policy via $\sigma_Q(x) = \arg\min_a q(x, a)$.
660+ policy via $\sigma_Q(x) = \argmin_a q(x, a)$.
661661
662662``` {code-cell} ipython3
663663K = len(x_values) - 1
0 commit comments