Add Reinforcement Learning section (#822)

jstac · claude · mmcky · web-flow · commit b05460d2f34d · 2026-03-09T08:20:03.000+09:00
* Add Reinforcement Learning section with inventory Q-learning lecture

Add a new 'Reinforcement Learning' section to the book containing:
- inventory_q.md: a new lecture on inventory management via DP and Q-learning
- mccall_q.md: moved from the Search section

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Fix errors and improve clarity in inventory Q-learning lecture

Mathematical fixes:
- Fix argument order in transition function: h(X_t, D_{t+1}, A_t) → h(X_t, A_t, D_{t+1})
  to match the definition h(x, a, d) := (x - d) ∨ 0 + a
- Rename reward function from r(x, a, d) to π(x, a, d) to resolve notation
  clash with interest rate r and align with profit notation π_t
- Fix action space typography: A := X → 𝖠 := 𝖷 (mathsf consistency)
- Fix inconsistent notation in modified update rule: π_{t+1} → R_{t+1}

Prose improvements:
- Clarify timing language: "after the firm caters to current demand D_{t+1}"
  → "after demand D_{t+1} is realized and served"
- Rewrite Q-table and behavior policy section to carefully distinguish
  the max in the update target (a scalar value computation) from the
  behavior policy (the action actually taken). The previous text claimed
  random actions still yield convergence, which is only true if you
  understand the max stays in the update — a distinction the text did
  not make explicit.
- Introduce on-policy vs off-policy terminology with explanation
- Contrast the optimality operator (max → q*) with the evaluation
  operator (fixed σ → q^σ) to make the role of the max rigorous
- Improve code comments to separate the max value (update target) from
  the argmax action (behavior policy)

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* misc

* Fix argmax rendering: update MathJax macro to use operatorname*

Updated the global MathJax macros for \argmax and \argmin in _config.yml
to use \operatorname*{} so subscripts render directly below in display
mode, matching the style of \max. Reverted inline workarounds in
inventory_q.md back to \argmax.

---------

Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
Co-authored-by: Matt McKay &lt;mmcky@users.noreply.github.com&gt;
diff --git a/lectures/_config.yml b/lectures/_config.yml
@@ -107,8 +107,8 @@ sphinx:
     mathjax3_config:
       tex:
         macros:
-          "argmax" : "arg\\,max"
-          "argmin" : "arg\\,min"
+          "argmax" : ["\\operatorname*{argmax}", 0]
+          "argmin" : ["\\operatorname*{argmin}", 0]
     mathjax_path: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
     # Local Redirects
     rediraffe_redirects:
diff --git a/lectures/_static/quant-econ.bib b/lectures/_static/quant-econ.bib
@@ -4,7 +4,7 @@
 ###
 
 @article{evans2005interview,
-  title={An interview with thomas j. sargent},
+  title={An interview with Thomas J. Sargent},
   author={Evans, George W and Honkapohja, Seppo},
   journal={Macroeconomic Dynamics},
   volume={9},
diff --git a/lectures/_toc.yml b/lectures/_toc.yml
@@ -73,6 +73,10 @@ parts:
   - file: career
   - file: jv
   - file: odu
+- caption: Reinforcement Learning
+  numbered: true
+  chapters:
+  - file: inventory_q
   - file: mccall_q
 - caption: Introduction to Optimal Savings
   numbered: true
diff --git a/lectures/inventory_q.md b/lectures/inventory_q.md