You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add risk-sensitive inventory management lecture and improve inventory_q
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add rs_inventory_q to table of contents
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Remove generated ipynb and py files from tracking
The build system generates notebooks from the .md source.
Having .ipynb and .py files in the repo causes duplicate
document warnings and cross-reference failures.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix cross-references: use {doc} instead of {ref} for lecture links
The {ref} directive requires a label placed before a heading to
resolve a title. The inventory_q label is before a raw block,
so {ref} fails. Using {doc} links to the document directly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Use \argmin and \argmax macros from _config.yml
* Rename risk-sensitivity function from φ to ψ to avoid notation conflict
The symbol φ was overloaded: used for both the demand PMF and the
risk-sensitivity transformation. Now ψ(t) = exp(-γt) is the risk
transformation and φ(d) remains the demand PMF, consistent with
inventory_q.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add forward ref from inventory_q and fix minor grammar
- Add cross-reference from inventory_q.md to rs_inventory_q.md
- Fix missing comma after introductory phrase
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Move random seeds inside jitted functions for reproducibility
Numba JIT functions use their own RNG state, so np.random.seed()
called outside a jitted function has no effect on random draws
inside it. Moved seeds into sim_inventories and q_learning kernels
as parameters.
Fixes issue noted by @HumphreyYang.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Matt McKay <mmcky@users.noreply.github.com>
Q-learning approximates the fixed point of the Q-factor Bellman equation using **stochastic approximation**.
441
+
Q-learning approximates the fixed point of the Q-factor Bellman equation using **[stochastic approximation](https://en.wikipedia.org/wiki/Stochastic_approximation)**.
431
442
432
443
At each step, the agent is in state $x$, takes action $a$, observes reward
433
444
$R_{t+1} = \pi(x, a, D_{t+1})$ and next state $X_{t+1} = h(x, a, D_{t+1})$, and
@@ -443,8 +454,23 @@ where $\alpha_t$ is the learning rate.
443
454
444
455
The update blends the current estimate $q_t(x, a)$ with a fresh sample of the Bellman target.
445
456
457
+
### What the manager needs to know
458
+
459
+
Notice what is **not** required to implement the update.
460
+
461
+
The manager does not need to know the demand distribution $\phi$, the unit cost $c$, the fixed cost $\kappa$, or the transition function $h$.
462
+
463
+
All the manager needs to observe at each step is:
464
+
465
+
1. the current inventory level $x$,
466
+
2. the order quantity $a$ they chose,
467
+
3. the resulting profit $R_{t+1}$ (which appears on the books), and
468
+
4. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
469
+
470
+
These are all directly observable quantities — no model knowledge is required.
446
471
447
-
### The Q-table and the behavior policy
472
+
473
+
### The Q-table and the role of the max
448
474
449
475
It is important to understand how the update rule relates to the manager's
450
476
actions.
@@ -489,7 +515,9 @@ By contrast, the original update with the $\max$ is a stochastic sample of the
489
515
Bellman *optimality* operator, whose fixed point is $q^*$. The $\max$ in the
490
516
update target is therefore what drives convergence to $q^*$.
491
517
492
-
**The behavior policy.**
518
+
In short, the $\max$ is doing the work of finding the optimum; without it, you only evaluate a fixed policy.
519
+
520
+
### The behavior policy
493
521
494
522
The rule governing how the manager chooses actions is called the **behavior
495
523
policy**. Because the $\max$ in the update target always points toward $q^*$
@@ -511,26 +539,11 @@ In practice, we want the manager to mostly take good actions (to earn reasonable
511
539
profits while learning), while still occasionally experimenting to discover
512
540
better alternatives.
513
541
514
-
### What the manager needs to know
515
-
516
-
Notice what is **not** required to implement the update.
517
-
518
-
The manager does not need to know the demand distribution $\phi$, the unit cost $c$, the fixed cost $\kappa$, or the transition function $h$.
519
-
520
-
All the manager needs to observe at each step is:
521
-
522
-
1. the current inventory level $x$,
523
-
2. the order quantity $a$ they chose,
524
-
3. the resulting profit $R_{t+1}$ (which appears on the books), and
525
-
4. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
526
-
527
-
These are all directly observable quantities — no model knowledge is required.
528
-
529
542
### Learning rate
530
543
531
544
We use $\alpha_t = 1 / n_t(x, a)^{0.51}$, where $n_t(x, a)$ is the number of times the pair $(x, a)$ has been visited up to time $t$.
532
545
533
-
This decays slowly enough to allow learning from later (better-informed) updates, while still satisfying the Robbins-Monro conditions for convergence.
546
+
This decays slowly enough to allow learning from later (better-informed) updates, while still satisfying the [Robbins–Monro conditions](https://en.wikipedia.org/wiki/Stochastic_approximation#Robbins%E2%80%93Monro_algorithm) for convergence.
534
547
535
548
### Exploration: epsilon-greedy
536
549
@@ -574,7 +587,8 @@ At specified step counts (given by `snapshot_steps`), we record the current gree
0 commit comments