diff --git a/lectures/_config.yml b/lectures/_config.yml index 698fa5b32..75d5a65bf 100644 --- a/lectures/_config.yml +++ b/lectures/_config.yml @@ -109,6 +109,7 @@ sphinx: macros: "argmax" : ["\\operatorname*{argmax}", 0] "argmin" : ["\\operatorname*{argmin}", 0] + "EE" : "\\mathbb{E}" mathjax_path: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js # Local Redirects rediraffe_redirects: diff --git a/lectures/inventory_q.md b/lectures/inventory_q.md index 146921bf8..e9c9de9ce 100644 --- a/lectures/inventory_q.md +++ b/lectures/inventory_q.md @@ -37,18 +37,23 @@ We approach the problem in two ways. First, we solve it exactly using dynamic programming, assuming full knowledge of the model — the demand distribution, cost parameters, and transition dynamics. -Second, we show how a manager can learn the optimal policy from experience alone, using *[Q-learning](https://en.wikipedia.org/wiki/Q-learning)*. +Second, we show how a manager can learn the optimal policy from experience alone, using [Q-learning](https://en.wikipedia.org/wiki/Q-learning). -The manager observes only the inventory level, the order placed, the resulting -profit, and the next inventory level — without knowing any of the underlying -parameters. +In this setting, we assume that the manager observes only + +* the inventory level, +* the order placed, +* the resulting profit, and +* the next inventory level. + +The manager knows the interest rate -- and hence the discount factor -- but not any of the other underlying parameters. A key idea is the *Q-factor* representation, which reformulates the Bellman equation so that the optimal policy can be recovered without knowledge of the -transition function. +transition dynamics. -We show that, given enough experience, the manager's learned policy converges to -the optimal one. +We show that, given enough experience, the +manager's learned policy converges to the optimal one. The lecture proceeds as follows: @@ -67,16 +72,18 @@ import matplotlib.pyplot as plt from typing import NamedTuple ``` + ## The Model -We study a firm where a manager tries to maximize shareholder value. +We study a firm where a manager tries to maximize shareholder value by +controlling inventories. To simplify the problem, we assume that the firm only sells one product. Letting $\pi_t$ be profits at time $t$ and $r > 0$ be the interest rate, the value of the firm is $$ - V_0 = \sum_{t \geq 0} \beta^t \pi_t + V_0 = \EE \sum_{t \geq 0} \beta^t \pi_t \qquad \text{ where } \quad \beta := \frac{1}{1+r}. @@ -97,9 +104,9 @@ $$ $$ The term $A_t$ is units of stock ordered this period, which arrive at the start -of period $t+1$, after demand $D_{t+1}$ is realized and served. +of period $t+1$, after demand $D_{t+1}$ is realized and served: -**Timeline for period $t$:** observe $X_t$ → choose $A_t$ → demand $D_{t+1}$ arrives → profit realized → $X_{t+1}$ determined. +* observe $X_t$ → choose $A_t$ → demand $D_{t+1}$ arrives → profit realized → $X_{t+1}$ determined. (We use a $t$ subscript in $A_t$ to indicate the information set: it is chosen before $D_{t+1}$ is observed.) @@ -115,7 +122,7 @@ $$ Here * the sales price is set to unity (for convenience) -* revenue is the minimum of current stock and demand because orders in excess of inventory are lost rather than back-filled +* revenue is the minimum of current stock and demand because orders in excess of inventory are lost (not back-filled) * $c$ is unit product cost and $\kappa$ is a fixed cost of ordering inventory We can map our inventory problem into a dynamic program with state space $\mathsf X := \{0, \ldots, K\}$ and action space $\mathsf A := \mathsf X$. @@ -463,9 +470,10 @@ The manager does not need to know the demand distribution $\phi$, the unit cost All the manager needs to observe at each step is: 1. the current inventory level $x$, -2. the order quantity $a$ they chose, -3. the resulting profit $R_{t+1}$ (which appears on the books), and -4. the next inventory level $X_{t+1}$ (which they can read off the warehouse). +2. the order quantity $a$, which they choose, +3. the resulting profit $R_{t+1}$ (which appears on the books), +4. the discount factor $\beta$, which is determined by the interest rate, and +5. the next inventory level $X_{t+1}$ (which they can read off the warehouse). These are all directly observable quantities — no model knowledge is required. @@ -480,47 +488,29 @@ a)$ for every state-action pair $(x, a)$. At each step, the manager is in some state $x$ and must choose a specific action $a$ to take. Whichever $a$ is chosen, the manager observes profit $R_{t+1}$ -and next state $X_{t+1}$, and updates **that one entry** $q_t(x, a)$ of the +and next state $X_{t+1}$, and updates *that one entry* $q_t(x, a)$ of the table using the rule above. -**The max computes a value, not an action.** - It is tempting to read the $\max_{a'}$ in the update rule as prescribing the manager's next action — that is, to interpret the update as saying "move to -state $X_{t+1}$ and take action $\argmax_{a'} q_t(X_{t+1}, a')$." +state $X_{t+1}$ and take an action in $\argmax_{a'} q_t(X_{t+1}, a')$." -But the $\max$ plays a different role. The quantity $\max_{a' \in -\Gamma(X_{t+1})} q_t(X_{t+1}, a')$ is a **scalar** — it estimates the value of -being in state $X_{t+1}$ under the best possible continuation. This scalar -enters the update as part of the target value for $q_t(x, a)$. +But the $\max$ plays a different role. -Which action the manager *actually takes* at state $X_{t+1}$ is a separate -decision entirely. - -To see why this distinction matters, consider what happens if we modify the -update rule by replacing the $\max$ with evaluation under a fixed feasible -policy $\sigma$: - -$$ - q_{t+1}(x, a) - = (1 - \alpha_t) q_t(x, a) + - \alpha_t \left(R_{t+1} + \beta \, q_t(X_{t+1}, \sigma(X_{t+1}))\right). -$$ +The quantity $\max_{a' \in \Gamma(X_{t+1})} q_t(X_{t+1}, a')$ is just an estimate of the value of +being in state $X_{t+1}$ under the best possible continuation. -This modified update is a stochastic sample of the Bellman *evaluation* operator -for $\sigma$. The Q-table then converges to $q^\sigma$ — the Q-function -associated with the lifetime value of $\sigma$, not the optimal one. +This scalar enters the update as part of the target value for $q_t(x, a)$. -By contrast, the original update with the $\max$ is a stochastic sample of the -Bellman *optimality* operator, whose fixed point is $q^*$. The $\max$ in the -update target is therefore what drives convergence to $q^*$. +Which action the manager *actually takes* at time $t+1$ is a separate decision. -In short, the $\max$ is doing the work of finding the optimum; without it, you only evaluate a fixed policy. +In short, the $\max$ is doing the work of finding the optimum; it does not dictate the action that the manager actually takes. ### The behavior policy -The rule governing how the manager chooses actions is called the **behavior -policy**. Because the $\max$ in the update target always points toward $q^*$ +The rule governing how the manager chooses actions is called the **behavior policy**. + +Because the $\max$ in the update target always points toward $q^*$ regardless of how the manager selects actions, the behavior policy affects only which $(x, a)$ entries get visited — and hence updated — over time. @@ -545,6 +535,7 @@ We use $\alpha_t = 1 / n_t(x, a)^{0.51}$, where $n_t(x, a)$ is the number of tim This decays slowly enough to allow learning from later (better-informed) updates, while still satisfying the [Robbins–Monro conditions](https://en.wikipedia.org/wiki/Stochastic_approximation#Robbins%E2%80%93Monro_algorithm) for convergence. + ### Exploration: epsilon-greedy For our behavior policy, we use an $\varepsilon$-greedy strategy: @@ -560,6 +551,16 @@ We decay $\varepsilon$ each step: $\varepsilon_{t+1} = \max(\varepsilon_{\min},\ The stochastic demand shocks naturally drive the manager across different inventory levels, providing exploration over the state space without any artificial resets. +### Optimistic initialization + +A simple but powerful technique for accelerating learning is **optimistic initialization**: instead of starting the Q-table at zero, we initialize every entry to a value above the true optimum. + +Because every untried action looks optimistically good, the agent is "disappointed" whenever it tries one — the update pulls that entry down toward reality. This drives the agent to try other actions (which still look optimistically high), producing broad exploration of the state-action space early in training. + +This idea is sometimes called **optimism in the face of uncertainty** and is widely used in both bandit and reinforcement learning settings. + +In our problem, the value function $v^*$ ranges from about 13 to 18. We initialize the Q-table at 20 — modestly above the true maximum — to ensure optimistic exploration without being so extreme as to distort learning. + ### Implementation We first define a helper to extract the greedy policy from a Q-table. @@ -587,9 +588,9 @@ At specified step counts (given by `snapshot_steps`), we record the current gree ```{code-cell} ipython3 @numba.jit(nopython=True) def q_learning_kernel(K, p, c, κ, β, n_steps, X_init, - ε_init, ε_min, ε_decay, snapshot_steps, seed): + ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed): np.random.seed(seed) - q = np.zeros((K + 1, K + 1)) + q = np.full((K + 1, K + 1), q_init) n = np.zeros((K + 1, K + 1)) # visit counts for learning rate ε = ε_init @@ -642,22 +643,21 @@ The wrapper function unpacks the model and provides default hyperparameters. ```{code-cell} ipython3 def q_learning(model, n_steps=20_000_000, X_init=0, ε_init=1.0, ε_min=0.01, ε_decay=0.999999, - snapshot_steps=None, seed=1234): + q_init=20.0, snapshot_steps=None, seed=1234): x_values, d_values, ϕ_values, p, c, κ, β = model K = len(x_values) - 1 if snapshot_steps is None: snapshot_steps = np.array([], dtype=np.int64) return q_learning_kernel(K, p, c, κ, β, n_steps, X_init, - ε_init, ε_min, ε_decay, snapshot_steps, seed) + ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed) ``` -### Running Q-learning - -We run 20 million steps and take policy snapshots at steps 10,000, 1,000,000, and at the end. +Next we run $n$ = 5 million steps and take policy snapshots at steps 10,000, 1,000,000, and $n$. ```{code-cell} ipython3 -snap_steps = np.array([10_000, 1_000_000, 19_999_999], dtype=np.int64) -q, snapshots = q_learning(model, snapshot_steps=snap_steps) +n = 5_000_000 +snap_steps = np.array([10_000, 1_000_000, n], dtype=np.int64) +q, snapshots = q_learning(model, n_steps=n+1, snapshot_steps=snap_steps) ``` ### Comparing with the exact solution @@ -710,9 +710,11 @@ All panels use the **same demand sequence** (via a fixed random seed), so differ The top panel shows the optimal policy from VFI for reference. -After only 10,000 steps the agent has barely explored and its policy is poor. +After 10,000 steps the agent has barely explored and its policy is poor. + +By 1,000,000 steps the policy has improved but still differs noticeably from the optimum. -By step 20 million, the learned policy produces inventory dynamics that closely resemble the S-s pattern of the optimal solution. +By step 5 million, the learned policy produces inventory dynamics that closely resemble the S-s pattern of the optimal solution. ```{code-cell} ipython3 ts_length = 200 diff --git a/lectures/rs_inventory_q.md b/lectures/rs_inventory_q.md index cd81e6bad..347932d20 100644 --- a/lectures/rs_inventory_q.md +++ b/lectures/rs_inventory_q.md @@ -530,7 +530,7 @@ $X_{t+1}$ — no model knowledge is required. Our implementation follows the same structure as the risk-neutral Q-learning in {doc}`inventory_q`, with the modifications above: -1. **Initialize** the Q-table $q$ to ones (since Q-values are positive) and +1. **Initialize** the Q-table $q$ optimistically (see below) and visit counts $n$ to zeros. 2. **At each step:** - Draw demand $D_{t+1}$ and compute observed profit $R_{t+1}$ and next state @@ -548,6 +548,17 @@ Our implementation follows the same structure as the risk-neutral Q-learning in $\sigma(x) = \argmin_{a \in \Gamma(x)} q(x, a)$. 4. **Compare** the learned policy against the VFI solution. +### Optimistic initialization + +As in {doc}`inventory_q`, we use optimistic initialization to accelerate learning. + +The logic is the same — initialize the Q-table so that every untried action looks attractive, driving the agent to explore broadly — but the direction is reversed. + +Since the optimal policy *minimizes* $q$, "optimistic" means initializing the Q-table *below* the true values. When the agent tries an action, the update pushes $q$ upward toward reality, making that entry look worse and prompting the agent to try other actions that still appear optimistically good. + +The true Q-values are on the order of $\exp(-\gamma \, v^*) \approx 10^{-8}$ to $10^{-6}$. +We initialize the Q-table at $10^{-9}$, modestly below this range. + ### Implementation We first define a helper to extract the greedy policy from the Q-table. @@ -571,15 +582,15 @@ def greedy_policy_from_q_rs(q, K): ``` The Q-learning loop mirrors the risk-neutral version, with the key changes: -Q-table initialized to ones, the update target uses $\exp(-\gamma R_{t+1}) +the update target uses $\exp(-\gamma R_{t+1}) \cdot (\min_{a'} q_t)^\beta$, and the behavior policy follows the $\argmin$. ```{code-cell} ipython3 @numba.jit(nopython=True) def q_learning_rs_kernel(K, p, c, κ, β, γ, n_steps, X_init, - ε_init, ε_min, ε_decay, snapshot_steps, seed): + ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed): np.random.seed(seed) - q = np.ones((K + 1, K + 1)) # positive Q-values, initialized to 1 + q = np.full((K + 1, K + 1), q_init) # optimistic initialization n = np.zeros((K + 1, K + 1)) # visit counts for learning rate ε = ε_init @@ -633,22 +644,23 @@ The wrapper function unpacks the model and provides default hyperparameters. ```{code-cell} ipython3 def q_learning_rs(model, n_steps=20_000_000, X_init=0, ε_init=1.0, ε_min=0.01, ε_decay=0.999999, - snapshot_steps=None, seed=1234): + q_init=1e-9, snapshot_steps=None, seed=1234): x_values, d_values, ϕ_values, p, c, κ, β, γ = model K = len(x_values) - 1 if snapshot_steps is None: snapshot_steps = np.array([], dtype=np.int64) return q_learning_rs_kernel(K, p, c, κ, β, γ, n_steps, X_init, - ε_init, ε_min, ε_decay, snapshot_steps, seed) + ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed) ``` ### Running Q-learning -We run 20 million steps and take policy snapshots at steps 10,000, 1,000,000, and at the end. +We run $n$ = 5 million steps and take policy snapshots at steps 10,000, 1,000,000, and $n$. ```{code-cell} ipython3 -snap_steps = np.array([10_000, 1_000_000, 19_999_999], dtype=np.int64) -q_table, snapshots = q_learning_rs(model, snapshot_steps=snap_steps) +n = 5_000_000 +snap_steps = np.array([10_000, 1_000_000, n], dtype=np.int64) +q_table, snapshots = q_learning_rs(model, n_steps=n+1, snapshot_steps=snap_steps) ``` ### Comparing with the exact solution @@ -731,8 +743,9 @@ plt.show() After 10,000 steps, the agent has barely explored and its policy is erratic. -By 1,000,000 steps the learned policy begins to resemble the optimal one, and -by step 20 million the inventory dynamics are nearly indistinguishable from the +By 1,000,000 steps the learned policy has improved but still differs noticeably from the optimum. + +By step 5 million the inventory dynamics are nearly indistinguishable from the VFI solution. Note that the converged policy maintains lower inventory levels than in the