Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions lectures/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ sphinx:
macros:
"argmax" : ["\\operatorname*{argmax}", 0]
"argmin" : ["\\operatorname*{argmin}", 0]
"EE" : "\\mathbb{E}"
mathjax_path: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
# Local Redirects
rediraffe_redirects:
Expand Down
112 changes: 57 additions & 55 deletions lectures/inventory_q.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,18 +37,23 @@ We approach the problem in two ways.
First, we solve it exactly using dynamic programming, assuming full knowledge of
the model — the demand distribution, cost parameters, and transition dynamics.

Second, we show how a manager can learn the optimal policy from experience alone, using *[Q-learning](https://en.wikipedia.org/wiki/Q-learning)*.
Second, we show how a manager can learn the optimal policy from experience alone, using [Q-learning](https://en.wikipedia.org/wiki/Q-learning).

The manager observes only the inventory level, the order placed, the resulting
profit, and the next inventory level — without knowing any of the underlying
parameters.
In this setting, we assume that the manager observes only

* the inventory level,
* the order placed,
* the resulting profit, and
* the next inventory level.

The manager knows the interest rate -- and hence the discount factor -- but not any of the other underlying parameters.

A key idea is the *Q-factor* representation, which reformulates the Bellman
equation so that the optimal policy can be recovered without knowledge of the
transition function.
transition dynamics.

We show that, given enough experience, the manager's learned policy converges to
the optimal one.
We show that, given enough experience, the
manager's learned policy converges to the optimal one.

The lecture proceeds as follows:

Expand All @@ -67,16 +72,18 @@ import matplotlib.pyplot as plt
from typing import NamedTuple
```


## The Model

We study a firm where a manager tries to maximize shareholder value.
We study a firm where a manager tries to maximize shareholder value by
controlling inventories.

To simplify the problem, we assume that the firm only sells one product.

Letting $\pi_t$ be profits at time $t$ and $r > 0$ be the interest rate, the value of the firm is

$$
V_0 = \sum_{t \geq 0} \beta^t \pi_t
V_0 = \EE \sum_{t \geq 0} \beta^t \pi_t
\qquad
\text{ where }
\quad \beta := \frac{1}{1+r}.
Expand All @@ -97,9 +104,9 @@ $$
$$

The term $A_t$ is units of stock ordered this period, which arrive at the start
of period $t+1$, after demand $D_{t+1}$ is realized and served.
of period $t+1$, after demand $D_{t+1}$ is realized and served:

**Timeline for period $t$:** observe $X_t$ → choose $A_t$ → demand $D_{t+1}$ arrives → profit realized → $X_{t+1}$ determined.
* observe $X_t$ → choose $A_t$ → demand $D_{t+1}$ arrives → profit realized → $X_{t+1}$ determined.

(We use a $t$ subscript in $A_t$ to indicate the information set: it is chosen
before $D_{t+1}$ is observed.)
Expand All @@ -115,7 +122,7 @@ $$
Here

* the sales price is set to unity (for convenience)
* revenue is the minimum of current stock and demand because orders in excess of inventory are lost rather than back-filled
* revenue is the minimum of current stock and demand because orders in excess of inventory are lost (not back-filled)
* $c$ is unit product cost and $\kappa$ is a fixed cost of ordering inventory

We can map our inventory problem into a dynamic program with state space $\mathsf X := \{0, \ldots, K\}$ and action space $\mathsf A := \mathsf X$.
Expand Down Expand Up @@ -463,9 +470,10 @@ The manager does not need to know the demand distribution $\phi$, the unit cost
All the manager needs to observe at each step is:

1. the current inventory level $x$,
2. the order quantity $a$ they chose,
3. the resulting profit $R_{t+1}$ (which appears on the books), and
4. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
2. the order quantity $a$, which they choose,
3. the resulting profit $R_{t+1}$ (which appears on the books),
4. the discount factor $\beta$, which is determined by the interest rate, and
5. the next inventory level $X_{t+1}$ (which they can read off the warehouse).

These are all directly observable quantities — no model knowledge is required.

Expand All @@ -480,47 +488,29 @@ a)$ for every state-action pair $(x, a)$.

At each step, the manager is in some state $x$ and must choose a specific action
$a$ to take. Whichever $a$ is chosen, the manager observes profit $R_{t+1}$
and next state $X_{t+1}$, and updates **that one entry** $q_t(x, a)$ of the
and next state $X_{t+1}$, and updates *that one entry* $q_t(x, a)$ of the
table using the rule above.

**The max computes a value, not an action.**

It is tempting to read the $\max_{a'}$ in the update rule as prescribing the
manager's next action — that is, to interpret the update as saying "move to
state $X_{t+1}$ and take action $\argmax_{a'} q_t(X_{t+1}, a')$."
state $X_{t+1}$ and take an action in $\argmax_{a'} q_t(X_{t+1}, a')$."

But the $\max$ plays a different role. The quantity $\max_{a' \in
\Gamma(X_{t+1})} q_t(X_{t+1}, a')$ is a **scalar** — it estimates the value of
being in state $X_{t+1}$ under the best possible continuation. This scalar
enters the update as part of the target value for $q_t(x, a)$.
But the $\max$ plays a different role.

Which action the manager *actually takes* at state $X_{t+1}$ is a separate
decision entirely.

To see why this distinction matters, consider what happens if we modify the
update rule by replacing the $\max$ with evaluation under a fixed feasible
policy $\sigma$:

$$
q_{t+1}(x, a)
= (1 - \alpha_t) q_t(x, a) +
\alpha_t \left(R_{t+1} + \beta \, q_t(X_{t+1}, \sigma(X_{t+1}))\right).
$$
The quantity $\max_{a' \in \Gamma(X_{t+1})} q_t(X_{t+1}, a')$ is just an estimate of the value of
being in state $X_{t+1}$ under the best possible continuation.

This modified update is a stochastic sample of the Bellman *evaluation* operator
for $\sigma$. The Q-table then converges to $q^\sigma$ — the Q-function
associated with the lifetime value of $\sigma$, not the optimal one.
This scalar enters the update as part of the target value for $q_t(x, a)$.

By contrast, the original update with the $\max$ is a stochastic sample of the
Bellman *optimality* operator, whose fixed point is $q^*$. The $\max$ in the
update target is therefore what drives convergence to $q^*$.
Which action the manager *actually takes* at time $t+1$ is a separate decision.

In short, the $\max$ is doing the work of finding the optimum; without it, you only evaluate a fixed policy.
In short, the $\max$ is doing the work of finding the optimum; it does not dictate the action that the manager actually takes.

### The behavior policy

The rule governing how the manager chooses actions is called the **behavior
policy**. Because the $\max$ in the update target always points toward $q^*$
The rule governing how the manager chooses actions is called the **behavior policy**.

Because the $\max$ in the update target always points toward $q^*$
regardless of how the manager selects actions, the behavior policy affects only
which $(x, a)$ entries get visited — and hence updated — over time.

Expand All @@ -545,6 +535,7 @@ We use $\alpha_t = 1 / n_t(x, a)^{0.51}$, where $n_t(x, a)$ is the number of tim

This decays slowly enough to allow learning from later (better-informed) updates, while still satisfying the [Robbins–Monro conditions](https://en.wikipedia.org/wiki/Stochastic_approximation#Robbins%E2%80%93Monro_algorithm) for convergence.


### Exploration: epsilon-greedy

For our behavior policy, we use an $\varepsilon$-greedy strategy:
Expand All @@ -560,6 +551,16 @@ We decay $\varepsilon$ each step: $\varepsilon_{t+1} = \max(\varepsilon_{\min},\

The stochastic demand shocks naturally drive the manager across different inventory levels, providing exploration over the state space without any artificial resets.

### Optimistic initialization

A simple but powerful technique for accelerating learning is **optimistic initialization**: instead of starting the Q-table at zero, we initialize every entry to a value above the true optimum.

Because every untried action looks optimistically good, the agent is "disappointed" whenever it tries one — the update pulls that entry down toward reality. This drives the agent to try other actions (which still look optimistically high), producing broad exploration of the state-action space early in training.

This idea is sometimes called **optimism in the face of uncertainty** and is widely used in both bandit and reinforcement learning settings.

In our problem, the value function $v^*$ ranges from about 13 to 18. We initialize the Q-table at 20 — modestly above the true maximum — to ensure optimistic exploration without being so extreme as to distort learning.

### Implementation

We first define a helper to extract the greedy policy from a Q-table.
Expand Down Expand Up @@ -587,9 +588,9 @@ At specified step counts (given by `snapshot_steps`), we record the current gree
```{code-cell} ipython3
@numba.jit(nopython=True)
def q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
ε_init, ε_min, ε_decay, snapshot_steps, seed):
ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed):
np.random.seed(seed)
q = np.zeros((K + 1, K + 1))
q = np.full((K + 1, K + 1), q_init)
n = np.zeros((K + 1, K + 1)) # visit counts for learning rate
ε = ε_init

Expand Down Expand Up @@ -642,22 +643,21 @@ The wrapper function unpacks the model and provides default hyperparameters.
```{code-cell} ipython3
def q_learning(model, n_steps=20_000_000, X_init=0,
ε_init=1.0, ε_min=0.01, ε_decay=0.999999,
snapshot_steps=None, seed=1234):
q_init=20.0, snapshot_steps=None, seed=1234):
x_values, d_values, ϕ_values, p, c, κ, β = model
K = len(x_values) - 1
if snapshot_steps is None:
snapshot_steps = np.array([], dtype=np.int64)
return q_learning_kernel(K, p, c, κ, β, n_steps, X_init,
ε_init, ε_min, ε_decay, snapshot_steps, seed)
ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed)
```

### Running Q-learning

We run 20 million steps and take policy snapshots at steps 10,000, 1,000,000, and at the end.
Next we run $n$ = 5 million steps and take policy snapshots at steps 10,000, 1,000,000, and $n$.

```{code-cell} ipython3
snap_steps = np.array([10_000, 1_000_000, 19_999_999], dtype=np.int64)
q, snapshots = q_learning(model, snapshot_steps=snap_steps)
n = 5_000_000
snap_steps = np.array([10_000, 1_000_000, n], dtype=np.int64)
q, snapshots = q_learning(model, n_steps=n+1, snapshot_steps=snap_steps)
```

### Comparing with the exact solution
Expand Down Expand Up @@ -710,9 +710,11 @@ All panels use the **same demand sequence** (via a fixed random seed), so differ

The top panel shows the optimal policy from VFI for reference.

After only 10,000 steps the agent has barely explored and its policy is poor.
After 10,000 steps the agent has barely explored and its policy is poor.

By 1,000,000 steps the policy has improved but still differs noticeably from the optimum.

By step 20 million, the learned policy produces inventory dynamics that closely resemble the S-s pattern of the optimal solution.
By step 5 million, the learned policy produces inventory dynamics that closely resemble the S-s pattern of the optimal solution.

```{code-cell} ipython3
ts_length = 200
Expand Down
35 changes: 24 additions & 11 deletions lectures/rs_inventory_q.md
Original file line number Diff line number Diff line change
Expand Up @@ -530,7 +530,7 @@ $X_{t+1}$ — no model knowledge is required.
Our implementation follows the same structure as the risk-neutral Q-learning in
{doc}`inventory_q`, with the modifications above:

1. **Initialize** the Q-table $q$ to ones (since Q-values are positive) and
1. **Initialize** the Q-table $q$ optimistically (see below) and
visit counts $n$ to zeros.
2. **At each step:**
- Draw demand $D_{t+1}$ and compute observed profit $R_{t+1}$ and next state
Expand All @@ -548,6 +548,17 @@ Our implementation follows the same structure as the risk-neutral Q-learning in
$\sigma(x) = \argmin_{a \in \Gamma(x)} q(x, a)$.
4. **Compare** the learned policy against the VFI solution.

### Optimistic initialization

As in {doc}`inventory_q`, we use optimistic initialization to accelerate learning.

The logic is the same — initialize the Q-table so that every untried action looks attractive, driving the agent to explore broadly — but the direction is reversed.

Since the optimal policy *minimizes* $q$, "optimistic" means initializing the Q-table *below* the true values. When the agent tries an action, the update pushes $q$ upward toward reality, making that entry look worse and prompting the agent to try other actions that still appear optimistically good.

The true Q-values are on the order of $\exp(-\gamma \, v^*) \approx 10^{-8}$ to $10^{-6}$.
We initialize the Q-table at $10^{-9}$, modestly below this range.

### Implementation

We first define a helper to extract the greedy policy from the Q-table.
Expand All @@ -571,15 +582,15 @@ def greedy_policy_from_q_rs(q, K):
```

The Q-learning loop mirrors the risk-neutral version, with the key changes:
Q-table initialized to ones, the update target uses $\exp(-\gamma R_{t+1})
the update target uses $\exp(-\gamma R_{t+1})
\cdot (\min_{a'} q_t)^\beta$, and the behavior policy follows the $\argmin$.

```{code-cell} ipython3
@numba.jit(nopython=True)
def q_learning_rs_kernel(K, p, c, κ, β, γ, n_steps, X_init,
ε_init, ε_min, ε_decay, snapshot_steps, seed):
ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed):
np.random.seed(seed)
q = np.ones((K + 1, K + 1)) # positive Q-values, initialized to 1
q = np.full((K + 1, K + 1), q_init) # optimistic initialization
n = np.zeros((K + 1, K + 1)) # visit counts for learning rate
ε = ε_init

Expand Down Expand Up @@ -633,22 +644,23 @@ The wrapper function unpacks the model and provides default hyperparameters.
```{code-cell} ipython3
def q_learning_rs(model, n_steps=20_000_000, X_init=0,
ε_init=1.0, ε_min=0.01, ε_decay=0.999999,
snapshot_steps=None, seed=1234):
q_init=1e-9, snapshot_steps=None, seed=1234):
x_values, d_values, ϕ_values, p, c, κ, β, γ = model
K = len(x_values) - 1
if snapshot_steps is None:
snapshot_steps = np.array([], dtype=np.int64)
return q_learning_rs_kernel(K, p, c, κ, β, γ, n_steps, X_init,
ε_init, ε_min, ε_decay, snapshot_steps, seed)
ε_init, ε_min, ε_decay, q_init, snapshot_steps, seed)
```

### Running Q-learning

We run 20 million steps and take policy snapshots at steps 10,000, 1,000,000, and at the end.
We run $n$ = 5 million steps and take policy snapshots at steps 10,000, 1,000,000, and $n$.

```{code-cell} ipython3
snap_steps = np.array([10_000, 1_000_000, 19_999_999], dtype=np.int64)
q_table, snapshots = q_learning_rs(model, snapshot_steps=snap_steps)
n = 5_000_000
snap_steps = np.array([10_000, 1_000_000, n], dtype=np.int64)
q_table, snapshots = q_learning_rs(model, n_steps=n+1, snapshot_steps=snap_steps)
```

### Comparing with the exact solution
Expand Down Expand Up @@ -731,8 +743,9 @@ plt.show()

After 10,000 steps, the agent has barely explored and its policy is erratic.

By 1,000,000 steps the learned policy begins to resemble the optimal one, and
by step 20 million the inventory dynamics are nearly indistinguishable from the
By 1,000,000 steps the learned policy has improved but still differs noticeably from the optimum.

By step 5 million the inventory dynamics are nearly indistinguishable from the
VFI solution.

Note that the converged policy maintains lower inventory levels than in the
Expand Down
Loading