Add optimistic initialization to Q-learning lecture by jstac · Pull Request #830 · QuantEcon/lecture-python.myst

jstac · 2026-03-11T09:28:33Z

Summary

Optimistic Q-table initialization: Initialize the Q-table to 20 (above the true value function range of ~13–18) instead of zeros. This implements "optimism in the face of uncertainty" — every untried action looks promising, driving broad early exploration without relying solely on ε-greedy randomness.
4x faster convergence: The optimistic init speeds up learning enough to reduce the training run from 20 million to 5 million steps while achieving better policy accuracy (20/21 states match optimal vs 18/21 previously).
New subsection explaining the technique and why it works.
Prose updates to reflect the shorter training run and describe the learning progression at each snapshot.

Test plan

Verify lecture builds with jupytext and runs without errors
Check that Q-learning converges to near-optimal policy within 5M steps
Review generated plots for reasonable value function and policy match

🤖 Generated with Claude Code

Initialize Q-table to 20 (above true value range of 13-18) instead of zeros, which drives broader exploration via "optimism in the face of uncertainty". This speeds convergence enough to reduce the training run from 20M to 5M steps. Added a new subsection explaining the technique. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-11T09:54:40Z

📖 Netlify Preview Ready!

Preview URL: https://pr-830--sunny-cactus-210e3e.netlify.app

Commit: 7394176

📚 Changed Lectures

Build Info

Workflow: Build Project [using jupyter-book]

Initialize Q-table to 1e-5 (below true Q-value range of ~1e-5 to 1e-4) instead of ones. Since the optimal policy minimizes q, optimistic means initializing below the truth — the reverse of the risk-neutral case. This speeds convergence enough to reduce training from 20M to 5M steps. Added a subsection explaining the reversed optimistic init logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jstac · 2026-03-11T22:11:45Z

@mmcky @HumphreyYang all reviews appreciated!

mmcky · 2026-03-11T22:29:29Z

The PR introduces \EE in the firm value equation but this macro was not defined in the MathJax config. I have pushed a fix (7c2f3de) adding "EE": "\\mathbb{E}" to the mathjax3_config macros in _config.yml so it renders correctly across all lectures.

mmcky · 2026-03-11T22:38:13Z

Two additional review items:

1. Missing profit in the observation list (inventory_q.md)

In the "What the manager needs to know" section, adding the discount factor β is a good idea (it appears in the update rule and the intro now mentions the manager knows the interest rate). However, the resulting profit R_{t+1} was removed to make room for it. Since R_{t+1} appears directly in the Q-learning update rule, the manager must observe it. I think this list should have 5 items — keeping profit and adding β — rather than replacing one with the other.

2. Risk-sensitive optimistic initialization logic (rs_inventory_q.md)

The new subsection says the Q-table is initialized at 10⁻⁵, "modestly below" the true Q-values of roughly 10⁻⁵ to 10⁻⁴. However, with γ = 1 and v* ≈ 13–18, the true Q-values are exp(−v*) ≈ 10⁻⁶ to 10⁻⁸, so 10⁻⁵ is actually above most of the true range. Since the agent minimizes, values above the truth look bad (not attractive), which is the opposite of the optimistic mechanism described in the text. The code works well empirically, so the initialization value may be fine — but the narrative explaining why it works appears to have the direction wrong. Worth verifying the actual Q-value range from the converged Q-table and adjusting the explanation.

(@jstac this comment is from Claude and I don't fully understand the Q-values TBH, but thought important to mention. Maybe @HumphreyYang can review and check this comment).

jstac · 2026-03-12T01:31:08Z

Thanks @mmcky .

I think claude is spot on here and could be asked to fix this.

…t value and narrative

mmcky · 2026-03-12T01:57:01Z

Pushed two additional fixes (7394176):

Restored profit to the observation list (inventory_q.md): The resulting profit $R_{t+1}$ was missing from the "What the manager needs to know" list — it appears directly in the Q-learning update rule so the manager must observe it. The list now has 5 items (profit restored, discount factor kept).
Fixed risk-sensitive optimistic init (rs_inventory_q.md): With γ=1 and v≈13–18, the true Q-values are exp(−v)≈10⁻⁸ to 10⁻⁶, not 10⁻⁵ to 10⁻⁴ as previously stated. The init value of 10⁻⁵ was actually above most true Q-values, contradicting the "modestly below" narrative. Changed q_init from 1e-5 to 1e-9 and updated the text to match. Build completes successfully with cached execution.

@jstac I have gone ahead and implemented the suggestions.

@mmcky to visually check html output with change in initial conditions.

jstac · 2026-03-12T02:00:14Z

Thanks @mmcky .

I think this can probably go live, unless @HumphreyYang wants to look again.

We could set Chihiro the task of reviewing this section of the lectures, as a starting point.

HumphreyYang · 2026-03-12T02:03:27Z

Many thanks @jstac, I will study it this afternoon after finishing slides. Please feel free to merge this and I will open another PR if I spot anything!

jstac · 2026-03-12T02:04:30Z

Thanks @HumphreyYang .

@mmcky , please merge when ready.

mmcky · 2026-03-12T04:55:48Z

@HumphreyYang merging this now.

Add \EE macro to MathJax config

7c2f3de

Restore profit to observation list; fix risk-sensitive optimistic ini…

7394176

…t value and narrative

mmcky merged commit d825b92 into main Mar 12, 2026
1 of 2 checks passed

mmcky deleted the rl_edits branch March 12, 2026 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add optimistic initialization to Q-learning lecture#830

Add optimistic initialization to Q-learning lecture#830
mmcky merged 4 commits intomainfrom
rl_edits

jstac commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

jstac commented Mar 11, 2026

Uh oh!

mmcky commented Mar 11, 2026

Uh oh!

mmcky commented Mar 11, 2026 •

edited

Loading

Uh oh!

jstac commented Mar 12, 2026

Uh oh!

mmcky commented Mar 12, 2026 •

edited

Loading

Uh oh!

jstac commented Mar 12, 2026

Uh oh!

HumphreyYang commented Mar 12, 2026

Uh oh!

jstac commented Mar 12, 2026

Uh oh!

mmcky commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jstac commented Mar 11, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📖 Netlify Preview Ready!

📚 Changed Lectures

Uh oh!

jstac commented Mar 11, 2026

Uh oh!

mmcky commented Mar 11, 2026

Uh oh!

mmcky commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jstac commented Mar 12, 2026

Uh oh!

mmcky commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jstac commented Mar 12, 2026

Uh oh!

HumphreyYang commented Mar 12, 2026

Uh oh!

jstac commented Mar 12, 2026

Uh oh!

mmcky commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 11, 2026 •

edited

Loading

mmcky commented Mar 11, 2026 •

edited

Loading

mmcky commented Mar 12, 2026 •

edited

Loading