Skip to content

Add optimistic initialization to Q-learning lecture#830

Merged
mmcky merged 4 commits intomainfrom
rl_edits
Mar 12, 2026
Merged

Add optimistic initialization to Q-learning lecture#830
mmcky merged 4 commits intomainfrom
rl_edits

Conversation

@jstac
Copy link
Contributor

@jstac jstac commented Mar 11, 2026

Summary

  • Optimistic Q-table initialization: Initialize the Q-table to 20 (above the true value function range of ~13–18) instead of zeros. This implements "optimism in the face of uncertainty" — every untried action looks promising, driving broad early exploration without relying solely on ε-greedy randomness.
  • 4x faster convergence: The optimistic init speeds up learning enough to reduce the training run from 20 million to 5 million steps while achieving better policy accuracy (20/21 states match optimal vs 18/21 previously).
  • New subsection explaining the technique and why it works.
  • Prose updates to reflect the shorter training run and describe the learning progression at each snapshot.

Test plan

  • Verify lecture builds with jupytext and runs without errors
  • Check that Q-learning converges to near-optimal policy within 5M steps
  • Review generated plots for reasonable value function and policy match

🤖 Generated with Claude Code

Initialize Q-table to 20 (above true value range of 13-18) instead of
zeros, which drives broader exploration via "optimism in the face of
uncertainty". This speeds convergence enough to reduce the training run
from 20M to 5M steps. Added a new subsection explaining the technique.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 11, 2026

📖 Netlify Preview Ready!

Preview URL: https://pr-830--sunny-cactus-210e3e.netlify.app

Commit: 7394176

📚 Changed Lectures


Build Info

Initialize Q-table to 1e-5 (below true Q-value range of ~1e-5 to 1e-4)
instead of ones. Since the optimal policy minimizes q, optimistic means
initializing below the truth — the reverse of the risk-neutral case.
This speeds convergence enough to reduce training from 20M to 5M steps.
Added a subsection explaining the reversed optimistic init logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jstac
Copy link
Contributor Author

jstac commented Mar 11, 2026

@mmcky @HumphreyYang all reviews appreciated!

@mmcky
Copy link
Contributor

mmcky commented Mar 11, 2026

The PR introduces \EE in the firm value equation but this macro was not defined in the MathJax config. I have pushed a fix (7c2f3de) adding "EE": "\\mathbb{E}" to the mathjax3_config macros in _config.yml so it renders correctly across all lectures.

@mmcky
Copy link
Contributor

mmcky commented Mar 11, 2026

Two additional review items:

1. Missing profit in the observation list (inventory_q.md)

In the "What the manager needs to know" section, adding the discount factor β is a good idea (it appears in the update rule and the intro now mentions the manager knows the interest rate). However, the resulting profit R_{t+1} was removed to make room for it. Since R_{t+1} appears directly in the Q-learning update rule, the manager must observe it. I think this list should have 5 items — keeping profit and adding β — rather than replacing one with the other.

2. Risk-sensitive optimistic initialization logic (rs_inventory_q.md)

The new subsection says the Q-table is initialized at 10⁻⁵, "modestly below" the true Q-values of roughly 10⁻⁵ to 10⁻⁴. However, with γ = 1 and v* ≈ 13–18, the true Q-values are exp(−v*) ≈ 10⁻⁶ to 10⁻⁸, so 10⁻⁵ is actually above most of the true range. Since the agent minimizes, values above the truth look bad (not attractive), which is the opposite of the optimistic mechanism described in the text. The code works well empirically, so the initialization value may be fine — but the narrative explaining why it works appears to have the direction wrong. Worth verifying the actual Q-value range from the converged Q-table and adjusting the explanation.

(@jstac this comment is from Claude and I don't fully understand the Q-values TBH, but thought important to mention. Maybe @HumphreyYang can review and check this comment).

@jstac
Copy link
Contributor Author

jstac commented Mar 12, 2026

Thanks @mmcky .

I think claude is spot on here and could be asked to fix this.

@mmcky
Copy link
Contributor

mmcky commented Mar 12, 2026

Pushed two additional fixes (7394176):

  1. Restored profit to the observation list (inventory_q.md): The resulting profit $R_{t+1}$ was missing from the "What the manager needs to know" list — it appears directly in the Q-learning update rule so the manager must observe it. The list now has 5 items (profit restored, discount factor kept).

  2. Fixed risk-sensitive optimistic init (rs_inventory_q.md): With γ=1 and v≈13–18, the true Q-values are exp(−v)≈10⁻⁸ to 10⁻⁶, not 10⁻⁵ to 10⁻⁴ as previously stated. The init value of 10⁻⁵ was actually above most true Q-values, contradicting the "modestly below" narrative. Changed q_init from 1e-5 to 1e-9 and updated the text to match. Build completes successfully with cached execution.

@jstac I have gone ahead and implemented the suggestions.

  • @mmcky to visually check html output with change in initial conditions.

@jstac
Copy link
Contributor Author

jstac commented Mar 12, 2026

Thanks @mmcky .

I think this can probably go live, unless @HumphreyYang wants to look again.

We could set Chihiro the task of reviewing this section of the lectures, as a starting point.

@HumphreyYang
Copy link
Member

Many thanks @jstac, I will study it this afternoon after finishing slides. Please feel free to merge this and I will open another PR if I spot anything!

@jstac
Copy link
Contributor Author

jstac commented Mar 12, 2026

Thanks @HumphreyYang .

@mmcky , please merge when ready.

@mmcky
Copy link
Contributor

mmcky commented Mar 12, 2026

@HumphreyYang merging this now.

@mmcky mmcky merged commit d825b92 into main Mar 12, 2026
1 of 2 checks passed
@mmcky mmcky deleted the rl_edits branch March 12, 2026 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants