Conversation
Initialize Q-table to 20 (above true value range of 13-18) instead of zeros, which drives broader exploration via "optimism in the face of uncertainty". This speeds convergence enough to reduce the training run from 20M to 5M steps. Added a new subsection explaining the technique. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📖 Netlify Preview Ready!Preview URL: https://pr-830--sunny-cactus-210e3e.netlify.app Commit: 📚 Changed LecturesBuild Info
|
Initialize Q-table to 1e-5 (below true Q-value range of ~1e-5 to 1e-4) instead of ones. Since the optimal policy minimizes q, optimistic means initializing below the truth — the reverse of the risk-neutral case. This speeds convergence enough to reduce training from 20M to 5M steps. Added a subsection explaining the reversed optimistic init logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@mmcky @HumphreyYang all reviews appreciated! |
|
The PR introduces |
|
Two additional review items: 1. Missing profit in the observation list ( In the "What the manager needs to know" section, adding the discount factor β is a good idea (it appears in the update rule and the intro now mentions the manager knows the interest rate). However, the resulting profit R_{t+1} was removed to make room for it. Since R_{t+1} appears directly in the Q-learning update rule, the manager must observe it. I think this list should have 5 items — keeping profit and adding β — rather than replacing one with the other. 2. Risk-sensitive optimistic initialization logic ( The new subsection says the Q-table is initialized at 10⁻⁵, "modestly below" the true Q-values of roughly 10⁻⁵ to 10⁻⁴. However, with γ = 1 and v* ≈ 13–18, the true Q-values are exp(−v*) ≈ 10⁻⁶ to 10⁻⁸, so 10⁻⁵ is actually above most of the true range. Since the agent minimizes, values above the truth look bad (not attractive), which is the opposite of the optimistic mechanism described in the text. The code works well empirically, so the initialization value may be fine — but the narrative explaining why it works appears to have the direction wrong. Worth verifying the actual Q-value range from the converged Q-table and adjusting the explanation. (@jstac this comment is from Claude and I don't fully understand the Q-values TBH, but thought important to mention. Maybe @HumphreyYang can review and check this comment). |
|
Thanks @mmcky . I think claude is spot on here and could be asked to fix this. |
…t value and narrative
|
Pushed two additional fixes (7394176):
@jstac I have gone ahead and implemented the suggestions.
|
|
Thanks @mmcky . I think this can probably go live, unless @HumphreyYang wants to look again. We could set Chihiro the task of reviewing this section of the lectures, as a starting point. |
|
Many thanks @jstac, I will study it this afternoon after finishing slides. Please feel free to merge this and I will open another PR if I spot anything! |
|
Thanks @HumphreyYang . @mmcky , please merge when ready. |
|
@HumphreyYang merging this now. |
Summary
Test plan
jupytextand runs without errors🤖 Generated with Claude Code