From a6f3406c8f263a4ede269e1cf639a473fb237001 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 9 Sep 2025 02:39:11 +0000 Subject: [PATCH 1/2] Initial plan From ff5b354ed1d8ae51d3eb0dfa39f0ec6ea9832311 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 9 Sep 2025 02:57:25 +0000 Subject: [PATCH 2/2] Fix broken links in back_prop.md - URL encoding and dead domain Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com> --- lectures/back_prop.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lectures/back_prop.md b/lectures/back_prop.md index 038d36983..82c04d537 100644 --- a/lectures/back_prop.md +++ b/lectures/back_prop.md @@ -201,7 +201,7 @@ $$ (eq:sgd) where $\frac{d {\mathcal L}}{dx_{N+1}}=-\left(x_{N+1}-y\right)$ and $\alpha > 0 $ is a step size. -(See [this](https://en.wikipedia.org/wiki/Gradient_descent#Description) and [this](https://en.wikipedia.org/wiki/Newton%27s_method) to gather insights about how stochastic gradient descent +(See [this](https://en.wikipedia.org/wiki/Gradient_descent#Description) and [this](https://en.wikipedia.org/wiki/Newton's_method) to gather insights about how stochastic gradient descent relates to Newton's method.) To implement one step of this parameter update rule, we want the vector of derivatives $\frac{dx_{N+1}}{dp_k}$. @@ -540,7 +540,7 @@ Image(fig.to_image(format="png")) It is fun to think about how deepening the neural net for the above example affects the quality of approximation -* If the network is too deep, you'll run into the [vanishing gradient problem](https://neuralnetworksanddeeplearning.com/chap5.html) +* If the network is too deep, you'll run into the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) * Other parameters such as the step size and the number of epochs can be as important or more important than the number of layers in the situation considered in this lecture. * Indeed, since $f$ is a linear function of $x$, a one-layer network with the identity map as an activation would probably work best.