From a6f3406c8f263a4ede269e1cf639a473fb237001 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Tue, 9 Sep 2025 02:39:11 +0000
Subject: [PATCH 1/2] Initial plan


From ff5b354ed1d8ae51d3eb0dfa39f0ec6ea9832311 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Tue, 9 Sep 2025 02:57:25 +0000
Subject: [PATCH 2/2] Fix broken links in back_prop.md - URL encoding and dead
 domain

Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
---
 lectures/back_prop.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lectures/back_prop.md b/lectures/back_prop.md
index 038d36983..82c04d537 100644
--- a/lectures/back_prop.md
+++ b/lectures/back_prop.md
@@ -201,7 +201,7 @@ $$ (eq:sgd)
 
 where $\frac{d {\mathcal L}}{dx_{N+1}}=-\left(x_{N+1}-y\right)$ and $\alpha > 0 $ is a step size.
 
-(See [this](https://en.wikipedia.org/wiki/Gradient_descent#Description) and [this](https://en.wikipedia.org/wiki/Newton%27s_method) to gather insights about how stochastic gradient descent
+(See [this](https://en.wikipedia.org/wiki/Gradient_descent#Description) and [this](https://en.wikipedia.org/wiki/Newton's_method) to gather insights about how stochastic gradient descent
 relates to Newton's method.)
 
 To implement one step of this parameter update rule, we want  the vector of derivatives $\frac{dx_{N+1}}{dp_k}$.
@@ -540,7 +540,7 @@ Image(fig.to_image(format="png"))
 It  is  fun to think about how deepening the neural net for the above example affects the quality of  approximation 
 
 
-* If the network is too deep, you'll run into the [vanishing gradient problem](https://neuralnetworksanddeeplearning.com/chap5.html)
+* If the network is too deep, you'll run into the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem)
 * Other parameters such as the step size and the number of epochs can be as  important or more important than the number of layers in the situation considered in this lecture.
 * Indeed, since $f$ is a linear function of $x$, a one-layer network with the identity map as an activation would probably work best.