diff --git a/lectures/back_prop.md b/lectures/back_prop.md index 038d36983..82c04d537 100644 --- a/lectures/back_prop.md +++ b/lectures/back_prop.md @@ -201,7 +201,7 @@ $$ (eq:sgd) where $\frac{d {\mathcal L}}{dx_{N+1}}=-\left(x_{N+1}-y\right)$ and $\alpha > 0 $ is a step size. -(See [this](https://en.wikipedia.org/wiki/Gradient_descent#Description) and [this](https://en.wikipedia.org/wiki/Newton%27s_method) to gather insights about how stochastic gradient descent +(See [this](https://en.wikipedia.org/wiki/Gradient_descent#Description) and [this](https://en.wikipedia.org/wiki/Newton's_method) to gather insights about how stochastic gradient descent relates to Newton's method.) To implement one step of this parameter update rule, we want the vector of derivatives $\frac{dx_{N+1}}{dp_k}$. @@ -540,7 +540,7 @@ Image(fig.to_image(format="png")) It is fun to think about how deepening the neural net for the above example affects the quality of approximation -* If the network is too deep, you'll run into the [vanishing gradient problem](https://neuralnetworksanddeeplearning.com/chap5.html) +* If the network is too deep, you'll run into the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) * Other parameters such as the step size and the number of epochs can be as important or more important than the number of layers in the situation considered in this lecture. * Indeed, since $f$ is a linear function of $x$, a one-layer network with the identity map as an activation would probably work best.