-
A loss function measures how far a model’s prediction is from the actual (ground truth) value for a single data point.
-
It’s basically an error metric: the smaller the loss, the better the prediction.
-
Example:
- Predicted = 0.8, Actual = 1.0
- Squared Error = (0.8 − 1.0)² = 0.04 → this is the loss for that one point.
- A cost function is the average (or sum) of losses over the entire dataset.
- So while loss = error for one training sample, cost = overall measure of how well the model is doing.
Formally,
- Training Guidance: They tell the optimizer (SGD, Adam, etc.) how to update weights.
- Model Evaluation: They measure progress during training and validation.
- Choosing the Right Function: Different tasks need different loss functions (regression vs classification).
- In Backpropagation: Gradients are computed by differentiating the loss function.
- In Evaluation: Even after training, cost functions are used to compare models.
- Standard for regression problems.
- Penalizes larger errors more heavily (squared term).
- Always non-negative.
- Differentiable (good for optimization).
- Smooth, convex, easy to differentiate.
- Works well when errors are normally distributed.
- Sensitive to outliers (because squaring magnifies large errors).
-
$y \in {0,1}$ is the true label. -
$\hat{y} \in [0,1]$ is the predicted probability.
- Binary classification problems (spam/not spam, yes/no).
- Works with sigmoid output activation.
- Strongly penalizes confident but wrong predictions.
- If model predicts 0.99 but true label = 0 → very high loss.
- Probabilistic interpretation.
- Encourages the model to output probabilities close to the true label.
For K-class classification:
-
$y_i$ : one-hot encoded true label. -
$\hat{y}_i$ : predicted probability from Softmax output.
- Multiclass classification (MNIST digit recognition, ImageNet classification).
- Each sample belongs to one class only.
- Encourages the correct class probability → 1.
- Works hand-in-hand with softmax activation in the output layer.
Here are additional ones that may come in exams or practical use:
- Combines MSE and MAE (Mean Absolute Error).
- Less sensitive to outliers than MSE.
- Used in regression when data may have noise.
- Used for SVMs and sometimes neural networks.
- Good for classification problems.
- Measures how one probability distribution differs from another.
- Used in variational autoencoders (VAEs), regularization, and information theory.
- Simpler than MSE.
- Robust to outliers, but less smooth for optimization.
- Instead of strict one-hot labels, assign small probabilities to other classes.
- Helps prevent overconfidence in predictions.
| Loss Function | Typical Use | Activation at Output | Notes |
|---|---|---|---|
| MSE | Regression | Linear | Sensitive to outliers |
| MAE | Regression | Linear | More robust than MSE |
| Huber | Regression | Linear | Balance between MSE & MAE |
| Binary Cross-Entropy | Binary classification | Sigmoid | Penalizes wrong confident predictions |
| Categorical Cross-Entropy | Multiclass classification | Softmax | Standard for classification |
| Hinge Loss | Classification (SVM) | Linear | Margin-based |
| KL Divergence | Probabilistic models | Softmax/Prob dists | Used in VAEs, NLP, etc. |
Summary :
- Loss = single point error, Cost = overall error.
- Pick MSE/MAE/Huber for regression, Cross-Entropy for classification.
- Loss functions are at the heart of backpropagation.
- Once we have a loss function (measuring error), we need a way to minimize it by updating weights.
- This is done by an optimization algorithm.
- Core idea: adjust parameters in the direction that reduces loss the most.
-
Parameters: weights
$W$ and biases$b$ . -
Gradient: slope/derivative of the loss function with respect to parameters.
- Tells us how to change weights to reduce error.
-
Learning Rate (
$\eta$ ): step size for weight updates.- Too high → divergence (overshooting).
- Too low → very slow learning.
-
$\theta$ = parameters (weights, biases). -
$\eta$ = learning rate. -
$J(\theta)$ = cost function. -
$\nabla_\theta J(\theta)$ = gradient of cost wrt parameters.
- Computes gradient using entire dataset.
- Guarantees smooth convergence for convex problems.
- Stable convergence.
- True direction of steepest descent.
- Very slow for large datasets (need to process all samples before one update).
- Memory expensive.
- Instead of using the whole dataset, update weights using one sample at a time.
- Updates are noisy, but this noise can help escape local minima.
- Converges faster than batch GD (frequent updates).
- Fast updates (especially for large datasets).
- Helps in escaping shallow local minima.
- Very noisy trajectory → loss curve fluctuates heavily.
- Requires tuning learning rate carefully.
- Momentum: smooths updates by adding a velocity term.
- Adaptive learning rates: AdaGrad, RMSprop, Adam.
- Compromise between Batch GD and SGD.
- Uses a small subset of data (mini-batch) for each update.
- Typical batch sizes: 16, 32, 64, 128.
- More stable than SGD.
- Faster than full Batch GD.
- Works well with GPU parallelization.
- Efficiency + stability.
- Smooth convergence but still allows some stochasticity.
- Choosing batch size is tricky (too small → noisy, too large → slow).
- Adds a fraction of the previous update to the current one.
- Prevents oscillations and accelerates convergence.
- Looks ahead by applying momentum before computing gradient.
- Improves convergence speed.
-
AdaGrad: adapts learning rate per parameter (good for sparse data, NLP).
-
RMSprop: scales updates by moving average of squared gradients.
-
Adam (Adaptive Moment Estimation): combines momentum + RMSprop.
- Most popular today.
- Decay: gradually reduce learning rate during training.
- Warm restarts: periodically reset learning rate for better exploration.
- Goal: Train a neural network by minimizing the loss function.
- Neural networks are composed of layers of functions. Each function’s output depends on weights and inputs.
- To update weights using gradient descent, we need gradients of the loss wrt weights.
- Backpropagation uses the chain rule of calculus to efficiently compute these gradients from output layer → backward to input layer.
Flow:
- Forward Pass: compute predictions step by step.
- Loss Calculation: measure error with a loss function.
- Backward Pass (Backpropagation): propagate error backward using chain rule to compute gradients.
- Parameter Update: apply gradient descent (or its variants).
For a neuron:
-
$z$ = linear combination of inputs -
$f(z)$ = activation function -
$a$ = neuron output
During backpropagation, we compute:
using the chain rule:
This cascades backward through layers.
Let’s use a 2–2–1 network (2 inputs, 1 hidden layer with 2 neurons, 1 output). Task: binary classification (sigmoid output).
-
Inputs:
$x_1, x_2$ -
Hidden Layer (2 neurons, activation = sigmoid):
$z_1 = w_{11}x_1 + w_{21}x_2 + b_1$ $a_1 = \sigma(z_1)$ $z_2 = w_{12}x_1 + w_{22}x_2 + b_2$ $a_2 = \sigma(z_2)$
-
Output Layer (1 neuron, sigmoid):
$z_3 = w_{13}a_1 + w_{23}a_2 + b_3$ $\hat{y} = \sigma(z_3)$
Loss function: Binary Cross-Entropy (BCE).
-
Input:
$x = [1, 0]$ -
True label:
$y = 1$ -
Initial weights & biases:
$w_{11} = 0.2, w_{21} = 0.4, b_1 = 0.1$ $w_{12} = 0.3, w_{22} = 0.1, b_2 = 0.2$ $w_{13} = 0.7, w_{23} = 0.5, b_3 = 0.3$
-
Activation: Sigmoid
-
Hidden neuron 1:
$$ z_1 = (0.2)(1) + (0.4)(0) + 0.1 = 0.3 $$
$$ a_1 = \sigma(0.3) \approx 0.574 $$
-
Hidden neuron 2:
$$ z_2 = (0.3)(1) + (0.1)(0) + 0.2 = 0.5 $$
$$ a_2 = \sigma(0.5) \approx 0.622 $$
-
Output neuron:
$$ z_3 = (0.7)(0.574) + (0.5)(0.622) + 0.3 \approx 1.086 $$
$$ \hat{y} = \sigma(1.086) \approx 0.748 $$
So the model predicts $ \hat{y} \approx 0.748$.
Binary Cross-Entropy Loss:
Since
Derivative of BCE loss wrt output neuron:
Gradients for output weights:
Hidden Layer
For hidden neuron 1:
$\sigma'(0.3) = 0.574(1-0.574) = 0.244$ $\delta_1 = (0.7)(-0.252)(0.244) \approx -0.043$
For hidden neuron 2:
$\sigma'(0.5) = 0.622(1-0.622) = 0.235$ $\delta_2 = (0.5)(-0.252)(0.235) \approx -0.030$
Gradients for hidden weights:
Learning rate:
Example update for
Do similar updates for all weights and biases.
- Next iteration: forward pass with updated weights → new loss → backprop again.
- Over many epochs, loss decreases, predictions improve.
- Forward pass: compute activations layer by layer.
- Loss computation: compare prediction with true label.
- Backward pass: compute gradients using chain rule.
- Parameter update: use Gradient Descent (SGD, Mini-batch, Adam, etc.).
- Repeat until convergence.
This small worked example showed:
input layer → hidden activations → output → loss → gradient descent → backpropagation.