- A Feedforward Neural Network is the simplest form of Artificial Neural Network (ANN).
- The flow of information is one-way only: from inputs → hidden layers → outputs. No loops or feedback.
- Input Layer: takes raw features (e.g., pixels of an image).
- Hidden Layers: each neuron applies a weighted sum + bias → activation function → passes to the next layer.
- Output Layer: final predictions (classification, regression, etc.).
For one neuron:
-
$W$ : weights, -
$x$ : input vector, -
$b$ : bias, -
$f$ : activation function, -
$a$ : neuron’s output.
- Universal function approximator: With enough neurons/layers, FNNs can approximate any continuous function.
- Static model: no internal memory of past inputs. (In contrast: RNNs handle sequential data).
- Training: done using backpropagation + optimization algorithm (SGD, Adam, etc.).
- Shallow vs Deep Networks: one hidden layer vs many hidden layers.
- Overfitting risk: if network is too big without regularization.
- Batch Normalization, Dropout: techniques to improve training.
- Vanishing/Exploding gradients: problem with deep FNNs, led to innovations like ReLU and ResNets.
-
Linear models:
$y = w^Tx + b$ .- Good for linearly separable data (like classifying if a point is above/below a line).
- But fail for complex patterns (e.g., XOR problem, image recognition).
- XOR is not linearly separable.
- A single line cannot divide (0,0), (1,1) vs (0,1), (1,0).
- Neural networks with non-linear activation functions solve XOR easily.
-
They stack multiple layers of neurons.
-
Each layer performs a non-linear transformation, allowing composition of simple features into complex ones.
-
Example: In image classification,
- Lower layers detect edges,
- Mid layers detect shapes,
- Higher layers detect objects.
- Non-linear decision boundaries: curves, surfaces, etc.
- Kernel trick in SVMs: another way to handle non-linearity, but neural networks scale better with data.
- Deep Learning revolution: GPUs + big data allowed training of very large non-linear models.
- Input Layer: number of neurons = number of features.
- Hidden Layers: can be 1 or many. Depth gives more power.
- Output Layer: depends on task.
- Weights: strength of connections between neurons.
- Biases: allow shifting the activation threshold.
- Activation functions: introduce non-linearity.
- Loss function: measures error (cross-entropy, MSE, etc.).
- Forward Pass: compute outputs layer by layer.
- Loss Calculation: compare prediction vs ground truth.
- Backpropagation: compute gradients of loss wrt weights.
- Weight Update: optimizer (SGD, Adam, RMSprop).
- Learning rate (too high → unstable, too low → slow).
- Batch size (trade-off between speed and stability).
- Number of layers/neurons (capacity of model).
-
Overfitting vs Underfitting:
- Overfitting = memorizing training data,
- Underfitting = failing to learn patterns.
-
Regularization: L1/L2, Dropout, Early stopping.
-
Weight Initialization: Xavier, He, etc. help training stability.
- Without activation functions, the entire network is just a linear model regardless of depth.
- Non-linear activations allow complex mappings.
-
Sigmoid (Logistic)
$$ f(x) = \frac{1}{1 + e^{-x}} $$
f(x) = 1 / (1 + e^(-x))- Outputs between (0,1).
- Good for probability interpretation.
- Problem: Vanishing gradients for large |x|.
-
Tanh
$$ f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$
f(x) = (e^x - e^(-x)) / (e^x + e^(-x))- Outputs between (-1,1).
- Centered at 0 (better than sigmoid).
- Still suffers from vanishing gradient.
-
ReLU (Rectified Linear Unit)
$$ f(x) = \max(0, x) $$
f(x) = max(0, x)- Most widely used.
- Pros: no vanishing gradient for positive inputs, very fast.
- Cons: Dying ReLU problem (neurons stuck at 0 if weights misaligned).
-
Leaky ReLU
$$ f(x) = \max(0.01x, x) $$
f(x) = max(0.01 * x, x)- Fixes dying ReLU.
-
ELU / GELU / Swish
- Advanced activations with smoother curves.
- Used in modern architectures (Transformers often use GELU).
- ReLU: default choice for hidden layers.
- Tanh/Sigmoid: rarely used in hidden layers today, but still used in RNNs (e.g., LSTM gates).
- GELU: cutting-edge models (BERT, GPT) use it.
- Derivative of activation: must be easy to compute for backpropagation.
- Batch Normalization: reduces dependency on activation scaling.
The output layer’s activation depends on type of problem:
- Regression: Linear activation (no activation at output).
- Binary Classification: Sigmoid (output in [0,1]).
- Multiclass Classification: Softmax.
- Converts raw scores (logits) into probabilities that sum to 1.
- Each class gets a probability.
- Intuitive: outputs are probabilities.
- Ensures competition: increasing one probability decreases others.
- Works well with Cross-Entropy Loss, which is standard for classification.
-
Cross-Entropy Loss:
- For true label y and predicted softmax p:
L = -∑ yᵢ log(pᵢ)
-
Encourages correct class probability → 1.
-
One-Hot Encoding: how labels are represented for softmax training.
-
Logits: raw outputs before softmax.
-
Sigmoid for Multi-label Classification:
- Each class independent, not mutually exclusive.
-
Linear for Regression: predicts continuous values.