In the field of deep learning, Convolutional Neural Networks (CNNs) have revolutionized the way machines interpret and classify images. CNNs are specially designed to process data with a grid-like topology, such as images, by applying convolution operations to extract hierarchical features — from simple edges to complex patterns.
Over the years, various CNN architectures have been developed, each improving upon the previous in terms of depth, accuracy, and computational efficiency. Understanding these architectures is essential for:
- Analyzing model design choices,
- Comparing performance across tasks,
- Gaining intuition on how deep learning models interpret visual data.
This section explores popular CNN architectures, starting with the LeNet-5 model, which laid the foundation for modern deep learning approaches in computer vision.
- Developed by: Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner (1998)
- Purpose: Handwritten and machine-printed character recognition.
- Dataset: Designed for MNIST (digits 0–9).
- Significance: One of the first Convolutional Neural Networks (CNNs) – foundational model in deep learning history.
- Input: Grayscale image of size 32×32×1
- Total Parameters: ~60,000
| Layer # | Type | Details | Output Size |
|---|---|---|---|
| 1 | Input | 32×32 grayscale image | 32×32×1 |
| 2 | C1 – Convolution | 6 filters, 5×5 kernel, stride=1 | 28×28×6 |
| 3 | S2 – Average Pooling | 2×2 pool, stride=2 | 14×14×6 |
| 4 | C3 – Convolution | 16 filters, 5×5 kernel | 10×10×16 |
| 5 | S4 – Average Pooling | 2×2 pool, stride=2 | 5×5×16 |
| 6 | C5 – Convolution / FC | 120 filters, 5×5 kernel (acts like FC) | 1×1×120 |
| 7 | F6 – Fully Connected | 84 neurons | 84 |
| 8 | Output – Fully Connected + Softmax | 10 classes (digits 0–9) | 10 |
-
Convolution Layers:
- Extract local spatial features.
- Use small filters (e.g., 5×5).
- Apply activation functions (typically tanh or sigmoid in LeNet-5).
-
Average Pooling Layers:
- Downsample feature maps.
- Reduces computation and helps generalization.
-
Fully Connected Layers:
- Learn high-level features.
- Perform final classification using Softmax.
-
Activation Function:
- Originally used tanh/sigmoid; modern variants may use ReLU.
- Why 32×32 input: MNIST digits are 28×28, padded to 32×32 to allow better edge detection in convolutions.
- Pooling type: Used Average Pooling, not Max Pooling (which is more common today).
- Depth increases: From 1 channel to 6, 16, 120 as we go deeper.
- Feature abstraction: Shallow layers detect edges, deeper layers detect shapes, final layers detect digit representations.
- Parameter efficiency: Despite being deep, uses far fewer parameters than modern networks.
- LeNet-5 is a pioneering CNN model.
- Designed for simple image classification tasks.
- Important for understanding basic CNN components.
- A good example of how feature extraction and classification are combined in a neural network.
2. AlexNet
- Developed by: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012)
- Purpose: Large-scale image classification on ImageNet dataset (1.2 million images, 1000 classes)
- Significance: Winner of the ImageNet ILSVRC-2012 competition; pivotal model that revived deep learning research in computer vision
- Training: Used two Nvidia GeForce GTX 580 GPUs; network was split into two pipelines due to hardware limitations
- Parameters: Approximately 60 million
-
Layers: 8 layers in total
- 5 convolutional layers
- 3 fully connected layers
-
Input Size: 224×224×3 (RGB images)
-
Key Innovations:
- ReLU activation for faster training and better gradient flow
- Max Pooling instead of Average Pooling for better feature downsampling
- Dropout in fully connected layers to reduce overfitting
- Data Augmentation to artificially increase dataset size and improve generalization
- Local Response Normalization (LRN) introduced to mimic lateral inhibition observed in real neurons
- Stochastic Gradient Descent (SGD) optimization
| Layer # | Type | Details | Output Size |
|---|---|---|---|
| 1 | Convolution | 96 filters, 11×11 kernel, stride=4, ReLU | 55×55×96 |
| 2 | Max Pooling | 3×3 pool, stride=2 | 27×27×96 |
| 3 | Convolution | 256 filters, 5×5 kernel, padding=2, ReLU | 27×27×256 |
| 4 | Max Pooling | 3×3 pool, stride=2 | 13×13×256 |
| 5 | Convolution | 384 filters, 3×3 kernel, padding=1, ReLU | 13×13×384 |
| 6 | Convolution | 384 filters, 3×3 kernel, padding=1, ReLU | 13×13×384 |
| 7 | Convolution | 256 filters, 3×3 kernel, padding=1, ReLU | 13×13×256 |
| 8 | Max Pooling | 3×3 pool, stride=2 | 6×6×256 |
| 9 | Fully Connected | 4096 neurons, ReLU | 4096 |
| 10 | Fully Connected | 4096 neurons, ReLU | 4096 |
| 11 | Fully Connected | 1000 neurons (output classes), Softmax | 1000 |
-
ReLU Activation
- Replaced sigmoid/tanh to reduce vanishing gradient problem and accelerate training
-
Max Pooling
- More effective than average pooling in highlighting dominant features
-
Dropout
- Randomly disables neurons during training to prevent overfitting
-
Data Augmentation
- Techniques like image flipping, cropping, and color jittering increase training data diversity
-
Local Response Normalization (LRN)
- Encourages competition among neurons, improving generalization
-
Use of GPUs
- Allowed training of deeper and larger models, crucial for practical deep learning
- AlexNet marked the transition from shallow to deep CNN architectures for large-scale vision tasks.
- Introduced several key innovations that are standard in modern CNNs: ReLU, dropout, data augmentation, and LRN.
- Utilized multiple GPUs due to computational requirements, which was a significant technical advancement at the time.
- Despite high performance, it had a large number of hyperparameters making tuning complex.
- Model size (~60 million parameters) is significantly larger than LeNet-5 (~60,000 parameters).
- AlexNet is a landmark CNN architecture that demonstrated the power of deep learning on a large-scale dataset.
- Its success reignited research interest in CNNs and deep learning for computer vision.
- Innovations such as ReLU, dropout, and GPU-based training are now foundational in CNN design.
3. VGG-16 Net
- Developed by: Karen Simonyan and Andrew Zisserman (2014)
- Purpose: Improve on AlexNet by reducing the number of hyperparameters and increasing network depth with a more uniform architecture
- Achievement: 1st runner-up in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014
- Parameters: Approximately 138 million
-
Input: RGB image of size 224×224×3
-
Depth: 16 weight layers (13 convolutional + 3 fully connected)
-
Key Design Change:
- Replaced large convolutional kernels (e.g., 11×11, 5×5 in AlexNet) with multiple stacked 3×3 convolutional filters
- Used 2×2 max pooling with stride 2 (not stride 1) to downsample while preserving important features
- Padding applied to keep spatial dimensions consistent after convolution
| Block | Layer Type | Filters / Neurons | Repetitions | Output Size |
|---|---|---|---|---|
| 1 | Convolution | 64 filters (3×3) | 2 | 224×224×64 |
| Max Pooling | 2×2, stride=2 | 1 | 112×112×64 | |
| 2 | Convolution | 128 filters (3×3) | 2 | 112×112×128 |
| Max Pooling | 2×2, stride=2 | 1 | 56×56×128 | |
| 3 | Convolution | 256 filters (3×3) | 3 | 56×56×256 |
| Max Pooling | 2×2, stride=2 | 1 | 28×28×256 | |
| 4 | Convolution | 512 filters (3×3) | 3 | 28×28×512 |
| Max Pooling | 2×2, stride=2 | 1 | 14×14×512 | |
| 5 | Convolution | 512 filters (3×3) | 3 | 14×14×512 |
| Max Pooling | 2×2, stride=2 | 1 | 7×7×512 | |
| Fully Connected | 4096 neurons | 2 | 4096 | |
| Fully Connected | 1000 neurons | 1 | 1000 (classes) |
-
Use of small 3×3 kernels stacked consecutively:
- Equivalent to a larger receptive field (e.g., two 3×3 conv layers have effective 5×5 receptive field)
- Reduces number of parameters and computational cost compared to large kernels
-
Consistent use of padding to maintain spatial resolution after convolutions
-
Max Pooling layers reduce spatial dimension by half each time
-
Fully connected layers at the end perform classification
-
ReLU activation used throughout the network for non-linearity
- VGG-16 solved the problem of too many hyperparameters in AlexNet by using a uniform architecture with smaller convolution kernels.
- The network is very deep (16 layers) compared to AlexNet (8 layers), which improved learning capacity and accuracy.
- Though powerful, VGG-16 is computationally expensive, with high memory requirements and longer training times.
- Suffers from vanishing/exploding gradient problems due to depth, making training more difficult without proper techniques (later addressed by ResNet).
- The large number of parameters (~138 million) makes it a heavy model for practical applications without hardware acceleration.
- VGG-16 is a deep CNN architecture with a simple and uniform design based on stacked 3×3 convolutional filters.
- It improved classification accuracy significantly by increasing depth while controlling parameters.
- Its drawbacks include heavy computational requirements and difficulty in training, motivating the development of more efficient models later.
4. ResNet
- Developed by: Kaiming He et al. (2015)
- Purpose: Address the degradation problem in very deep neural networks by enabling effective training of networks with over 100 layers
- Achievement: Winner of ILSVRC 2015 competition
- Key Innovation: Introduction of skip connections (residual connections) and batch normalization
- ResNet builds on architectures like VGG by stacking many layers but introduces residual blocks with identity skip connections
- These skip connections allow the network to learn a residual mapping instead of directly fitting the desired underlying mapping
- Enables training of very deep networks (e.g., ResNet-50, ResNet-101, ResNet-152) without degradation in accuracy
- Uses Batch Normalization to stabilize and speed up training
-
Skip Connections / Residual Learning
- Allows gradients to flow directly through the network by bypassing one or more layers
- If the weights of a layer degrade to zero, the output can still pass through unchanged (identity mapping).
-
Formula for residual block output:
a[l+2] = g(w[l+2] * a[l+1] + a[l])Where:
a[l]is the input to the residual blockw[l+2]represents the weights of the layergis the activation function
-
Vanishing Gradient Problem
- In very deep networks, gradients can become very small, preventing effective learning
- Skip connections mitigate this by providing alternate paths for gradient flow
-
Batch Normalization
- Normalizes layer inputs to reduce internal covariate shift, improving training speed and stability
- Before ResNet, increasing network depth beyond a certain point caused accuracy to saturate or degrade
- ResNet’s skip connections enable training of networks with hundreds or even thousands of layers, significantly improving performance on image recognition tasks
- Inspired by earlier ideas like highway networks with gated shortcuts and similar to skip connections in LSTMs for sequential data
- ResNet introduced a fundamental architectural change with residual connections allowing very deep CNNs to be trained effectively
- It overcame limitations of previous deep networks caused by vanishing gradients and degradation of performance
- ResNet models remain a strong baseline for modern deep learning research and applications
-
Proposed by: Researchers at Google in the paper “Going Deeper with Convolutions” (2014)
-
Alternative Name: GoogLeNet
-
Purpose: Efficiently capture both local and global features by applying multiple convolution kernel sizes in parallel
-
Key Motivation:
- In real-world images, salient features vary greatly in size
- Choosing the right kernel size becomes difficult
- Inception module solves this by applying multiple kernel sizes (1×1, 3×3, 5×5) in parallel
-
Model Depth:
- 22 layers deep (27 if pooling layers are counted)
- Contains 9 Inception modules
The Inception module applies multiple filters in parallel and then concatenates the results along the depth (channel) dimension.
Note: Same padding is used to preserve the dimension of the image.
How it works:
- Applies 1×1, 3×3, and 5×5 convolutions simultaneously
- Applies Max Pooling in parallel as well
- Concatenates all outputs and passes to the next layer
- This allows the network to capture features at multiple receptive fields in the same layer
-
Direct use of 3×3 and 5×5 kernels increases the number of parameters significantly (initial model ~120M parameters)
-
To address this, 1×1 convolutions are used before applying larger convolutions
- Act as bottlenecks that reduce depth (number of channels)
- Dramatically reduce computational cost and parameters
-
With this trick, total parameters reduced by ~90%
GoogLeNet stacks multiple inception modules to form a deep and wide architecture.
What this image shows:
-
Multiple inception modules are linked together
-
Side branches (auxiliary classifiers) predict outputs at intermediate depths
- Help with vanishing gradients
- Provide regularization and better convergence during training
- Multi-scale feature extraction within the same layer
- 1×1 convolutions used for both dimensionality reduction and as activation functions
- Parallel structure rather than sequential stacking
- Global average pooling replaces fully connected layers
- Auxiliary classifiers help with training deeper models
- Inception Net provided a significant improvement in both accuracy and efficiency
- Architecture is modular and scalable, allowing for variants like Inception v2, v3, and v4
- The model structure is wider (parallel operations), not just deeper
- Bottleneck layers using 1×1 convolutions are crucial in managing computational cost
- Inception Net (GoogLeNet) introduced a new approach of multi-kernel convolution within the same layer
- Efficient in handling varying spatial features while controlling parameter growth
- The model’s modular design inspired many modern deep learning architectures focused on both depth and width
- Later versions (v2, v3, v4) further optimized training and reduced computational complexity
6. EfficientNet
Developed by: Mingxing Tan and Quoc V. Le, Google AI (2019) Paper: "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" Purpose: To achieve state-of-the-art accuracy on image classification tasks while maintaining computational efficiency. Key Innovation: A novel compound scaling method that uniformly scales network depth, width, and input resolution using a principled approach rather than manual tuning. Achievements:
- Achieved top-1 accuracy of 84.3% on ImageNet with EfficientNet-B7, outperforming larger models like ResNet and Inception with significantly fewer parameters and FLOPs.
- Demonstrated an excellent balance between accuracy, model size, and inference speed, making it highly practical for deployment on various devices.
EfficientNet is based on a baseline architecture called EfficientNet-B0, which was discovered using Neural Architecture Search (NAS) optimized for accuracy and efficiency on mobile devices. The model family EfficientNet-B0 → B7 is generated by systematically scaling this baseline network using the compound scaling method.
Input: RGB image (varies per version, from 224×224 in B0 to 600×600 in B7) Number of Parameters: ~5.3M (B0) → ~66M (B7) Model Depth: 237 layers (B0) → 813 layers (B7, counting all operations) Key Components:
- MBConv (Mobile Inverted Bottleneck Convolution) blocks
- Squeeze-and-Excitation (SE) optimization
- Swish (SiLU) activation function
- Compound scaling of depth, width, and resolution
EfficientNet adopts the MBConv structure originally introduced in MobileNetV2.
Structure of MBConv Block:
- 1×1 Expansion Convolution: Expands input channels by a factor (e.g., ×6).
- 3×3 or 5×5 Depthwise Convolution: Applies lightweight spatial filtering per channel.
- Squeeze-and-Excitation (SE): Recalibrates channel-wise feature responses.
- 1×1 Projection Convolution: Reduces channels back to original size.
- Skip Connection: Used when input and output dimensions match.
Formula for MBConv Output: [ y = x + F(x) \quad \text{(if dimensions match)} ] where (F(x)) represents the non-linear transformation through the expansion, depthwise, and projection steps.
Advantages:
- Reduces parameters and computation compared to standard convolutions.
- Preserves representational power due to the SE and Swish activation.
Traditional CNN scaling increases either depth, width, or input resolution individually. EfficientNet introduces a compound scaling rule that balances all three dimensions simultaneously.
Let:
- Depth → ( d = \alpha^\phi )
- Width → ( w = \beta^\phi )
- Resolution → ( r = \gamma^\phi )
Subject to: [ \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2 ] and [ \alpha, \beta, \gamma > 1 ]
Here, ( \phi ) is a user-specified coefficient that controls the overall model size (e.g., ( \phi = 0 ) for B0, ( \phi = 7 ) for B7).
This ensures that each version of EfficientNet scales uniformly and efficiently, maintaining a balance between model complexity and computational cost.
| Stage | Operator | #Layers | #Channels | Kernel Size | Stride | SE Ratio | Activation | Output Size |
|---|---|---|---|---|---|---|---|---|
| 1 | Conv | 1 | 32 | 3×3 | 2 | – | Swish | 112×112×32 |
| 2 | MBConv1 | 1 | 16 | 3×3 | 1 | 0.25 | Swish | 112×112×16 |
| 3 | MBConv6 | 2 | 24 | 3×3 | 2 | 0.25 | Swish | 56×56×24 |
| 4 | MBConv6 | 2 | 40 | 5×5 | 2 | 0.25 | Swish | 28×28×40 |
| 5 | MBConv6 | 3 | 80 | 3×3 | 2 | 0.25 | Swish | 14×14×80 |
| 6 | MBConv6 | 3 | 112 | 5×5 | 1 | 0.25 | Swish | 14×14×112 |
| 7 | MBConv6 | 4 | 192 | 5×5 | 2 | 0.25 | Swish | 7×7×192 |
| 8 | MBConv6 | 1 | 320 | 3×3 | 1 | 0.25 | Swish | 7×7×320 |
| 9 | Conv + Pool + FC | 1 | 1280 | 1×1 | – | – | Swish | 1×1×1280 → 1000 |
Total Parameters: ~5.3 million (EfficientNet-B0)
- Introduced from SENet.
- Applies global average pooling followed by two FC layers (squeeze and excitation).
- Learns per-channel weights to emphasize informative features and suppress irrelevant ones.
[ f(x) = x \cdot \sigma(x) ]
- Smooth, non-monotonic function.
- Outperforms ReLU by improving gradient flow and feature expressiveness.
- The base network (B0) was discovered using AutoML-based NAS, optimizing for both accuracy and efficiency on the ImageNet dataset and mobile hardware constraints.
- Avoids overfitting or underfitting by proportionally increasing depth, width, and input size.
- Each larger variant (B1–B7) is systematically scaled using the compound coefficients.
| Model | Input Resolution | Depth Scale | Width Scale | Top-1 Accuracy (ImageNet) | Parameters (Millions) |
|---|---|---|---|---|---|
| EfficientNet-B0 | 224×224 | 1.0 | 1.0 | 77.1% | 5.3 |
| EfficientNet-B1 | 240×240 | 1.1 | 1.0 | 79.1% | 7.8 |
| EfficientNet-B2 | 260×260 | 1.2 | 1.1 | 80.1% | 9.2 |
| EfficientNet-B3 | 300×300 | 1.4 | 1.2 | 81.6% | 12 |
| EfficientNet-B4 | 380×380 | 1.8 | 1.4 | 83.0% | 19 |
| EfficientNet-B5 | 456×456 | 2.2 | 1.6 | 83.7% | 30 |
| EfficientNet-B6 | 528×528 | 2.6 | 1.8 | 84.0% | 43 |
| EfficientNet-B7 | 600×600 | 3.1 | 2.0 | 84.3% | 66 |
- High Accuracy with Low Computation: Outperforms models like ResNet-152 and Inception-v4 with up to 8× fewer parameters and 10× less computation.
- Scalable and Adaptable: The compound scaling method generalizes across hardware and dataset constraints.
- Energy and Latency Efficient: Ideal for deployment on mobile and edge devices.
- Strong Generalization: Transfers well to diverse tasks such as object detection (EfficientDet), segmentation, and NLP.
EfficientNet represents a major leap forward in CNN architecture design, combining:
- Automated architecture discovery (NAS),
- Balanced compound scaling,
- Lightweight yet expressive MBConv and SE blocks.
It sets a new paradigm for designing neural networks that are not just accurate but computationally optimal. Its influence continues in subsequent architectures such as EfficientNetV2 and EfficientDet, which extend these principles to even broader vision tasks.
In short:
EfficientNet redefined efficiency in deep CNN design by coupling NAS-discovered architecture with a mathematically principled scaling strategy — achieving state-of-the-art performance using fewer resources.





