Skip to content

Latest commit

 

History

History
266 lines (181 loc) · 19.7 KB

File metadata and controls

266 lines (181 loc) · 19.7 KB

Semantic Segmentation vs. Object Detection


1. Core Definition and Purpose

Aspect Semantic Segmentation Object Detection
Goal Assign a class label to every pixel in an image. The output is a pixel-wise classification map that segments the image into meaningful regions. Identify what objects are present and where they are by drawing bounding boxes around them. Each detected box has a class label and confidence score.
Granularity Very fine-grained, down to the level of each pixel. Coarse, as it predicts object-level boxes, not pixel-accurate boundaries.
Output Type Segmentation mask (image-sized map, where each pixel = class ID). A set of bounding boxes with associated labels and confidence scores.
Objective Example “Which pixels belong to a cat?” “Where is the cat located, and what is its bounding box?”

2. Level of Understanding

Type Description
Semantic Segmentation Provides dense predictions, where every pixel is classified. It focuses on understanding the structure and meaning of every part of an image.
Object Detection Provides sparse predictions, locating objects using bounding boxes. It focuses on instance-level recognition rather than full-scene understanding.

3. Input and Output Representation

Stage Semantic Segmentation Object Detection
Input RGB image (HxWx3) RGB image (HxWx3)
Output Format 2D tensor (HxW) where each value represents a class ID. For instance, class 0 = background, 1 = road, 2 = car, etc. A variable-length list of detections: [(class, confidence, x, y, w, h)]
Visualization Typically shown as a color-coded mask overlaid on the image. Bounding boxes drawn around detected objects.

4. Architecture and Network Design

(a) Semantic Segmentation Architecture

  • Fully Convolutional Networks (FCNs) are the base architecture.

  • No fully connected layers — only convolutional, pooling, and upsampling.

  • Typical networks:

    • FCN (Fully Convolutional Network)
    • U-Net (encoder-decoder structure with skip connections)
    • SegNet, DeepLab (V1–V3+), PSPNet, HRNet
  • Architecture flow:

    1. Encoder: Extracts hierarchical features (like CNN backbone).
    2. Decoder: Upsamples feature maps to recover spatial resolution.
    3. Skip connections: Combine low-level and high-level features for sharp boundaries.

(b) Object Detection Architecture

  • Two-stage detectors: R-CNN → Fast R-CNN → Faster R-CNN → Mask R-CNN.

    • Region Proposal Network (RPN) proposes regions → classifier refines and labels them.
  • One-stage detectors: YOLO, SSD, RetinaNet, EfficientDet, FCOS.

    • Predict bounding boxes and class scores directly from feature maps.
  • Architecture flow:

    1. Backbone: CNN extracts features (e.g., ResNet, EfficientNet).
    2. Neck: Feature Pyramid Network (FPN) for multi-scale detection.
    3. Head: Dense or anchor-based predictions for boxes and class probabilities.

5. Mathematical Formulation

Concept Semantic Segmentation Object Detection
Input Image tensor ( X \in \mathbb{R}^{H\times W\times C} ) Image tensor ( X \in \mathbb{R}^{H\times W\times C} )
Output Pixel-wise probability map ( P \in \mathbb{R}^{H\times W\times K} ), where K = #classes Set of boxes ( B_i = (x_i, y_i, w_i, h_i, c_i, s_i) )
Loss Function Cross-Entropy Loss, Dice Loss, IoU Loss, Lovasz-Softmax Classification Loss + Bounding Box Regression Loss (Smooth L1, GIoU, DIoU, CIoU)
Training Objective Minimize difference between predicted mask and ground-truth mask. Maximize overlap (IoU) and correct classification per detected object.

6. Loss Functions

Semantic Segmentation Object Detection
Pixel-wise Cross-Entropy Loss: Measures classification error per pixel. Cross-Entropy/Focal Loss: For object classification.
Dice Loss / IoU Loss: Focuses on overlapping region accuracy. Smooth L1 / IoU / GIoU / DIoU / CIoU Loss: For bounding box regression.
Weighted Cross-Entropy: Compensates class imbalance when background dominates. Objectness / Centerness Loss: Helps model confidence for anchor or center predictions.

7. Output Density and Complexity

  • Semantic Segmentation:

    • Dense prediction: Every pixel has a label.
    • Complexity grows linearly with image size.
    • Needs high memory due to large feature maps and upsampling.
  • Object Detection:

    • Sparse prediction: Only a few objects per image.
    • Computational cost depends on number of anchors or proposals, not image pixels.
    • Easier to deploy for large images since it doesn’t require pixel-wise output.

8. Performance Metrics

Metric Semantic Segmentation Object Detection
Pixel Accuracy (PA) % of correctly classified pixels. Not applicable (box-based).
Mean Pixel Accuracy (mPA) Average accuracy over all classes.
Intersection over Union (IoU) Pixel-level overlap per class. Box-level overlap per object.
Mean IoU (mIoU) Average IoU across all classes (main metric). Average Precision (AP), mean Average Precision (mAP).
Dice Coefficient / F1 Score Measures overlap between masks. Recall, Precision, F1, mAP@[.5:.95].

9. Post-Processing

Semantic Segmentation Object Detection
- Uses Conditional Random Fields (CRFs) or Morphological Operations to refine mask edges.
- Sometimes employs Thresholding or Smoothing Filters to clean predictions.
- Uses Non-Maximum Suppression (NMS) to remove duplicate overlapping boxes.
- May apply Soft-NMS or Weighted Box Fusion for better handling of crowded scenes.

10. Data Requirements and Annotation

Aspect Semantic Segmentation Object Detection
Annotation Type Pixel-level mask (each pixel labeled). Bounding boxes around objects.
Annotation Effort Extremely high — each pixel needs labeling. Moderate — only draw rectangles.
Annotation Examples Cityscapes, ADE20K, Pascal VOC (segmentation), COCO-Stuff. COCO, Pascal VOC, Open Images, KITTI.
Dataset Size Impact Smaller datasets can work if classes are limited due to dense labeling. Larger datasets usually required to capture scale and shape variation.

11. Applications

Semantic Segmentation Object Detection
Autonomous driving (road, lane, pedestrian segmentation) Pedestrian detection, vehicle detection, obstacle localization
Medical imaging (tumor boundary detection, organ segmentation) Face detection, human pose estimation, retail product counting
Satellite imagery (land cover classification) Security systems, traffic surveillance
Agricultural vision (crop/weed segmentation) Animal detection, defect detection in manufacturing

12. Computation and Memory

  • Semantic Segmentation

    • Requires large memory because predictions are at full spatial resolution.
    • Training is slower due to upsampling operations and high-resolution features.
    • Batch size often smaller (limited by GPU memory).
  • Object Detection

    • Computationally lighter than segmentation since outputs are sparse.
    • One-stage detectors (YOLO, SSD) achieve real-time performance.
    • Two-stage methods (Faster R-CNN) are heavier but more accurate.

13. Model Complexity and Training

Concept Semantic Segmentation Object Detection
Backbone Same as CNNs (ResNet, VGG, EfficientNet). Same.
Head Design Decoder with upsampling and skip connections. Detection head with classification and regression layers.
Optimization Challenges Class imbalance (background vs object). Requires careful loss weighting. Class imbalance (many negative anchors). Solved by focal loss or sampling.
Training Strategy Often uses patch-based training for high-res images. Typically uses entire images or multi-scale inputs.
Inference Speed Slower, especially for large images. Real-time possible with modern YOLO or SSD models.

14. Limitations

Semantic Segmentation Object Detection
Cannot differentiate between multiple instances of the same class (e.g., two people → same mask). Cannot precisely delineate object boundaries (only bounding boxes).
Pixel labeling is memory-intensive and time-consuming to train. Struggles with small or overlapping objects.
Sensitive to resolution changes and occlusions. Sensitive to class imbalance and anchor design.

15. Extension Tasks

Base Task Extension / Variant Purpose
Semantic Segmentation Instance Segmentation Adds instance-level separation. Each object instance gets a unique mask. (e.g., Mask R-CNN, SOLO, YOLACT).
Object Detection Panoptic Segmentation Combines semantic and instance segmentation: both stuff (background) and things (objects).

Thus, Panoptic Segmentation can be seen as a fusion of both: it detects objects and segments them at the pixel level, unifying the strengths of both tasks.


16. Evaluation Example

Example Semantic Segmentation Object Detection
Ground Truth Pixel mask of cat Bounding box of cat
Model Output Predicted cat mask Predicted bounding box + confidence
Metric IoU per class, mIoU overall IoU per box, AP/mAP overall
Threshold IoU > 0.5 = correct IoU > 0.5 = true positive
Interpretation How much area of the cat mask overlaps How well the box localizes the cat

17. Modern Trends and Hybrid Approaches

A. Instance Segmentation

  • Merges object detection and semantic segmentation.
  • Detects individual instances with pixel-level accuracy.
  • Example models: Mask R-CNN, SOLOv2, YOLACT, HTC.
  • Output: Bounding box + binary mask for each object.

B. Panoptic Segmentation

  • A unified framework proposed by Google (2018).
  • Handles both stuff (e.g., sky, road) and things (e.g., cars, people).
  • Combines the pixel-level accuracy of segmentation with instance awareness of detection.
  • Example: Panoptic FPN, UPSNet, DeepLab2.

C. Vision Transformers (ViTs) in both

  • DETR, Mask2Former, Segment Anything (SAM) unify object detection and segmentation under a transformer-based formulation using query embeddings.

18. Conceptual Summary Table

Feature Semantic Segmentation Object Detection
Output Type Per-pixel label map Bounding boxes + class labels
Resolution Same as input Lower (based on boxes)
Instance Separation No Yes
Granularity Pixel level Object level
Architecture Encoder-decoder CNNs CNN with region or grid-based prediction
Post-processing CRF, Morphology NMS, Soft-NMS
Evaluation Metric mIoU mAP
Major Models FCN, U-Net, DeepLab Faster R-CNN, SSD, YOLO, RetinaNet
Annotation Type Pixel mask Bounding box
Computational Cost High Moderate
Best For Scene understanding, medical images, road segmentation Object localization, counting, recognition

19. Final Summary and Insights

  • Semantic Segmentation: Used when precise spatial understanding is required. Every pixel must be correctly labeled (e.g., autonomous driving, medical imaging). High computation, requires detailed data, but gives dense understanding of the scene.

  • Object Detection: Used when we only need to know what and where — not exact pixel boundaries. Excellent for real-time applications like surveillance, tracking, or robotics. Faster and less data-hungry than segmentation.

  • In short:

    • Object Detection = “Find and label objects.”
    • Semantic Segmentation = “Label every pixel in the image.”

Both are complementary: detection gives object-level awareness; segmentation gives spatial awareness. Modern systems increasingly combine both for complete scene understanding — a crucial step toward Autonomous Vision and General AI Perception.