This project implements a robust person tracking system by combining Faster R-CNN for object detection and a Siamese network for person re-identification (ReID). The system continuously tracks a pedestrian across frames in a video sequence, even under challenging conditions such as occlusions, lighting changes, and similar appearances among individuals. The tracking pipeline achieves both accuracy and efficiency by leveraging deep learning models, data augmentation, and GPU-accelerated training.
-
Faster R-CNN with ResNet50 Backbone
Fine-tuned for pedestrian detection using the MOT16-02 dataset, achieving a training loss of ~1.0065 after 10 epochs. -
Siamese Network for Person Re-Identification
Learns feature embeddings to uniquely identify individuals across frames using triplet loss and fine-tuning on Market1501. -
Bounding Box Tracking with Motion Prediction
Predicts future locations based on prior bounding box velocities, maintaining continuity even under brief occlusions. -
Similarity-Based Data Association
Combines IoU and embedding similarity scores for robust identity matching between frames. -
GPU-Accelerated Training
Fully utilizes CUDA with mixed precision and gradient accumulation for optimized performance. -
Augmented Datasets for Robustness
Includes color jitter, Gaussian blur, and brightness variation to enhance generalization under varied lighting and visual conditions.
| Category | Technologies |
|---|---|
| Programming Language | Python 3.10+ |
| Deep Learning Framework | PyTorch, Torchvision |
| Computer Vision | OpenCV, PIL |
| Data Handling & Visualization | Pandas, NumPy, scikit-image, Matplotlib |
| Datasets | MOT16, Market1501 |
| Hardware | CUDA-enabled GPU |
Tracking_video_MOTS.mp4
📂 Project Root
├── Faster_RCNN_GPU.py # Training script for Faster R-CNN on MOT16
├── Siamese_network.py # Core Siamese model (Triplet-based ReID)
├── siamese_network_final_v2_prerit.py # Advanced Siamese training with augmentation
├── inference_test.py # Inference and tracking pipeline
├── Project Report.pdf # System design, methodology, and results
└── outputs/
├── faster_rcnn_mots16.pth # Trained detection model
├── finetuned_siamese_model_test.pth # Trained re-ID model
└── tracked_output_test.mp4 # Output tracking video
- Detection (Faster R-CNN) — Detects pedestrians in each frame.
- Feature Embedding (Siamese Network) — Extracts a 256-dimensional feature vector per detection.
- Similarity Calculation — Compares embeddings between frames to match identities.
- Bounding Box Prediction (Tracker) — Predicts next-frame positions using velocity estimation.
- Data Association — Combines IoU and embedding similarity to maintain consistent IDs.
Ensure you have the following installed:
- Python ≥ 3.10
- CUDA-compatible GPU
- PyTorch ≥ 2.0
- Torchvision ≥ 0.15
- OpenCV ≥ 4.5
# Clone the repository
git clone https://github.com/YOUR_USERNAME/mots-person-tracking.git
cd mots-person-tracking
# (Optional) Create a virtual environment
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
# Install dependencies
pip install torch torchvision opencv-python pandas scikit-image tqdm matplotlibCreate a .env file in the project root to specify dataset and output paths:
# Dataset and model paths
MOT16_TRAIN_DIR=D:\Path\To\MOT16\train\MOT16-02
MARKET1501_DIR=D:\Path\To\Market1501
OUTPUT_VIDEO=tracked_output_test.mp4python Faster_RCNN_GPU.pyThis trains a Faster R-CNN (ResNet50-FPN) on the MOT16-02 dataset and saves the model as faster_rcnn_mots16.pth.
python siamese_network_final_v2_prerit.pyThis trains and fine-tunes the Siamese network using triplet loss across multiple MOT16 sequences and Market1501.
python inference_test.pyThe script will:
- Detect people per frame
- Match them using re-ID embeddings
- Track them across frames
- Export an annotated output video (
tracked_output_test.mp4)
| Model | Dataset | Epochs | Train Loss | Valid Loss | Notes |
|---|---|---|---|---|---|
| Faster R-CNN | MOT16-02 | 10 | 1.0065 | — | Fine-tuned from pretrained ResNet50 |
| Siamese Network | Market1501 + MOT16 | 10 | 0.1401 | 0.2154 | Fine-tuned with augmentation |
- Train both models on the complete MOT16 dataset for broader scene understanding.
- Optimize similarity thresholds for dynamic environments.
- Integrate Kalman filters or DeepSORT for enhanced temporal consistency.
- Explore multi-camera tracking and cross-view re-identification.