Skip to content

Product's price prediction model for the Amazon ML Challenge using an ensemble of two models: (1) a text-only DistilBERT for deep language understanding and (2) a multimodal EfficientNetV2 + DistilBERT model for image-text fusion. Built with PyTorch, Hugging Face Transformers, and timm.

Notifications You must be signed in to change notification settings

harshitsinghcode/Multimodal-Price-Prediction-MLC25

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal Product's Price Prediction (Amazon ML Challenge' 25)

A product's price prediction model for the Amazon ML Challenge. This solution implements a powerful ensemble strategy, averaging the predictions of two specialized models: (1) a text-only model using DistilBERT for deep language understanding, and (2) a multimodal model fusing an EfficientNetV2 CNN with DistilBERT to process both images and text. Built with PyTorch, Hugging Face Transformers, and timm.

📋 Table of Contents


📂 Project Structure

The project is organized to separate model definitions, training scripts, and core logic, ensuring clarity and maintainability.

/
├── input/                  
├── output/                 
├── src/                    
│   ├── config.py           
│   ├── dataset.py          
│   ├── model.py           
│   ├── text_only.py  
│   ├── engine.py         
│   ├── train_multimodal.py
│   ├── train_text_only.py
│   ├── inference_multimodal.py
│   └── inference_text_only.py 
│
├── ensemble.py            
├── requirements.txt      
└── README.md             

⚙️ Setup and Installation

Follow these steps to set up the environment and install all necessary dependencies.

1. Create a Conda Environment

It is highly recommended to use a dedicated Conda environment to manage dependencies.

# Create a new environment with Python 3.10
conda create --name mlc python=3.10

# Activate the new environment
conda activate mlc

2. Install PyTorch with GPU Support

This project requires a GPU-enabled version of PyTorch.

# Install PyTorch, torchvision, and torchaudio with CUDA 12.1
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

3. Install Required Packages

Install all other necessary packages using the provided requirements.txt file.

# Navigate to the project's root directory
cd desktop/flipdart (lol)

# Install packages using pip
pip install -r requirements.txt

🚀 How to Run the Full Pipeline

The complete workflow involves training two separate models, generating predictions from each, and then ensembling the results to create the final submission file.

Step 1: Train the Text-Only Model

This model specializes in understanding the product's catalog_content.

# Navigate to the src directory
cd src

# Run the training script for the text-only model
python train_text_only.py

This process will train the model for the number of epochs specified in config.py and save the best-performing weights as distilbert_text_only_best.pth in the /output folder.

Step 2: Train the Multimodal Model

This model processes both product images and text descriptions. First, you need to update config.py to select this experiment.

In src/config.py, make this change:

# == EXPERIMENT 1: Text-Only Model ==
# MODEL_NAME = "distilbert_text_only"
# IMG_MODEL_NAME = None

# == EXPERIMENT 2: Multimodal CNN + Text Model ==
MODEL_NAME = "multimodal_cnn_distilbert"
IMG_MODEL_NAME = "tf_efficientnetv2_s.in21k"
IMG_SIZE = 300

Then, run the training script:

# Make sure you are still in the src directory
python train_multimodal.py

This will save the best model weights as multimodal_cnn_distilbert_best.pth in the /output folder.

Step 3: Generate Predictions from Both Models

After both models are trained, generate a separate prediction file for each.

# Generate predictions from the text-only model
python inference_text_only.py
# This creates test_out1.csv in the root folder.

# Generate predictions from the multimodal model
python inference_multimodal.py
# This creates test_out2.csv in the root folder.

Step 4: Ensemble the Predictions

This is the final step. Run the ensemble.py script from the project's root directory to average the two prediction files.

# Go back to the root directory
cd ..

# Run the ensembling script
python ensemble.py

This will create your final submission file, test_out.csv, which contains the averaged prices.


🤖 Model Architecture

Our final solution is a Two-Model Ensemble.

  1. Text-Only Model (The Specialist):

    • Architecture: Uses a pre-trained DistilBERT model followed by a regression head.
    • Purpose: Excels at extracting fine-grained details from product descriptions like brand names, specific features, materials, and technical specifications that may not be apparent from the image alone.
  2. Multimodal Model (The Generalist):

    • Architecture: A dual-stream model that processes image and text data in parallel.
      • Image Stream: An EfficientNetV2 (a powerful CNN) extracts visual features.
      • Text Stream: DistilBERT extracts textual features.
    • Fusion: The features from both streams are projected into a common dimension, concatenated, and then passed to a final regression head to predict the price.
    • Purpose: Learns the relationship between a product's appearance and its description, capturing contextual information that either modality might miss on its own.

🔧 Configuration

The primary configuration file is src/config.py. It allows for easy management of model settings, paths, and hyperparameters. The CHOOSE YOUR EXPERIMENT section is designed to quickly switch between training the text-only and multimodal models by commenting and uncommenting the relevant blocks.

🎯Results -

Achieved a SMAPE of 44.901 with the test_out.csv.

WhatsApp Image 2025-10-12 at 23 35 57_05a00195

Keep Predicting and Contributing 🛩️!

About

Product's price prediction model for the Amazon ML Challenge using an ensemble of two models: (1) a text-only DistilBERT for deep language understanding and (2) a multimodal EfficientNetV2 + DistilBERT model for image-text fusion. Built with PyTorch, Hugging Face Transformers, and timm.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages