Multimodal Product's Price Prediction (Amazon ML Challenge' 25)

A product's price prediction model for the Amazon ML Challenge. This solution implements a powerful ensemble strategy, averaging the predictions of two specialized models: (1) a text-only model using DistilBERT for deep language understanding, and (2) a multimodal model fusing an EfficientNetV2 CNN with DistilBERT to process both images and text. Built with PyTorch, Hugging Face Transformers, and timm.

📂 Project Structure

The project is organized to separate model definitions, training scripts, and core logic, ensuring clarity and maintainability.

/
├── input/                  
├── output/                 
├── src/                    
│   ├── config.py           
│   ├── dataset.py          
│   ├── model.py           
│   ├── text_only.py  
│   ├── engine.py         
│   ├── train_multimodal.py
│   ├── train_text_only.py
│   ├── inference_multimodal.py
│   └── inference_text_only.py 
│
├── ensemble.py            
├── requirements.txt      
└── README.md

⚙️ Setup and Installation

Follow these steps to set up the environment and install all necessary dependencies.

1. Create a Conda Environment

It is highly recommended to use a dedicated Conda environment to manage dependencies.

# Create a new environment with Python 3.10
conda create --name mlc python=3.10

# Activate the new environment
conda activate mlc

2. Install PyTorch with GPU Support

This project requires a GPU-enabled version of PyTorch.

# Install PyTorch, torchvision, and torchaudio with CUDA 12.1
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

3. Install Required Packages

Install all other necessary packages using the provided requirements.txt file.

# Navigate to the project's root directory
cd desktop/flipdart (lol)

# Install packages using pip
pip install -r requirements.txt

🚀 How to Run the Full Pipeline

The complete workflow involves training two separate models, generating predictions from each, and then ensembling the results to create the final submission file.

Step 1: Train the Text-Only Model

This model specializes in understanding the product's catalog_content.

# Navigate to the src directory
cd src

# Run the training script for the text-only model
python train_text_only.py

This process will train the model for the number of epochs specified in config.py and save the best-performing weights as distilbert_text_only_best.pth in the /output folder.

Step 2: Train the Multimodal Model

This model processes both product images and text descriptions. First, you need to update config.py to select this experiment.

In src/config.py, make this change:

# == EXPERIMENT 1: Text-Only Model ==
# MODEL_NAME = "distilbert_text_only"
# IMG_MODEL_NAME = None

# == EXPERIMENT 2: Multimodal CNN + Text Model ==
MODEL_NAME = "multimodal_cnn_distilbert"
IMG_MODEL_NAME = "tf_efficientnetv2_s.in21k"
IMG_SIZE = 300

Then, run the training script:

# Make sure you are still in the src directory
python train_multimodal.py

This will save the best model weights as multimodal_cnn_distilbert_best.pth in the /output folder.

Step 3: Generate Predictions from Both Models

After both models are trained, generate a separate prediction file for each.

# Generate predictions from the text-only model
python inference_text_only.py
# This creates test_out1.csv in the root folder.

# Generate predictions from the multimodal model
python inference_multimodal.py
# This creates test_out2.csv in the root folder.

Step 4: Ensemble the Predictions

This is the final step. Run the ensemble.py script from the project's root directory to average the two prediction files.

# Go back to the root directory
cd ..

# Run the ensembling script
python ensemble.py

This will create your final submission file, test_out.csv, which contains the averaged prices.

🤖 Model Architecture

Our final solution is a Two-Model Ensemble.

Text-Only Model (The Specialist):
- Architecture: Uses a pre-trained DistilBERT model followed by a regression head.
- Purpose: Excels at extracting fine-grained details from product descriptions like brand names, specific features, materials, and technical specifications that may not be apparent from the image alone.
Multimodal Model (The Generalist):
- Architecture: A dual-stream model that processes image and text data in parallel.
  - Image Stream: An EfficientNetV2 (a powerful CNN) extracts visual features.
  - Text Stream: DistilBERT extracts textual features.
- Fusion: The features from both streams are projected into a common dimension, concatenated, and then passed to a final regression head to predict the price.
- Purpose: Learns the relationship between a product's appearance and its description, capturing contextual information that either modality might miss on its own.

🔧 Configuration

The primary configuration file is src/config.py. It allows for easy management of model settings, paths, and hyperparameters. The CHOOSE YOUR EXPERIMENT section is designed to quickly switch between training the text-only and multimodal models by commenting and uncommenting the relevant blocks.

🎯Results -

Achieved a SMAPE of 44.901 with the test_out.csv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimodal Product's Price Prediction (Amazon ML Challenge' 25)

📋 Table of Contents

📂 Project Structure

⚙️ Setup and Installation

1. Create a Conda Environment

2. Install PyTorch with GPU Support

3. Install Required Packages

🚀 How to Run the Full Pipeline

Step 1: Train the Text-Only Model

Step 2: Train the Multimodal Model

Step 3: Generate Predictions from Both Models

Step 4: Ensemble the Predictions

🤖 Model Architecture

🔧 Configuration

🎯Results -

Keep Predicting and Contributing 🛩️!

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
.gitignore		.gitignore
README.md		README.md
crux.pdf		crux.pdf
ensemble.py		ensemble.py
requirements.txt		requirements.txt
test_out.csv		test_out.csv

harshitsinghcode/Multimodal-Price-Prediction-MLC25

Folders and files

Latest commit

History

Repository files navigation

Multimodal Product's Price Prediction (Amazon ML Challenge' 25)

📋 Table of Contents

📂 Project Structure

⚙️ Setup and Installation

1. Create a Conda Environment

2. Install PyTorch with GPU Support

3. Install Required Packages

🚀 How to Run the Full Pipeline

Step 1: Train the Text-Only Model

Step 2: Train the Multimodal Model

Step 3: Generate Predictions from Both Models

Step 4: Ensemble the Predictions

🤖 Model Architecture

🔧 Configuration

🎯Results -

Keep Predicting and Contributing 🛩️!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages