A product's price prediction model for the Amazon ML Challenge. This solution implements a powerful ensemble strategy, averaging the predictions of two specialized models: (1) a text-only model using DistilBERT for deep language understanding, and (2) a multimodal model fusing an EfficientNetV2 CNN with DistilBERT to process both images and text. Built with PyTorch, Hugging Face Transformers, and timm.
- Project Structure
- Setup and Installation
- How to Run the Full Pipeline
- Model Architecture
- Configuration
The project is organized to separate model definitions, training scripts, and core logic, ensuring clarity and maintainability.
/
├── input/
├── output/
├── src/
│ ├── config.py
│ ├── dataset.py
│ ├── model.py
│ ├── text_only.py
│ ├── engine.py
│ ├── train_multimodal.py
│ ├── train_text_only.py
│ ├── inference_multimodal.py
│ └── inference_text_only.py
│
├── ensemble.py
├── requirements.txt
└── README.md
Follow these steps to set up the environment and install all necessary dependencies.
It is highly recommended to use a dedicated Conda environment to manage dependencies.
# Create a new environment with Python 3.10
conda create --name mlc python=3.10
# Activate the new environment
conda activate mlcThis project requires a GPU-enabled version of PyTorch.
# Install PyTorch, torchvision, and torchaudio with CUDA 12.1
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidiaInstall all other necessary packages using the provided requirements.txt file.
# Navigate to the project's root directory
cd desktop/flipdart (lol)
# Install packages using pip
pip install -r requirements.txtThe complete workflow involves training two separate models, generating predictions from each, and then ensembling the results to create the final submission file.
This model specializes in understanding the product's catalog_content.
# Navigate to the src directory
cd src
# Run the training script for the text-only model
python train_text_only.pyThis process will train the model for the number of epochs specified in config.py and save the best-performing weights as distilbert_text_only_best.pth in the /output folder.
This model processes both product images and text descriptions. First, you need to update config.py to select this experiment.
In src/config.py, make this change:
# == EXPERIMENT 1: Text-Only Model ==
# MODEL_NAME = "distilbert_text_only"
# IMG_MODEL_NAME = None
# == EXPERIMENT 2: Multimodal CNN + Text Model ==
MODEL_NAME = "multimodal_cnn_distilbert"
IMG_MODEL_NAME = "tf_efficientnetv2_s.in21k"
IMG_SIZE = 300Then, run the training script:
# Make sure you are still in the src directory
python train_multimodal.pyThis will save the best model weights as multimodal_cnn_distilbert_best.pth in the /output folder.
After both models are trained, generate a separate prediction file for each.
# Generate predictions from the text-only model
python inference_text_only.py
# This creates test_out1.csv in the root folder.
# Generate predictions from the multimodal model
python inference_multimodal.py
# This creates test_out2.csv in the root folder.This is the final step. Run the ensemble.py script from the project's root directory to average the two prediction files.
# Go back to the root directory
cd ..
# Run the ensembling script
python ensemble.pyThis will create your final submission file, test_out.csv, which contains the averaged prices.
Our final solution is a Two-Model Ensemble.
-
Text-Only Model (The Specialist):
- Architecture: Uses a pre-trained DistilBERT model followed by a regression head.
- Purpose: Excels at extracting fine-grained details from product descriptions like brand names, specific features, materials, and technical specifications that may not be apparent from the image alone.
-
Multimodal Model (The Generalist):
- Architecture: A dual-stream model that processes image and text data in parallel.
- Image Stream: An EfficientNetV2 (a powerful CNN) extracts visual features.
- Text Stream: DistilBERT extracts textual features.
- Fusion: The features from both streams are projected into a common dimension, concatenated, and then passed to a final regression head to predict the price.
- Purpose: Learns the relationship between a product's appearance and its description, capturing contextual information that either modality might miss on its own.
- Architecture: A dual-stream model that processes image and text data in parallel.
The primary configuration file is src/config.py. It allows for easy management of model settings, paths, and hyperparameters. The CHOOSE YOUR EXPERIMENT section is designed to quickly switch between training the text-only and multimodal models by commenting and uncommenting the relevant blocks.
Achieved a SMAPE of 44.901 with the test_out.csv.
