This guide will walk you through the process of preparing data, configuring your training setup, and launching training for the ThinkSound model. For best results, we recommend reading through all steps before starting.
Before training, you must preprocess the dataset following the instructions in Dataset.md. This includes:
- Converting raw data (e.g., video/audio/text) into structured feature files.
- Constructing a valid dataset metadata JSON that points to all precomputed features.
Make sure your extracted dataset includes all required modalities and is organized correctly.
Open scripts/train.sh and customize the following items:
-
Update the paths to your dataset, model config, and checkpoint directory:
dataset_configmodel_configpretransform_ckpt_path
-
Modify distributed training settings as needed:
num_gpus,num_nodes,node_rank,MASTER_PORT, etc.
-
(Optional) Enable debug mode by adding the
--debugflag when running the script.
If you're using a multi-GPU setup, ensure the WORLD_SIZE, NODE_RANK, and MASTER_PORT are correctly set for your environment. These are critical for DistributedDataParallel (DDP) training.
To monitor training quality visually, modify the training-demo-demo_cond section in the model config and make sure it contains exactly 10 test samples.
Set this section to include several representative video features. These will be periodically passed through the generator during training, allowing you to visually assess output quality over time.
Make the script executable (if not already) and start training:
chmod +x scripts/train.sh
./scripts/train.shLogs will be written to the specified log directory (log_dir).
To modify model architecture or training strategy, open the model config.
You can adjust a wide range of parameters, such as:
- Number of model parameters
- Optimizer type
- Learning rate
- Latent dimension
Be sure to keep a backup of your config for reproducibility.
Add the --debug flag when running the training script to run on a single GPU (single-node)
This is useful for quick sanity checks or development iterations.
Checkpoints (including model weights, optimizer state, and EMA versions) are saved periodically in the configured log directory. You can resume training or fine-tune on the original model by modifying the training script accordingly (add --ckpt-path). If you plan to fine-tune using our pretrained model, please use thinksound.ckpt instead of thinksound_light.ckpt.
Happy training! 🚀 If you run into any issues, consider opening an issue or checking the documentation for detailed help.