This guide provides step-by-step instructions for preparing datasets to train models in this repository.
Ensure the following checkpoint files exist in the ckpts/ directory before continuing:
ckpts/vae.ckptckpts/synchformer_state_dict.pth
To convert raw videos and CoT into training features, use the following command:
torchrun --nproc_per_node=8 data_utils/extract_training_video.py \
--root <video_path> \
--tsv_path <csv_path> \
--save-dir <feature_output_dir> \
--duration_sec <uniform_video_duration_in_seconds> \
--audio_samples duration_sec*44100<video_path>: Path to the root directory containing all .mp4 videos to be processed (all videos must be of equal duration).<csv_path>: Path to the TSV/CSV file that lists video-text pairs.(seedemo_test.csvfor format).<feature_output_dir>: Directory where extracted video features will be saved.<uniform_video_duration_in_seconds>: Duration to which all videos will be uniformly trimmed or padded.
You can also include audio-text pairs for training. Use the following command to extract features:
torchrun --nproc_per_node=8 data_utils/extract_training_audio.py \
--root <audio_path> \
--tsv_path <csv_path> \
--save-dir <feature_output_dir> \
--duration_sec <uniform_audio_duration_in_seconds> \
--audio_samples duration_sec*44100<audio_path>: Path to the raw audio files.<csv_path>: Path to the TSV/CSV file that lists audio-text pairs.<feature_output_dir>: Directory where extracted audio features will be saved.<uniform_audio_duration_in_seconds>: Duration to which all audios will be uniformly trimmed or padded.- Note that the audio input for feature extraction must be trimmed to match the duration of the video-text datasets.
For each dataset (video or audio), create a .txt file listing all feature file names (one per line), for example:
item1.pth
item2.pth
item3.pth
...
This file acts as the training split and will be referenced in the dataset config.
Create a JSON file following the structure below (adapted from ThinkSound/configs/multimodal_dataset_demo.json):
{
"dataset_type": "multimodal_dir",
"video_datasets": [
{
"id": "video_dataset_id",
"path": "path_to_video_feature_dir",
"split_path": "path_to_video_split_txt"
}
],
"audio_datasets": [
{
"id": "audio_dataset_id",
"path": "path_to_audio_feature_dir",
"split_path": "path_to_audio_split_txt"
}
],
"val_datasets": [
{
"id": "val_dataset_id",
"path": "path_to_val_feature_dir",
"split_path": "path_to_val_split_txt"
}
],
"random_crop": true,
"input_type": "prompt"
}You can include multiple datasets under video_datasets and audio_datasets by appending additional dictionary blocks to each list. The val_datasets is encouraged and must be a video-text dataset.
Refer to docs/Training.md for detailed training instructions once the dataset configuration is complete.