Official implementation of VIRST, a video-instructed reasoning framework for spatiotemporal segmentation.
- release model code
- release checkpoint
- release data code
- release utility scripts
- release training scripts
- eval script
- demo script
This repository contains the core training and evaluation code for VIRST, including:
- model definition in
model/ - training entrypoints in
train.pyandtrain_stage3.py - RVOS evaluation in
eval.py - dataset handling in
data/ - utility code in
utils/
Pretrained checkpoint: Google Drive
- The project page will be updated as the release is polished further.
This project builds upon prior work, including VISA, LISA, VideoChat-Flash, and SAM2.
We thank the authors for releasing their code and models.