From 88195dbde541e5f99621ae644beea4eea0701c4c Mon Sep 17 00:00:00 2001 From: polyedr Date: Sun, 24 Aug 2025 16:30:01 +0200 Subject: [PATCH] =?UTF-8?q?Fix=20minor=20grammar=20in=20README=20(Inferenc?= =?UTF-8?q?e=20a=20video=20=E2=86=92=20Run=20inference=20on=20a=20video;?= =?UTF-8?q?=20each=20frames=20=E2=86=92=20each=20frame)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index a288718..901d244 100644 --- a/README.md +++ b/README.md @@ -90,7 +90,7 @@ Download the checkpoints listed [here](#pre-trained-models) and put them under t bash get_weights.sh ``` -### Inference a video +### Run inference on a video ```bash python3 run.py --input_video ./assets/example_videos/davis_rollercoaster.mp4 --output_dir ./outputs --encoder vitl ``` @@ -108,8 +108,8 @@ Options: - `--save_npz` (optional): Save the depth map in `npz` format. - `--save_exr` (optional): Save the depth map in `exr` format. -### Inference a video using streaming mode (Experimental features) -We implement an experimental streaming mode **without training**. In details, we save the hidden states of temporal attentions for each frames in the caches, and only send a single frame into our video depth model during inference by reusing these past hidden states in temporal attentions. We hack our pipeline to align the original inference setting in the offline mode. Due to the inevitable gap between training and testing, we observe a **performance drop** between the streaming model and the offline model (e.g. the `d1` of ScanNet drops from `0.926` to `0.836`). Finetuning the model in the streaming mode will greatly improve the performance. We leave it for future work. +### Run inference on a video using streaming mode (Experimental features) +We implement an experimental streaming mode **without training**. In details, we save the hidden states of temporal attentions for each frame in the caches, and only send a single frame into our video depth model during inference by reusing these past hidden states in temporal attentions. We hack our pipeline to align the original inference setting in the offline mode. Due to the inevitable gap between training and testing, we observe a **performance drop** between the streaming model and the offline model (e.g. the `d1` of ScanNet drops from `0.926` to `0.836`). Finetuning the model in the streaming mode will greatly improve the performance. We leave it for future work. To run the streaming model: ```bash