diff --git a/README.md b/README.md index e84d02d448..a50313fa2c 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,32 @@ -# Nerfies - -This is the repository that contains source code for the [Nerfies website](https://nerfies.github.io). - -If you find Nerfies useful for your work please cite: -``` -@article{park2021nerfies - author = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo}, - title = {Nerfies: Deformable Neural Radiance Fields}, - journal = {ICCV}, - year = {2021}, -} -``` - -# Website License -Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. +# V-skip + +
+ +[![arXiv](https://img.shields.io/badge/arXiv-2601.13879-b31b1b.svg)](https://arxiv.org/pdf/2601.13879) +[![Project Page](https://img.shields.io/badge/Project-Page-green)](https://dongxu-zhang.github.io/v-skip.github.io/) +[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) + +**V-Skip: Efficient Multimodal Reasoning via Dual-Path Anchoring** +
+[Dongxu Zhang](https://dongxu-zhang.github.io/)1,*, [Yiding Sun](https://github.com/Issac-Sun)1,*, [Cheng Tan](https://chengtan9907.github.io/)3, [Wenbiao Yan](#)4, [Ning Yang](http://ningyangcasia.cn/)2,†, [Jihua Zhu](https://gr.xjtu.edu.cn/web/zhujh)1,†, [Haijun Zhang](https://scce.ustb.edu.cn/shiziduiwu/jiaoshixinxi/2018-04-13/100.html)5 + +1State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, XJTU, 2CASIA, 3Shanghai AI Laboratory, 4HITSZ, 5USTB +
+ +--- + +## 🚀 Introduction + +This repository contains the official implementation (and project page source) for the paper **"Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring"**. + +**V-Skip** is a novel token pruning framework designed for Multimodal Large Language Models (MLLMs). It solves the **"Visual Amnesia"** problem found in standard text-centric compression methods. By employing a dual-path gating mechanism (Linguistic Surprisal + Visual Attention Flow), V-Skip preserves visually salient tokens while reducing latency. + +![V-Skip Teaser](./static/images/fig1.png) +*Figure 1: Comparison of compression paradigms. V-Skip successfully rescues visual anchors (e.g., "red") that are blindly pruned by text-only methods.* + +## 📈 Key Results +- **Speedup:** Achieves **2.9x** inference speedup on Qwen2-VL. +- **Accuracy:** Outperforms baselines by over **30%** on the DocVQA benchmark. +- **Robustness:** Effectively prevents object hallucination caused by over-pruning. + +## 🛠️ Usage diff --git a/index.html b/index.html index 373119fe36..396707a1e5 100644 --- a/index.html +++ b/index.html @@ -2,25 +2,11 @@ - - + + - Nerfies: Deformable Neural Radiance Fields + V-Skip: Chain-of-Thought Compression Should Not Be Blind - - - @@ -32,7 +18,7 @@ - + @@ -42,45 +28,40 @@ -
@@ -88,39 +69,44 @@
-

Nerfies: Deformable Neural Radiance Fields

+

Chain-of-Thought Compression Should Not Be Blind:
V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

- 1University of Washington, - 2Google Research + 1State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, XJTU, + 2CASIA, + 3Shanghai AI Lab, + 4HITSZ, + 5USTB +
+
+
@@ -178,265 +135,105 @@

Nerfies: Deformable Neural Radiance Fie
- + Visual Amnesia Concept

- Nerfies turns selfie videos from your phone into - free-viewpoint - portraits. + V-Skip solves Visual Amnesia. + Standard text compression (middle) blindly prunes visually essential tokens (e.g., "red"), causing hallucinations. + V-Skip (bottom) preserves these visual anchors via Dual-Path scoring.

-
-
-
- -
-
-
- -
-

Abstract

- We present the first method capable of photorealistically reconstructing a non-rigidly - deforming scene using photos/videos captured casually from mobile phones. + While Chain-of-Thought (CoT) reasoning significantly enhances the performance of + Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive + latency constraints. Current efforts to mitigate this via token compression often fail + by blindly applying text-centric metrics to multimodal contexts.

- Our approach augments neural radiance fields - (NeRF) by optimizing an - additional continuous volumetric deformation field that warps each observed point into a - canonical 5D NeRF. - We observe that these NeRF-like deformation fields are prone to local minima, and - propose a coarse-to-fine optimization method for coordinate-based models that allows for - more robust optimization. - By adapting principles from geometry processing and physical simulation to NeRF-like - models, we propose an elastic regularization of the deformation field that further - improves robustness. + We identify a critical failure mode termed Visual Amnesia, where + linguistically redundant tokens are erroneously pruned, severing the connection to the input image + and leading to hallucinations. To address this, we introduce V-Skip, + a novel framework that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) + optimization problem.

- We show that Nerfies can turn casually captured selfie - photos/videos into deformable NeRF - models that allow for photorealistic renderings of the subject from arbitrary - viewpoints, which we dub "nerfies". We evaluate our method by collecting data - using a - rig with two mobile phones that take time-synchronized photos, yielding train/validation - images of the same pose at different viewpoints. We show that our method faithfully - reconstructs non-rigidly deforming scenes and reproduces unseen views with high - fidelity. + V-Skip employs a dual-path gating mechanism that weighs token importance through + both linguistic surprisal and cross-modal attention flow. This allows the model to identify and + rescue visually salient anchors that would otherwise be discarded. Extensive experiments on + Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a 2.9× speedup + with negligible accuracy loss, outperforming other baselines by over 30% on the DocVQA benchmark.

- - - -
-
-

Video

-
- -
-
- -
- -
- - -
-
-

Visual Effects

-

- Using nerfies you can create fun visual effects. This Dolly zoom effect - would be impossible without nerfies since it would require going through a wall. -

- -
-
- - - -
-

Matting

-
-
-

- As a byproduct of our method, we can also solve the matting problem by ignoring - samples that fall outside of a bounding box during rendering. -

- -
- -
-
-
- - - -
+
-

Animation

- - -

Interpolating states

+

Methodology

- We can also animate the scene by interpolating the deformation latent codes of two input - frames. Use the slider here to linearly interpolate between the left frame and the right - frame. + V-Skip reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) problem. + The pipeline consists of three stages:

+
    +
  • Stage 1: Data Generation using a frozen Teacher MLLM.
  • +
  • Stage 2: Dual-Path Filtering & Pruning, utilizing both Linguistic Surprisal and Visual Attention Flow.
  • +
  • Stage 3: Efficient Fine-tuning via LoRA distillation.
  • +
-
-
- Interpolate start reference image. -

Start Frame

-
-
-
- Loading... -
- -
-
- Interpolation end reference image. -

End Frame

-
-
-
- - - -

Re-rendering the input video

+ V-Skip Architecture

- Using Nerfies, you can re-render a video from a novel - viewpoint such as a stabilized camera by playing back the training deformations. + As shown above (Figure 2), our method automates the construction of efficient multimodal reasoners. + Unlike standard text compression which discards tokens based only on text probability, V-Skip rescues + visually salient tokens (anchors) to prevent hallucination.

-
- -
- -
- - +
+
- -
+
+
+
-

Related Links

- +

Qualitative Comparison

+

- There's a lot of excellent work that was introduced around the same time as ours. -

-

- Progressive Encoding for Neural Optimization introduces an idea similar to our windowed position encoding for coarse-to-fine optimization. -

-

- D-NeRF and NR-NeRF - both use deformation fields to model non-rigid scenes. -

-

- Some works model videos with a NeRF by directly modulating the density, such as Video-NeRF, NSFF, and DyNeRF + We compare V-Skip against standard text-centric pruning methods (e.g., LLMLingua-2). + The example below (from DocVQA) demonstrates the Information Entropy Mismatch.

+
+ + Qualitative comparison on DocVQA + Qualitative comparison on DocVQA +

- There are probably many more by the time you are reading this. Check out Frank Dellart's survey on recent NeRF papers, and Yen-Chen Lin's curated list of NeRF papers. +
+ In the invoice example, the key value "$45.20" has high linguistic perplexity (it looks like a random number to the LLM) + but is visually grounded in the image. + Standard methods (LLMLingua-2) prune it, leading to a wrong answer. + V-Skip detects the high cross-modal attention and preserves the token, answering correctly.

- -
@@ -444,11 +241,14 @@

Related Links

BibTeX

-
@article{park2021nerfies,
-  author    = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo},
-  title     = {Nerfies: Deformable Neural Radiance Fields},
-  journal   = {ICCV},
-  year      = {2021},
+    
@misc{zhangvskip,
+      title={Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring}, 
+      author={Dongxu Zhang and Yiding Sun and Cheng Tan and Wenbiao Yan and Ning Yang and Jihua Zhu and Haijun Zhang},
+      year={2026},
+      eprint={2601.13879},
+      archivePrefix={arXiv},
+      primaryClass={cs.MM},
+      url={https://arxiv.org/abs/2601.13879}, 
 }
@@ -458,10 +258,10 @@

BibTeX

@@ -474,11 +274,8 @@

BibTeX

Commons Attribution-ShareAlike 4.0 International License.

- This means you are free to borrow the source code of this website, - we just ask that you link back to this page in the footer. - Please remember to remove the analytics code included in the header of the website which - you do not want on your website. + This website utilizes the project page template from Nerfies.

diff --git a/static/images/XJTU_emblem.svg b/static/images/XJTU_emblem.svg new file mode 100644 index 0000000000..959c01d2e9 --- /dev/null +++ b/static/images/XJTU_emblem.svg @@ -0,0 +1,33 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/static/images/fig1.pdf b/static/images/fig1.pdf new file mode 100644 index 0000000000..008c01dec3 Binary files /dev/null and b/static/images/fig1.pdf differ diff --git a/static/images/fig1.png b/static/images/fig1.png new file mode 100644 index 0000000000..aeee215159 Binary files /dev/null and b/static/images/fig1.png differ diff --git a/static/images/pipeline.png b/static/images/pipeline.png new file mode 100644 index 0000000000..5dfbf2b068 Binary files /dev/null and b/static/images/pipeline.png differ diff --git a/static/images/qualitative.png b/static/images/qualitative.png new file mode 100644 index 0000000000..f817a065ef Binary files /dev/null and b/static/images/qualitative.png differ diff --git a/static/images/qualitative2.png b/static/images/qualitative2.png new file mode 100644 index 0000000000..f1e955df50 Binary files /dev/null and b/static/images/qualitative2.png differ diff --git "a/\344\277\256\346\224\271\346\214\207\345\215\227.md" "b/\344\277\256\346\224\271\346\214\207\345\215\227.md" new file mode 100644 index 0000000000..2f30142af3 --- /dev/null +++ "b/\344\277\256\346\224\271\346\214\207\345\215\227.md" @@ -0,0 +1,493 @@ +# 论文网站修改指南 + +本文档详细列出了将 Nerfies 网站模板改为您自己论文内容时需要修改的所有部分。 + +## 📋 修改清单总览 + +### 🔴 必须修改(核心内容) + +### 🟡 建议修改(个性化内容) + +### 🟢 可选修改(增强功能) + +--- + +## 一、HTML 头部信息 (index.html 第 1-42 行) + +### 1.1 Meta 标签信息 +**位置**: `index.html` 第 5-9 行 + +```html + + + +您的论文标题 +``` + +**修改说明**: +- `description`: 改为您论文的简短描述(150字以内) +- `keywords`: 改为与您论文相关的关键词,用逗号分隔 +- `title`: 改为您的论文标题 + +### 1.2 Google Analytics(可选) +**位置**: `index.html` 第 11-23 行 + +**选项**: +- **保留**: 如果您有自己的 Google Analytics ID,将 `G-PYVRSFMDRL` 替换为您的 ID +- **删除**: 如果不需要统计,删除整个 `