VoCo-LLaMA:
Towards Vision Compression with
Large Language Models

1Tsinghua University, 2ARC Lab, Tencent PCG, 3UC Santa Cruz

TL;DR

1. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By fully utilizing the LLMs' understanding paradigm of vision tokens, our method can compress hundreds of vision tokens into a single VoCo token, while minimizing visual information loss.
2. VoCo-LLaMA demonstrates the ability to understand video through continuous training using time-series compressed token sequences of video frames.
3. VoCo-LLaMA presents a promising way to unlock the full potential of VLMs' contextual window.
Interpolate start reference image.


(a) VLMs are bottlenecked by the limited context window when processing high-resolution images and videos. (b) Previous methods compress vision tokens with external modules with substantial loss. (c) Illustration of VoCo-LLaMA, which empowers LLM to compress vision tokens and understand compressed tokens via intrinsic token distillation.

Abstract

We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. We strive to fully utilize the LLMs' understanding paradigm of vision tokens during the compression learning process. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576 x, resulting in up to 94.8% fewer FLOPs and 69.6% acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications.

Vision Compression Design

Interpolate start reference image.

Illustration of the VoCo-LLaMA framework. Based on standard VLMs (a), VoCo-LLaMA (b) first isolate visual and text tokens by injecting VoCo tokens, and then establishes a dedicated interaction pathway between the two modalities via VoCo tokens, enabling the distillation and compression of vision tokens into the transformer activations upon the compact VoCo tokens.
The compression process of VoCo-LLaMA can be elegantly implemented by strategically modifying the attention mask and learned through standard visual instruction tuning.

Vision Compression Results

Interpolate start reference image.
Interpolate start reference image.

Temporal Modeling

To further investigate the efficacy of our proposed method in handling video input, we utilize the time-series VoCo token sequence obtained by compressing video frames to further explore the temporal modeling capabilities of VoCo-LLaMA. Without any additional designs, VoCo-LLaMA outperforms existing vision compression methods on common video question-answering benchmarks, with each video frame compressed into an equal number of tokens.
Interpolate start reference image.
Interpolate start reference image.

Visualization

Interpolate start reference image.

BibTeX

@article{ye2024voco,
    author    = {Ye, Xubing and Gan, Yukang and Huang, Xiaoke and Ge, Yixiao and Shan, Ying and Tang, Yansong},
    title     = {{VoCo-LLaMA: Towards Vision Compression with Large Language Models}},
    journal   = {arXiv preprint arXiv:2406.12275},
    year      = {2024},
}