ATP-LLAVA: Adaptive Token Pruning for Large Vision Language Models

ATP-LLaVA:
Adaptive Token Pruning for Large
Vision Language Models

1Tsinghua University, 2ARC Lab, Tencent PCG

TL;DR

1. We reveal the importance of adaptively determining pruning ratios at the instance and LLM layer levels for effective visual token pruning, and propose ATP-LLaVA, a framework that dynamically reduces computational cost for large vision language models.
2. ATP-LLaVA achieves a 75% average pruning ratio while maintaining 98.1% performance across seven widely used vision understanding benchmarks.

Insight & Motivation

Token pruning of LVLMs should be layer-wise and instance-wise.

Adaptive Token Pruning

Interpolate start reference image.

(a) Previous methods employ a fixed, pre-defined token pruning ratio.
(b) Illustration of ATP-LLaVA, which dynamically selects the adaptive pruning ratio for each layer of the LLM decoder based on the instance-specific characteristics

Abstract

Large Vision Language Models (LVLMs) have achieved significant success across multi-modal tasks. However, the computational cost of processing long visual tokens can be prohibitively expensive on resource-limited devices. Previous methods have identified redundancy in visual tokens within the Large Language Model (LLM) decoder layers and have mitigated this by pruning tokens using a pre-defined or fixed ratio, thereby reducing computational overhead. Nonetheless, we observe that the impact of pruning ratio varies across different LLM layers and instances (image-prompt pairs). Therefore, it is essential to develop a layer-wise and instance-wise vision token pruning strat- egy to balance computational cost and model performance effectively. We propose ATP-LLaVA, a novel approach that adaptively determines instance-specific token pruning ratios for each LLM layer. Specifically, we introduce an Adaptive Token Pruning (ATP) module, which computes the importance score and pruning threshold based on input in- stance adaptively. The ATP module can be seamlessly integrated between any two LLM layers with negligible computational overhead. Additionally, we develop a Spatial Augmented Pruning (SAP) strategy that prunes visual tokens with both token redundancy and spatial modeling perspectives. Our approach reduces the average token count by 75% while maintaining performance, with only a minimal 1.9% degradation across seven widely used benchmarks.

Adaptive Token Pruning

Interpolate start reference image.

Illustration of the Adaptive Token Pruning (ATP) module. The ATP module can be flexibly inserted between any two LLaMA decoder layers. It adaptively predicts pruning thresholds for current layer and instance. Redundant or text-irrelevant visual tokens are pruned at this stage, and they will be ignored by other tokens in subsequent LLaMA decoder layers.

Token Pruning Results

Interpolate start reference image.
Interpolate start reference image.

Visualization

Interpolate start reference image.