ATP-LLAVA: Adaptive Token Pruning for Large Vision Language Models

TL;DR

1. We reveal the importance of adaptively determining pruning ratios at the instance and LLM layer levels for effective visual token pruning, and propose ATP-LLaVA, a framework that dynamically reduces computational cost for large vision language models.
2. ATP-LLaVA achieves a 75% average pruning ratio while maintaining 98.1% performance across seven widely used vision understanding benchmarks.

Abstract

Large Vision Language Models (LVLMs) have achieved significant success across multi-modal tasks. However, the computational cost of processing long visual tokens can be prohibitively expensive on resource-limited devices. Previous methods have identified redundancy in visual tokens within the Large Language Model (LLM) decoder layers and have mitigated this by pruning tokens using a pre-defined or fixed ratio, thereby reducing computational overhead. Nonetheless, we observe that the impact of pruning ratio varies across different LLM layers and instances (image-prompt pairs). Therefore, it is essential to develop a layer-wise and instance-wise vision token pruning strat- egy to balance computational cost and model performance effectively. We propose ATP-LLaVA, a novel approach that adaptively determines instance-specific token pruning ratios for each LLM layer. Specifically, we introduce an Adaptive Token Pruning (ATP) module, which computes the importance score and pruning threshold based on input in- stance adaptively. The ATP module can be seamlessly integrated between any two LLM layers with negligible computational overhead. Additionally, we develop a Spatial Augmented Pruning (SAP) strategy that prunes visual tokens with both token redundancy and spatial modeling perspectives. Our approach reduces the average token count by 75% while maintaining performance, with only a minimal 1.9% degradation across seven widely used benchmarks.

ATP-LLaVA:
Adaptive Token Pruning for Large
Vision Language Models

TL;DR

Insight & Motivation

Token pruning of LVLMs should be layer-wise and instance-wise.

Adaptive Token Pruning

(a) Previous methods employ a fixed, pre-defined token pruning ratio.
(b) Illustration of ATP-LLaVA, which dynamically selects the adaptive pruning ratio for each layer of the LLM decoder based on the instance-specific characteristics

Abstract

Adaptive Token Pruning

Token Pruning Results

Visualization

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

TL;DR

Insight & Motivation

Token pruning of LVLMs should be layer-wise and instance-wise.

Adaptive Token Pruning

(a) Previous methods employ a fixed, pre-defined token pruning ratio. (b) Illustration of ATP-LLaVA, which dynamically selects the adaptive pruning ratio for each layer of the LLM decoder based on the instance-specific characteristics

Abstract

Adaptive Token Pruning

Token Pruning Results

Visualization

ATP-LLaVA:
Adaptive Token Pruning for Large
Vision Language Models

(a) Previous methods employ a fixed, pre-defined token pruning ratio.
(b) Illustration of ATP-LLaVA, which dynamically selects the adaptive pruning ratio for each layer of the LLM decoder based on the instance-specific characteristics