Xubing Ye (叶栩冰)

Now, I am a 3rd year master at Shenzhen International Graduate School, Tsinghua University (M.Eng.@THU’2026), supervised by Prof. Yansong Tang. I obtained my bachelor's degree from the School of Software Engineering at Tongji University in 2023. I collaborated with Dr. Yukang Gan, Dr. Yixiao Ge, Dr. Ying Shan and Dr. Zhao Yang.

My current research interest lies at Agentic LLMs and MLLMs.

Email: yxb_tongji@163.com  /  Github  /  Scholar  /  X

profile photo
News

  • 2025-02: A paper on KV Cache compression and video understanding with LLMs accepted by CVPR, 2025.
  • 2025-02: A paper on vision token pruning for MLLMs accepted by CVPR, 2025.
  • 2024-12: Start an internship at Bytedance.
  • 2024-09: A paper on referring image & video segmentation accepted by TPAMI, 2024.
  • 2024-02: Start an internship at Tencent ARC Lab.
  • Recent Publications

    * Indicates Equal Contribution

    dise VoCo-LLaMA: Towards Vision Compression with Large Language Models
    Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Yansong Tang
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
    [arXiv] [PDF] [Project Page] [Code] [AK] [中文解读]

    Proposed VoCo-LLaMA, an attention-distilled video token compression method enabling video-LLMs to train and inference million-token (1+ hour) videos within a 4k-context LLM.

    dise ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
    Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-ping Zhang, Yansong Tang
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
    [arXiv] [PDF] [Project Page]

    Proposed ATP-LLaVA, an efficient MLLM that performs adaptive instance-wise and decoder-layer-wise token pruning with nearly no performance degradation.

    dise Language-Aware Vision Transformer for Referring Segmentation
    Xubing Ye*, Zhao Yang*, Jiaqi Wang*, Yansong Tang, Kai Chen, Hengshuang Zhao, Philip H.S. Torr
    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, IF=20.8), 2024
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
    [IEEE] [PDF] [Code] [Conference Version]

    Proposed LAVT, a Transformer-based universal referring image and video segmentation (RIS and RVOS) framework that performs language-aware visual encoding in place of cross-modal fusion post feature extraction.

    Selected Honors and Awards

  • Nanhu Elite Scholarship of Tsinghua University, 2025. (清华大学综合优秀奖学金, 校级一等)
  • Zhaoyi Scholarship of Tsinghua University, 2024. (清华大学综合优秀奖学金, 校级一等)
  • First Prize Scholarship of Tongji University, 2023. (同济大学综合优秀奖学金, 校级一等)
  • Second Prize Scholarship of Tongji University, 2021, 2022. (同济大学综合优秀奖学金, 校级二等)
  • Industrial Experience

    dise Bytedance Seed Application, Beijing, China. December, 2024 - April, 2025.
  • Project: AI Search with MLLMs.
  • Work with Dr. Baihan Shu.
  • dise Tencent ARC Lab (PCG), Shenzhen, China. February, 2024 - December, 2024.
  • Project: Token Pruning & Compression for MLLMs, Video MLLMs.
  • Work with Dr. Yukang Gan, Dr. Yixiao Ge, Dr. Ying Shan.

  • Academic Services

  • Conference Reviewer: CVPR 2025; JVCIR 2024, 2025
  • © Xubing Ye | Last updated: Mar. 17, 2024 | Website Template