Fastvlm: Efficient vision encoding for vision language models
Published on: 2025-07-13 17:16:02
FastVLM: Efficient Vision Encoding for Vision Language Models
This is the official repository of FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025)
Highlights
We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.
Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.
Demo iOS app to demonstrate the performance of our model on a mobile device.
Getting Started
We use LLaVA codebase to train FastVLM variants. In order to train or finetune your own variants, please follow instructions provided in LLaVA codebase. We provide instructions for running inference with our models.
Setup
conda create -n fastvlm python=3.10 conda activate fastvlm pip inst
... Read full article.