GoKawiil - Fastvlm: Efficient vision encoding for vision language models

FastVLM: Efficient Vision Encoding for Vision Language Models This is the official repository of FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025) Highlights We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder. Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT. Demo iOS app to demonstrate the performance of our model on a mobile device. Getting Started We use LLaVA codebase to train FastVLM variants. In order to train or finetune your own variants, please follow instructions provided in LLaVA codebase. We provide instructions for running inference with our models. Setup conda create -n fastvlm python=3.10 conda activate fastvlm pip inst ... Read full article.

Find Related products on Amazon

Fastvlm: Efficient vision encoding for vision language models

Related Articles