Tech News
← Back to articles

FastVLM: Efficient Vision Encoding for Vision Language Models

read original related products more articles

Vision Language Models (VLMs) enable visual understanding alongside textual inputs. They are typically built by passing visual tokens from a pretrained vision encoder to a pretrained Large Language Model (LLM) through a projection layer. By leveraging the rich visual representations of the vision encoder and the world knowledge and reasoning capabilities of the LLM, VLMs can be useful for a wide range of applications, including accessibility assistants, UI navigation, robotics, and gaming.

VLM accuracy generally improves with higher input image resolution, creating a tradeoff between accuracy and efficiency. For many production use-cases, VLMs need to be both accurate and efficient to meet the low-latency demands of real-time applications and run on-device for privacy-preserving AI experiences.

In a paper accepted to CVPR 2025, Apple ML researchers recently shared a new technique to address this challenge: FastVLM, a new type of VLM that significantly improves accuracy-latency trade-offs with a simple design. Leveraging a hybrid architecture visual encoder designed for high-resolution images, FastVLM delivers accurate, fast, and efficient visual query processing, making it suitable for powering real-time applications on-device. The inference code, model checkpoints, and an iOS/macOS demo app based on MLX are available here.

Image Resolution and the Accuracy-Latency Tradeoff

Generally, VLM accuracy improves with higher image resolution, especially for tasks needing detailed understanding, such as document analysis, UI recognition, or answering natural language queries about images. For example, in Figure 1 below, we ask our VLM about the street sign visible in the image. On the left, the model receives a low-resolution image and cannot respond correctly. On the right, the VLM receives a high-resolution image and correctly identifies the traffic sign which is a “Do Not Enter”.

Figure 1: Comparison of VLM performance with a low-resolution (left) and high-resolution (right) input image.

High resolution significantly increases time-to-first-token in VLMs. While using high-resolution images improves accuracy, it also reduces efficiency in two ways: 1) higher resolution images take longer for the vision encoder to process, and 2) the encoder creates more visual tokens, which increases the pre-filling time for the LLM. Both factors increase the time-to-first-token (TTFT), which is the sum of vision encoding time and LLM pre-filling time. As shown in Figure 2 below, both vision encoding and LLM pre-filling times grow as image resolution increases, and at high resolutions, vision encoder latency becomes the dominant bottleneck. To address this, our research introduces FastVLM, a new vision language model that significantly improves efficiency without sacrificing accuracy.

Latency Breakdown For 1.5B VLM (fp16) Figure 2: Vision latency dominates at high resolution. Breakdown of FastVLM’s time to the first token for different image resolutions. The vision encoder is FastViT-HD, and the LLM has 1.5B parameters.

Hybrid Vision Encoders Deliver the Best Accuracy-Latency Tradeoff

To identify which architecture delivers the best accuracy-latency tradeoff, we systematically compared existing pre-trained vision encoders with an experiment in which everything (training data, recipe, LLM, etc.) was kept the same, and only the vision encoder was changed. In Figure 3 below, the x-axis shows TTFT, and the y-axis shows the average accuracy across different VLM tasks. We show two points for popular transformer-based encoders, ViT-L/14and SigLIP-SO400, pre-trained on image-text data at their native resolutions. We also show curves for ConvNeXT(fully convolutional encoder) and FastViT (a hybrid encoder combining convolutional and transformer blocks) at various resolutions. FastViT, which is based on two of our previous works (FastViT, ICCV 2023; and MobileCLIP, CVPR 2024), achieves the best accuracy-latency trade-off compared to other vision encoders—about 8 times smaller and 20 times faster than ViT-L/14.

... continue reading