apple/ml-fastvlm

★ 7,337PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((FastVLM))
    What it does
      Fast image understanding
      Fewer image tokens
      Faster AI responses
    Model sizes
      0.5B smallest
      1.5B medium
      7B largest
    Platforms
      Apple Silicon
      iPhone and iPad
      Standard GPU
    Audience
      ML researchers
      iOS developers
      CV engineers

mindmap root((FastVLM)) What it does Fast image understanding Fewer image tokens Faster AI responses Model sizes 0.5B smallest 1.5B medium 7B largest Platforms Apple Silicon iPhone and iPad Standard GPU Audience ML researchers iOS developers CV engineers

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run a fast vision-language model on an iPhone or iPad using the included demo iOS app

USE CASE 2

Fine-tune a FastVLM variant on your own image dataset using the LLaVA-based training pipeline

USE CASE 3

Export a FastVLM model for Apple Silicon to power an on-device image question-answering feature in a macOS app

USE CASE 4

Benchmark FastVLM against other vision-language models to validate speed gains for a production use case

Tech stack

PythonPyTorchSwiftiOSApple SiliconLLaVA

Getting it running

Difficulty · hard Time to first run · 1h+

Requires downloading pretrained weights and Python environment setup, Apple Silicon export path needed for on-device iOS/macOS deployment.

In plain English

FastVLM is a research project from Apple that makes AI models faster at understanding images. Specifically, it addresses the bottleneck that occurs when an AI model has to process a high-resolution photo before it can say anything about it. The project introduces a new image-processing component called FastViTHD that produces fewer intermediate tokens, which means the model can start generating a response much sooner. The practical result is dramatic speed improvements. The smallest variant of FastVLM responds up to 85 times faster than a comparable model, and the larger 7-billion-parameter version is nearly 8 times faster than competing approaches, all while matching or exceeding their accuracy scores. These results were published at CVPR 2025, a major computer vision conference. The code ships in three sizes: 0.5B, 1.5B, and 7B, where the number refers to the count of parameters in the language part of the model. Pretrained weights are available for download, and running inference on a standard computer requires only a few setup commands and a Python script. The repository also includes a dedicated export path for running the models on Apple Silicon chips, including iPhones, iPads, and Macs, with a demo iOS app included to show the model working on a mobile device. This is primarily a research release aimed at developers and researchers who want to experiment with fast vision-language models, fine-tune their own variants, or understand the technical approach described in the paper. The training pipeline builds on the existing LLaVA codebase, so anyone already familiar with that project will find the workflow recognizable.

Copy-paste prompts

Prompt 1

Set up FastVLM 0.5B on my Mac and run inference on a photo, show me the exact setup commands and Python script

Prompt 2

Export the FastVLM 1.5B model for Apple Silicon so I can run it in an iOS app, walk me through the export steps

Prompt 3

Fine-tune FastVLM 7B on my own image-caption dataset using the LLaVA training pipeline, what files do I need to change?

Prompt 4

Compare FastVLM to LLaVA-1.5: what architectural change in FastViTHD makes the 7B version respond 8x faster?

Open on GitHub → Explain another repo

← apple on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.