Run a vision AI model on your Mac to ask questions about images without sending data to the cloud.
Analyze video files locally by asking an AI to describe or answer questions about their content.
Extract structured JSON data from images using a local vision-language model on Apple Silicon.
Build a local image-understanding feature into a Python app without needing a GPU or API key.
Requires a Mac with Apple Silicon (M1/M2/M3/M4). Install via pip: `pip install mlx-vlm`. Models are downloaded on first use and can be several gigabytes, ensure you have sufficient free disk space and RAM.
MLX-VLM is a Python package that lets you run vision-language AI models directly on a Mac, using Apple's MLX framework that is designed for Apple Silicon chips. Vision-language models are AI systems that can look at images, process text, and respond to questions combining both, so they can do things like describe a photo, read text in an image, or answer questions about something shown to them. This package brings those capabilities to your local machine without needing a cloud service. Beyond images, the package also supports audio and video inputs through what it calls "Omni Models," so you can feed a model an image and an audio clip together and get a combined response. The README covers a wide range of supported models including Qwen, Gemma, LLaVA, Florence2, Molmo, and many OCR-focused models designed to extract text from images. You can interact with the package in several ways. A command-line tool lets you generate responses directly from a terminal by pointing it at a model name and an image or prompt. A Python API is available for use in scripts and applications. A chat interface built on Gradio can be launched in a browser for a more conversational experience. There is also a FastAPI-based server mode that exposes the models over a local HTTP endpoint, with support for processing multiple requests at once and caching repeated inputs to avoid redundant computation. The package includes a feature called speculative decoding, which uses a smaller companion model to draft candidate responses that the main model then verifies, making generation faster. Fine-tuning support is also mentioned, meaning you can train a model further on your own data using your Mac hardware. Installation is a single pip command. The project is installable as a standard Python package and has a table of contents in the README pointing to documentation for each supported model. The full README is longer than what was shown.
← blaizzy on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.