jundot/omlx

★ 13,959PythonAudience · developerComplexity · 3/5LicenseSetup · moderate

Mindmap

mindmap
  root((repo))
    What it does
      Local LLM server
      KV cache persistence
      OpenAI-compatible API
    Model Types
      Text models
      Vision models
      Embedding models
      Rerankers
    Features
      Browser dashboard
      Admin controls
      Homebrew service
      Mac menu bar
    Requirements
      Apple Silicon
      macOS 15 or later
      Python 3.10 plus

mindmap root((repo)) What it does Local LLM server KV cache persistence OpenAI-compatible API Model Types Text models Vision models Embedding models Rerankers Features Browser dashboard Admin controls Homebrew service Mac menu bar Requirements Apple Silicon macOS 15 or later Python 3.10 plus

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run an AI coding assistant locally on a Mac without sending your code to cloud servers.

USE CASE 2

Host an OpenAI-compatible local API endpoint so tools like Cursor or Claude Code use your own models.

USE CASE 3

Serve vision, OCR, and embedding models simultaneously from a single local server managed via a browser dashboard.

USE CASE 4

Manage and monitor multiple AI models from an admin dashboard without restarting the server.

Tech stack

PythonMLX

Getting it running

Difficulty · moderate Time to first run · 30min

Requires Apple Silicon Mac running macOS 15 or later, not compatible with Intel Macs.

Apache 2.0, use freely for any purpose including commercial projects, modifications must retain the copyright notice.

In plain English

oMLX is a program for running large language models directly on Apple Silicon Macs, the M1 through M4 chips. A large language model is the kind of AI that powers chat assistants and coding tools. Instead of sending your text to a company's servers, oMLX runs the model on your own machine and answers requests locally. You manage it from the macOS menu bar or from a command line tool. The main problem it tries to solve is reusing past work. When an AI model reads a long conversation, it builds up internal data called a KV cache. oMLX keeps this cache in two places: a hot tier in fast memory and a cold tier on the SSD. When memory fills up, older pieces move to disk and get restored later instead of being recalculated, even after the server restarts. The goal is to make local models practical for real coding sessions with tools such as Claude Code. It can serve several kinds of models at once: text models, vision models that read images, OCR models that read text from pictures, embedding models, and rerankers. Any app that expects an OpenAI-style connection can point at the local address and start using it. There is also a built-in chat page in the browser for talking to a loaded model directly. oMLX includes an admin dashboard in the browser for watching activity in real time, loading or unloading models, running benchmarks, and changing per-model settings. It can pin frequently used models in memory, drop the least recently used ones when space runs low, and set an idle timeout per model. Settings can be changed without restarting the server. Installation options include a downloadable Mac app with one-click updates, a Homebrew package that can run as a background service, or building from source. It requires macOS 15 or later, Python 3.10 or later, and an Apple Silicon chip. The project is shared under the Apache 2.0 license.

Copy-paste prompts

Prompt 1

I have oMLX running on my M2 Mac. Help me configure Claude Code or Cursor to point at my local oMLX server instead of the OpenAI API.

Prompt 2

How do I load a vision model in oMLX and send it an image file using the local OpenAI-compatible API endpoint?

Prompt 3

Set up oMLX with Homebrew as a background service that starts automatically on boot and pre-loads my two most-used models.

Prompt 4

I'm running long coding sessions with oMLX. Explain how the hot and cold KV cache tiers work and how to tune them on a 32GB M3 Pro.

Prompt 5

Walk me through using the oMLX browser dashboard to benchmark a loaded model, adjust its idle timeout, and pin it so it is never unloaded.

Open on GitHub → Explain another repo

← jundot on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.