Run 70B+ parameter models on a cluster of personal devices without cloud costs or data privacy concerns.
Combine multiple Apple Silicon Macs via Thunderbolt for high-speed local AI inference.
Use existing OpenAI or Ollama client tools with your own hardware-based model cluster.
Requires network configuration, RDMA setup across multiple devices, and coordinating distributed tensor parallelism infrastructure.
exo is a tool that lets you run large AI language models locally by pooling the computing resources of multiple devices you already own, turning a cluster of laptops, desktops, or servers into a single cooperative AI inference machine. The problem it solves is that the most capable AI models (like 70-billion or 600-billion parameter models) are too large to fit in the memory of a single consumer device. Cloud services can run them, but that costs money and sends your data to a remote server. exo lets you combine the memory and processing power of several personal devices to run these large models entirely on your own hardware. The software automatically discovers other devices on your network that are also running exo, no manual configuration is needed. When you send a prompt, exo splits (or "shards") the model across all available devices using a technique called tensor parallelism, where different parts of the model's computation happen simultaneously on different machines. The devices communicate the intermediate results of their computations with each other over the network. For Apple Silicon Macs connected via Thunderbolt cables, exo supports RDMA (Remote Direct Memory Access), a high-speed direct-memory transfer technique that dramatically reduces communication latency between devices. The API it exposes is compatible with OpenAI, Claude, and Ollama client formats, meaning you can use existing tools and applications with it without modification. You would use exo if you have multiple Apple Silicon Macs, Linux machines with GPUs, or any combination thereof and want to run powerful AI models locally for privacy, cost, or experimentation reasons. It is written in Python, uses Apple's MLX framework as the inference backend on Apple Silicon, and is installed by cloning the repository and running with the uv Python project manager.
Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.