xorbitsai/inference

★ 9,300PythonAudience · developerComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((Xinference))
    What it does
      Run AI locally
      OpenAI-compatible API
      Multi-GPU support
    Model types
      Text generation
      Image generation
      Speech recognition
      Text embedding
    Integrations
      LangChain
      LlamaIndex
      Dify
      RAGFlow
    Tech
      Python
      vLLM
      llama.cpp

mindmap root((Xinference)) What it does Run AI locally OpenAI-compatible API Multi-GPU support Model types Text generation Image generation Speech recognition Text embedding Integrations LangChain LlamaIndex Dify RAGFlow Tech Python vLLM llama.cpp

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Replace OpenAI API calls in your app with a locally hosted model by changing the base URL in your code.

USE CASE 2

Run image generation, speech recognition, or text embedding models on your own server without paying per call.

USE CASE 3

Spread a large AI model across multiple GPUs when it is too big to fit on a single device.

USE CASE 4

Point LangChain or LlamaIndex at your local Xinference server instead of a paid cloud API to build chatbots or document Q&A apps.

Tech stack

PythonvLLMllama.cpppip

Getting it running

Difficulty · moderate Time to first run · 30min

GPU strongly recommended for useful performance, very large models require multiple GPUs with significant combined VRAM.

Free and open-source for the core version, an enterprise edition with additional support is available commercially.

In plain English

Xinference (short for Xorbits Inference) is a Python library that makes it straightforward to run open-source AI models on your own hardware, whether that is a laptop, a company server, or a cloud machine. The goal is to give you a single API that works the same way regardless of which model you pick or where you run it. If you are already using OpenAI's API in your application, switching to a locally hosted model can be done by changing one line of code, because Xinference exposes an OpenAI-compatible interface. The library supports text generation models (the large language models you chat with), speech recognition, image generation, text embedding, and multimodal models that can process both text and images. It can run models using several different back-end engines, including vLLM and llama.cpp, and it can spread a single large model across multiple GPUs or machines when the model is too big for one device. Installation is through pip, the standard Python package manager. Once installed, you can launch a server with a single command and then load models through a web interface, a command-line tool, or the API. The web UI shows which models are running, lets you start or stop them, and provides a built-in chat window for testing. Automatic batching groups multiple incoming requests together so the hardware is used more efficiently under load. Xinference integrates with several popular AI application frameworks, including LangChain, LlamaIndex, Dify, and RAGFlow. These are tools that developers use to build chatbots, document question-answering systems, and other AI-powered products. Because Xinference handles the model serving layer, those frameworks can point to it instead of a paid cloud API. An enterprise edition with additional support is available from the company behind the project. The open-source version is free and covers the core serving functionality described above.

Copy-paste prompts

Prompt 1

I'm using OpenAI in my Python app and want to switch to a self-hosted model with Xinference. Show me exactly which line in my code changes and how to start the Xinference server.

Prompt 2

Help me set up Xinference to run a LLaMA-3 model spread across two GPUs on my Linux server.

Prompt 3

I want to use Xinference as the model backend for my LangChain chatbot. How do I configure LangChain to point to my Xinference server instead of OpenAI?

Prompt 4

Walk me through installing Xinference via pip, launching the server, and using the web UI to load a text generation model and test it with a chat prompt.

Open on GitHub → Explain another repo

← xorbitsai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.