explaingit

bentoml/openllm

12,320PythonAudience · developerComplexity · 4/5LicenseSetup · hard

TLDR

OpenLLM lets you run open-source AI language models on your own hardware and serve them through an API that matches OpenAI's format, so any existing OpenAI-compatible app works with your self-hosted model without code changes.

Mindmap

mindmap
  root((openllm))
    What it does
      Self-host LLMs
      OpenAI API compat
      Browser chat UI
    Tech Stack
      Python
      Hugging Face
      Docker
      BentoML
    Use Cases
      Local AI server
      Replace OpenAI
      Production deploy
    Audience
      Developers
      AI engineers
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Run a local AI chat assistant using Llama or Gemma without paying OpenAI API fees.

USE CASE 2

Replace OpenAI API calls in an existing app by pointing it to your own self-hosted model server.

USE CASE 3

Package a model as a Docker container and deploy it to a cloud environment for production use.

USE CASE 4

Serve large open-source models on a GPU server and expose them to multiple team members via API.

Tech stack

PythonHugging FaceDockerKubernetesBentoMLBentoCloud

Getting it running

Difficulty · hard Time to first run · 30min

Requires a compatible GPU, smaller models need ~12GB VRAM, and the largest models require multiple high-end data center GPUs.

Use freely for any purpose, including commercial use, as long as you include the original copyright notice.

In plain English

OpenLLM is a Python tool that lets you run open-source language models on your own hardware and expose them through an API that matches the same format as OpenAI's API. The key idea is that software already built to work with OpenAI can point to your self-hosted model instead, with no code changes beyond swapping the server address. To get started, you install the package with pip and run a single command such as "openllm serve llama3.2:1b". The tool fetches the model weights from Hugging Face, starts a local server at http://localhost:3000, and provides OpenAI-compatible endpoints right away. A built-in chat interface is available at the /chat URL, so you can test the model in a browser without writing any code. OpenLLM does not store model weights itself. It downloads them from Hugging Face the first time you run a model. Some models require you to request access on Hugging Face and set an authentication token before they will download. The supported model list runs from small models that fit on a consumer GPU with around 12GB of memory, such as Gemma at 2 billion parameters, up to very large ones requiring multiple high-end data center GPUs, such as DeepSeek R1 at 671 billion parameters. A companion GitHub repository tracks the full catalog, and you can add your own custom model repositories to extend what the tool can serve. For production use, OpenLLM integrates with BentoCloud and supports packaging models as Docker containers or Kubernetes deployments. The project is built by BentoML and released under the Apache 2.0 license.

Copy-paste prompts

Prompt 1
Help me set up OpenLLM to serve the Llama 3.2 1B model locally and connect to it using the OpenAI Python client.
Prompt 2
I want to switch my app from OpenAI to a self-hosted model via OpenLLM, show me the minimal code change needed.
Prompt 3
How do I package a model served by OpenLLM into a Docker container for cloud deployment on BentoCloud?
Prompt 4
Write a Python script that sends chat completion requests to a locally running OpenLLM server using the openai library.
Prompt 5
Guide me through setting up OpenLLM with a Hugging Face gated model that requires an access token.
Open on GitHub → Explain another repo

← bentoml on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.