bentoml/openllm

★ 12,320PythonAudience · developerComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((openllm))
    What it does
      Self-host LLMs
      OpenAI API compat
      Browser chat UI
    Tech Stack
      Python
      Hugging Face
      Docker
      BentoML
    Use Cases
      Local AI server
      Replace OpenAI
      Production deploy
    Audience
      Developers
      AI engineers

mindmap root((openllm)) What it does Self-host LLMs OpenAI API compat Browser chat UI Tech Stack Python Hugging Face Docker BentoML Use Cases Local AI server Replace OpenAI Production deploy Audience Developers AI engineers

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run a local AI chat assistant using Llama or Gemma without paying OpenAI API fees.

USE CASE 2

Replace OpenAI API calls in an existing app by pointing it to your own self-hosted model server.

USE CASE 3

Package a model as a Docker container and deploy it to a cloud environment for production use.

USE CASE 4

Serve large open-source models on a GPU server and expose them to multiple team members via API.

Tech stack

PythonHugging FaceDockerKubernetesBentoMLBentoCloud

Getting it running

Difficulty · hard Time to first run · 30min

Requires a compatible GPU, smaller models need ~12GB VRAM, and the largest models require multiple high-end data center GPUs.

Use freely for any purpose, including commercial use, as long as you include the original copyright notice.

In plain English

OpenLLM is a Python tool that lets you run open-source language models on your own hardware and expose them through an API that matches the same format as OpenAI's API. The key idea is that software already built to work with OpenAI can point to your self-hosted model instead, with no code changes beyond swapping the server address. To get started, you install the package with pip and run a single command such as "openllm serve llama3.2:1b". The tool fetches the model weights from Hugging Face, starts a local server at http://localhost:3000, and provides OpenAI-compatible endpoints right away. A built-in chat interface is available at the /chat URL, so you can test the model in a browser without writing any code. OpenLLM does not store model weights itself. It downloads them from Hugging Face the first time you run a model. Some models require you to request access on Hugging Face and set an authentication token before they will download. The supported model list runs from small models that fit on a consumer GPU with around 12GB of memory, such as Gemma at 2 billion parameters, up to very large ones requiring multiple high-end data center GPUs, such as DeepSeek R1 at 671 billion parameters. A companion GitHub repository tracks the full catalog, and you can add your own custom model repositories to extend what the tool can serve. For production use, OpenLLM integrates with BentoCloud and supports packaging models as Docker containers or Kubernetes deployments. The project is built by BentoML and released under the Apache 2.0 license.

Copy-paste prompts

Prompt 1

Help me set up OpenLLM to serve the Llama 3.2 1B model locally and connect to it using the OpenAI Python client.

Prompt 2

I want to switch my app from OpenAI to a self-hosted model via OpenLLM, show me the minimal code change needed.

Prompt 3

How do I package a model served by OpenLLM into a Docker container for cloud deployment on BentoCloud?

Prompt 4

Write a Python script that sends chat completion requests to a locally running OpenLLM server using the openai library.

Prompt 5

Guide me through setting up OpenLLM with a Hugging Face gated model that requires an access token.

Open on GitHub → Explain another repo

← bentoml on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.