abetlen/llama-cpp-python

★ 10,293PythonAudience · developerComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((llama-cpp-python))
    What it does
      Local LLM inference
      OpenAI-compatible API
      Offline text generation
    Hardware support
      NVIDIA CUDA
      Apple Silicon Metal
      AMD ROCm
      CPU fallback
    Interfaces
      Low-level C bindings
      High-level Python API
      Local HTTP server
    Integrations
      LangChain
      LlamaIndex
      Any OpenAI client

mindmap root((llama-cpp-python)) What it does Local LLM inference OpenAI-compatible API Offline text generation Hardware support NVIDIA CUDA Apple Silicon Metal AMD ROCm CPU fallback Interfaces Low-level C bindings High-level Python API Local HTTP server Integrations LangChain LlamaIndex Any OpenAI client

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run an AI language model locally on a laptop and query it with the same Python code you use for OpenAI.

USE CASE 2

Start a local HTTP server that any OpenAI-compatible tool or app can use instead of OpenAI, at zero API cost.

USE CASE 3

Integrate a local LLM into a LangChain or LlamaIndex app without sending data to an external service.

USE CASE 4

Run a multimodal vision model locally to analyze images without cloud costs or privacy concerns.

Tech stack

PythonC++CUDAMetalROCmVulkanOpenBLAS

Getting it running

Difficulty · moderate Time to first run · 30min

Requires a C compiler to build from source during install, downloading a GGUF model file separately adds steps, but pre-built wheels are available.

In plain English

llama-cpp-python is a Python package that lets you run large language models locally on your own machine, without sending data to any external service. It works by wrapping llama.cpp, a popular C++ library for running AI language models efficiently. Once installed, you can load a model file and generate text completions entirely offline. The package offers two levels of access. The low-level interface gives direct access to the underlying C library functions for developers who need fine-grained control. The high-level interface provides a Python API designed to look like OpenAI's API, so existing code written against OpenAI can often be pointed at a local model with minimal changes. It also integrates with LangChain and LlamaIndex for use in AI application frameworks. A built-in web server mode starts a local HTTP server that speaks the OpenAI REST API format, which means any tool or application that can talk to OpenAI can be redirected to a locally-running model instead. The server supports function calling, multimodal (vision) inputs, and running multiple models behind one endpoint. Installation is a single pip command, but it compiles llama.cpp from source during install, so a C compiler is required (gcc or clang on Linux/Mac, Visual Studio or MinGW on Windows). Pre-built wheels are available for CPU, NVIDIA CUDA, and Apple Silicon Metal to skip compilation. Other supported hardware acceleration backends include AMD ROCm, Vulkan, Intel SYCL, and OpenBLAS. The package supports Python 3.8 and above and runs on Linux, macOS, and Windows. Documentation is hosted at readthedocs.io.

Copy-paste prompts

Prompt 1

Install llama-cpp-python with CUDA support and start the built-in OpenAI-compatible local server. Show me how to point my existing OpenAI Python code at it instead of the real OpenAI API.

Prompt 2

Using llama-cpp-python's high-level Python API, load a local GGUF model file and generate a streaming text completion. Show me the minimal working example.

Prompt 3

I want to use a local GGUF model as the LLM in a LangChain chain. Show me how to connect llama-cpp-python as the LangChain LLM provider.

Prompt 4

Build a Python script that uses llama-cpp-python to call a locally-running model with function calling, passing a custom tool definition and handling the response.

Open on GitHub → Explain another repo

← abetlen on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.