ztxz16/fastllm

★ 4,558C++Audience · developerComplexity · 4/5Setup · moderate

Mindmap

mindmap
  root((fastllm))
    What it does
      Run LLMs locally
      No PyTorch needed
      GPU plus CPU split
    Supported models
      DeepSeek
      Llama
      Qwen
      Phi
      GLM
    Key features
      Quantization INT4 INT8
      OpenAI-compatible API
      Web chat interface
    Tech stack
      C++
      CUDA
      Python pip

mindmap root((fastllm)) What it does Run LLMs locally No PyTorch needed GPU plus CPU split Supported models DeepSeek Llama Qwen Phi GLM Key features Quantization INT4 INT8 OpenAI-compatible API Web chat interface Tech stack C++ CUDA Python pip

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run a large model like DeepSeek or Llama on a single consumer GPU with as little as 10 GB of memory by splitting computation between the GPU and CPU.

USE CASE 2

Start an OpenAI-compatible API server locally so existing tools or apps built for OpenAI can talk to a self-hosted model instead.

USE CASE 3

Load a quantized INT4 model from Hugging Face and start an interactive chat session entirely offline, with no cloud API costs.

Tech stack

C++PythonCUDANVIDIA GPUAMD GPU

Getting it running

Difficulty · moderate Time to first run · 30min

Install via pip with the platform-specific package (NVIDIA, AMD, or Windows), requires a compatible GPU with at least 10 GB of memory for large models.

License details are not described in the explanation, check the repository directly.

In plain English

fastllm is a C++ library for running large AI language models locally on your own hardware, without needing to install PyTorch or most other common AI dependencies. It implements its own low-level math operations in C++, which gives it speed advantages and broad compatibility with older or less common hardware. The main appeal of this project is that it can run very large models that would normally require enormous amounts of GPU memory. For example, it claims to run the DeepSeek R1 671B model (one of the largest publicly available AI models) on a single GPU card with at least 10 gigabytes of memory, by splitting the computation between the GPU and the CPU. It does this through a technique called quantization, which reduces the precision of the model's numbers (for example from full floating point to INT4 format) so the model fits into less memory. The README lists concrete throughput figures for various configurations. Installation is through pip, the standard Python package manager, despite the library being written in C++. There are separate packages for NVIDIA GPUs, AMD GPUs, and Windows. Once installed, you interact with it through a command-line tool called ftllm. A single command can start a chat session, launch a web-based chat interface, or start an API server that speaks the same protocol as OpenAI's API, meaning existing tools built for OpenAI can talk to it instead. The library supports a range of model families including Qwen, Llama, Phi, DeepSeek, and GLM, in various precision formats (FP16, BF16, FP8, INT8, INT4, AWQ). Models can be downloaded from Hugging Face or loaded from local files. It also supports running parts of large mixture-of-experts models (a specific architecture used in models like DeepSeek) on CPU while keeping other parts on the GPU, allowing very large models to run on consumer hardware. The README is written primarily in Chinese, and community discussion happens via Chinese messaging apps. Documentation for specific deployments (DeepSeek, Qwen3) lives in separate linked files.

Copy-paste prompts

Prompt 1

I have an NVIDIA GPU with 12 GB of VRAM and want to run a DeepSeek model locally using fastllm. Walk me through installing the right pip package and starting a chat session.

Prompt 2

Show me how to start fastllm's OpenAI-compatible API server so I can point my existing Python app, which uses the openai library, at a local model instead of OpenAI.

Prompt 3

What is the difference between INT4, INT8, and FP16 quantization in fastllm, and which should I choose if I want the best balance of speed and output quality?

Prompt 4

I want to run a Qwen3 model with fastllm on a machine with both a GPU and a lot of RAM. How do I split the model layers between GPU and CPU to maximize throughput?

Open on GitHub → Explain another repo

← ztxz16 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.