Run a large model like DeepSeek or Llama on a single consumer GPU with as little as 10 GB of memory by splitting computation between the GPU and CPU.
Start an OpenAI-compatible API server locally so existing tools or apps built for OpenAI can talk to a self-hosted model instead.
Load a quantized INT4 model from Hugging Face and start an interactive chat session entirely offline, with no cloud API costs.
Install via pip with the platform-specific package (NVIDIA, AMD, or Windows), requires a compatible GPU with at least 10 GB of memory for large models.
fastllm is a C++ library for running large AI language models locally on your own hardware, without needing to install PyTorch or most other common AI dependencies. It implements its own low-level math operations in C++, which gives it speed advantages and broad compatibility with older or less common hardware. The main appeal of this project is that it can run very large models that would normally require enormous amounts of GPU memory. For example, it claims to run the DeepSeek R1 671B model (one of the largest publicly available AI models) on a single GPU card with at least 10 gigabytes of memory, by splitting the computation between the GPU and the CPU. It does this through a technique called quantization, which reduces the precision of the model's numbers (for example from full floating point to INT4 format) so the model fits into less memory. The README lists concrete throughput figures for various configurations. Installation is through pip, the standard Python package manager, despite the library being written in C++. There are separate packages for NVIDIA GPUs, AMD GPUs, and Windows. Once installed, you interact with it through a command-line tool called ftllm. A single command can start a chat session, launch a web-based chat interface, or start an API server that speaks the same protocol as OpenAI's API, meaning existing tools built for OpenAI can talk to it instead. The library supports a range of model families including Qwen, Llama, Phi, DeepSeek, and GLM, in various precision formats (FP16, BF16, FP8, INT8, INT4, AWQ). Models can be downloaded from Hugging Face or loaded from local files. It also supports running parts of large mixture-of-experts models (a specific architecture used in models like DeepSeek) on CPU while keeping other parts on the GPU, allowing very large models to run on consumer hardware. The README is written primarily in Chinese, and community discussion happens via Chinese messaging apps. Documentation for specific deployments (DeepSeek, Qwen3) lives in separate linked files.
← ztxz16 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.