Run DeepSeek-V3 or Qwen3 on a consumer PC with a single GPU by offloading model experts to CPU.
Fine-tune a large Mixture-of-Experts model on limited GPU memory using the LLaMA-Factory integration.
Serve a quantized INT4 language model locally using CPU AMX or AVX512 kernels for fast inference.
Use KTransformers as a backend for SGLang to serve large models at lower hardware cost.
Requires CUDA, Intel AMX or AVX512 CPU, and building from source in the kt-kernel directory.
KTransformers is a research project for running and fine-tuning large language models efficiently by splitting the work between CPU and GPU. The core idea is that modern LLMs, especially Mixture-of-Experts models, are too big to fit comfortably in GPU memory, so KTransformers offloads parts of the computation to the CPU while keeping the hot path on the GPU. This lets people run very large models on smaller, cheaper hardware. The project exposes two user-facing capabilities from its kt-kernel source tree: Inference and SFT (supervised fine-tuning). On the inference side, kt-kernel provides CPU-optimized kernel operations using Intel AMX and AVX512/AVX2 instructions for INT4 and INT8 quantized models, NUMA-aware memory management for Mixture-of-Experts inference, and CPU-side quantized weights paired with GPU-side GPTQ support. It exposes a Python API for integration with SGLang and other serving frameworks. On the fine-tuning side, KTransformers integrates with LLaMA-Factory so users can fine-tune very large MoE models, such as DeepSeek-V3 and R1, on limited GPU memory. You would use KTransformers if you want to serve or fine-tune cutting-edge open models on consumer or modest data-center hardware without paying for a fleet of high-end GPUs. The README lists supported models including DeepSeek-V3 and R1, Kimi-K2, GLM-5, Qwen3, MiniMax, and others. The codebase is Python with native kernels underneath, and installation is via pip from the kt-kernel directory. The full README is longer than what was provided.
← kvcache-ai on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.