kvcache-ai/ktransformers

★ 17,156PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((KTransformers))
    What it does
      CPU-GPU offloading
      Large model inference
      Low-cost LLM serving
    Kernel features
      INT4 INT8 quantization
      AMX AVX512 support
      NUMA-aware memory
    Capabilities
      Inference serving
      SFT fine-tuning
    Supported models
      DeepSeek V3 R1
      Qwen3
      Kimi-K2
      GLM-5

mindmap root((KTransformers)) What it does CPU-GPU offloading Large model inference Low-cost LLM serving Kernel features INT4 INT8 quantization AMX AVX512 support NUMA-aware memory Capabilities Inference serving SFT fine-tuning Supported models DeepSeek V3 R1 Qwen3 Kimi-K2 GLM-5

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run DeepSeek-V3 or Qwen3 on a consumer PC with a single GPU by offloading model experts to CPU.

USE CASE 2

Fine-tune a large Mixture-of-Experts model on limited GPU memory using the LLaMA-Factory integration.

USE CASE 3

Serve a quantized INT4 language model locally using CPU AMX or AVX512 kernels for fast inference.

USE CASE 4

Use KTransformers as a backend for SGLang to serve large models at lower hardware cost.

Tech stack

PythonCUDAC++pip

Getting it running

Difficulty · hard Time to first run · 1day+

Requires CUDA, Intel AMX or AVX512 CPU, and building from source in the kt-kernel directory.

License not mentioned in the explanation.

In plain English

KTransformers is a research project for running and fine-tuning large language models efficiently by splitting the work between CPU and GPU. The core idea is that modern LLMs, especially Mixture-of-Experts models, are too big to fit comfortably in GPU memory, so KTransformers offloads parts of the computation to the CPU while keeping the hot path on the GPU. This lets people run very large models on smaller, cheaper hardware. The project exposes two user-facing capabilities from its kt-kernel source tree: Inference and SFT (supervised fine-tuning). On the inference side, kt-kernel provides CPU-optimized kernel operations using Intel AMX and AVX512/AVX2 instructions for INT4 and INT8 quantized models, NUMA-aware memory management for Mixture-of-Experts inference, and CPU-side quantized weights paired with GPU-side GPTQ support. It exposes a Python API for integration with SGLang and other serving frameworks. On the fine-tuning side, KTransformers integrates with LLaMA-Factory so users can fine-tune very large MoE models, such as DeepSeek-V3 and R1, on limited GPU memory. You would use KTransformers if you want to serve or fine-tune cutting-edge open models on consumer or modest data-center hardware without paying for a fleet of high-end GPUs. The README lists supported models including DeepSeek-V3 and R1, Kimi-K2, GLM-5, Qwen3, MiniMax, and others. The codebase is Python with native kernels underneath, and installation is via pip from the kt-kernel directory. The full README is longer than what was provided.

Copy-paste prompts

Prompt 1

Using KTransformers, show me how to load DeepSeek-V3 with CPU offloading so it runs on a machine with only 24 GB of GPU RAM.

Prompt 2

How do I install ktransformers from the kt-kernel directory and run a basic inference test with a quantized model?

Prompt 3

Give me a Python snippet that uses KTransformers to run INT4 inference on a Qwen3 model using AVX512 CPU kernels.

Prompt 4

How do I fine-tune DeepSeek-R1 on a limited-GPU machine using KTransformers together with LLaMA-Factory?

Prompt 5

What NUMA memory settings should I configure in KTransformers for Mixture-of-Experts inference on a dual-socket server?

Open on GitHub → Explain another repo

← kvcache-ai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.