antirez/ds4

★ 8,397CAudience · developerComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((DwarfStar 4))
    What it does
      Local AI inference
      DeepSeek V4 Flash only
      Very long context
    Hardware Targets
      macOS Metal 96GB+
      NVIDIA CUDA
      AMD ROCm branch
    Key Design
      Disk-based KV cache
      1M token context
      2-bit quantization
    Use Cases
      Private local AI
      Long conversation memory
      Offline reasoning

mindmap root((DwarfStar 4)) What it does Local AI inference DeepSeek V4 Flash only Very long context Hardware Targets macOS Metal 96GB+ NVIDIA CUDA AMD ROCm branch Key Design Disk-based KV cache 1M token context 2-bit quantization Use Cases Private local AI Long conversation memory Offline reasoning

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run the DeepSeek V4 Flash language model locally on a Mac with 96GB RAM for private, offline AI without cloud API costs.

USE CASE 2

Use a local AI model with very long conversation memory by offloading the key-value cache to disk instead of RAM.

USE CASE 3

Run an AI reasoning model on a local NVIDIA GPU without depending on cloud APIs or rate limits.

Tech stack

CMetalCUDAROCmGGUF

Getting it running

Difficulty · hard Time to first run · 1h+

macOS requires 96GB RAM minimum, CUDA build targets NVIDIA DGX Spark, CPU-only mode is for diagnostics, not regular use.

In plain English

DwarfStar 4 is a self-contained inference engine built specifically for running DeepSeek V4 Flash, a large AI language model, on local hardware. Unlike general-purpose runtimes that handle many different models, this project does exactly one thing: run this one model as correctly and efficiently as possible. It was created by antirez, also known as the creator of Redis. The project targets high-end personal machines. On macOS, it uses Metal, the graphics API built into Apple hardware, and requires a MacBook or Mac Studio with at least 96GB of RAM. On Linux, it supports NVIDIA CUDA with particular attention to the DGX Spark. AMD ROCm support exists on a separate branch maintained by community contributors. A CPU-only build is available for diagnostics but not for regular use. One of the key design ideas is that DeepSeek V4 Flash has a compressed key-value cache, which is the part of memory an AI model uses to keep track of earlier conversation context. This compression is small enough that the project stores the cache on disk rather than in RAM, allowing very long context windows (up to 1 million tokens) on machines that would otherwise not have enough memory. The README lists several reasons the authors consider this model worth a dedicated engine: it is fast due to fewer active parameters, its thinking process scales in length with problem difficulty (short for simple questions, longer for complex ones), and it works well with aggressive 2-bit quantization without major quality loss. The project is in alpha state and was built with significant assistance from GPT-5.5. It only works with GGUF files produced specifically for this engine. The README is longer than what was shown.

Copy-paste prompts

Prompt 1

I'm setting up antirez/ds4 on a Mac Studio with 96GB RAM. Walk me through downloading the correct GGUF file format and starting the inference server.

Prompt 2

Help me compile ds4 on Linux with CUDA support for an NVIDIA GPU, what build flags do I need and how do I run the model after compiling?

Prompt 3

I want to use ds4 for long-context tasks. Explain how the disk-based KV cache works and how to configure the maximum context length toward 1 million tokens.

Prompt 4

Write a shell command that starts the ds4 inference server on macOS with Metal and sends a test prompt to verify it is working.

Prompt 5

What tradeoffs should I know before choosing ds4 over a general runtime like llama.cpp for running DeepSeek V4 Flash locally?

Open on GitHub → Explain another repo

← antirez on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.