explaingit

david19p/custom-llm-kernel-2080

1C++Audience · researcherComplexity · 5/5ActiveLicenseSetup · hard

TLDR

Custom C++/CUDA inference stack tuned for the NVIDIA RTX 2080 that runs local LLMs with speculative decoding and exposes an OpenAI-compatible HTTP API.

Mindmap

mindmap
  root((custom-llm-kernel-2080))
    Inputs
      Quantized target model
      DFlash draft model
      PFlash drafter
      Chat requests
    Outputs
      Streaming tokens
      OpenAI compatible API
      Benchmark numbers
    Use Cases
      Local chat server
      Speculative decoding research
      Long prompt compression
    Tech Stack
      C++
      CUDA
      CMake
      llama.cpp tokenizer

Things people build with this

USE CASE 1

Run Qwen3.5-9B locally on a single RTX 2080 and serve it to existing OpenAI chat clients.

USE CASE 2

Benchmark DFlash speculative decoding against a baseline on HumanEval or Math500.

USE CASE 3

Experiment with PFlash long-prompt compression using a 0.6B drafter model.

USE CASE 4

Build a private OpenAI-compatible endpoint for older 8 GB Turing GPUs.

Tech stack

C++CUDACMakellama.cpp

Getting it running

Difficulty · hard Time to first run · 1day+

Needs an RTX 2080 class GPU, CUDA toolchain at arch 75, three separate quantized model files in distinct roles, and a CMake plus MSVC or GCC build.

MIT license. You can use, modify, and redistribute the code commercially with attribution.

In plain English

This repository is an experimental C++ and CUDA project that runs large language models locally on a single NVIDIA RTX 2080 graphics card. The 2080 is a Turing-generation GPU with 8 GB of memory, and the README states that the whole stack is tuned for that hardware. The project bundles two parts: a custom inference kernel that talks directly to the GPU, and a separate HTTP server that exposes an OpenAI-compatible API so that existing chat clients can connect to it without changes. The README lists what works and what does not. The server already accepts the standard /v1/chat/completions request, streams responses back as server-sent events, and uses the tokenizer and chat templates from llama.cpp. A technique called DFlash speculative decoding runs a small draft model alongside the main one and verifies multiple guessed tokens at once through a structure called DDTree. There is also an optional PFlash mode that uses a tiny 0.6 billion parameter drafter model to compress long prompts. Items still in progress include full tool-calling parity, runtime grammar masking, and a cleaner daemon protocol. The system expects three separate model files in three distinct roles: a target model that produces final tokens, a DFlash draft model used only for speculative decoding, and a PFlash drafter used only for prompt compression. The README is emphatic that these are not interchangeable and links to Hugging Face downloads for each one. The recommended target is Qwen3.5-9B in a roughly four-bit quantization. Benchmarks on an RTX 2080 are included for two evaluations, HumanEval and Math500. On the HumanEval coding benchmark, the baseline runs at 46.35 tokens per second while DFlash reaches 145.04 tokens per second, a 3.13 times speedup. On Math500 the speedup is 2.56 times. The README warns that real throughput depends on quantization, prompt shape, draft length, and cache state. Build instructions use git clone with submodules, cmake with CUDA architecture 75, and either MSVC on Windows or GCC/Clang on Linux. A quick-start command shows the long list of flags needed to launch the server, and the licence is MIT.

Copy-paste prompts

Prompt 1
Clone custom-llm-kernel-2080 with submodules and build it with cmake for CUDA arch 75 on Linux with GCC.
Prompt 2
Download the recommended Qwen3.5-9B 4-bit target plus the DFlash draft and PFlash drafter from the Hugging Face links and wire them into the quick-start launch command.
Prompt 3
Point my existing OpenAI chat client at the local /v1/chat/completions endpoint and stream a code completion to confirm SSE works.
Prompt 4
Reproduce the HumanEval benchmark on my RTX 2080 and compare baseline tokens per second against DFlash with DDTree verification.
Prompt 5
Add notes to the README about which roadmap items, like tool-calling parity and grammar masking, are still missing.
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.