Run Qwen3.5-9B locally on a single RTX 2080 and serve it to existing OpenAI chat clients.
Benchmark DFlash speculative decoding against a baseline on HumanEval or Math500.
Experiment with PFlash long-prompt compression using a 0.6B drafter model.
Build a private OpenAI-compatible endpoint for older 8 GB Turing GPUs.
Needs an RTX 2080 class GPU, CUDA toolchain at arch 75, three separate quantized model files in distinct roles, and a CMake plus MSVC or GCC build.
This repository is an experimental C++ and CUDA project that runs large language models locally on a single NVIDIA RTX 2080 graphics card. The 2080 is a Turing-generation GPU with 8 GB of memory, and the README states that the whole stack is tuned for that hardware. The project bundles two parts: a custom inference kernel that talks directly to the GPU, and a separate HTTP server that exposes an OpenAI-compatible API so that existing chat clients can connect to it without changes. The README lists what works and what does not. The server already accepts the standard /v1/chat/completions request, streams responses back as server-sent events, and uses the tokenizer and chat templates from llama.cpp. A technique called DFlash speculative decoding runs a small draft model alongside the main one and verifies multiple guessed tokens at once through a structure called DDTree. There is also an optional PFlash mode that uses a tiny 0.6 billion parameter drafter model to compress long prompts. Items still in progress include full tool-calling parity, runtime grammar masking, and a cleaner daemon protocol. The system expects three separate model files in three distinct roles: a target model that produces final tokens, a DFlash draft model used only for speculative decoding, and a PFlash drafter used only for prompt compression. The README is emphatic that these are not interchangeable and links to Hugging Face downloads for each one. The recommended target is Qwen3.5-9B in a roughly four-bit quantization. Benchmarks on an RTX 2080 are included for two evaluations, HumanEval and Math500. On the HumanEval coding benchmark, the baseline runs at 46.35 tokens per second while DFlash reaches 145.04 tokens per second, a 3.13 times speedup. On Math500 the speedup is 2.56 times. The README warns that real throughput depends on quantization, prompt shape, draft length, and cache state. Build instructions use git clone with submodules, cmake with CUDA architecture 75, and either MSVC on Windows or GCC/Clang on Linux. A quick-start command shows the long list of flags needed to launch the server, and the licence is MIT.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.