kyr0/ornith-35b-fp8-e4m3-mtp

Analysis updated 2026-05-18

★ 2ShellAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((Ornith-35B FP8 MTP))
    What it does
      Faster inference
      FP8 quantization
      Speculative decoding
    Components
      Graft script
      vLLM server
      Docker Makefile
    Tech Stack
      NVIDIA GPU
      Docker
      vLLM
      HuggingFace
    Use Cases
      Local LLM server
      API endpoint
      AI research

mindmap root((Ornith-35B FP8 MTP)) What it does Faster inference FP8 quantization Speculative decoding Components Graft script vLLM server Docker Makefile Tech Stack NVIDIA GPU Docker vLLM HuggingFace Use Cases Local LLM server API endpoint AI research

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Run a 35-billion-parameter language model locally on a high-end NVIDIA GPU at about 18% faster speed than the original.

USE CASE 2

Merge an MTP speculative decoding companion model into an existing base model using the provided graft shell script.

USE CASE 3

Serve a large language model via an OpenAI-compatible chat completions API on your own hardware using the Docker and vLLM setup.

What is it built with?

ShellDockervLLMNVIDIA CUDAHuggingFace

How does it compare?

	kyr0/ornith-35b-fp8-e4m3-mtp	chrisor-dev/claude-autosync	dangerousyams/muxer
Stars	2	2	2
Language	Shell	Shell	Shell
Setup difficulty	hard	moderate	moderate
Complexity	5/5	3/5	3/5
Audience	researcher	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Requires an NVIDIA GPU with roughly 40 GB or more of VRAM, Docker with NVIDIA Container Toolkit, and a HuggingFace account to download the 35.8 GB model weights.

In plain English

This repository provides a modified version of an AI language model called Ornith-35B, specifically configured for faster text generation. The original Ornith model was created by deepreinforce-ai and contains 35 billion parameters. This version applies two techniques on top: a smaller number format for storing the model's weights (FP8, which uses less memory than the usual 16-bit floats), and a speculative decoding method called Multi-Token Prediction (MTP), which runs a small companion model in parallel to guess upcoming tokens and speed up generation. The practical result is about 18% faster text output compared to running the base model without these additions. In a benchmark on a single high-end NVIDIA GPU, the modified version produced roughly 751 tokens per second versus 635 for the baseline. The MTP companion model adds about 800 MB of extra GPU memory on top of the main model, which sits at around 35.8 GB on disk. The repository contains two things: a graft script that downloads the original model weights and the MTP companion weights and merges them into a single runnable model, and a Makefile with Docker-based commands for running the merged model as a local server using vLLM (an open-source inference framework). Once the model is running, it exposes a chat completions API that other applications can call. Getting this working requires a machine with an NVIDIA GPU that has enough video memory to hold the model (roughly 40 GB or more), Docker with the NVIDIA Container Toolkit installed, and a HuggingFace account to download the weights. The graft and startup steps are handled via make commands in the terminal. This is infrastructure-level work aimed at AI researchers or engineers who want to run a large language model on their own hardware and need extra speed. The repository does not include a license file, so the terms of use are not stated.

Copy-paste prompts

Prompt 1

Walk me through running Ornith-35B-FP8-E4M3-MTP using the Makefile. What GPU memory do I need and what does each make command do?

Prompt 2

Explain what FP8 E4M3 quantization does to a language model's weights and why it speeds up inference without losing too much quality.

Prompt 3

How does Multi-Token Prediction speculative decoding work in vLLM, and what do the draft acceptance rate numbers in the benchmark mean?

Prompt 4

Help me adapt the graft.sh script to merge a different MTP sidecar into a different 35B base model I want to run on my H100 GPU.

Frequently asked questions

What is ornith-35b-fp8-e4m3-mtp?

A quantized, faster-inference version of the Ornith-35B language model using FP8 precision and speculative decoding, with a Docker/vLLM setup for running it locally on an NVIDIA GPU.

What language is ornith-35b-fp8-e4m3-mtp written in?

Mainly Shell. The stack also includes Shell, Docker, vLLM.

How hard is ornith-35b-fp8-e4m3-mtp to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is ornith-35b-fp8-e4m3-mtp for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub kyr0 on gitmyhub

Verify against the repo before relying on details.