Analysis updated 2026-05-18
Run a 35-billion-parameter language model locally on a high-end NVIDIA GPU at about 18% faster speed than the original.
Merge an MTP speculative decoding companion model into an existing base model using the provided graft shell script.
Serve a large language model via an OpenAI-compatible chat completions API on your own hardware using the Docker and vLLM setup.
| kyr0/ornith-35b-fp8-e4m3-mtp | chrisor-dev/claude-autosync | dangerousyams/muxer | |
|---|---|---|---|
| Stars | 2 | 2 | 2 |
| Language | Shell | Shell | Shell |
| Setup difficulty | hard | moderate | moderate |
| Complexity | 5/5 | 3/5 | 3/5 |
| Audience | researcher | developer | developer |
Figures from each repo's GitHub metadata at analysis time.
Requires an NVIDIA GPU with roughly 40 GB or more of VRAM, Docker with NVIDIA Container Toolkit, and a HuggingFace account to download the 35.8 GB model weights.
This repository provides a modified version of an AI language model called Ornith-35B, specifically configured for faster text generation. The original Ornith model was created by deepreinforce-ai and contains 35 billion parameters. This version applies two techniques on top: a smaller number format for storing the model's weights (FP8, which uses less memory than the usual 16-bit floats), and a speculative decoding method called Multi-Token Prediction (MTP), which runs a small companion model in parallel to guess upcoming tokens and speed up generation. The practical result is about 18% faster text output compared to running the base model without these additions. In a benchmark on a single high-end NVIDIA GPU, the modified version produced roughly 751 tokens per second versus 635 for the baseline. The MTP companion model adds about 800 MB of extra GPU memory on top of the main model, which sits at around 35.8 GB on disk. The repository contains two things: a graft script that downloads the original model weights and the MTP companion weights and merges them into a single runnable model, and a Makefile with Docker-based commands for running the merged model as a local server using vLLM (an open-source inference framework). Once the model is running, it exposes a chat completions API that other applications can call. Getting this working requires a machine with an NVIDIA GPU that has enough video memory to hold the model (roughly 40 GB or more), Docker with the NVIDIA Container Toolkit installed, and a HuggingFace account to download the weights. The graft and startup steps are handled via make commands in the terminal. This is infrastructure-level work aimed at AI researchers or engineers who want to run a large language model on their own hardware and need extra speed. The repository does not include a license file, so the terms of use are not stated.
A quantized, faster-inference version of the Ornith-35B language model using FP8 precision and speculative decoding, with a Docker/vLLM setup for running it locally on an NVIDIA GPU.
Mainly Shell. The stack also includes Shell, Docker, vLLM.
Setup difficulty is rated hard, with roughly 1day+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.