explaingit

onuralpszr/litert-lm-cookbook

13Jupyter NotebookAudience · developerComplexity · 3/5ActiveSetup · moderate

TLDR

Twelve Python and Colab examples that run Google Gemma-4 E4B locally with LiteRT-LM, from a single completion call up to a full OpenAI-compatible server.

Mindmap

mindmap
  root((litert-lm-cookbook))
    Inputs
      Gemma-4 E4B model file
      Text prompts
      Audio and image inputs
    Outputs
      Streamed completions
      Tool calls
      Local OpenAI API
    Use Cases
      Local chat over CPU or GPU
      Drop-in OpenAI API replacement
      Speculative decoding demo
    Tech Stack
      Python 3.10
      LiteRT-LM
      Gemma-4 E4B
      uv
      Hugging Face

Things people build with this

USE CASE 1

Run a single non-streaming chat call against Gemma-4 E4B on a laptop CPU

USE CASE 2

Start a local server that mimics the OpenAI and Gemini API shapes for existing clients

USE CASE 3

Try speculative decoding and GPU inference for faster local responses

USE CASE 4

Send audio or image inputs alongside text using the multimodal examples

Tech stack

PythonLiteRT-LMGemma-4uvHuggingFace

Getting it running

Difficulty · moderate Time to first run · 30min

Plain examples run on CPU, but examples 04, 05, and 10 need a compatible GPU driver and example 11 needs the litert-lm CLI on PATH.

In plain English

LiteRT-LM Cookbook is a collection of Python scripts and Google Colab notebooks that show how to run a Google language model called Gemma-4 directly on your own computer, with no cloud service, no API key, and no internet connection required during inference. The author orders the examples from the simplest possible chat exchange up to running a full local web server that mimics the OpenAI and Gemini APIs. LiteRT-LM is Google's runtime for running large language models locally on CPU and GPU. The README explains that it ships a Python API, a command-line tool called litert-lm, and a local server that speaks both the OpenAI Responses API shape and the Gemini API shape, so all inference stays on the user's machine. The specific model used in every example is Gemma-4 E4B Instruct, a 4-billion-parameter version of Gemma-4 that the README describes as a balance between capability and speed on consumer hardware. The prerequisites are Python 3.10 or newer and pip or uv. Some examples (04, 05, and 10) need a GPU with a compatible driver, and example 11 needs the litert-lm command-line tool on the user's PATH. Installation is either uv sync or pip install -r requirements.txt; the project is defined in pyproject.toml and uv sync creates a .venv automatically. The model file itself is downloaded from Hugging Face, either with curl directly into the script directory, or through the litert-lm import command, which places the model under ~/.litert-lm/models/ so the API server example can find it. The heart of the cookbook is a table of twelve examples, each with both a plain Python script and a Colab notebook. Example 01 is a single non-streaming request and response. Example 02 is an interactive terminal chat with streaming output. Example 03 sets a persona via a system message. Example 04 switches inference to GPU. Example 05 adds multi-token speculative decoding for faster output. Example 06 registers Python functions as callable tools. Examples 07 and 08 send audio and images alongside text. Example 09 combines streaming with a system persona. Example 10 turns on GPU, speculative decoding, tools, and streaming at the same time. Example 11 runs a local web server that exposes the model through the OpenAI and Gemini API shapes, which means existing client code written against those services can be pointed at localhost instead. Example 12 shows how to control output randomness with the temperature, top_k, top_p, and seed sampler parameters. Each example is written to be read in order, and the README links straight to the corresponding script file and Colab badge for every row.

Copy-paste prompts

Prompt 1
Walk me through running example 02, an interactive streaming chat with Gemma-4 E4B
Prompt 2
Set up example 11 so a Cursor or LangChain client can hit a local OpenAI-style endpoint
Prompt 3
Compare LiteRT-LM Gemma-4 E4B to Ollama with Llama 3 8B on a Mac M2
Prompt 4
Explain how the temperature and top_k sampler parameters in example 12 change outputs
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.