onuralpszr/litert-lm-cookbook

Analysis updated 2026-06-24

★ 13Jupyter NotebookAudience · developerComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((litert-lm-cookbook))
    Inputs
      Gemma-4 E4B model file
      Text prompts
      Audio and image inputs
    Outputs
      Streamed completions
      Tool calls
      Local OpenAI API
    Use Cases
      Local chat over CPU or GPU
      Drop-in OpenAI API replacement
      Speculative decoding demo
    Tech Stack
      Python 3.10
      LiteRT-LM
      Gemma-4 E4B
      uv
      Hugging Face

mindmap root((litert-lm-cookbook)) Inputs Gemma-4 E4B model file Text prompts Audio and image inputs Outputs Streamed completions Tool calls Local OpenAI API Use Cases Local chat over CPU or GPU Drop-in OpenAI API replacement Speculative decoding demo Tech Stack Python 3.10 LiteRT-LM Gemma-4 E4B uv Hugging Face

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Run a single non-streaming chat call against Gemma-4 E4B on a laptop CPU

USE CASE 2

Start a local server that mimics the OpenAI and Gemini API shapes for existing clients

USE CASE 3

Try speculative decoding and GPU inference for faster local responses

USE CASE 4

Send audio or image inputs alongside text using the multimodal examples

What is it built with?

PythonLiteRT-LMGemma-4uvHuggingFace

How does it compare?

	onuralpszr/litert-lm-cookbook	lfrincond/seismic_imaging26	open-x-humanoid/hex
Stars	13	13	13
Language	Jupyter Notebook	Jupyter Notebook	Jupyter Notebook
Setup difficulty	moderate	hard	hard
Complexity	3/5	4/5	5/5
Audience	developer	researcher	researcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Plain examples run on CPU, but examples 04, 05, and 10 need a compatible GPU driver and example 11 needs the litert-lm CLI on PATH.

In plain English

LiteRT-LM Cookbook is a collection of Python scripts and Google Colab notebooks that show how to run a Google language model called Gemma-4 directly on your own computer, with no cloud service, no API key, and no internet connection required during inference. The author orders the examples from the simplest possible chat exchange up to running a full local web server that mimics the OpenAI and Gemini APIs. LiteRT-LM is Google's runtime for running large language models locally on CPU and GPU. The README explains that it ships a Python API, a command-line tool called litert-lm, and a local server that speaks both the OpenAI Responses API shape and the Gemini API shape, so all inference stays on the user's machine. The specific model used in every example is Gemma-4 E4B Instruct, a 4-billion-parameter version of Gemma-4 that the README describes as a balance between capability and speed on consumer hardware. The prerequisites are Python 3.10 or newer and pip or uv. Some examples (04, 05, and 10) need a GPU with a compatible driver, and example 11 needs the litert-lm command-line tool on the user's PATH. Installation is either uv sync or pip install -r requirements.txt, the project is defined in pyproject.toml and uv sync creates a .venv automatically. The model file itself is downloaded from Hugging Face, either with curl directly into the script directory, or through the litert-lm import command, which places the model under ~/.litert-lm/models/ so the API server example can find it. The heart of the cookbook is a table of twelve examples, each with both a plain Python script and a Colab notebook. Example 01 is a single non-streaming request and response. Example 02 is an interactive terminal chat with streaming output. Example 03 sets a persona via a system message. Example 04 switches inference to GPU. Example 05 adds multi-token speculative decoding for faster output. Example 06 registers Python functions as callable tools. Examples 07 and 08 send audio and images alongside text. Example 09 combines streaming with a system persona. Example 10 turns on GPU, speculative decoding, tools, and streaming at the same time. Example 11 runs a local web server that exposes the model through the OpenAI and Gemini API shapes, which means existing client code written against those services can be pointed at localhost instead. Example 12 shows how to control output randomness with the temperature, top_k, top_p, and seed sampler parameters. Each example is written to be read in order, and the README links straight to the corresponding script file and Colab badge for every row.

Copy-paste prompts

Prompt 1

Walk me through running example 02, an interactive streaming chat with Gemma-4 E4B

Prompt 2

Set up example 11 so a Cursor or LangChain client can hit a local OpenAI-style endpoint

Prompt 3

Compare LiteRT-LM Gemma-4 E4B to Ollama with Llama 3 8B on a Mac M2

Prompt 4

Explain how the temperature and top_k sampler parameters in example 12 change outputs

Frequently asked questions

What is litert-lm-cookbook?

Twelve Python and Colab examples that run Google Gemma-4 E4B locally with LiteRT-LM, from a single completion call up to a full OpenAI-compatible server.

What language is litert-lm-cookbook written in?

Mainly Jupyter Notebook. The stack also includes Python, LiteRT-LM, Gemma-4.

How hard is litert-lm-cookbook to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is litert-lm-cookbook for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.