Run a single non-streaming chat call against Gemma-4 E4B on a laptop CPU
Start a local server that mimics the OpenAI and Gemini API shapes for existing clients
Try speculative decoding and GPU inference for faster local responses
Send audio or image inputs alongside text using the multimodal examples
Plain examples run on CPU, but examples 04, 05, and 10 need a compatible GPU driver and example 11 needs the litert-lm CLI on PATH.
LiteRT-LM Cookbook is a collection of Python scripts and Google Colab notebooks that show how to run a Google language model called Gemma-4 directly on your own computer, with no cloud service, no API key, and no internet connection required during inference. The author orders the examples from the simplest possible chat exchange up to running a full local web server that mimics the OpenAI and Gemini APIs. LiteRT-LM is Google's runtime for running large language models locally on CPU and GPU. The README explains that it ships a Python API, a command-line tool called litert-lm, and a local server that speaks both the OpenAI Responses API shape and the Gemini API shape, so all inference stays on the user's machine. The specific model used in every example is Gemma-4 E4B Instruct, a 4-billion-parameter version of Gemma-4 that the README describes as a balance between capability and speed on consumer hardware. The prerequisites are Python 3.10 or newer and pip or uv. Some examples (04, 05, and 10) need a GPU with a compatible driver, and example 11 needs the litert-lm command-line tool on the user's PATH. Installation is either uv sync or pip install -r requirements.txt; the project is defined in pyproject.toml and uv sync creates a .venv automatically. The model file itself is downloaded from Hugging Face, either with curl directly into the script directory, or through the litert-lm import command, which places the model under ~/.litert-lm/models/ so the API server example can find it. The heart of the cookbook is a table of twelve examples, each with both a plain Python script and a Colab notebook. Example 01 is a single non-streaming request and response. Example 02 is an interactive terminal chat with streaming output. Example 03 sets a persona via a system message. Example 04 switches inference to GPU. Example 05 adds multi-token speculative decoding for faster output. Example 06 registers Python functions as callable tools. Examples 07 and 08 send audio and images alongside text. Example 09 combines streaming with a system persona. Example 10 turns on GPU, speculative decoding, tools, and streaming at the same time. Example 11 runs a local web server that exposes the model through the OpenAI and Gemini API shapes, which means existing client code written against those services can be pointed at localhost instead. Example 12 shows how to control output randomness with the temperature, top_k, top_p, and seed sampler parameters. Each example is written to be read in order, and the README links straight to the corresponding script file and Colab badge for every row.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.