explaingit

geekgineer/needle-rs

28RustAudience · developerComplexity · 4/5ActiveLicenseSetup · moderate

TLDR

Rust and WebAssembly runtime for the 26M parameter Needle model that picks which tool an AI app should call, fitting in 258KB plus 22MB of weights.

Mindmap

mindmap
  root((needle-rs))
    Inputs
      User query
      Tool list JSON
      Safetensors weights
      Vocab file
    Outputs
      Tool name
      Arguments JSON
      280ms decision
    Use Cases
      Browser tool routing
      Edge function calling
      Embedded device AI
      Offline agent
    Tech Stack
      Rust
      WebAssembly
      Node.js
      Python
      Cloudflare Workers

Things people build with this

USE CASE 1

Add offline tool-calling to a browser app without sending queries to OpenAI

USE CASE 2

Run sub-second function routing on Cloudflare Workers edge nodes

USE CASE 3

Embed tool-selection AI into a no-std Rust firmware target

USE CASE 4

Wrap needle-rs in a Python pipeline to pick tools before hitting a bigger LLM

Tech stack

RustWebAssemblyNode.jsPythonCloudflare Workers

Getting it running

Difficulty · moderate Time to first run · 30min

You must download safetensors weights and a vocab file from Hugging Face before the first run; the runtime itself is tiny but model loading is a separate step.

MIT lets you use, modify, and redistribute the code for any purpose as long as you keep the copyright and license text.

In plain English

needle-rs is a Rust and WebAssembly runtime for Needle, a small AI model from a company called Cactus Compute. Needle is a 26 million parameter transformer that does exactly one thing: it takes a user's query plus a list of available tools and outputs a JSON object that says which tool to call and with what arguments. This pattern is usually called function calling or tool calling, and it is what lets an AI app decide on its own to book a flight, look up the weather, or run any other action you have defined. The whole runtime fits in 258 kilobytes and the model weights are about 22 megabytes. The point of the project, as the README presents it, is to make tool-routing AI cheap and portable. Normally you either pay an API like OpenAI to do this, which sends data off the user's machine and costs money per call, or you ship hundreds of megabytes of a local language model. needle-rs claims similar routing accuracy at a fraction of the size and around 280 milliseconds of latency, with no API key and no backend. The README is explicit that the model itself, including architecture, training, dataset, and weights, is the work of Cactus Compute, and that this project is a community deployment layer for places the official Python implementation does not cover. It runs in a browser, in Node.js, on Cloudflare Workers (small server-side scripts that run on Cloudflare's edge), as a command line tool on Linux, macOS, and Windows, as a Python wheel installable through pip, and as a Rust library that can target embedded systems without a standard library. Mobile and on-device neural accelerator paths are deferred to Cactus's own engine, which the README presents as complementary. The quick start shows three usage patterns: an npm package for JavaScript with an init call and a NeedleWasm.load function, a Rust crate called needle-infer, and a Python package also called needle-rs. In each case you load the safetensors weights and a vocab file, then call run with a query and a JSON list of tools. Weights can be pulled from Hugging Face with the huggingface-cli tool or loaded straight from a URL in the browser. The project is MIT licensed.

Copy-paste prompts

Prompt 1
Load needle-rs in a browser via the npm package and route a weather query to one of three tools
Prompt 2
Build a Cloudflare Worker that uses needle-rs to choose a tool from a JSON list per request
Prompt 3
Use the needle-infer Rust crate to add tool routing to an existing CLI without a network call
Prompt 4
Fetch the Needle safetensors weights from Hugging Face and load them into a Python script with needle-rs
Prompt 5
Compile needle-rs for a no-std embedded target and benchmark inference latency
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.