explaingit

seanpedersen/freellmapi

0TypeScriptAudience · developerComplexity · 4/5ActiveSetup · moderate

TLDR

Local proxy that fans the free tiers of about eleven LLM providers out behind a single OpenAI-compatible endpoint. A Thompson-sampling bandit picks models and falls back on rate-limit or error.

Mindmap

mindmap
  root((freellmapi))
    Inputs
      Provider API keys
      OpenAI-style requests
      Bearer token
    Outputs
      Streamed completions
      Routing decisions
      Usage counters
    Use Cases
      Stack free LLM tiers
      Fall back on 429s
      Route LangChain through one URL
    Tech Stack
      TypeScript
      SQLite
      AES-256-GCM
      OpenAI API

Things people build with this

USE CASE 1

Pool free tiers from Groq, Cerebras, OpenRouter, and others behind one OpenAI URL

USE CASE 2

Auto-fail over to another provider when a 429 or 5xx comes back

USE CASE 3

Point LangChain or Continue at one local endpoint instead of juggling SDKs

USE CASE 4

Store upstream provider keys encrypted with AES-256-GCM and gate access via bearer tokens

Tech stack

TypeScriptSQLiteAES-256-GCMOpenAI APINode

Getting it running

Difficulty · moderate Time to first run · 30min

Needs API keys from several free LLM providers and a Node toolchain, plus an admin bearer token to access the dashboard.

In plain English

FreeLLMAPI is a local proxy server that pulls together the free tiers of about eleven AI providers and exposes them through a single endpoint that looks identical to the OpenAI API. Supported providers include Google, Groq, Cerebras, SambaNova, NVIDIA, Mistral, OpenRouter, GitHub Models, Cohere, Cloudflare, and Z.ai. The README claims the stacked free tiers add up to roughly 1.3 billion tokens per month of working inference capacity. The motivation is that each free tier on its own is small, and juggling fourteen different SDKs, rate limits, and failure modes by hand is painful. With this proxy, any OpenAI-compatible client library, including tools like LangChain or Continue, can be pointed at your local server and routed transparently across whichever provider keys you have added. The routing layer is the main piece of engineering. A Thompson-sampling bandit assigns each model a score drawn from a Beta posterior over its past success rate, adds a normalised speed term in tokens per second, and subtracts any active rate-limit penalty. The stochastic draw means better models tend to win without locking out unproven ones. If the chosen provider returns a 429 error, a 5xx error, or times out, the router skips it, puts the key on a short cooldown, and retries the next model in the fallback chain up to twenty times. Per-key counters track requests and tokens per minute and per day so the router only picks keys that are under their caps. Multi-turn conversations stick to the same model for thirty minutes to avoid the quality drop from mid-conversation switches. Keys are stored in SQLite encrypted with AES-256-GCM and decrypted in memory only when a request needs them. Client apps authenticate with a single bearer token they get from the dashboard, so upstream provider keys never leave the proxy. A separate admin key gates the dashboard routes. Production mode adds CSP and HSTS headers, locks CORS, and hides stack traces. Features not yet supported include embeddings, image generation, audio, vision inputs, legacy completions, moderation, and multi-tenant billing.

Copy-paste prompts

Prompt 1
Walk me through adding a twelfth provider, including how the bandit picks it up.
Prompt 2
Help me wire LangChain to this proxy and confirm streaming completions work end to end.
Prompt 3
Show me where per-key request and token counters are enforced and how to raise the daily cap.
Prompt 4
Add support for embeddings using the providers that expose a compatible endpoint.
Prompt 5
Tune the Thompson-sampling parameters so a freshly added model gets explored more aggressively.
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.