andersondanieln/hexllama

★ 13TypeScriptAudience · vibe coderComplexity · 2/5ActiveSetup · easy

Mindmap

mindmap
  root((hexllama))
    Inputs
      GGUF models
      Hugging Face links
      llama.cpp builds
    Outputs
      Running model server
      Chat web UI
      Saved templates
    Use Cases
      Run local LLMs
      Manage llama.cpp versions
      Serve OpenAI API locally
    Tech Stack
      Electron
      React
      TypeScript
      Vite
      Node

mindmap root((hexllama)) Inputs GGUF models Hugging Face links llama.cpp builds Outputs Running model server Chat web UI Saved templates Use Cases Run local LLMs Manage llama.cpp versions Serve OpenAI API locally Tech Stack Electron React TypeScript Vite Node

Things people build with this

USE CASE 1

Browse Hugging Face and download GGUF model files with one click

USE CASE 2

Save per-model templates with thread count, batch size, and context length defaults

USE CASE 3

Run several models at once on different ports in API-only or chat mode

USE CASE 4

Switch between llama.cpp builds from a settings panel without recompiling

Tech stack

TypeScriptElectronReactViteNodellama.cpp

Getting it running

Difficulty · easy Time to first run · 5min

Pre-built installer works out of the box; source build needs Node.js 18 or newer.

In plain English

Hexllama is a desktop application that gives you a friendly graphical interface for running large language models on your own computer. The underlying engine it talks to is llama.cpp, a popular open-source program that loads and runs AI models locally. The pitch is that you normally have to fight with terminal commands and config flags to use llama.cpp, and Hexllama replaces all of that with buttons, menus, and forms. The app has a built-in browser for Hugging Face, which is a website where people share AI model files. From inside Hexllama you can search for a model, look at the files in a repository, and download GGUF model files with one click. There is a download manager that lets you pause, resume, or cancel large downloads, and you can also paste a direct link to a GGUF file. When a download finishes, the app creates a starter configuration with reasonable defaults for things like thread count, batch size, and context length, based on the model you grabbed. You can save these configurations as templates and reuse them. Multiple models can be run at the same time on different network ports without colliding. Each template can be launched in a chat mode that opens the llama.cpp web interface in your browser, or in API-only mode that runs the model silently so other tools can connect to it. There is also a visual editor for command-line flags, so you tick checkboxes and set numbers instead of memorizing arguments. Hexllama can manage different builds of the llama.cpp engine itself, check the ggml-org repository for new releases, and switch between versions from a settings panel. Installation is either a pre-built installer from the Releases page, or cloning the repo and running npm install and npm run dev with Node.js 18 or newer. The project is built with Electron, React, TypeScript, and Vite. It stores no telemetry and runs fully local, though downloads still go through Hugging Face. The README also lists a roadmap covering a native chat UI, speculative decoding, multi-language support, and future backends like MLX and vLLM.

Copy-paste prompts

Prompt 1

Install hexllama from the Releases page and open the llama.cpp web UI for a downloaded Qwen GGUF

Prompt 2

Clone hexllama and run npm install and npm run dev on Node 18, then explain the Electron entry point

Prompt 3

Configure hexllama to serve two models in API-only mode on ports 8080 and 8081 for local agent tooling

Prompt 4

Walk me through the visual flag editor in hexllama and which llama.cpp flags it maps to

Prompt 5

Add a new backend stub to hexllama for MLX so I can see where to plug in a non-llama.cpp engine

Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.