Analysis updated 2026-05-18
Integrate into an Android AI app to make repeated or similar prompts respond in milliseconds instead of seconds.
Build an edge inference pipeline that conserves phone battery by skipping redundant AI computations.
Benchmark three on-device AI engines side-by-side to measure TTFT and energy use across 1000 real requests.
| tensor-master/edgesync-llm | ashutosh-swain-git/dahmer | audriusbutkevicius/gohashcompare | |
|---|---|---|---|
| Stars | 1 | 1 | 1 |
| Language | Go | Go | Go |
| Last pushed | — | — | 2016-07-09 |
| Maintenance | — | — | Dormant |
| Setup difficulty | hard | easy | moderate |
| Complexity | 4/5 | 1/5 | 2/5 |
| Audience | developer | developer | developer |
Figures from each repo's GitHub metadata at analysis time.
Requires cross-compilation toolchain for Android ARM64, a local llama.cpp or ONNX Runtime build, and CGO enabled.
EdgeSync-LLM is a Go library that makes AI language models respond faster on mobile devices, particularly Android phones running ARM-based chips. It works by saving and reusing pieces of the heavy computation that AI models perform when processing a prompt, so the model does not have to repeat that work from scratch every time. When an AI model processes text, it generates large internal tables of numbers called the attention cache. Normally, every new prompt triggers the model to rebuild this cache from zero, which is the slowest part of generating a response. EdgeSync-LLM stores slices of those computations, and when a new prompt arrives, it searches for a close match among what it has already computed. If the match is close enough, it injects the saved computation directly and skips most of the heavy lifting. The system classifies each incoming prompt into three categories based on how similar it is to something already stored. An exact match (above 92% similarity) skips nearly all computation and gets a response in roughly 8 milliseconds. A partial match (75 to 92% similarity) reuses the overlapping portion and fills in only the difference, taking around 280 milliseconds. A total miss runs the full computation as normal at roughly 1800 milliseconds, then saves the result for future reuse. The library is designed to slot into three popular engines used to run AI on phones and small devices: llama.cpp, MLC-LLM, and ONNX Runtime. It is written in Go and includes a bridge for integrating into Android apps via Kotlin. A built-in benchmark tests all three modes across 1000 requests drawn from 8 prompt clusters to give developers a realistic picture of the speedup. This is a developer-facing library aimed at engineers building on-device AI apps. The README is detailed and technical, covering the internal data structures, adapter interface, and build instructions for Android cross-compilation.
A Go library that speeds up on-device AI responses by caching and reusing model computations, cutting response times from 1800ms to roughly 8ms for similar prompts on Android.
Mainly Go. The stack also includes Go, Android, ONNX Runtime.
Setup difficulty is rated hard, with roughly 1h+ to a first successful run.
Mainly developer.
This repo across BitVibe Labs
Verify against the repo before relying on details.