explaingit

tensor-master/edgesync-llm

Analysis updated 2026-05-18

1GoAudience · developerComplexity · 4/5Setup · hard

TLDR

A Go library that speeds up on-device AI responses by caching and reusing model computations, cutting response times from 1800ms to roughly 8ms for similar prompts on Android.

Mindmap

mindmap
  root((EdgeSync-LLM))
    What it does
      Cache KV computations
      Skip repeated work
      Speed up AI responses
    How it works
      Similarity scoring
      Exact hit inject
      Partial hit merge
      Full miss fallback
    Supported engines
      llama.cpp
      MLC-LLM
      ONNX Runtime
    Platforms
      Android ARM64
      Kotlin bridge
      Linux cross-compile
    Benchmarks
      1000 request test
      Three mode comparison
      Energy profiling
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Integrate into an Android AI app to make repeated or similar prompts respond in milliseconds instead of seconds.

USE CASE 2

Build an edge inference pipeline that conserves phone battery by skipping redundant AI computations.

USE CASE 3

Benchmark three on-device AI engines side-by-side to measure TTFT and energy use across 1000 real requests.

What is it built with?

GoAndroidONNX RuntimeSQLiteMiniLMHNSW

How does it compare?

tensor-master/edgesync-llmashutosh-swain-git/dahmeraudriusbutkevicius/gohashcompare
Stars111
LanguageGoGoGo
Last pushed2016-07-09
MaintenanceDormant
Setup difficultyhardeasymoderate
Complexity4/51/52/5
Audiencedeveloperdeveloperdeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1h+

Requires cross-compilation toolchain for Android ARM64, a local llama.cpp or ONNX Runtime build, and CGO enabled.

In plain English

EdgeSync-LLM is a Go library that makes AI language models respond faster on mobile devices, particularly Android phones running ARM-based chips. It works by saving and reusing pieces of the heavy computation that AI models perform when processing a prompt, so the model does not have to repeat that work from scratch every time. When an AI model processes text, it generates large internal tables of numbers called the attention cache. Normally, every new prompt triggers the model to rebuild this cache from zero, which is the slowest part of generating a response. EdgeSync-LLM stores slices of those computations, and when a new prompt arrives, it searches for a close match among what it has already computed. If the match is close enough, it injects the saved computation directly and skips most of the heavy lifting. The system classifies each incoming prompt into three categories based on how similar it is to something already stored. An exact match (above 92% similarity) skips nearly all computation and gets a response in roughly 8 milliseconds. A partial match (75 to 92% similarity) reuses the overlapping portion and fills in only the difference, taking around 280 milliseconds. A total miss runs the full computation as normal at roughly 1800 milliseconds, then saves the result for future reuse. The library is designed to slot into three popular engines used to run AI on phones and small devices: llama.cpp, MLC-LLM, and ONNX Runtime. It is written in Go and includes a bridge for integrating into Android apps via Kotlin. A built-in benchmark tests all three modes across 1000 requests drawn from 8 prompt clusters to give developers a realistic picture of the speedup. This is a developer-facing library aimed at engineers building on-device AI apps. The README is detailed and technical, covering the internal data structures, adapter interface, and build instructions for Android cross-compilation.

Copy-paste prompts

Prompt 1
Using EdgeSync-LLM with llama.cpp, show me how to set up the KVAdapter so my Android app can inject cached fragments instead of running a full prefill each time.
Prompt 2
Write a Go function using EdgeSync-LLM's DifferentialEngine to route a new prompt through the EXACT, PARTIAL, or MISS path and log which branch it takes.
Prompt 3
Explain the KVFragment struct in EdgeSync-LLM and why it stores both the raw tensor bytes and a 384-dim embedding vector.
Prompt 4
Help me cross-compile EdgeSync-LLM for Android ARM64 with CGO enabled, pointing at my local llama.cpp build directory.
Prompt 5
Run the EdgeSync-LLM benchmark in verbose mode and interpret the TTFT, hit rate, and energy columns it outputs.

Frequently asked questions

What is edgesync-llm?

A Go library that speeds up on-device AI responses by caching and reusing model computations, cutting response times from 1800ms to roughly 8ms for similar prompts on Android.

What language is edgesync-llm written in?

Mainly Go. The stack also includes Go, Android, ONNX Runtime.

How hard is edgesync-llm to set up?

Setup difficulty is rated hard, with roughly 1h+ to a first successful run.

Who is edgesync-llm for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub tensor-master on gitmyhub

Verify against the repo before relying on details.