nyraseithhh/cache

★ 39

This is a quick first-pass explanation. The richer sections — use-cases, tech stack, setup, prompts — are still being generated.

In plain English

This repository (written in Chinese) documents the exact prompt caching setup used by its authors to achieve a 96% cache hit rate when calling the Anthropic Claude API. In one measured run, a request with 49,310 input tokens had 47,354 of them served from cache rather than recomputed, saving significant cost and latency. The document explains not what prompt caching is, but specifically how they structured their requests to make that number happen. The core idea is that AI API caching works by matching the beginning of each request byte-for-byte against what was previously seen. If anything changes early in the request, everything after it must be recalculated. So the authors sorted all content by how often it changes, put the most stable content first, and pushed everything that changes each turn to the very end, outside any cached section. They divide the request into four labeled blocks. The first holds the AI persona, language rules, tool descriptions, and long-term memory, because those almost never change. The second holds a daily content file that updates once per day. The third holds a compressed summary of the current conversation session, which is regenerated roughly every 80,000 tokens. The fourth is a rolling marker placed on the second-to-last user message, which pulls all the conversation history into the cache boundary. The final user message, which is new every turn, sits outside any cache marker alongside dynamic content like timestamps and per-turn memory lookups. Beyond the request structure, they discovered that sticky routing matters: if the API load balancer sends different requests to different backend servers, a cache written on one server cannot be read by another. Their fix is to send a fixed user_id in the request metadata so the provider routes all their traffic to the same backend. Without this, they say, the cache only writes and never reads. The document also covers how this setup differs across providers: direct Anthropic connections, OpenRouter, and generic OpenAI-compatible proxies each need slightly different handling. It ends with seven hard-won rules, including keeping the system prompt split into stable and volatile sections, never letting variable content appear before the last cache breakpoint, and keeping tool lists in a fixed order since reordering them breaks the cache prefix.

Open on GitHub → Explain another repo

← nyraseithhh on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.