mit-han-lab/streaming-llm

★ 7,227PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((StreamingLLM))
    Core idea
      Attention sink tokens
      Keep first and recent
      Discard middle tokens
    Benefits
      Infinite chat sessions
      No memory reset
      22x faster than sliding
    Supported models
      Llama-2
      Falcon
      MPT and Pythia
    Integrations
      HuggingFace
      NVIDIA TensorRT
      Intel extension

mindmap root((StreamingLLM)) Core idea Attention sink tokens Keep first and recent Discard middle tokens Benefits Infinite chat sessions No memory reset 22x faster than sliding Supported models Llama-2 Falcon MPT and Pythia Integrations HuggingFace NVIDIA TensorRT Intel extension

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Build a chat assistant that runs non-stop for hours without resetting context or crashing.

USE CASE 2

Integrate efficient streaming inference into Llama-2, Falcon, or MPT models via HuggingFace.

USE CASE 3

Research how attention sink tokens enable infinite-length generation without full-context recomputation.

Tech stack

PythonPyTorchHuggingFace Transformers

Getting it running

Difficulty · hard Time to first run · 1h+

Requires GPU hardware and compatible HuggingFace model weights for Llama-2, Falcon, or MPT.

In plain English

StreamingLLM is a research project from MIT Han Lab that addresses a practical limitation of AI language models in long-running applications. Language models like Llama-2 and similar systems are trained to handle text only up to a certain length, called a context window. In a long back-and-forth chat session, the conversation can grow beyond that limit, at which point the model either needs to restart and forget what was said earlier, or spend significant compute time reprocessing the recent history. Both options are costly. The key observation behind this project is called an attention sink. When a language model processes text, it assigns attention scores to tokens to decide which parts of the text to focus on. The researchers found that early tokens in a sequence receive very high attention scores regardless of how important they actually are, acting as a kind of anchor. Removing those early tokens causes the model's quality to drop noticeably, even if they contain little useful content. StreamingLLM works by keeping two things in memory: the most recent tokens the model has seen, and the initial anchor tokens that serve as attention sinks. Everything in the middle gets discarded. This allows the model to keep running indefinitely without resetting its memory, and without the cost of recomputing past states. According to the paper, this approach achieves up to 22 times the speed of an alternative method called sliding window recomputation. It is important to understand what this does not do. The model's context window does not grow. The model cannot see or reason about the tokens that were discarded from the middle of a long conversation. Feeding an entire book into StreamingLLM and asking for a summary would only produce a summary of the final pages, because the model can only work with what is currently in its window. The project is designed for continuous dialogue and assistant-style applications where the model needs to keep running without crashing, not for tasks requiring full-document comprehension. StreamingLLM has been integrated into HuggingFace Transformers, NVIDIA TensorRT-LLM, and Intel's extension for Transformers. It supports Llama-2, MPT, Falcon, and Pythia. The paper was accepted at ICLR 2024.

Copy-paste prompts

Prompt 1

Using StreamingLLM with a HuggingFace Llama-2 model, show me how to set up an infinite chat loop that never hits a context-length error.

Prompt 2

Explain the attention sink concept in StreamingLLM and show me how to configure the number of sink tokens and the recent token window size.

Prompt 3

I want to compare StreamingLLM against sliding-window recomputation on a long dialogue. Write a benchmark script that measures throughput for both.

Prompt 4

How do I apply StreamingLLM to a Falcon model? Show me the minimal code changes needed to enable streaming inference.

Open on GitHub → Explain another repo

← mit-han-lab on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.