explaingit

kimiyoung/transformer-xl

Analysis updated 2026-07-03

3,702PythonAudience · researcherComplexity · 5/5Setup · hard

TLDR

Research code for Transformer-XL, an AI language model that uses a memory mechanism to carry context across document chunks, letting it reference text from much earlier in a passage than standard transformers.

Mindmap

mindmap
  root((transformer-xl))
    What it does
      Long context language model
      Memory across chunks
      Text generation and prediction
    Key ideas
      Segment recurrence
      Extended attention span
      Below 1.0 character benchmark
    Implementations
      PyTorch multi GPU
      TensorFlow multi GPU
      TPU training
    Getting started
      Pre-trained weights included
      PyTorch subfolder README
      TensorFlow subfolder README
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Reproduce the Transformer-XL benchmark results using the provided pre-trained model weights without training from scratch.

USE CASE 2

Train a long-context language model across multiple GPUs or Google TPUs using the included training scripts.

USE CASE 3

Study the segment-level memory mechanism as a reference implementation when building or comparing transformer architectures.

USE CASE 4

Fine-tune the Transformer-XL architecture on a custom long-document NLP task using the PyTorch implementation.

What is it built with?

PythonPyTorchTensorFlowCUDATPU

How does it compare?

kimiyoung/transformer-xlwebsocket-client/websocket-clientfacebookresearch/reagent
Stars3,7023,7013,699
LanguagePythonPythonPython
Setup difficultyhardeasymoderate
Complexity5/52/54/5
Audienceresearcherdeveloperresearcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Requires GPU or TPU hardware, multi-machine training needs Google TPU access, training from scratch is highly compute-intensive.

In plain English

Transformer-XL is research code released alongside an academic paper of the same name. The project proposes a change to how language models, the type of AI system that predicts and generates text, handle long documents. Standard transformer models process text in fixed-length chunks and lose context that appeared earlier in the document. Transformer-XL introduces a memory mechanism that lets the model carry information forward across chunks, so it can reference words and patterns from much earlier in a piece of text. The repository provides implementations in both PyTorch and TensorFlow, two popular machine learning frameworks. The TensorFlow version supports training across multiple GPUs on a single machine and also across multiple machines using Google TPU hardware. The PyTorch version supports multi-GPU training on a single machine. The paper reports that Transformer-XL set new top scores on several standard language modeling benchmarks at the time of publication, and was the first model to score below 1.0 on a character-level language modeling task (lower scores are better on the specific metric used). Pre-trained model weights are included so that researchers can reproduce the reported results without training from scratch. This repository is primarily aimed at machine learning researchers and engineers who want to study or build on the work. It is not a general-purpose tool for end users. The README is brief and points to subfolder READMEs in the tf/ and pytorch/ directories for setup and training instructions.

Copy-paste prompts

Prompt 1
Using the Transformer-XL PyTorch implementation, how do I load the pre-trained weights and generate text conditioned on a long input passage?
Prompt 2
How do I run the Transformer-XL TensorFlow training script on multiple GPUs for character-level language modeling, following the tf/ subdirectory README?
Prompt 3
What is the segment-level recurrence mechanism in Transformer-XL and how does it differ from standard transformer self-attention? Walk me through the relevant code.
Prompt 4
How do I adapt the Transformer-XL PyTorch code to fine-tune on a custom dataset of long documents?
Prompt 5
What benchmark datasets and metrics does Transformer-XL report results on, and how do I reproduce those evaluations from this repository?

Frequently asked questions

What is transformer-xl?

Research code for Transformer-XL, an AI language model that uses a memory mechanism to carry context across document chunks, letting it reference text from much earlier in a passage than standard transformers.

What language is transformer-xl written in?

Mainly Python. The stack also includes Python, PyTorch, TensorFlow.

How hard is transformer-xl to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is transformer-xl for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub kimiyoung on gitmyhub

Verify against the repo before relying on details.