kimiyoung/transformer-xl

Analysis updated 2026-07-03

★ 3,702PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((transformer-xl))
    What it does
      Long context language model
      Memory across chunks
      Text generation and prediction
    Key ideas
      Segment recurrence
      Extended attention span
      Below 1.0 character benchmark
    Implementations
      PyTorch multi GPU
      TensorFlow multi GPU
      TPU training
    Getting started
      Pre-trained weights included
      PyTorch subfolder README
      TensorFlow subfolder README

mindmap root((transformer-xl)) What it does Long context language model Memory across chunks Text generation and prediction Key ideas Segment recurrence Extended attention span Below 1.0 character benchmark Implementations PyTorch multi GPU TensorFlow multi GPU TPU training Getting started Pre-trained weights included PyTorch subfolder README TensorFlow subfolder README

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Reproduce the Transformer-XL benchmark results using the provided pre-trained model weights without training from scratch.

USE CASE 2

Train a long-context language model across multiple GPUs or Google TPUs using the included training scripts.

USE CASE 3

Study the segment-level memory mechanism as a reference implementation when building or comparing transformer architectures.

USE CASE 4

Fine-tune the Transformer-XL architecture on a custom long-document NLP task using the PyTorch implementation.

What is it built with?

PythonPyTorchTensorFlowCUDATPU

How does it compare?

	kimiyoung/transformer-xl	websocket-client/websocket-client	facebookresearch/reagent
Stars	3,702	3,701	3,699
Language	Python	Python	Python
Setup difficulty	hard	easy	moderate
Complexity	5/5	2/5	4/5
Audience	researcher	developer	researcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Requires GPU or TPU hardware, multi-machine training needs Google TPU access, training from scratch is highly compute-intensive.

In plain English

Transformer-XL is research code released alongside an academic paper of the same name. The project proposes a change to how language models, the type of AI system that predicts and generates text, handle long documents. Standard transformer models process text in fixed-length chunks and lose context that appeared earlier in the document. Transformer-XL introduces a memory mechanism that lets the model carry information forward across chunks, so it can reference words and patterns from much earlier in a piece of text. The repository provides implementations in both PyTorch and TensorFlow, two popular machine learning frameworks. The TensorFlow version supports training across multiple GPUs on a single machine and also across multiple machines using Google TPU hardware. The PyTorch version supports multi-GPU training on a single machine. The paper reports that Transformer-XL set new top scores on several standard language modeling benchmarks at the time of publication, and was the first model to score below 1.0 on a character-level language modeling task (lower scores are better on the specific metric used). Pre-trained model weights are included so that researchers can reproduce the reported results without training from scratch. This repository is primarily aimed at machine learning researchers and engineers who want to study or build on the work. It is not a general-purpose tool for end users. The README is brief and points to subfolder READMEs in the tf/ and pytorch/ directories for setup and training instructions.

Copy-paste prompts

Prompt 1

Using the Transformer-XL PyTorch implementation, how do I load the pre-trained weights and generate text conditioned on a long input passage?

Prompt 2

How do I run the Transformer-XL TensorFlow training script on multiple GPUs for character-level language modeling, following the tf/ subdirectory README?

Prompt 3

What is the segment-level recurrence mechanism in Transformer-XL and how does it differ from standard transformer self-attention? Walk me through the relevant code.

Prompt 4

How do I adapt the Transformer-XL PyTorch code to fine-tune on a custom dataset of long documents?

Prompt 5

What benchmark datasets and metrics does Transformer-XL report results on, and how do I reproduce those evaluations from this repository?

Frequently asked questions

What is transformer-xl?

Research code for Transformer-XL, an AI language model that uses a memory mechanism to carry context across document chunks, letting it reference text from much earlier in a passage than standard transformers.

What language is transformer-xl written in?

Mainly Python. The stack also includes Python, PyTorch, TensorFlow.

How hard is transformer-xl to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is transformer-xl for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub kimiyoung on gitmyhub

Verify against the repo before relying on details.