Study each Transformer component (attention, positional encoding, encoder, decoder) by reading the code alongside the README diagrams.
Run the included German-to-English translation training example to see how encoder-decoder models are trained end to end.
Use this as a reference starting point before reading the original 'Attention Is All You Need' paper, to understand the architecture from code first.
Requires PyTorch and the WMT 2014 dataset, no active maintenance, so dependency versions may need adjustment.
This repository contains one person's Python implementation of the Transformer architecture, a design for neural networks that became highly influential in AI after a 2017 Google paper titled "Attention Is All You Need." The author wrote it in 2019 as a personal learning project and includes an upfront warning that they were not fully familiar with the model at the time, so the code should not be treated as a definitive reference. The Transformer is a type of model used to process sequences of text, such as translating sentences from one language to another. Where older approaches processed words one by one in order, the Transformer looks at all words in a sentence simultaneously and figures out which ones are most relevant to each other. The key mechanism for this is called attention, and the code here implements it along with the other building blocks: positional encoding (which tells the model where each word sits in a sentence), multi-head attention (which runs several attention calculations in parallel), feed-forward layers, and layer normalization. The project is structured around an encoder and a decoder. The encoder reads the input sentence and builds an understanding of it, the decoder takes that understanding and generates the output sentence word by word. The README walks through each component with code snippets and diagrams, making it useful as a study guide for anyone trying to understand how the architecture works from the inside. The included training example uses the WMT 2014 German-to-English translation dataset. Configuration options such as batch size, number of attention heads, layer depth, and learning rate are set in a separate configuration file. Because this is a personal study project from 2019, the author notes they are not actively maintaining it. Contributions via pull requests are welcome if someone finds a bug.
← hyunwoongko on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.