Train a tiny GPT from scratch on a themed dataset as a learning exercise
Load the released nanoBeard weights and generate pirate-flavored text
Reuse the piratize.py script to rewrite a different corpus in a custom style
Study a minimal decoder-only Transformer in PyTorch with BPE tokenization
Pre-training and fine-tuning need a CUDA GPU with bfloat16 support and TinyStories data prep.
pirate_llm is the source code for nanoBeard, a small pirate-themed language model trained from scratch as a learning project. The author describes it as closer in spirit to nanoGPT than to any production language model. The trained model itself lives on the Hugging Face Hub under younissk/nanoBeard, and this GitHub repo holds the training code and tokenizer. The model is a decoder-only Transformer in the GPT style with about 13.9 million parameters, 6 layers, 6 attention heads, an embedding size of 384, and a context window of only 256 tokens. It uses a custom byte-pair-encoding tokenizer with a vocabulary of 8192 tokens, stored in a file called pirate_bpe.json. The released bundle on the Hub includes the weights as model.safetensors, an architecture config, the tokenizer file, a training metadata snapshot, and a banner image. Training happened in two stages. First, the model was pre-trained on TinyStories, a small synthetic story dataset, after the stories were rewritten in pirate-speak using a rule-based script included in the repo at dataset/piratize.py. Then a short supervised fine-tuning stage ran for 1400 iterations, ending at a validation loss of around 4.28. The optimizer was AdamW with a linear warmup followed by a cosine decay, and training used bfloat16 on a CUDA GPU. The README includes a Python snippet showing how to download the weights, the config, and the tokenizer from the Hub, then load them into the custom GPT class from this repo and generate text. The author is upfront about the limits: tiny vocabulary, narrow grammar, no safety tuning, and outputs that are pirate-flavored nonsense at best. It is meant as an educational artifact, not a usable chat model.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.