younissk/pirate_llm

★ 20PythonAudience · researcherComplexity · 4/5ActiveSetup · hard

Mindmap

mindmap
  root((pirate-llm))
    Inputs
      TinyStories dataset
      Piratize rules
      Custom BPE tokenizer
    Outputs
      nanoBeard weights
      Generated pirate text
      Training metadata
    Use Cases
      Learn GPT internals
      Reproduce a tiny LLM run
      Experiment with theme datasets
    Tech Stack
      Python
      PyTorch
      CUDA
      Hugging Face

mindmap root((pirate-llm)) Inputs TinyStories dataset Piratize rules Custom BPE tokenizer Outputs nanoBeard weights Generated pirate text Training metadata Use Cases Learn GPT internals Reproduce a tiny LLM run Experiment with theme datasets Tech Stack Python PyTorch CUDA Hugging Face

Things people build with this

USE CASE 1

Train a tiny GPT from scratch on a themed dataset as a learning exercise

USE CASE 2

Load the released nanoBeard weights and generate pirate-flavored text

USE CASE 3

Reuse the piratize.py script to rewrite a different corpus in a custom style

USE CASE 4

Study a minimal decoder-only Transformer in PyTorch with BPE tokenization

Tech stack

PythonPyTorchCUDAHugging FaceBPE

Getting it running

Difficulty · hard Time to first run · 1day+

Pre-training and fine-tuning need a CUDA GPU with bfloat16 support and TinyStories data prep.

In plain English

pirate_llm is the source code for nanoBeard, a small pirate-themed language model trained from scratch as a learning project. The author describes it as closer in spirit to nanoGPT than to any production language model. The trained model itself lives on the Hugging Face Hub under younissk/nanoBeard, and this GitHub repo holds the training code and tokenizer. The model is a decoder-only Transformer in the GPT style with about 13.9 million parameters, 6 layers, 6 attention heads, an embedding size of 384, and a context window of only 256 tokens. It uses a custom byte-pair-encoding tokenizer with a vocabulary of 8192 tokens, stored in a file called pirate_bpe.json. The released bundle on the Hub includes the weights as model.safetensors, an architecture config, the tokenizer file, a training metadata snapshot, and a banner image. Training happened in two stages. First, the model was pre-trained on TinyStories, a small synthetic story dataset, after the stories were rewritten in pirate-speak using a rule-based script included in the repo at dataset/piratize.py. Then a short supervised fine-tuning stage ran for 1400 iterations, ending at a validation loss of around 4.28. The optimizer was AdamW with a linear warmup followed by a cosine decay, and training used bfloat16 on a CUDA GPU. The README includes a Python snippet showing how to download the weights, the config, and the tokenizer from the Hub, then load them into the custom GPT class from this repo and generate text. The author is upfront about the limits: tiny vocabulary, narrow grammar, no safety tuning, and outputs that are pirate-flavored nonsense at best. It is meant as an educational artifact, not a usable chat model.

Copy-paste prompts

Prompt 1

Walk me through running piratize.py on TinyStories and starting pre-training of nanoBeard

Prompt 2

Download nanoBeard from Hugging Face and load it into the GPT class from this repo to generate sample text

Prompt 3

Increase the context window from 256 to 512 tokens and tell me which other config values need to change

Prompt 4

Swap the BPE tokenizer for a SentencePiece tokenizer while keeping the rest of the training loop intact

Prompt 5

Explain how the AdamW warmup and cosine decay schedule is wired up in this codebase

Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.