Train a Chinese language model from scratch on a single GPU using the provided pre-training pipeline and ModelScope data.
Study how a real language model is built by reading clean, framework-free PyTorch code without DeepSpeed complexity.
Fine-tune the pre-trained GWen weights on your own Chinese text dataset using SFT, LoRA, or DPO.
Experiment with modern AI architecture choices like hybrid attention layers in a small, fast-to-iterate codebase.
Requires a GPU (single GPU is sufficient for the smaller sizes) and downloading pre-trained weights from ModelScope.
GWen (Chinese name: Gewan) is a small Chinese language model built from scratch using PyTorch. The goal is to make it easy to train, read, and modify a working language model without needing enormous computing resources or complex infrastructure. The entire training pipeline is written in plain PyTorch, avoiding heavy frameworks like DeepSpeed, so the code stays transparent and straightforward to follow. The model comes in three sizes: approximately 80 million, 118 million, and 252 million parameters. These are very small compared to production AI assistants, which often run into the tens of billions. The smaller size is intentional: it makes training feasible on a single GPU and lets researchers experiment with architecture choices without waiting days for results. The vocabulary is intentionally kept small at 8,192 tokens (most Chinese-capable models use 50,000 or more), which reduces the size of the embedding layer and leaves more of the model's capacity for the reasoning layers. The architecture uses a hybrid attention setup: for every group of four layers, three use a type of efficient attention called Gated DeltaNet and one uses standard full attention. This combination is designed to balance long-range language modeling ability with the stability that full attention provides. Other design choices include grouped query attention (which reduces memory during inference), RMS normalization instead of standard layer normalization, and a SwiGLU activation function in the feed-forward layers, all of which are patterns seen in modern language models. The training pipeline covers the full lifecycle: pre-training on raw text, supervised fine-tuning (SFT) either as a full parameter update or as a lightweight LoRA update, preference optimization via DPO, and an experimental GRPO stage. Pre-trained weights and training data are available to download from ModelScope. The README and most of the project documentation are written in Chinese. The commands and code are readable regardless of language, but the explanatory text will need translation for non-Chinese readers.
← juncheng0178-del on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.