Study the full lifecycle of a large language model chapter by chapter, from architecture basics to deployment.
Learn efficient fine-tuning methods to adapt a pre-trained model to a specific task without retraining from scratch.
Understand AI social harms, bias, hallucination, and legal questions around training data for policy or governance work.
So-Large-LM is an open-source Chinese-language educational project that teaches how large language models work, from foundational concepts through practical training and deployment. It is maintained by Datawhale, a Chinese open-source learning community, and is structured as a 14-chapter course rooted in the Stanford CS324 curriculum and a generative AI course by professor Hung-yi Lee. The course covers the full lifecycle of a large language model. Early chapters explain the architecture decisions that make these models work: how the Transformer structure processes language, how positional encoding helps the model understand word order, and how attention mechanisms let the model weigh relationships between words. Later chapters cover data preparation, training strategies, and efficient fine-tuning methods that let researchers adapt a pre-trained model to a specific task without retraining it from scratch. The project also addresses topics that many purely technical tutorials skip: environmental costs such as carbon emissions from large training runs, legal questions around copyright and fair use of training data, social harms including bias and hallucination, and how AI agents are structured. A dedicated chapter traces the full history of Meta's Llama model family from version 1 through version 3. The README and all course content are written in Chinese. Companion video lectures are available on Bilibili. Datawhale positions this project as the theoretical foundation in a three-part learning path, with separate sibling repositories covering hands-on application development and open-source model deployment. The target audience includes students, researchers, industry professionals, and policy specialists who want a thorough grounding in how large language models are built and governed.
← datawhalechina on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.