datawhalechina/so-large-lm

★ 7,262Audience · researcherComplexity · 1/5Setup · easy

Mindmap

mindmap
  root((so-large-lm))
    Architecture
      Transformer basics
      Positional encoding
      Attention mechanism
    Training
      Data preparation
      Training strategies
      Efficient fine-tuning
    Responsible AI
      Carbon emissions
      Bias and hallucination
      Legal and copyright
    Applications
      AI agent design
      Llama model history
      Open-source deployment

mindmap root((so-large-lm)) Architecture Transformer basics Positional encoding Attention mechanism Training Data preparation Training strategies Efficient fine-tuning Responsible AI Carbon emissions Bias and hallucination Legal and copyright Applications AI agent design Llama model history Open-source deployment

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Study the full lifecycle of a large language model chapter by chapter, from architecture basics to deployment.

USE CASE 2

Learn efficient fine-tuning methods to adapt a pre-trained model to a specific task without retraining from scratch.

USE CASE 3

Understand AI social harms, bias, hallucination, and legal questions around training data for policy or governance work.

Getting it running

Difficulty · easy Time to first run · 5min

In plain English

So-Large-LM is an open-source Chinese-language educational project that teaches how large language models work, from foundational concepts through practical training and deployment. It is maintained by Datawhale, a Chinese open-source learning community, and is structured as a 14-chapter course rooted in the Stanford CS324 curriculum and a generative AI course by professor Hung-yi Lee. The course covers the full lifecycle of a large language model. Early chapters explain the architecture decisions that make these models work: how the Transformer structure processes language, how positional encoding helps the model understand word order, and how attention mechanisms let the model weigh relationships between words. Later chapters cover data preparation, training strategies, and efficient fine-tuning methods that let researchers adapt a pre-trained model to a specific task without retraining it from scratch. The project also addresses topics that many purely technical tutorials skip: environmental costs such as carbon emissions from large training runs, legal questions around copyright and fair use of training data, social harms including bias and hallucination, and how AI agents are structured. A dedicated chapter traces the full history of Meta's Llama model family from version 1 through version 3. The README and all course content are written in Chinese. Companion video lectures are available on Bilibili. Datawhale positions this project as the theoretical foundation in a three-part learning path, with separate sibling repositories covering hands-on application development and open-source model deployment. The target audience includes students, researchers, industry professionals, and policy specialists who want a thorough grounding in how large language models are built and governed.

Copy-paste prompts

Prompt 1

Based on the so-large-lm course content on attention mechanisms, explain step by step how a Transformer model weighs relationships between words in an input sentence.

Prompt 2

What efficient fine-tuning methods does the so-large-lm course cover? Walk me through applying LoRA to adapt a pre-trained language model to a text classification task.

Prompt 3

Summarize the environmental cost concerns and carbon emission estimates around training large language models, as covered in the so-large-lm curriculum.

Open on GitHub → Explain another repo

← datawhalechina on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.