zai-org/chatglm2-6b

★ 15,593PythonAudience · researcherComplexity · 3/5LicenseSetup · moderate

Mindmap

mindmap
  root((ChatGLM2-6B))
    What it is
      Bilingual LLM
      6B parameters
      Chinese and English
    Key improvements
      32K context window
      Faster inference
      Better benchmarks
    Quantization
      INT4 6GB GPU
      INT8 option
    Use cases
      Local chatbot
      Fine-tuning
      Research base

mindmap root((ChatGLM2-6B)) What it is Bilingual LLM 6B parameters Chinese and English Key improvements 32K context window Faster inference Better benchmarks Quantization INT4 6GB GPU INT8 option Use cases Local chatbot Fine-tuning Research base

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Deploy a bilingual Chinese-English chatbot on a single consumer GPU with as little as 6GB VRAM

USE CASE 2

Fine-tune the model on your own dataset for a domain-specific assistant or research task

USE CASE 3

Run a local AI chat assistant with 32K context for long documents using the ChatGLM2-6B-32K variant

USE CASE 4

Use the model as a research base for studying bilingual language understanding and alignment

Tech stack

PythonPyTorchHugging Face Transformers

Getting it running

Difficulty · moderate Time to first run · 30min

Requires a CUDA GPU (at least 6GB VRAM for INT4 quantization) and pip install of PyTorch and Transformers.

Free for research and personal use, commercial use requires registering through an online form.

In plain English

ChatGLM2-6B is the second-generation version of ChatGLM-6B, an open-source bilingual (Chinese and English) conversational large language model. "6B" refers to its size, roughly six billion parameters, which is small enough to run on a single consumer GPU while still being capable enough for general chat. The repository contains the model code and supporting scripts you need to download weights, run inference, and fine-tune the model on your own data. Compared with the first generation, ChatGLM2-6B was upgraded across several axes. The base model was retrained on 1.4T Chinese and English tokens with the GLM mixed-objective function and aligned to human preferences, producing large jumps on benchmarks like MMLU, C-Eval, GSM8K, and BBH (the README quotes gains such as +23% on MMLU and +571% on GSM8K). The context length was extended from 2K to 32K tokens using FlashAttention, with an 8K window used during chat training and a separate ChatGLM2-6B-32K variant for longer documents. Inference was made more efficient through Multi-Query Attention: roughly 42% faster generation than the first generation, and a 6GB GPU running INT4 quantization can sustain conversations up to 8K characters. INT8 and INT4 quantization further reduce memory with only modest accuracy loss. You would use ChatGLM2-6B if you want a freely available chatbot model that is strong in both Chinese and English, can run on a single GPU, and can be fine-tuned locally, for research, prototyping, or, after registering through a form, free commercial use. It is built in Python on PyTorch and Hugging Face Transformers, installed with pip after cloning. The full README is longer than what was provided.

Copy-paste prompts

Prompt 1

I have a machine with an 8GB GPU. Walk me through running ChatGLM2-6B with INT4 quantization so it fits in memory, show the exact Python commands to load and query the model.

Prompt 2

I want to fine-tune ChatGLM2-6B on a dataset of customer-support conversations in Chinese. What fine-tuning approach does the repo support and what data format does it expect?

Prompt 3

Compare ChatGLM2-6B's 32K context variant with the standard 8K chat model, when should I use each, and what are the memory and speed trade-offs?

Prompt 4

I need to serve ChatGLM2-6B as an API for a small team. What is the simplest way to wrap the Hugging Face model in a FastAPI endpoint?

Open on GitHub → Explain another repo

← zai-org on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.