microsoft/unilm

Analysis updated 2026-05-18

★ 22,115PythonAudience · researcherComplexity · 4/5LicenseSetup · moderate

Mindmap

mindmap
  root((repo))
    What it does
      Unified pre-training
      Multiple modalities
      Research implementations
    Model families
      Language models
      Vision models
      Speech models
      Document models
    Key projects
      UniLM
      MiniLM
      BEiT
      WavLM
      LayoutLM
    Use cases
      Train custom models
      Access pre-trained weights
      Research foundation models
    Tech stack
      Python
      PyTorch
      Transformers

mindmap root((repo)) What it does Unified pre-training Multiple modalities Research implementations Model families Language models Vision models Speech models Document models Key projects UniLM MiniLM BEiT WavLM LayoutLM Use cases Train custom models Access pre-trained weights Research foundation models Tech stack Python PyTorch Transformers

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Download pre-trained model weights for language, vision, speech, or document understanding tasks.

USE CASE 2

Study and implement research papers on unified pre-training approaches across multiple data types.

USE CASE 3

Fine-tune smaller models like MiniLM for faster inference on resource-constrained devices.

USE CASE 4

Build document understanding systems using LayoutLM that combine text and visual layout information.

What is it built with?

PythonPyTorchTransformers

How does it compare?

	microsoft/unilm	mkdocs/mkdocs	tornadoweb/tornado
Stars	22,115	22,048	22,182
Language	Python	Python	Python
Setup difficulty	moderate	easy	easy
Complexity	4/5	2/5	3/5
Audience	researcher	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Requires PyTorch and specific model weights download, GPU recommended but not mandatory for inference.

Use freely for any purpose including commercial, as long as you keep the copyright notice.

In plain English

unilm is a Microsoft Research repository that collects a large family of foundation-model projects under one roof. A foundation model is an AI model trained on huge amounts of raw data so that it can later be adapted to many different specific tasks. The umbrella theme of the repo, summarised at the top of the README as The Big Convergence, is large-scale self-supervised pre-training that spans tasks (both understanding and generation), languages (more than a hundred), and modalities (text, image, audio, and combinations like text plus layout, text plus vision, or text plus speech). Rather than being a single library you install and call, the repository works as a directory of sub-projects, each with its own folder, paper link, and in many cases code and pre-trained checkpoints. The README organises them into groups. Foundation Architecture covers building blocks like DeepNet (scaling transformers to a thousand layers and beyond), Foundation Transformers (Magneto), a length-extrapolatable transformer, X-MoE (sparse Mixture-of-Experts), BitNet (1-bit transformers), RetNet, and LongNet (scaling to a billion tokens). Foundation Models include the Kosmos series of multimodal large language models (Kosmos-1, Kosmos-2, Kosmos-2.5) and MetaLM, framed as general-purpose language interfaces. Language and multilingual work includes UniLM itself (unified pre-training for understanding and generation), InfoXLM/XLM-E, DeltaLM/mT6, MiniLM, AdaLM, EdgeLM, SimLM, E5 text embeddings, and MiniLLM for knowledge distillation. Vision work includes BEiT and BEiT-2, DiT for document image transformers, and TextDiffuser. Speech work includes WavLM and VALL-E for neural-codec text-to-speech. Multimodal entries include the LayoutLM family for Document AI, LayoutXLM, MarkupLM, XDoc, UniSpeech, SpeechT5, SpeechLM, VLMo, VL-BEiT, and BEiT-3 as a general-purpose multimodal foundation model. You would explore this repo if you are an AI researcher or engineer who wants reference code, pre-trained checkpoints, and papers for a specific Microsoft Research model (OCR, document understanding, multilingual translation, a 1-bit LLM experiment, speech models) rather than reaching for a general production framework. The primary language is Python, and the README also points readers to the related TorchScale library for the underlying architectures. The top of the README is also a hiring notice for the team that maintains the repo.

Copy-paste prompts

Prompt 1

How do I download and use the UniLM pre-trained weights from this Microsoft repository for text generation?

Prompt 2

Show me how to fine-tune MiniLM on my custom dataset using the training code in this repo.

Prompt 3

Explain how LayoutLM combines text and visual layout to understand scanned documents and forms.

Prompt 4

What is the difference between the various model families in this repo (UniLM, BEiT, WavLM, LayoutLM) and when should I use each one?

Prompt 5

How do I implement the BitNet architecture from this repo to reduce model size and computation?

Frequently asked questions

What is unilm?

Microsoft's research collection of pre-trained AI models and training code for handling text, images, speech, and documents with unified approaches rather than separate specialized models.

What language is unilm written in?

Mainly Python. The stack also includes Python, PyTorch, Transformers.

What license does unilm use?

Use freely for any purpose including commercial, as long as you keep the copyright notice.

How hard is unilm to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is unilm for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub microsoft on gitmyhub

Verify against the repo before relying on details.