explaingit

openai/gpt-oss

20,104PythonAudience · developerComplexity · 3/5MaintainedLicenseSetup · hard

TLDR

Open-weight AI language models (120B and 20B parameters) from OpenAI that you can download and run on your own hardware, with reasoning, function calling, and web browsing built in.

Mindmap

mindmap
  root((repo))
    What it does
      Two open models
      Run locally
      Reasoning support
      Function calling
    Model specs
      120B parameters
      20B parameters
      Mixture of Experts
      Quantized weights
    How to use
      Ollama
      LM Studio
      Transformers
      vLLM
    Capabilities
      Chain of thought
      Web browsing
      Code execution
      Structured outputs
    Tech stack
      PyTorch
      Triton
      Metal

Things people build with this

USE CASE 1

Run a private AI assistant on your own GPU without relying on OpenAI's API or paying per-token fees.

USE CASE 2

Build applications that need reasoning, web search, and code execution without external API calls.

USE CASE 3

Fine-tune or customize the model weights for domain-specific tasks using your own training data.

USE CASE 4

Deploy a production inference server using vLLM to serve multiple users with low latency.

Tech stack

PythonPyTorchTritonMetalOllamavLLMHugging Face Transformers

Getting it running

Difficulty · hard Time to first run · 1day+

120B model requires significant GPU VRAM (80GB+) or multi-GPU setup; downloading and quantizing models takes hours.

Use freely for any purpose, including commercial use, with no copyleft restrictions; you must include the Apache 2.0 license notice.

In plain English

gpt-oss is a pair of open-weight AI language models released by OpenAI: gpt-oss-120b (a large model with 117 billion total parameters but only 5.1 billion active at once) and gpt-oss-20b (a smaller, faster model with 21 billion parameters). "Open-weight" means the model weights, the learned numerical values that define how the model thinks, are publicly downloadable and can be run on your own hardware, unlike OpenAI's proprietary models which require API access. Both models are Mixture-of-Experts (MoE) models, a design where only a fraction of the network activates for any given input. This makes the 120b model surprisingly efficient: despite its large size, it fits on a single NVIDIA H100 or AMD MI300X GPU (80GB of memory) because of MXFP4 quantization, a technique that compresses the model's numbers to use less memory. The 20b model runs within 16GB of memory, making it accessible on high-end consumer hardware. The models support reasoning with configurable effort levels (low, medium, or high), full access to the model's internal chain-of-thought, function calling, web browsing, Python code execution, and structured outputs. They use a specific "Harmony" message format that must be applied correctly for the models to work. You can run these models locally using Ollama (two commands to download and start), LM Studio, the Hugging Face Transformers library, or vLLM for production serving. The models are licensed under Apache 2.0, making them free to use commercially without copyleft restrictions. The repository also includes educational reference implementations in PyTorch, Triton, and Metal.

Copy-paste prompts

Prompt 1
How do I download and run gpt-oss-120b locally using Ollama on my NVIDIA GPU?
Prompt 2
Show me how to use the Harmony message format to get chain-of-thought reasoning from gpt-oss models.
Prompt 3
What's the difference between gpt-oss-120b and gpt-oss-20b, and which should I use for my hardware?
Prompt 4
How do I set up vLLM to serve gpt-oss-20b as a production API endpoint with function calling enabled?
Prompt 5
Can I fine-tune gpt-oss-120b on my own data, and what's the minimum GPU memory required?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.