explaingit

microsoft/jarvis

24,739PythonAudience · researcherComplexity · 4/5QuietLicenseSetup · hard

TLDR

A system that uses ChatGPT to coordinate and execute complex AI tasks by automatically selecting and running specialized models from Hugging Face.

Mindmap

mindmap
  root((repo))
    How it works
      ChatGPT plans tasks
      Selects expert models
      Runs in sequence
      Synthesizes results
    Key components
      Task planning stage
      Model selection stage
      Task execution stage
      Response generation
    Use cases
      Multi-step image tasks
      Complex AI workflows
      Model orchestration
    Tech stack
      Python
      PyTorch
      ChatGPT API
      Hugging Face models
    Setup requirements
      OpenAI API key
      Hugging Face account
      GPU memory optional

Things people build with this

USE CASE 1

Automatically break down complex image tasks (like pose detection + image generation) and execute them end-to-end.

USE CASE 2

Build AI agent systems that coordinate multiple specialized models without manually writing orchestration code.

USE CASE 3

Research how large language models can act as central planners for multi-model AI workflows.

Tech stack

PythonPyTorchChatGPTHugging Face

Getting it running

Difficulty · hard Time to first run · 1h+

Requires OpenAI API key, multiple Hugging Face model downloads, and PyTorch/CUDA setup for inference.

Use freely for any purpose including commercial, as long as you keep the copyright notice.

In plain English

JARVIS (also known as HuggingGPT) is a Microsoft research project that uses a large language model, specifically ChatGPT, as a central coordinator to automatically plan and execute complex AI tasks by delegating work to specialized AI models hosted on Hugging Face. Here is how it works: when you give JARVIS a complicated request like "describe the poses in this photo and generate a new image based on them," ChatGPT breaks the task into steps, selects the appropriate expert models from the Hugging Face model hub (for pose detection, image generation, etc.), runs them in the right order, collects the outputs, and synthesizes a final response. The LLM (large language model) acts as the brain; the specialist models act as hands. The workflow has four stages: task planning (ChatGPT figures out what needs to be done), model selection (ChatGPT picks which Hugging Face models to use based on their descriptions), task execution (the models run), and response generation (ChatGPT summarizes the results). A lightweight mode exists that does not require downloading models locally. Researchers studying AI agent architectures or multi-model orchestration would use this project. It requires an OpenAI API key and a Hugging Face account. The full local setup needs significant GPU memory (24GB VRAM recommended) and disk space. Built in Python with PyTorch.

Copy-paste prompts

Prompt 1
How do I set up JARVIS to take a photo, detect poses, and generate a new image based on those poses?
Prompt 2
Show me how to add a custom Hugging Face model to JARVIS's available model pool for task execution.
Prompt 3
What's the lightweight mode in JARVIS and how do I use it without downloading models locally?
Prompt 4
How does JARVIS decide which Hugging Face model to use for each step in a multi-stage task?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.