bigscience-workshop/petals

★ 10,130PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((repo))
    What it does
      Distributed inference
      Large model serving
      Fine-tuning support
    Supported Models
      Llama 3.1 405B
      Mixtral
      Falcon
      BLOOM
    How it works
      Model layers split
      Peer routing
      Volunteer GPU nodes
    Features
      Prompt tuning
      Interactive speed
      Private swarm option
    Setup Options
      Linux with Anaconda
      Docker container
      Windows via WSL
    Privacy
      Public swarm risk
      Private swarm option
      Wiki documentation

mindmap root((repo)) What it does Distributed inference Large model serving Fine-tuning support Supported Models Llama 3.1 405B Mixtral Falcon BLOOM How it works Model layers split Peer routing Volunteer GPU nodes Features Prompt tuning Interactive speed Private swarm option Setup Options Linux with Anaconda Docker container Windows via WSL Privacy Public swarm risk Private swarm option Wiki documentation

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run a 405-billion-parameter Llama model interactively on a consumer GPU by joining the public Petals network without downloading the full model.

USE CASE 2

Contribute your GPU to the Petals network to help others run large AI models while sharing compute costs.

USE CASE 3

Fine-tune a large language model for a specific task using prompt tuning without storing or training the full model locally.

USE CASE 4

Set up a private Petals swarm among a trusted team to run large AI models without routing data through public volunteer machines.

Tech stack

PythonPyTorchHugging Face Transformers

Getting it running

Difficulty · hard Time to first run · 1h+

Requires a CUDA GPU on Linux, Windows requires WSL setup, data passes through volunteer machines on the public swarm.

In plain English

Petals is a Python library that lets you run very large AI language models on consumer hardware by spreading the work across multiple computers over the internet, similar to how BitTorrent distributes file downloads across many peers. The models it targets, such as Llama 3.1 (up to 405 billion parameters), Mixtral, Falcon, and BLOOM, are too large to fit on a single consumer GPU. Petals solves this by letting each participant load just a portion of the model's layers, while the system routes data between participants to complete each request. From a user's perspective, you write code against the Petals library much like you would use standard tools from the Hugging Face Transformers library. You load a model, pass it some text, and get generated output back. The difference is that the heavy computation is happening across a network of volunteer-run machines rather than locally. The project reports inference speeds of up to 6 tokens per second for large models, which is enough for interactive chatbot-style use. You can also fine-tune models through the network. Petals supports prompt-tuning, which means you can adapt a model's behavior for a specific task without needing to store or train the full model yourself. Anyone with a GPU can contribute to the network by running the Petals server software, which hosts a slice of a model and serves requests routed to it. Setup instructions are provided for Linux with Anaconda, Windows via the Windows Subsystem for Linux, Docker, and macOS with Apple Silicon. The project runs a public monitor at health.petals.dev showing which models are currently available and how many participants are serving each one. Privacy is flagged as a consideration: in the public swarm, your data passes through other people's machines. The project has a wiki page covering the privacy implications, and it is possible to run a private swarm among a trusted group if that is a concern. The library is backed by a research paper published at ACL 2023.

Copy-paste prompts

Prompt 1

Using Petals, write the Python code to connect to the public network and generate text with Llama 3.1 on a consumer GPU with just a few lines of code.

Prompt 2

How do I set up a Petals server node on my Linux machine with an NVIDIA GPU to contribute layers to the public inference network?

Prompt 3

I want to fine-tune a large language model for text classification using Petals prompt tuning. Show me the Python code to train soft prompts through the network.

Prompt 4

What are the privacy risks of using the public Petals swarm and how do I set up a private swarm restricted to machines I control?

Prompt 5

How does Petals achieve 6 tokens per second for a 405B model across consumer GPUs? Explain the layer-splitting approach in plain terms.

Open on GitHub → Explain another repo

← bigscience-workshop on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.