explaingit

volcano-sh/volcano

5,557GoAudience · ops devopsComplexity · 4/5LicenseSetup · hard

TLDR

Volcano is a Kubernetes-native batch job scheduler that adds gang scheduling, fair-share queuing, and preemption so you can run AI training, big data pipelines, and HPC workloads on shared Kubernetes clusters.

Mindmap

mindmap
  root((Volcano))
    Scheduling features
      Gang scheduling
      Fair-share queuing
      Preemption
    Frameworks supported
      TensorFlow
      PyTorch
      Spark
      Flink
    Use cases
      AI training jobs
      Big data pipelines
      HPC workloads
    Audience
      Kubernetes admins
      MLOps teams
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Schedule distributed PyTorch or TensorFlow training jobs on a shared Kubernetes cluster with fair resource allocation between teams

USE CASE 2

Use gang scheduling to ensure all workers in a distributed job start at the same time, preventing partial launches that waste resources

USE CASE 3

Run Spark or Flink batch pipelines on Kubernetes with priority-based preemption for urgent jobs

USE CASE 4

Set up resource quotas and queuing policies so multiple teams share the same GPU cluster without stepping on each other

Tech stack

GoKubernetes

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a running Kubernetes cluster with admin access to install CRDs and replace or extend the default scheduler.

Apache 2.0, use freely in commercial and open-source projects as long as you keep the license notice.

In plain English

Volcano is a batch job scheduling system that runs on top of Kubernetes, the software platform used to manage containers in production environments. Standard Kubernetes scheduling is designed for always-on services, not for jobs that start, run a workload, and stop. Volcano fills that gap by adding scheduling behaviors that large-scale computing jobs need: gang scheduling, which waits until a full group of workers can start together, fair-share queuing across teams, and the ability to preempt lower-priority jobs when higher-priority ones need resources. The project is aimed at organizations running AI and machine learning training jobs, big data pipelines, and high-performance computing workloads on Kubernetes clusters. It has documented integrations with a broad list of frameworks: Spark, Flink, Ray, TensorFlow, PyTorch, MPI, Horovod, MindSpore, PaddlePaddle, and several others. The integrations let those frameworks submit jobs to Kubernetes and let Volcano decide how to schedule them for best resource usage. Volcano is an incubating project under the Cloud Native Computing Foundation, which is the same organization that hosts Kubernetes itself. It has been adopted by organizations in finance, cloud infrastructure, manufacturing, and healthcare, with production deployments at companies including ING Bank and Xiaohongshu. The system is built in Go and extends the standard Kubernetes scheduler rather than replacing it. Teams can configure queuing policies, resource quotas, and scheduling plugins without changing how other workloads on the same cluster run. Volcano is licensed under the Apache 2.0 license.

Copy-paste prompts

Prompt 1
I am deploying PyTorch distributed training on Kubernetes using Volcano. Write me a Volcano Job YAML that schedules 8 worker pods and uses gang scheduling to wait until all 8 are available before starting.
Prompt 2
Help me configure Volcano queues to give two teams a 60/40 fair-share of GPU resources on a shared Kubernetes cluster, with preemption enabled for the higher-priority team.
Prompt 3
Explain how Volcano gang scheduling works: if only 6 out of 8 required worker pods can be scheduled right now, what does Volcano do and when does the job actually start?
Open on GitHub → Explain another repo

← volcano-sh on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.