volcano-sh/volcano

★ 5,557GoAudience · ops devopsComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((Volcano))
    Scheduling features
      Gang scheduling
      Fair-share queuing
      Preemption
    Frameworks supported
      TensorFlow
      PyTorch
      Spark
      Flink
    Use cases
      AI training jobs
      Big data pipelines
      HPC workloads
    Audience
      Kubernetes admins
      MLOps teams

mindmap root((Volcano)) Scheduling features Gang scheduling Fair-share queuing Preemption Frameworks supported TensorFlow PyTorch Spark Flink Use cases AI training jobs Big data pipelines HPC workloads Audience Kubernetes admins MLOps teams

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Schedule distributed PyTorch or TensorFlow training jobs on a shared Kubernetes cluster with fair resource allocation between teams

USE CASE 2

Use gang scheduling to ensure all workers in a distributed job start at the same time, preventing partial launches that waste resources

USE CASE 3

Run Spark or Flink batch pipelines on Kubernetes with priority-based preemption for urgent jobs

USE CASE 4

Set up resource quotas and queuing policies so multiple teams share the same GPU cluster without stepping on each other

Tech stack

GoKubernetes

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a running Kubernetes cluster with admin access to install CRDs and replace or extend the default scheduler.

Apache 2.0, use freely in commercial and open-source projects as long as you keep the license notice.

In plain English

Volcano is a batch job scheduling system that runs on top of Kubernetes, the software platform used to manage containers in production environments. Standard Kubernetes scheduling is designed for always-on services, not for jobs that start, run a workload, and stop. Volcano fills that gap by adding scheduling behaviors that large-scale computing jobs need: gang scheduling, which waits until a full group of workers can start together, fair-share queuing across teams, and the ability to preempt lower-priority jobs when higher-priority ones need resources. The project is aimed at organizations running AI and machine learning training jobs, big data pipelines, and high-performance computing workloads on Kubernetes clusters. It has documented integrations with a broad list of frameworks: Spark, Flink, Ray, TensorFlow, PyTorch, MPI, Horovod, MindSpore, PaddlePaddle, and several others. The integrations let those frameworks submit jobs to Kubernetes and let Volcano decide how to schedule them for best resource usage. Volcano is an incubating project under the Cloud Native Computing Foundation, which is the same organization that hosts Kubernetes itself. It has been adopted by organizations in finance, cloud infrastructure, manufacturing, and healthcare, with production deployments at companies including ING Bank and Xiaohongshu. The system is built in Go and extends the standard Kubernetes scheduler rather than replacing it. Teams can configure queuing policies, resource quotas, and scheduling plugins without changing how other workloads on the same cluster run. Volcano is licensed under the Apache 2.0 license.

Copy-paste prompts

Prompt 1

I am deploying PyTorch distributed training on Kubernetes using Volcano. Write me a Volcano Job YAML that schedules 8 worker pods and uses gang scheduling to wait until all 8 are available before starting.

Prompt 2

Help me configure Volcano queues to give two teams a 60/40 fair-share of GPU resources on a shared Kubernetes cluster, with preemption enabled for the higher-priority team.

Prompt 3

Explain how Volcano gang scheduling works: if only 6 out of 8 required worker pods can be scheduled right now, what does Volcano do and when does the job actually start?

Open on GitHub → Explain another repo

← volcano-sh on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.