Schedule distributed PyTorch or TensorFlow training jobs on a shared Kubernetes cluster with fair resource allocation between teams
Use gang scheduling to ensure all workers in a distributed job start at the same time, preventing partial launches that waste resources
Run Spark or Flink batch pipelines on Kubernetes with priority-based preemption for urgent jobs
Set up resource quotas and queuing policies so multiple teams share the same GPU cluster without stepping on each other
Requires a running Kubernetes cluster with admin access to install CRDs and replace or extend the default scheduler.
Volcano is a batch job scheduling system that runs on top of Kubernetes, the software platform used to manage containers in production environments. Standard Kubernetes scheduling is designed for always-on services, not for jobs that start, run a workload, and stop. Volcano fills that gap by adding scheduling behaviors that large-scale computing jobs need: gang scheduling, which waits until a full group of workers can start together, fair-share queuing across teams, and the ability to preempt lower-priority jobs when higher-priority ones need resources. The project is aimed at organizations running AI and machine learning training jobs, big data pipelines, and high-performance computing workloads on Kubernetes clusters. It has documented integrations with a broad list of frameworks: Spark, Flink, Ray, TensorFlow, PyTorch, MPI, Horovod, MindSpore, PaddlePaddle, and several others. The integrations let those frameworks submit jobs to Kubernetes and let Volcano decide how to schedule them for best resource usage. Volcano is an incubating project under the Cloud Native Computing Foundation, which is the same organization that hosts Kubernetes itself. It has been adopted by organizations in finance, cloud infrastructure, manufacturing, and healthcare, with production deployments at companies including ING Bank and Xiaohongshu. The system is built in Go and extends the standard Kubernetes scheduler rather than replacing it. Teams can configure queuing policies, resource quotas, and scheduling plugins without changing how other workloads on the same cluster run. Volcano is licensed under the Apache 2.0 license.
← volcano-sh on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.