Analysis updated 2026-05-18
Run a language model against the 467 PlanningBench instances to measure how often it produces plans that satisfy all constraints, and compare results across models.
Use the benchmark's verification checklists to automatically score a model's planning output without needing human judges for each response.
Study how well current AI models handle specific planning families like workforce scheduling or emergency resource allocation by filtering the dataset by task type.
| tencent-hunyuan/planningbench | 0whitedev/detranspiler | 0xluk3/zk-resources | |
|---|---|---|---|
| Stars | 21 | 21 | 21 |
| Language | — | Python | — |
| Setup difficulty | easy | hard | easy |
| Complexity | 1/5 | 4/5 | 1/5 |
| Audience | researcher | developer | researcher |
Figures from each repo's GitHub metadata at analysis time.
The dataset is available on Hugging Face, check the LICENSE-PlanningBench.txt file before using it in a published project.
PlanningBench is a research benchmark from Tencent Hunyuan and Renmin University of China that tests how well large language models handle complex planning problems with multiple constraints, priorities, and dependencies. Rather than asking simple questions with obvious right answers, it tests whether a model can produce a complete, executable plan that satisfies all the stated constraints at once. The benchmark releases 467 synthetic planning instances intended for evaluation rather than training. Each instance contains a planning question and a verification checklist that makes it possible to check automatically whether a model's answer satisfies every constraint and objective. The questions are self-contained, meaning everything needed to solve and verify the plan is included in the problem statement itself. The 467 instances span six broad families of planning problems: scheduling and timetabling (fitting tasks into time windows without conflicts), project and production operations (managing milestones and dependencies), routing and travel (coordinating movement across locations), emergency response and public service (allocating resources under urgency and priority), allocation and matching (assigning resources under compatibility and capacity limits), and shift and workforce scheduling (covering all required work while respecting fairness rules). Together these cover more than 30 concrete task types. The data was built through an automated synthesis pipeline: planning scenarios were abstracted into taxonomies of tasks and constraints, candidate problems were generated from those taxonomies, a model attempted to solve them, and a verification step checked the answers. Human reviewers participated in quality control. Difficulty was adjusted by tightening constraints, increasing resource scarcity, and adding dependencies. The dataset is available on Hugging Face and accompanied by a paper on arXiv. The repository contains the evaluation dataset, documentation, and figures. A separate license file governs its use.
A benchmark of 467 complex planning problems from Tencent and Renmin University that tests whether AI models can produce complete, constraint-satisfying plans across scheduling, routing, allocation, and other planning domains.
Setup difficulty is rated easy, with roughly 30min to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.