tencent-hunyuan/planningbench

Analysis updated 2026-05-18

★ 21Audience · researcherComplexity · 1/5Setup · easy

Mindmap

mindmap
  root((PlanningBench))
    What it is
      467 evaluation instances
      Constraint-driven synthesis
      Verification checklists
    Task Families
      Scheduling and timetabling
      Project and production
      Routing and travel
      Emergency response
      Allocation and matching
      Workforce scheduling
    Use Cases
      Benchmark LLM planning ability
      Evaluate constraint satisfaction
      Research on reasoning models
    Resources
      HuggingFace dataset
      arXiv paper

mindmap root((PlanningBench)) What it is 467 evaluation instances Constraint-driven synthesis Verification checklists Task Families Scheduling and timetabling Project and production Routing and travel Emergency response Allocation and matching Workforce scheduling Use Cases Benchmark LLM planning ability Evaluate constraint satisfaction Research on reasoning models Resources HuggingFace dataset arXiv paper

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Run a language model against the 467 PlanningBench instances to measure how often it produces plans that satisfy all constraints, and compare results across models.

USE CASE 2

Use the benchmark's verification checklists to automatically score a model's planning output without needing human judges for each response.

USE CASE 3

Study how well current AI models handle specific planning families like workforce scheduling or emergency resource allocation by filtering the dataset by task type.

How does it compare?

	tencent-hunyuan/planningbench	0whitedev/detranspiler	0xluk3/zk-resources
Stars	21	21	21
Language	—	Python	—
Setup difficulty	easy	hard	easy
Complexity	1/5	4/5	1/5
Audience	researcher	developer	researcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 30min

The dataset is available on Hugging Face, check the LICENSE-PlanningBench.txt file before using it in a published project.

In plain English

PlanningBench is a research benchmark from Tencent Hunyuan and Renmin University of China that tests how well large language models handle complex planning problems with multiple constraints, priorities, and dependencies. Rather than asking simple questions with obvious right answers, it tests whether a model can produce a complete, executable plan that satisfies all the stated constraints at once. The benchmark releases 467 synthetic planning instances intended for evaluation rather than training. Each instance contains a planning question and a verification checklist that makes it possible to check automatically whether a model's answer satisfies every constraint and objective. The questions are self-contained, meaning everything needed to solve and verify the plan is included in the problem statement itself. The 467 instances span six broad families of planning problems: scheduling and timetabling (fitting tasks into time windows without conflicts), project and production operations (managing milestones and dependencies), routing and travel (coordinating movement across locations), emergency response and public service (allocating resources under urgency and priority), allocation and matching (assigning resources under compatibility and capacity limits), and shift and workforce scheduling (covering all required work while respecting fairness rules). Together these cover more than 30 concrete task types. The data was built through an automated synthesis pipeline: planning scenarios were abstracted into taxonomies of tasks and constraints, candidate problems were generated from those taxonomies, a model attempted to solve them, and a verification step checked the answers. Human reviewers participated in quality control. Difficulty was adjusted by tightening constraints, increasing resource scarcity, and adding dependencies. The dataset is available on Hugging Face and accompanied by a paper on arXiv. The repository contains the evaluation dataset, documentation, and figures. A separate license file governs its use.

Copy-paste prompts

Prompt 1

I want to evaluate a language model on PlanningBench. How do I load the dataset from HuggingFace and what format are the 467 instances in?

Prompt 2

Explain how PlanningBench's verification checklists work: given a model's plan as output, how do I check whether it satisfies all the constraints in an instance?

Prompt 3

What are the six planning families in PlanningBench and how do they differ? Give me one concrete example of a task type from each family.

Prompt 4

How was PlanningBench built? Explain the Generator-Responder-Critic pipeline and how difficulty is controlled through constraint tightness and resource scarcity.

Frequently asked questions

What is planningbench?

A benchmark of 467 complex planning problems from Tencent and Renmin University that tests whether AI models can produce complete, constraint-satisfying plans across scheduling, routing, allocation, and other planning domains.

How hard is planningbench to set up?

Setup difficulty is rated easy, with roughly 30min to a first successful run.

Who is planningbench for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub tencent-hunyuan on gitmyhub

Verify against the repo before relying on details.