Find a leaderboard like Chatbot Arena or LiveBench when picking which LLM to ship
Pick a self-hostable eval framework like lm-evaluation-harness or HELM for an internal model
Locate a domain-specific benchmark like RAGAS for RAG or AgentBench for agent loops
Submit a pull request adding a new evaluation project to the curated list
No code, just a curated list of links with no license stated.
Awesome-AI-Benchmarking is a curated list of tools, leaderboards, and frameworks for evaluating large language models. It is one of the GitHub-style awesome lists, meaning a single README that links out to other projects with short descriptions. The author updates the list periodically and welcomes pull requests for new entries. The list is split into two main groups. The first group, SaaS and hosted platforms, points to LMSYS Chatbot Arena, the crowdsourced blind-comparison Elo arena; Artificial Analysis, an independent platform that publishes quality, speed, price, latency, and context window metrics; the Hugging Face Open LLM Leaderboard for open models; LiveBench, which refreshes its questions to fight contamination; and the Vellum LLM Leaderboard, which targets business use cases. HELM and BigBench are also called out. The second group covers open-source projects you can run yourself. It lists EleutherAI's lm-evaluation-harness, Hugging Face's open leaderboard codebase and LightEval, Stanford's HELM suite, the LiveBench source, EvalPlus for code generation with extended tests like HumanEval Plus and MBPP Plus, DeepEval, the LangSmith evaluators, Google's Big-Bench with over 200 tasks, RAGAS for retrieval-augmented generation, PromptBench for adversarial prompt testing, SafetyBench, MT-Bench, AgentBench, and LLM-KG-Bench for knowledge graphs. The README closes with notes on how to contribute and a disclaimer. Contributors are asked to fork the repo, follow the existing entry format, and submit a pull request with a short explanation. The disclaimer reminds readers that the list is community-curated and not exhaustive, that benchmark scores can be misleading without an understanding of methodology and contamination risk, and that no single leaderboard tells the full story since different benchmarks favor different model strengths. The repository itself contains only the README and a star history chart link, with no code. There is no license listed in the README. The audience the author addresses is AI researchers, LLM engineers, product teams, and open-source enthusiasts who want a reading list of evaluation projects in one place.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.