djlougen/busybeaver

★ 13PythonAudience · researcherComplexity · 2/5LicenseSetup · moderate

Mindmap

mindmap
  root((repo))
    What it does
      Local AI code benchmark
      CPU-only, no GPU needed
      Resume on interruption
    Benchmarks Supported
      HumanEval 164 problems
      MBPP 500 problems
      MMLU-Pro general knowledge
    How it works
      Up to 3 attempts per problem
      Code extracted from response
      Isolated subprocess testing
    Model Used
      Qwen2.5-Coder-3B-Instruct
      4-bit quantized, 2.1GB
      89% HumanEval score
    Key Finding
      Small models beat large on coding
      Evaluation design matters

mindmap root((repo)) What it does Local AI code benchmark CPU-only, no GPU needed Resume on interruption Benchmarks Supported HumanEval 164 problems MBPP 500 problems MMLU-Pro general knowledge How it works Up to 3 attempts per problem Code extracted from response Isolated subprocess testing Model Used Qwen2.5-Coder-3B-Instruct 4-bit quantized, 2.1GB 89% HumanEval score Key Finding Small models beat large on coding Evaluation design matters

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Benchmark how well a local AI model writes Python code without paying for a cloud API.

USE CASE 2

Reproduce the claim that a 2.1GB model can outperform large commercial models on HumanEval coding tasks.

USE CASE 3

Compare the effect of retry logic and temperature settings on benchmark scores.

USE CASE 4

Run coding benchmark evaluations on CPU-only machines with no GPU required.

Tech stack

PythonQwen2.5

Getting it running

Difficulty · moderate Time to first run · 30min

Requires downloading the Qwen2.5-Coder-3B-Instruct model file (~2.1GB), runs entirely on CPU with no GPU needed.

Use freely for any purpose including commercial use, as long as you keep the copyright notice.

In plain English

busyBeaver is a Python benchmark runner that tests AI coding ability on your own computer using a small, locally downloaded model. The project's main claim is that a 2.1 GB model running on a consumer CPU scored 89% on HumanEval, a standard test where the model writes Python functions from descriptions, while Cohere's Command A+, a much larger commercial model, scored 75% on the same test. The tool works by feeding each test problem to the model, generating up to three attempts at different temperature settings, extracting the code from the model's response, and running the benchmark's test suite against that code in an isolated subprocess with a 15-second timeout. Progress is saved after every problem, so if the run stops midway it can resume from where it left off. The tool supports three benchmarks: HumanEval (164 coding problems), MBPP (500 coding problems), and MMLU-Pro (general knowledge multiple-choice questions). The model used in the published results is Qwen2.5-Coder-3B-Instruct, a 3 billion parameter model from Alibaba quantized to 4-bit precision to fit in 2.1 GB. It runs on CPU with no GPU required. The README notes the caveats: the small model uses three attempts while the commercial model's score likely reflects a single attempt, and on MMLU-Pro, which tests general world knowledge rather than coding, the small model scores 27% versus 68% for the larger one. The broader point the project makes is that a small, code-specialized model can outperform a much larger general model on narrow coding tasks, and that evaluation design choices such as prompt framing, retry logic, and test sandboxing influence results significantly. Licensed under MIT.

Copy-paste prompts

Prompt 1

I want to benchmark Qwen2.5-Coder-3B on HumanEval using busybeaver on my laptop. Walk me through downloading the model and running the benchmark from the beginning.

Prompt 2

How does busybeaver generate up to three attempts per problem and what temperature settings does it use? Show me where to change the retry logic.

Prompt 3

My busybeaver run stopped halfway through MBPP. How does the resume feature work and what file does it save progress to?

Prompt 4

I want to add a custom benchmark to busybeaver. How do I define a problem set with test cases so the runner can evaluate it the same way it handles HumanEval?

Prompt 5

Why does the Qwen model score 89% on HumanEval but only 27% on MMLU-Pro, while a larger commercial model scores 68%? What does busybeaver's design tell us about model specialization?

Open on GitHub → Explain another repo

← djlougen on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.