hackingdave/modelregression

★ 23PythonAudience · developerLicense

Mindmap

mindmap
  root((ModelRegression))
    What it does
      Daily AI benchmarks
      Regression detection
      Public dashboard
    Test categories
      Logical reasoning
      Bug fixing
      Security awareness
      Code quality
    Models tested
      Claude Opus
      Claude Sonnet
      GPT-5.5
      Grok
    Tech
      Python orchestrator
      SQLite storage
      Next.js frontend

mindmap root((ModelRegression)) What it does Daily AI benchmarks Regression detection Public dashboard Test categories Logical reasoning Bug fixing Security awareness Code quality Models tested Claude Opus Claude Sonnet GPT-5.5 Grok Tech Python orchestrator SQLite storage Next.js frontend

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Track whether Claude, GPT, or Grok has silently degraded in coding quality after a model update.

USE CASE 2

Compare AI models side by side on tasks like bug fixing, security awareness, or safe code refactoring.

USE CASE 3

Get automatic regression alerts when a model's benchmark score drops more than 5% from its recent average.

USE CASE 4

Add new test categories to the open-source suite to cover gaps in the existing 10 benchmark areas.

Tech stack

PythonSQLiteNext.js

In plain English

ModelRegression is an independent benchmarking project that tracks how well AI coding tools perform over time and automatically flags when a model's quality drops. AI providers update their models frequently, sometimes without announcing changes, so this project runs the same fixed set of tests against each model every day and publishes the results at modelregression.com. The benchmark suite contains 30 tests spread across 10 categories: multi-step logical reasoning, coding tasks, bug fixing, feature implementation, edge case coverage, how safely a model can refactor code without introducing new bugs, security awareness (such as recognizing SQL injection or cross-site scripting risks), instruction following, code quality, and performance efficiency. Each day the tests run automatically at 3am and scores are stored in a local SQLite database. If a model's score drops more than 5% from its recent average, the system flags a regression, drops above 10% or 20% are escalated to higher severity levels. The models tested are Claude Opus 4.8 and Claude Sonnet 4.6 from Anthropic, GPT-5.5 from OpenAI, and Grok from xAI. Crucially, each model is tested through its official command-line tool rather than through a raw API call, so the benchmarks reflect the full experience a developer would actually have. The website is built with Next.js and shows a dashboard with scores over time, per-model and per-category detail pages, a side-by-side comparison view, outage history for when models are unreachable, and a page with full evidence for each test run including the original prompts, the model's output, and the score it received. The benchmark engine itself is a Python orchestrator that runs tests in parallel, uses another AI model (Claude Sonnet) as a judge for tests where there is no single correct answer, and exports results to static JSON files that the website reads. The project is MIT-licensed and open to contributions for new test categories.

Copy-paste prompts

Prompt 1

I want to add a new test category to the ModelRegression benchmark suite. Show me how the existing test structure works and write a sample test that checks whether an AI model correctly handles async Python error handling.

Prompt 2

I want to run the ModelRegression benchmark locally against Claude Sonnet and GPT-5. Walk me through setting up the Python orchestrator, pointing it at the models via their CLI tools, and storing results in SQLite.

Prompt 3

The ModelRegression dashboard shows a 12% drop in the bug fixing category for one model. How do I read the evidence page to see the original prompt, the model output, and what score the judge gave?

Open on GitHub → Explain another repo

← hackingdave on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.