explaingit

mattc95/2026-ai-detector-benchmark

18PythonAudience · researcherComplexity · 2/5ActiveSetup · moderate

TLDR

A May 2026 benchmark of four commercial AI-text detectors (GPTHumanizer, GPTZero, ZeroGPT, Sapling) on 1,000 balanced human and AI passages, with accuracy and human false-positive rates.

Mindmap

mindmap
  root((ai-detector-benchmark))
    Inputs
      500 human passages
      500 AI passages
      Pile-small samples
    Outputs
      Accuracy scores
      False positive rates
      Confusion matrices
    Use Cases
      Compare detectors
      Audit false accusations
      Pick a detector
      Reproduce the test
    Tech Stack
      Python
      JSON
      Google Drive

Things people build with this

USE CASE 1

Compare GPTHumanizer, GPTZero, ZeroGPT, and Sapling on accuracy and false-positive rate before picking one for school or work

USE CASE 2

Reproduce the benchmark on a new detector by reusing the 1,000-passage test set and label rules

USE CASE 3

Audit how often each detector wrongly flags real human writing from Wikipedia, StackExchange, or Enron emails

USE CASE 4

Slice the results by word-count bucket (50-200, 200-500, 500-1000) to see which detector handles long text best

Tech stack

PythonJSON

Getting it running

Difficulty · moderate Time to first run · 1h+

Per-item detector outputs are hosted on public Google Drive rather than in git, so you must download them before the evaluation scripts can run.

In plain English

This repository is a public benchmark, run in May 2026, that tests four commercial AI-text detectors on the same 1,000 English passages and reports how often each one was right. The four tools compared are GPTHumanizer, GPTZero, ZeroGPT, and the Sapling AI Detector. The author argues that raw accuracy is not the only number that matters: the rate at which a detector falsely accuses real human writing of being AI-generated, called the human false positive rate, is just as important because a wrong flag can cause real consequences for students or writers. The test set is 500 human-written passages sampled from the Pile-small dataset and 500 AI-generated passages sampled from a larger pool of 2,600 model outputs. The human side spans thirteen sources including Wikipedia, OpenWebText2, Pile-CC, USPTO patent backgrounds, StackExchange, HackerNews, FreeLaw, PubMed, ArXiv, and Enron emails, so detectors are not just rewarded for handling one writing style. The AI side mixes outputs from many model families, including several Claude versions, GPT-3.5, GPT-4o, GPT-5, o3, DeepSeek Chat, Kimi, and Grok-4. Each split is balanced across three word-count buckets: 50 to 200, 200 to 500, and 500 to 1000 words. The headline result in the README is that GPTZero was the most accurate overall at 98.7 percent, but GPTHumanizer was the safest in terms of false accusations, flagging zero of 500 human passages as AI. ZeroGPT and Sapling were close to each other near 88 percent accuracy but both flagged roughly 18 to 19 percent of human passages as AI, which the author calls a serious risk. GPTZero had two API errors that are kept in the data but excluded from the rates. The repository ships the benchmark input data as JSON files under data/, plus evaluation scripts that turn each detector's raw response into a single human-or-AI label using rules listed in a table. Per-item detector outputs are large, so the full files live on public Google Drive links rather than in git. The README also describes the confusion-matrix definitions used and the metric formulas, and notes a section on performance by text length that continues past the part shown here.

Copy-paste prompts

Prompt 1
Walk me through the data/ JSON files in this repo and the rule table that turns each detector's raw response into a final human-or-AI label
Prompt 2
Run the evaluation scripts on the public Google Drive outputs and reproduce the GPTZero 98.7 percent headline number
Prompt 3
Add a fifth detector to this benchmark and write the integration plus label-mapping rule following the existing pattern
Prompt 4
Compute per-word-count-bucket accuracy and false-positive rates and chart them so I can see which detector breaks on short text
Prompt 5
Extend the test set with 200 fresh 2026 model outputs (Claude 4.7, GPT-5) and rerun the benchmark
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.