mattc95/2026-ai-detector-benchmark

Analysis updated 2026-06-24

★ 17PythonAudience · researcherComplexity · 2/5Setup · moderate

Mindmap

mindmap
  root((ai-detector-benchmark))
    Inputs
      500 human passages
      500 AI passages
      Pile-small samples
    Outputs
      Accuracy scores
      False positive rates
      Confusion matrices
    Use Cases
      Compare detectors
      Audit false accusations
      Pick a detector
      Reproduce the test
    Tech Stack
      Python
      JSON
      Google Drive

mindmap root((ai-detector-benchmark)) Inputs 500 human passages 500 AI passages Pile-small samples Outputs Accuracy scores False positive rates Confusion matrices Use Cases Compare detectors Audit false accusations Pick a detector Reproduce the test Tech Stack Python JSON Google Drive

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Compare GPTHumanizer, GPTZero, ZeroGPT, and Sapling on accuracy and false-positive rate before picking one for school or work

USE CASE 2

Reproduce the benchmark on a new detector by reusing the 1,000-passage test set and label rules

USE CASE 3

Audit how often each detector wrongly flags real human writing from Wikipedia, StackExchange, or Enron emails

USE CASE 4

Slice the results by word-count bucket (50-200, 200-500, 500-1000) to see which detector handles long text best

What is it built with?

PythonJSON

How does it compare?

	mattc95/2026-ai-detector-benchmark	0petru/sentimo	alingalingling/akasha-wechat
Stars	17	17	17
Language	Python	Python	Python
Setup difficulty	moderate	moderate	hard
Complexity	2/5	3/5	4/5
Audience	researcher	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 1h+

Per-item detector outputs are hosted on public Google Drive rather than in git, so you must download them before the evaluation scripts can run.

In plain English

This repository is a public benchmark, run in May 2026, that tests four commercial AI-text detectors on the same 1,000 English passages and reports how often each one was right. The four tools compared are GPTHumanizer, GPTZero, ZeroGPT, and the Sapling AI Detector. The author argues that raw accuracy is not the only number that matters: the rate at which a detector falsely accuses real human writing of being AI-generated, called the human false positive rate, is just as important because a wrong flag can cause real consequences for students or writers. The test set is 500 human-written passages sampled from the Pile-small dataset and 500 AI-generated passages sampled from a larger pool of 2,600 model outputs. The human side spans thirteen sources including Wikipedia, OpenWebText2, Pile-CC, USPTO patent backgrounds, StackExchange, HackerNews, FreeLaw, PubMed, ArXiv, and Enron emails, so detectors are not just rewarded for handling one writing style. The AI side mixes outputs from many model families, including several Claude versions, GPT-3.5, GPT-4o, GPT-5, o3, DeepSeek Chat, Kimi, and Grok-4. Each split is balanced across three word-count buckets: 50 to 200, 200 to 500, and 500 to 1000 words. The headline result in the README is that GPTZero was the most accurate overall at 98.7 percent, but GPTHumanizer was the safest in terms of false accusations, flagging zero of 500 human passages as AI. ZeroGPT and Sapling were close to each other near 88 percent accuracy but both flagged roughly 18 to 19 percent of human passages as AI, which the author calls a serious risk. GPTZero had two API errors that are kept in the data but excluded from the rates. The repository ships the benchmark input data as JSON files under data/, plus evaluation scripts that turn each detector's raw response into a single human-or-AI label using rules listed in a table. Per-item detector outputs are large, so the full files live on public Google Drive links rather than in git. The README also describes the confusion-matrix definitions used and the metric formulas, and notes a section on performance by text length that continues past the part shown here.

Copy-paste prompts

Prompt 1

Walk me through the data/ JSON files in this repo and the rule table that turns each detector's raw response into a final human-or-AI label

Prompt 2

Run the evaluation scripts on the public Google Drive outputs and reproduce the GPTZero 98.7 percent headline number

Prompt 3

Add a fifth detector to this benchmark and write the integration plus label-mapping rule following the existing pattern

Prompt 4

Compute per-word-count-bucket accuracy and false-positive rates and chart them so I can see which detector breaks on short text

Prompt 5

Extend the test set with 200 fresh 2026 model outputs (Claude 4.7, GPT-5) and rerun the benchmark

Frequently asked questions

What is 2026-ai-detector-benchmark?

A May 2026 benchmark of four commercial AI-text detectors (GPTHumanizer, GPTZero, ZeroGPT, Sapling) on 1,000 balanced human and AI passages, with accuracy and human false-positive rates.

What language is 2026-ai-detector-benchmark written in?

Mainly Python. The stack also includes Python, JSON.

How hard is 2026-ai-detector-benchmark to set up?

Setup difficulty is rated moderate, with roughly 1h+ to a first successful run.

Who is 2026-ai-detector-benchmark for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.