Compare GPTHumanizer, GPTZero, ZeroGPT, and Sapling on accuracy and false-positive rate before picking one for school or work
Reproduce the benchmark on a new detector by reusing the 1,000-passage test set and label rules
Audit how often each detector wrongly flags real human writing from Wikipedia, StackExchange, or Enron emails
Slice the results by word-count bucket (50-200, 200-500, 500-1000) to see which detector handles long text best
Per-item detector outputs are hosted on public Google Drive rather than in git, so you must download them before the evaluation scripts can run.
This repository is a public benchmark, run in May 2026, that tests four commercial AI-text detectors on the same 1,000 English passages and reports how often each one was right. The four tools compared are GPTHumanizer, GPTZero, ZeroGPT, and the Sapling AI Detector. The author argues that raw accuracy is not the only number that matters: the rate at which a detector falsely accuses real human writing of being AI-generated, called the human false positive rate, is just as important because a wrong flag can cause real consequences for students or writers. The test set is 500 human-written passages sampled from the Pile-small dataset and 500 AI-generated passages sampled from a larger pool of 2,600 model outputs. The human side spans thirteen sources including Wikipedia, OpenWebText2, Pile-CC, USPTO patent backgrounds, StackExchange, HackerNews, FreeLaw, PubMed, ArXiv, and Enron emails, so detectors are not just rewarded for handling one writing style. The AI side mixes outputs from many model families, including several Claude versions, GPT-3.5, GPT-4o, GPT-5, o3, DeepSeek Chat, Kimi, and Grok-4. Each split is balanced across three word-count buckets: 50 to 200, 200 to 500, and 500 to 1000 words. The headline result in the README is that GPTZero was the most accurate overall at 98.7 percent, but GPTHumanizer was the safest in terms of false accusations, flagging zero of 500 human passages as AI. ZeroGPT and Sapling were close to each other near 88 percent accuracy but both flagged roughly 18 to 19 percent of human passages as AI, which the author calls a serious risk. GPTZero had two API errors that are kept in the data but excluded from the rates. The repository ships the benchmark input data as JSON files under data/, plus evaluation scripts that turn each detector's raw response into a single human-or-AI label using rules listed in a table. Per-item detector outputs are large, so the full files live on public Google Drive links rather than in git. The README also describes the confusion-matrix definitions used and the metric formulas, and notes a section on performance by text length that continues past the part shown here.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.