AI detectors measure how expected each word choice is - a signal called perplexity. Text written by AI tends to score low on perplexity, since models favor high-probability word combinations, while human writing scores higher, since people make word choices that language models would not rank highly. AI Busted is a free AI detector and humanizer. Paste any text to get a detection score, then rewrite it with tone and vocabulary controls in one place, and treat the result as one signal rather than a verdict.
What is AI detection?
AI detection is the process of estimating whether a piece of text was written by a human or produced by a language model like GPT-4, Claude, or Gemini. Detectors do not read for meaning. They analyze statistical patterns: how closely the text follows the probability curves of a reference language model. Every word in a sentence has a probability score when placed after the words that came before it. Language models assign those scores as part of how they work. AI detectors borrow that same probability-scoring engine to ask: does this text look like something a model would write? A text can score "AI-written" without a single factual error. It can score "human" even when a model wrote 80% of it. The score reflects statistical patterns in word choice, not the source or intent behind the writing. That gap between what the score measures and what people think it proves is where most institutional misuse begins.What signals do AI detectors actually look at?
Most detectors measure two overlapping signals: perplexity and burstiness. Perplexity measures how expected each word is to a language model given the words before it. Low perplexity means the text follows the model's internal probability curve closely: the writer kept choosing words the model would have ranked highly. High perplexity means the writer made choices the model would not have prioritized. Burstiness captures how much perplexity varies across sentences. Human writing swings between high- and low-perplexity sentences. A short punchy line followed by a long explanatory one, then a fragment, then a longer arc. AI-written text stays flatter across sentences, producing a narrower variance pattern that detectors can measure directly. Some detectors add stylometric signals like sentence length, punctuation frequency, or vocabulary spread. These fingerprints are compared with labeled human and AI samples, then folded into a vendor-set confidence threshold.
How does the detector score your text?
When you paste text into a detector, the tool feeds that text through a reference language model the vendor controls internally, then collects a log-probability score for each word given its surrounding context. This step is commonly called "tokenization" in popular writing about AI detectors. That framing is technically off. Tokenization is preprocessing: splitting text into sub-word units before any probability math happens. The detection-relevant step is the log-probability scoring that follows tokenization. Getting that distinction right matters when you evaluate claims about how a detector works, or why it misfired on a specific piece of text. The choice of reference model sets a ceiling on the whole pipeline. A detector calibrated against GPT-3 output will miss patterns specific to Claude 3.5 or Gemini 1.5. Newer models produce text with different stylistic fingerprints: smoother syntax, less repetition in phrasing. Older reference models may not flag those patterns reliably. The text samples used to calibrate the reference model determine how well the detector handles AI output that post-dates those samples.How does burstiness scoring work?
Once per-word probability scores exist, the detector groups them by sentence and calculates how much those scores vary. That variance measurement is burstiness. AI-written text stays low and flat across a passage. Human writing varies: a simple sentence, then a complex one, then a fragment, then a longer explanatory arc. Models consistently favor the smoothest, most probable continuation of the text. That makes the flat-burstiness pattern hard to avoid at scale. Detectors weight burstiness differently. Some treat it as the primary indicator. Others fold it in as a correction factor when perplexity alone produces a weak signal. The exact weighting is rarely published, which makes it difficult to interpret what a borderline score actually means for your specific text.How does the detector calibrate its score?
With perplexity and burstiness scores in hand, the detector checks where the text lands relative to a reference distribution of known AI-written and known human-written passages. This step maps closely to the DetectGPT method published by Mitchell, E., Lee, Y., Khazatsky, A., Manning, C.D., & Finn, C. (2023) in arXiv:2301.11305. Their central finding: AI-written text tends to sit near a local maximum of the model's probability surface, while human-edited text lands in regions of lower curvature. The detector applies small perturbations by swapping words and adjusting phrasing, then checks whether the probability score rises or falls. If the text stays near the top of the probability curve after those changes, that supports an AI-authorship signal.What does the final detection score actually mean?
The final step turns the math into something visible: a percentage, a label like "likely AI-written," or a sentence-level view where the tool marks specific passages as high-confidence AI. Some detectors flag individual sentences they scored as AI-likely. That helps manual review, but sentence-level labels carry the same uncertainty as the full score: they show where the probability pattern looks model-like, not who wrote it. The output is not a fact. It is a probability estimate with an unpublished confidence interval, so you rarely know the false positive rate for your topic, genre, or writing style.What does the full detection pipeline look like?
| Pipeline step | What the detector does | Signal direction |
| Log-probability scoring | Feeds text through reference model, scores each word in context | Lower score = more AI-like |
| Perplexity | Averages per-word probability scores across the passage | Lower = more AI-like |
| Burstiness | Measures variance in per-sentence perplexity | Lower variance = more AI-like |
| Probability curvature check | Perturbation test against reference distribution (DetectGPT method) | Near local probability max = more AI-like |
| Score output | Blends signals into a percentage with a vendor-set confidence threshold | 0-100% AI confidence |
Why do two detectors disagree on the same text?
Run the same paragraph through GPTZero and Winston AI and you will often get different results, sometimes wildly different. See our 2026 AI detector test results for concrete examples. Three things drive the divergence. First, each detector uses a different reference model. The probability scores they assign to the same word sequence are not identical, so the perplexity reading diverges before any other calculation happens. Second, calibration differs. Each vendor sets thresholds against its own labeled samples, so model mix, writing style, and topic area all set where the cutoff lands. Third, burstiness weighting differs. A detector that weights burstiness heavily will flag flat-rhythm text quickly. A detector that weights perplexity as the primary signal may pass the same paragraph with a comfortable margin.
Where does the detection pipeline break down?
The pipeline has consistent failure modes. Knowing them matters whether you are the one running the detector or the one whose writing is being scored.ESL and neurodivergent writers
Research by Liang et al. (2023) found false positive rates above 61% for non-native English writers across several commercial detectors. Short sentences, limited vocabulary range, and constrained syntax can push perplexity down in the same direction AI-written text moves. According to the NIST AI Risk Management guidelines, high-stakes AI-assisted systems need human oversight and appeal routes. Treat AI Busted as a second signal in that review, not a sole arbiter.Post-edited AI text
If a writer starts with GPT-4 and then edits heavily, burstiness rises and perplexity increases. The text may score "human" as detectors only see final word choices, not revision history.Domain-specific writing
Technical writing, legal text, and scientific abstracts often have low perplexity by nature. Formal style rules push toward standard phrasing, so detectors calibrated on general-web text can over-flag these genres.The base-rate problem
Suppose only 5% of submissions are AI-written. Even a strong detector can produce nearly as many false positives as true positives, since the human-written pool is much larger. That base-rate problem applies to any binary classifier when the target event is rare. For thorough reliability numbers across tools, see how reliable are AI detectors.How do watermark detection and multi-model scoring work?
Two parallel approaches work differently from the perplexity pipeline.Watermarking
Kirchenbauer et al. (2023) showed that a model can embed a statistical watermark during generation by nudging word choices into a pseudo-random pattern. A matching detector checks for that pattern without perplexity scoring, but this works only when the original model was configured to add the watermark.Multi-model scoring
The Binoculars approach (Hans et al., 2024) runs the same text through two models: a scorer and an observer. It measures how much their probability assignments diverge, which can reduce false positives for non-native English writers compared with single-model perplexity scoring.Common Questions
Varying sentence length raises burstiness, which can move a detector toward a more human reading. It does not reliably fool detectors, since perplexity scoring still runs separately and most tools blend both signals. AI Busted lets you test the same passage before and after edits to see whether the score actually changed.
No. GPTZero, Winston AI, Originality.ai, and similar tools use different reference models, calibration samples, thresholds, and burstiness weighting. That is why the same passage can score 20% on one tool and 80% on another. The score belongs to that vendor's pipeline, not to a universal standard.
Detectors flag human writing when phrasing has low perplexity and closely follows patterns a reference model ranks highly. Formal writing, constrained vocabulary, and academic abstracts can all look this way. Research published in PMC (Liang et al., 2023) found non-native English writers face high false positive rates, above 61% in their sample.