Your AI cold emails reply at 4.1%. Human ones at 5.2%. Filters can tell when readers can't.

A six-month A/B test from Digital Applied put 50,000 AI-generated cold emails against 50,000 human-written ones. The AI emails replied at 4.1%, the human ones at 5.2%. The spam-flag gap was wider: 8% for AI, 3% for human. That works out to 2.7x more of your AI campaign hitting the junk folder before anyone sees it.

The strange part is what’s behind the gap. A 2023 Cornell PNAS paper by Maurice Jakesch and colleagues tested whether humans can spot AI-written self-presentations across job applications, dating profiles, and Airbnb hosts. Across three experiments, accuracy was 50–52%. Chance level. The Cornell summary puts it bluntly: “AI was shown to exploit people’s heuristics to produce text that people more reliably rate as human-written than actual human-written text.”

So your prospect can’t reliably tell. Their inbox provider can.

What Gmail and Microsoft actually score

Google quietly shipped RETVec into Gmail’s spam filter in November 2023. Their own numbers: 38% improvement in spam-detection rate, 19.4% reduction in false positives, 83% less TPU compute. RETVec is a tokenization layer designed to be resilient to character-level adversarial tricks like homoglyphs, weird punctuation, and invisible characters. It also generalizes well to detecting LLM-generated text patterns, because both kinds of text drift away from the human baseline in measurable ways.

Microsoft was less subtle. In a September 2025 Microsoft Security Blog post, the Defender team published five specific signals their AI looks for in LLM-generated phishing:

Descriptive English variable names concatenated with random hex strings
Overly modular code structure with repeated logic blocks
Verbose, generic, formal documentation
Unusual CDATA/XML declarations
Formulaic obfuscation style

Items 3 and 5 generalize directly to email body text. Microsoft’s own quote: “AI-generated code may be more complex or syntactically polished, but it still operates within the same behavioral and infrastructural boundaries as human-crafted attacks.” The polish is the tell.

Both filters run before your email reaches the recipient. Both keep updating their signals as model output drifts. Microsoft says it processes “billions of messages daily” through Defender’s AI models, which is the order-of-magnitude number to keep in mind when you’re wondering whether your campaign is “too small to attract attention.”

The 2,000-upvote sysadmin rant

A few weeks ago, a sysadmin posted this on r/sysadmin under the title “Rant: I DO NOT WANT TO READ EMAILS WRITTEN BY LLMs!”:

“My boss and grandboss are just LLM-ing emails back and forth with me CC’d occasionally asking for my input and I just fucking can’t deal with it already. They’re not even reading the shit!”

It hit 2,015 upvotes and 486 comments. The substantive replies stacked up against AI-written email in a way that should worry anyone running outreach.

u/derango (105 upvotes): “I know immediately when someone is using copilot for emails. There’s zero personality. It’s so depressing. Especially if it’s like one sentence.”

u/Arudinne (23 upvotes): “People are using LLMs to expand emails from a few sentences and sending them to people who are using LLMs to reduce that email back to a few sentences. What a time to be alive.”

u/Strassi007 (19 upvotes): “That’s why we forbid AI written replies via company policy.”

u/ucancallmevicky (47 upvotes), citing his nephew at Microsoft: “Told me about 90% of the sales org writes every email with copilot and then has copilot summarize every response. Effectively it is all llm’s talking to one another.”

The Cornell paper says people can’t tell at chance level. The sysadmin thread says they think they can. Both can be true at once. Humans miss most of it, but the small percentage they do catch is enough for individual companies to ban AI replies in writing. One shop banning AI replies is one shop where your AI-flavored outreach is dead before it lands. A separate r/sysadmin thread on Microsoft Copilot’s terms of service (270 upvotes) shows what corporate IT is doing in parallel: reading the fine print, flagging “Copilot is for entertainment purposes only,” and writing internal policy memos around AI tooling. I wrote separately about Microsoft Copilot’s flex routing default flip, and the pattern is the same. IT departments keep getting more skeptical of AI in email each quarter.

What “AI-sounding” actually looks like in 2026

Dmitry Kobak and colleagues published the cleanest measurement to date in Science Advances, July 2025. They analyzed 15+ million PubMed abstracts from 2010 to 2024 and tracked which words spiked after ChatGPT’s release. The top excess-frequency multipliers in 2024 vs the pre-ChatGPT baseline:

“delves”: 28.0x
“underscores”: 13.8x
“showcasing”: 10.7x

They flagged 379 style words showing this pattern. The estimated share of 2024 abstracts processed with an LLM was at least 13.5%, and up to 40% in some subgroups. The authors note the effect “exceeds the effect of major world events such as the COVID pandemic.” If you want the underlying data, the arXiv preprint is 2406.07016.

Liang et al. in Cell Patterns extended this to consumer reviews, corporate communications, government text, and recruitment listings. Same pattern across the board. AI-assisted writing now leaves a measurable fingerprint in nearly every public text corpus on the internet.

Sean Goedecke’s quantitative breakdown is the most-cited engineering write-up on the punctuation side. GPT-4o uses about 10x more em-dashes than GPT-3.5, and GPT-4.1 uses more still. Anthropic and Google models use them less, but still more than the human baseline. The likely cause is that high-quality 19th and early 20th century book training corpora used roughly 30% more em-dashes than modern English. The model inherited the punctuation along with the prose.

Philip Shapira’s March 2024 “delving into delve” analysis showed a 654% increase in PubMed uses of “delve” between 2020 and 2023, from 349 occurrences to roughly 2,847. He posted it as a quick blog observation. Two years later, Kobak and colleagues confirmed it at scale.

The full informal taxonomy of 2026 AI tells lives in a 583-upvote r/ClaudeCode post by u/quang-vybe, the founder of vybe.build. His “humanizer” skill enforces a written list of AI tells:

“Em dashes everywhere. Use commas or periods. Em dashes are a strong AI signal in 2026.”
Curly quotes (the kind Word and AI default to). “Use straight quotes.”
Title-case headings instead of sentence case
Inflated importance vocabulary: “stands as,” “testament,” “pivotal,” “underscores,” “reflects broader,” “evolving landscape,” “vibrant,” “seamless,” “breathtaking,” “renowned,” “unlock,” “empower”
“-ing endings doing fake-depth work”: “highlighting,” “emphasizing,” “ensuring,” “fostering,” “reflecting,” “contributing,” “reinforcing”
Negative parallelism, “not just X but also Y,” overused enough to be a tell
Bold for no reason
Emojis

That list overlaps with the COPYWRITING.md anti-AI checklist I run before publishing on this site, item for item. It also overlaps with what Microsoft Defender’s blog called “verbose, generic, formal documentation.” The vocabulary is the spam signal.

What the AI-detection tools are claiming

The vendor wars on AI text detection are messy, and the published numbers don’t agree.

Academic work has been more honest. Binoculars (Hans et al., ICML 2024), from Maryland and NYU, reports 90%+ ChatGPT detection at a 0.01% false-positive rate, “without any training data,” using two LLMs to score perplexity ratios. They show 95% accuracy on 512-token news samples. It’s the cleanest published result I’ve seen.

OpenAI’s own classifier was retired on July 20, 2023. Their original announcement, since updated, confessed it correctly identified AI text only 26% of the time and flagged 9% of human text as AI. They called it “low rate of accuracy” and shut it down.

The commercial vendors disagree spectacularly with the academics and with each other. GPTZero claims 99.3% overall accuracy at 0.24% false positive. Pangram claims 99.98% on its 3.0 model. Originality.ai claims 99% with a 0.5% false-positive rate. All three are vendor self-reports. The independent Binoculars paper is closer to the floor of what the academic benchmarks support.

For an SMB owner running outreach, the academic floor matters more than the vendor ceiling. Binoculars-grade detectors hit 90% at near-zero false-positive rates, and inbox filters use overlapping signals to make spam-folder decisions on your campaigns right now.

What this means for cold outreach in 2026

Cold-email reply rates have been falling for three straight years. Instantly’s 2026 benchmark report, built from billions of cold-email interactions across thousands of workspaces from January through December 2025, puts the average reply rate at 3.43%. The top quartile is 5.5%. The very best campaigns are above 10.7%.

Apollo’s 2026 Go-To-Market Effectiveness Report, audited by The Tolly Group across 169 customer accounts and 384 users, lands at a 2.37% email-to-meeting conversion. Their headline: that beats the 0.5–1.5% industry average. Both numbers are dropping year over year.

That decline tracks the timeline of widespread LLM-assisted email. AI didn’t cause all of it (Google and Yahoo’s bulk-sender rules took effect in February 2024 and tightened through April 2024, pushing required SPF, DKIM, DMARC, and one-click unsubscribe), but AI-generated volume is a load-bearing part of the reason your reply rate is half of what it was three years ago.

For a 5–30 person business doing outreach, the math hardens fast. If you run a 3-person marketing agency doing campaigns for clients, the AI-tell penalty is a direct hit on your delivery ceiling. I wrote up the workflow side of this in how a small marketing agency stopped drowning in client email. If you run a real estate office, every email that lands in junk is a lead one of your competitors got and you didn’t, and the follow-up problem is already brutal without giving 8% of your sends to the spam folder.

What to do instead

Four moves I run with clients.

Run a humanizer pass on every draft

Run any AI draft through a humanizer pass before you hit send. The COPYWRITING.md on this site doubles as one. Strip the inflated-importance word list, swap em-dashes for commas or periods, flatten title-case headings to sentence case, and check for the “-ing” depth-faking pattern. Then take a corpus of your own real writing (LinkedIn posts, Slack messages, old emails) and feed it as the style anchor for new generations. A few of the humanizer-style projects on GitHub do exactly this. The output reads like you because the model was conditioned on a corpus of you.

Send shorter

Lavender’s published data says the 25–50 word sweet spot is where opening cold emails do best. The AI default is verbose. Cut every sentence that doesn’t carry the offer or a specific detail about the recipient.

Seed-test before launch

Before any AI-assisted campaign goes live, send the same email to a Gmail, an Outlook, an Apple Mail, and a corporate Microsoft 365 account that you control. Check the spam folder, not just the inbox. If two or more land in junk, kill the campaign and rewrite. This is cheap. Skipping it is expensive.

Make one detail prospect-specific

The thing AI is genuinely bad at is the one detail that proves you actually looked at the recipient. A specific deal they closed last month, a line from a recent talk, a bug in their booking flow, anything that’s true and theirs. That’s the part filters and humans both reward, and it’s what raises a 3.43% reply rate to a 10.7% one.

The deeper version of this argument is that the prompt is the product. A vague AI campaign produces vague AI text and gets filtered. A specific, opinionated, well-anchored prompt produces text that doesn’t pattern-match to the AI baseline, and it converts. The cost of writing the better prompt is once. The cost of running the worse one is every reply-rate point you give up for the next 12 months.

I wrote about the cost math in AI sales automation for SMBs. The savings are real, but only if you don’t sand off the part that makes outreach work. I also reverse-engineered Claude Design’s 30,000-character system prompt to show what a serious anti-AI-tell guardrail looks like at scale. Both pieces argue the same thing: a style discipline is what turns AI savings into actual reply-rate gains.

Quick FAQ

Should I use AI to write outreach at all? Yes. With a humanizer pass and a personal style anchor. Without those, the math is against you.

Will my prospects notice? Probably not. Cornell’s data says humans guess at chance level. The real risk is the inbox provider rejecting the email before recognition is even possible.

Does this hit 1:1 emails or just bulk? Both. Bulk gets hit harder because filter signals compound across the campaign and across the sending domain’s reputation. A single 1:1 email full of em-dashes and “delves” still scores worse than the same idea written cleanly.

How do I check if my emails are getting filtered? Send the same template to seed accounts on Gmail, Outlook, Apple Mail, and Microsoft 365. Check spam. If two or more land in junk, kill the campaign and rewrite.

What’s the single fastest thing I can change today? Strip em-dashes and replace them with commas or periods. It’s the simplest, most universal AI tell, and it costs nothing to fix.

Want a second opinion on your outreach

I do free 30-minute discovery calls. Send me your current cold-email template, your reply rate, and your seed-test results. We’ll find what’s actually killing the campaign (the AI tells, the deliverability setup, the targeting, or the offer) and decide what to fix first.

The PubMed numbers, the inbox-filter numbers, and the Reddit numbers all say the same thing. AI is now the default in business writing. The competitive edge moved from “use AI” to “use AI in a way nobody else does.” That’s the harder problem. And the more rewarding one.