AI companies have been sending crawlers to scrape every website they can find, feeding the content into training datasets whether site owners want it or not. The polite ones respect robots.txt. Most don't.

So I set a trap.

What's a tarpit?

A tarpit is a honeypot that doesn't block bots — it wastes their time instead. Rather than returning a 403 and letting the crawler move on in milliseconds, a tarpit serves an endless stream of content as slowly as possible, tying up the bot's connection for as long as it'll sit there.

This site is running Nepenthes, an AI-specific tarpit. Any request to /blog/ that looks like a scraper gets handed off to it. Nepenthes responds with a page full of randomly generated nonsense — Markov chain babble trained on public domain books — plus a maze of links pointing to more fake pages. The content is delivered at a trickle, 4–25 seconds per response, so the crawler spends real CPU time and bandwidth receiving text that is completely worthless.

How it works here

Nginx proxies /blog/ to Nepenthes running in a Docker container. The fake pages are seeded from a corpus of Project Gutenberg texts (Pride and Prejudice, Frankenstein, Sherlock Holmes) blended together into convincing-looking gibberish. Every page links to several others, creating a maze the crawler can wander indefinitely.

The delay is the real weapon. If a crawler holds 10 connections open and each response takes 20 seconds, that's 200 seconds of crawler resources spent on nothing. At scale, across many sites running tarpits, this raises the cost of indiscriminate scraping meaningfully.

What it catches

From the first few hours of running:

  • Bots hitting /blog/ receive the tarpit, not this site's actual content
  • Each connection is held open for up to 25 seconds
  • The generated text goes straight into the void — or, if they're not careful, into a training dataset full of Victorian novel slurry

The real posts (like this one) live at /blog/posts/ and are served normally. The tarpit is what you get if you wander in without reading.

Monitoring

I hooked up Prometheus and Grafana to track it — hits, unique IPs, bytes wasted, total delay inflicted. Watching the numbers tick up is genuinely satisfying.

If you're running your own site and are fed up with your content being scraped without consent, Nepenthes is worth a look. The setup is a single Docker container and a few lines of nginx config.