Title: Trivial anti-crawler with Caddy
Date: 2026-06-23 13:45

With the internet being crawled to death to feed the AI God, it's becoming
seriously annoying to exposed web content on the internet. While
[anubis](https://anubis.techaro.lol/) works, it's yet another layer of
complexity. It might be worth deploying and tuning it for high-profile
websites, but for my [cgit instance](https://git.dustri.org), it's absolutely
overkill.

Instead, I'm taking advantage of [Caddy](https://caddyserver.com/) (whose
[documentation](https://caddyserver.com/docs/) doesn't have a search feature‽)
matching capabilities to gate access on either the ability to execute
javascript to set a cookie, or having a user-agent string starting with `git/`
so that repository are still cloneable.

```caddy
git.dustri.org {
        import tls
        import noindex
        import compress

        @unverified {
                not header Cookie *not_a_crawler=1*
                not header User-Agent git/*
        }
        handle @unverified {
                header Content-Type text/html
                respond <<EOF
                    <script>
                    document.cookie = 'not_a_crawler=1';
                    window.location.reload();
                    </script>
                EOF 418
        }

        reverse_proxy cgit_upstream
}
```

It's not perfect, trivial to bypass, but strikes the right balance between
simplicity/zero-maintenance and blocking crawlers.

