With the internet being crawled to death to feed the AI God, it's becoming seriously annoying to exposed web content on the internet. While anubis works, it's yet another layer of complexity. It might be worth deploying and tuning it for high-profile websites, but for my cgit instance, it's absolutely overkill.
Instead, I'm taking advantage of Caddy (whose
documentation doesn't have a search feature‽)
matching capabilities to gate access on either the ability to execute
javascript to set a cookie, or having a user-agent string starting with git/
so that repository are still cloneable.
git.dustri.org {
import tls
import noindex
import compress
@unverified {
not header Cookie *not_a_crawler=1*
not header User-Agent git/*
}
handle @unverified {
header Content-Type text/html
respond <<EOF
<script>
document.cookie = 'not_a_crawler=1';
window.location.reload();
</script>
EOF 418
}
reverse_proxy cgit_upstream
}
It's not perfect, trivial to bypass, but strikes the right balance between simplicity/zero-maintenance and blocking crawlers.