CCBot
CCBot is the crawler for Common Crawl, a non-profit foundation that maintains a public, free archive of the web. That archive is the raw material used to train a huge number of AI models, both commercial and open-source. Being in Common Crawl means being in the knowledge base of much of today's AI ecosystem.
- User-agent
CCBotCCBot/2.0 (https://commoncrawl.org/faq/)- Does it respect robots.txt?
- Yes
- Official documentation
- https://commoncrawl.org/ccbot
How to allow it in your robots.txt
User-agent: CCBot
Allow: /How to block it (not recommended)
User-agent: CCBot
Disallow: /Frequently asked questions
Should I block CCBot?
It's not advisable if you're looking for AI visibility. Common Crawl feeds dozens of models at once: blocking CCBot is like erasing yourself from the encyclopedia almost every AI uses to learn.
Does CCBot respect robots.txt?
Yes. A simple Disallow rule for the CCBot user-agent is enough. Common Crawl also publishes its official IP ranges and offers a voluntary opt-out registry, and warns that impostors posing as CCBot exist.
How do I know if CCBot visits my site?
Search for "CCBot" in your server logs. Legitimate visits can be verified via reverse DNS: they resolve to domains like crawl.commoncrawl.org.