Diffbot
Diffbot uses computer vision and artificial intelligence to turn web pages into structured data (products, articles, organizations) and maintain a huge knowledge base of the web. Many companies and AI applications consume that data. If your business is listed there with accurate data, that information later flows into tools and assistants that reuse it.
- User-agent
DiffbotMozilla/5.0 (compatible; Diffbot/0.1; +http://www.diffbot.com/our-apis/crawler/)- Does it respect robots.txt?
- Partially
- Official documentation
- https://docs.diffbot.com/docs/does-crawl-respect-robotstxt
How to allow it in your robots.txt
User-agent: Diffbot
Allow: /How to block it (not recommended)
User-agent: Diffbot
Disallow: /Frequently asked questions
Should I block Diffbot?
It's not advisable. Its structured data ends up in business tools and AI systems that can showcase your business. Being well represented in its knowledge base works in your favor.
Does Diffbot respect robots.txt?
Partially. Its mass crawls (Crawl) respect robots.txt according to its official documentation, including disallow and crawl-delay directives; but extractions of specific URLs requested by customers can be processed even if there's a block in place.
How do I know if Diffbot visits my site?
Search for "Diffbot" in your server logs. Its user-agent includes a link to its crawler documentation that lets you identify it without doubt.