AI bots & crawlers

These are the robots that read your website to feed ChatGPT, Gemini, Perplexity and friends. Block them and your business disappears from their answers.

58 crawlers

OpenAITraining
GPTBot

Collects public content to train OpenAI's AI models, such as the ones behind ChatGPT.

OpenAIUser action
ChatGPT-User

Visits your site live when a ChatGPT user asks for information that requires checking your page.

OpenAISearch
OAI-SearchBot

Indexes websites so they can appear as results and links in ChatGPT search.

AnthropicTraining
ClaudeBot

Collects public content to train and improve Anthropic's Claude models.

AnthropicUser action
Claude-User

Accesses your site live when a Claude user asks a question that requires checking it.

AnthropicSearch
Claude-SearchBot

Indexes web content to improve the quality and relevance of Claude's search results.

GoogleMixed
Googlebot

Crawls the web for Google Search, including AI Overviews.

GoogleTraining
Google-Extended

A robots.txt control token that decides whether your content can be used to train Google's Gemini models.

GoogleMixed
GoogleOther

A general-purpose crawler Google's product teams use to download public content for various purposes, including AI research and development.

PerplexitySearch
PerplexityBot

Indexes websites to show and link them in Perplexity search results.

PerplexityUser action
Perplexity-User

Visits your site live when a user asks a question on Perplexity that requires checking your page.

MicrosoftMixed
Bingbot

Crawls the web for the Bing search engine and also feeds Microsoft Copilot's answers.

MetaTraining
meta-externalagent

Crawls the web to train Meta's AI models (Llama) and to index content for its products.

MetaUser action
meta-externalfetcher

Downloads specific links requested by users and assists Meta AI's agentic capabilities.

AppleSearch
Applebot

Crawls the web for Apple's search features: Siri, Spotlight and Safari suggestions.

AppleTraining
Applebot-Extended

A control token that decides whether Apple can use your content to train Apple Intelligence models.

AmazonMixed
Amazonbot

Crawls the web to improve Amazon products and services, such as Alexa's answers, and can be used to train its AI models.

ByteDanceTraining
Bytespider

Collects web content at scale to train ByteDance's AI models — the company behind TikTok and Doubao.

Common CrawlTraining
CCBot

Builds an open archive of the web that serves as training data for many AI models.

CohereTraining
cohere-training-data-crawler

Collects public web content to train Cohere's language models, aimed at businesses.

Mistral AIUser action
MistralAI-User

Visits your site when a user of Vibe, Mistral's assistant, asks a question that requires checking it.

DuckDuckGoSearch
DuckAssistBot

Collects content for DuckAssist, DuckDuckGo's AI-generated answers.

You.comSearch
YouBot

Discovers and indexes web pages so You.com's search engine and AI assistants can give up-to-date answers.

DiffbotMixed
Diffbot

Extracts structured data from web pages to build a knowledge base used by businesses and AI systems.

Allen Institute for AITraining
AI2Bot

Collects web documents to build open datasets used to train and evaluate Ai2's language models.

xAIMixed
GrokBot

Retrieves web content for Grok, xAI's AI assistant built into X (formerly Twitter).

HuaweiMixed
PetalBot

Crawls the web for Petal Search, Huawei's search engine, and its ecosystem services.

HiveMixed
ImagesiftBot

Collects public images and their context for Hive's web intelligence and AI products.

GoogleUser action
Google-NotebookLM

Downloads the content of a URL when a NotebookLM user — on Google's AI-powered note-taking app — manually adds it as a source in their project.

GoogleUser action
Google-Read-Aloud

Powers Google's Read Aloud feature: when a user asks to have a web page read to them, this bot fetches the text to convert it into audio.

GoogleMixed
Google-CloudVertexBot

Indexes your website content to power AI agents and enterprise search applications built on Google Cloud Vertex AI.

GoogleUser action
Google-InspectionTool

Fetches your page on demand when you actively use the URL Inspection tool or the Rich Results Test in Google Search Console.

GoogleUser action
Google-Agent

Browses the web on behalf of real users so Google's AI agent (Project Mariner) can carry out specific tasks, such as finding information or taking actions on web pages.

MicrosoftSearch
msnbot

Crawled web pages to build the Bing (and formerly MSN Search) index, helping sites appear in Microsoft's search results.

AmazonSearch
Amzn-SearchBot

Crawls public web content to improve search relevance across Amazon products, including Alexa.

MetaMixed
facebookexternalhit

Generates the link preview when someone shares your website on Facebook, Instagram, or Messenger.

MetaTraining
FacebookBot

Downloads public web content to build training datasets for Meta's voice recognition technology.

MetaMixed
meta-externalads

Crawls external pages to review Meta ads and generate the link previews displayed on Facebook, Instagram, and WhatsApp.

BaiduSearch
Baiduspider

Crawls public web pages to index them in Baidu, China's largest search engine.

Moonshot AITraining
KimiBot

Crawls public web content to train Moonshot AI's foundation models, which power the Kimi assistant.

Moonshot AIUser action
Kimi-User

Fetches your web content on demand when a Kimi user asks for real-time information or to summarise an article.

Moonshot AISearch
Kimi-SearchBot

Analyzes web pages for relevance to build the search index that powers Kimi, Moonshot AI's AI assistant.

Mistral AISearch
MistralAI-Index

Crawls public web pages to index content that powers the search engine inside Mistral's Vibe platform.

OpenAIMixed
OAI-AdsBot

Validates that landing pages submitted for ChatGPT Ads comply with OpenAI's advertising policies and are relevant to the ad.

Webz.ioTraining
webzio-extended

Validates which publicly available web content can legally be used to train AI models, labeling it for Webz.io's data pipeline clients.

Webz.ioMixed
omgilibot

Crawls news sites, forums and user-generated content to power Webz.io's data API, sold to media monitoring platforms and AI training companies alike.

Velen.ioTraining
VelenPublicWebCrawler

Crawls public websites to build business datasets and train the machine learning models that power Hunter.io.

Bright DataMixed
Brightbot

Collects public web data on behalf of business clients that need price monitoring, competitive intelligence, or commercial databases.

SemrushMixed
SemrushBot-OCOB

Crawls public web content to power Semrush's Content Toolkit, an AI-assisted content marketing and SEO suite.

YandexSearch
YandexAdditionalBot

Crawls pages already indexed by Yandex to use as real-time sources for YandexGPT, Yandex's AI-powered search feature.

TimpiSearch
Timpibot

Crawls public websites to build the index for Timpi's decentralised search engine, with that index potentially also used to train AI models.

AndiSearch
Andibot

Crawls public websites to power Andi's generative search results, which answer questions with AI-generated summaries instead of a traditional list of links.

iAskSearch
iaskspider

Crawls public web pages to build the index powering iAsk, an AI-based answer search engine.

PhindSearch
PhindBot

Crawls websites on demand to power Phind's AI-based search engine, which is aimed at developers and technical users.

Romain BeaumontTraining
img2dataset

Downloads images from public websites at scale to build datasets for training AI vision and image-generation models.

HuaweiTraining
PanguBot

Downloads public web content to train PanGu, Huawei's multimodal AI model.

Research Organization of Information and SystemsTraining
Cotoyogi

Collects publicly available web content in Japanese to build training datasets for AI models.

AwarioMixed
AwarioBot

Crawls public websites to collect brand and keyword mentions on behalf of Awario's monitoring customers.

Do you know if these bots already read your site and what they say about you? Run the free test.

Run the free test