All AI bots

img2dataset

Romain BeaumontTraining

img2dataset is an open-source tool by Romain Beaumont that automatically downloads millions of images from the web. It was central to building LAION-400M and LAION-5B, two of the largest image datasets ever assembled, which were used to train models like Stable Diffusion. If it crawls your site, your images could end up in training data for AI image models. It has no direct impact on whether conversational AI tools like ChatGPT or Gemini mention your business.

User-agent
img2datasetMozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 (compatible; img2dataset; +https://github.com/rom1504/img2dataset)
Does it respect robots.txt?
Partially

How to allow it in your robots.txt

User-agent: img2dataset
Allow: /

How to block it (not recommended)

User-agent: img2dataset
Disallow: /

Frequently asked questions

Should I block img2dataset?

It depends on whether you want your images used to train AI models. Allowing it won't improve your chances of being mentioned by ChatGPT or similar tools — it only means your photos could feed into image-generation datasets. If you'd rather keep your images out of those datasets, blocking it is a reasonable call.

Does img2dataset affect my AI visibility?

Not directly. This crawler feeds image datasets, not the language models behind conversational AI assistants. Letting it through won't make ChatGPT or Perplexity recommend your business more often when someone asks about what you offer.

How do I know if img2dataset is crawling my site?

Check your server logs for entries containing "img2dataset" or the string "(compatible; img2dataset;". You can also add a disallow rule for the "img2dataset" token in your robots.txt file — the crawler partially respects it, though its official compliance with robots.txt is not explicitly documented.

Related resources

Do you know if these bots already read your site and what they say about you? Run the free test.

Run the free test