Disallows & The Future of AI

TL;DR – Robots.txt disallows could become a big deal in the generative AI era. Tools like ChatGPT rely on crawling the web, like a search engine, to improve their models. But more and more sites are blocking them with disallows.

Why Robots.txt directives are important for AI

ChatGPT and other generative AI results might look really sad and confused in the near future. Content creators and publishers are taking actions to protect their content from being used by generative AI. While headlines might focus on lawsuits – the real war is being waged in text files. That’s right, a few lines of text added to an obscure file on websites could decide the fate of results on tools like ChatGPT.

These text files are known as robots.txt files and the rules they contain, known as directives, could impact generative AI in monumental ways. More sites are adding robots.txt disallows for ChatGPT and Google Bard. And the utility of generative AI tools and experiences could diminish greatly if they can’t scrape content from popular websites.

For generative AI responses to be useful they need tons of training data. For training data to be timely and accurate, much of it needs to come from scraping websites. Popular websites blocking ChatGPT will make responses resemble the beta when their model was restricted to data before September 2021 – or worse.

Recapping how to block AI bots

If you want to prevent your content from showing up in generative AI tools like ChatGPT, you should start by blocking their bots crawling your website and scraping your content. As you might remember from How to Block AI from Scraping Your Content, robots.txt directives are one of the primary ways to block AI crawlers.

Below are the directives needed to block ChatGPT and Google Bard from crawling your website:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Popular websites blocking ChatGPT and/or Bard

Some of the most popular websites I’ve noticed blocking ChatGPT and/or Google Bard with robots.txt directives are:

Facebook (blocking GPTBot & Google-Extended)
Amazon.com (blocking GPTBot)
NYTimes.com (blocking GPTBot & Google-Extended)
ESPN.com (blocking GPTBot & Google-Extended)
Pinterest.com (blocking GPTBot)
BBC.com (blocking GPTBot)
Lego.com (blocking GPTBot)
LATimes.com (blocking GPTBot)
TheVerge.com (blocking GPTBot & Google-Extended)
Freep.com (blocking GPTBot & Google-Extended)

Robots.txt could become a hot topic

Googlebot and other crawlers follow robots.txt directives on an honor system. Historically this has rarely been a problem, but that could change with the rise of generative AI. If robots.txt directives remain one of the only methods of blocking most popular generative AI content scrapers – it could become a hot topic for legal experts. Robots.txt compliance might need to be compulsory if other reliable methods for “opting out of AI” aren’t established. Meanwhile, it will be interesting to see if robots.txt directives are brought in legal battles – like the lawsuit the NY Times filed against OpenAI and Microsoft.

SGE, Circle to Search & AI-powered insights for multisearch

It’s worth pointing out that Google SGE, Circle to Search, and Google’s AI-powered insights for multisearch won’t be impacted by the changes referenced above. Google hasn’t offered a way to block SGE outside of blocking Googlebot from crawling your website. And it’s unlikely that many webmasters or companies will want to part with their organic search traffic in exchange for blocking these generative AI experiences. I suspect this could change quickly if Circle to Search is expanded to more devices, multisearch use increases and/or SGE (or an equivalent) goes live. Either way, it would be nice for Google to offer a way to block crawling for their AI models independent of Googlebot.

Only time will tell what direction Google takes with AI-powered search results. Meanwhile, I suspect the utility of ChatGPT and Google Bard will plummet as more sites block their crawlers.