How to Block AI from Scraping Your Content

TL;DR – you can’t completely block your content from being used to train AI models, but robots.txt directives and meta tags can limit it.

Your content is probably training AI

If you create and share any content on the internet, it’s probably being used to train generative AI. If it’s not already, it will be soon. Sure, you can delete your ChatGPT account or opt out having your data used to train models. But if you’re a social media user, chances are your content will still be used to train other AI models.

X, formerly Twitter, announced in early September 2023 that they would soon be using user posts to train AI. And later the same month a Meta spokesperson stated that their new AI assistant was trained used public Facebook and Instagram. Meta does offer the ability to delete your personal data form their generative AI models, but not every company or platform offers similar options. As of this writing it’s not exactly clear what Snapchat’s and TikTok’s stances are. More importantly, it’s not clear there will be a standardization of opt out options anytime soon.

Other publishing platforms, like Medium, have started an open dialog with users to try to establish rules for how generative AI should be used. It’s likely this approach will be the exception and not the rule. In general, it’s probably safe to assume your online content is being used to train generative AI.

Your last bastion against AI

Your website, whether it’s a personal blog or business site, might be your content’s last bastion against generative AI on the web. Blocking AI from scraping your website isn’t perfect, but I have some tips that can help limit them. I’m going to show you how to block the top generative AI platforms from using your website content as training data (the best I can).

Crawlers & robots.txt

First, let’s do a quick refresher on how bots crawl websites and what rules they’re supposed to follow. Spiders, bots and crawlers are all synonymous for the same thing – a program used to collect information about (or scrape) web pages.

Robots.txt is a file that you can host on your website (should be hosted at the root of every domain e.g., example.com/robots.txt) containing a set of rules for these bots to follow. It’s important to note that following these rules is entirely optional and is only done on a good faith basis. Adding a disallow directive to a robots.txt file tells bots they shouldn’t crawl a website, or specific paths on a website. A disallow directive begins with:

Disallow:

And is followed by the URL path being blocked. To block all your content from being crawl simply use a slash like this:

Disallow: /

So now how do you give instructions to specific bots?

What’s a user-agent?

A user-agent is sort of like an ID card for different types of software and bots on the internet. For example, Google’s Chrome browser has its own User-Agent (or UA for short). Identifying the UA for generative AI programs and adding disallow directives to your website robots.txt is the best way to block most of them.

Directives can be specified for individual bots based on their UA. Simply use an asterisk to apply your directives to all UAs, like so:

User-agent: *

Check out Moz’s SEO Learning Center to learn more about robots.txt files and user-agents.

Now let’s dive into how to block Generative AI user-agents.

How to block ChatGPT

You can block your website from ChatGPT by adding a disallow allow to your robots.txt for the user-agent GPTBot, like so:

User-agent: GPTbot
Disallow: /

How to block Google Bard

You can block Google Bard (and Vertex AI) by disallowing the user-agent Google-Extended in your robots.txt file, like so:

User-agent: Google-Extended
Disallow: /

How to block Google SGE

I have some bad news if you’re trying to block Google Search Generative Experience (SGE) from crawling your website… You can’t block SGE without blocking Googlebot entirely. You’ll have to block Googlebot from crawling your content (e.g., via robots.txt directive) if you don’t want Google using your content to train SGE. Or you can use the robots nosnippet meta tag to prevent Google from displaying your website content in SGE results. Unfortunately using this meta tag will also prevent Google from displaying text snippets or video previews in search results. I sincerely hope that Google reconsiders this move and allows webmasters to block SGE independently of Googlebot in the future.

How to block Bing Chat

Blocking your website from Bing Chat is a little different than ChatGPT and Google Bard. Instead of using a robots.txt directive, you’ll need to use a meta tag to stop Bing Chat from training its AI on your content.

Adding a <meta name=”robots” content=”nocache”> to a page will limit Bing Chat to only displaying the page URL, snippet or title as a result.

Adding <meta name=”robots” content=”noarchive”> to your page’s source code will prevent Bing Chat from using the page in its training data entirely.

CDNs & AI bots

Content distribution networks (CDNs) like Cloudflare and Akamai could help in the battle to block gen AI crawlers. CDN bot detection is a powerful tool against bots and scrapers. It doesn’t appear that major CDNs are blocking the user-agents above as of now, but that could change – especially if there is demand from customers.

Conclusion

It’s impossible as of now to completely block your content from being to train AI, but there are some things you can do. Using robots.txt directives allow you to prevent your website content from being scraped by many generative AI tools. Google SGE is a notable exception though at this time. Hopefully Google applies Google-Extended UA directives to SGE or offers other accommodations in the future.

Updated 11/12/23: added clarification on using nosnippet meta data to block Google SGE from displaying content.