Here I am, back again to pass along some knowledge on how to protect you and your art against generative AI. In fact, I’ve got a few posts happening now, and I’m sure more will come as the technology evolves, so I’m going to start a new category for blog posts under “protecting your art“. You can always subscribe to get blog updates from me if you’re interested in hearing more about these things as they emerge.
Today I’m going to teach you a little bit about the inner workings of web crawlers and how to utilize a robots.txt file to block certain crawlers from accessing your website.
What is a web crawler?
I went over what a web crawler is briefly in my post about the noai and noimageai directive in meta tags and how to add them to your website. In case you didn’t read that post, a good way to think of a web crawler is like a caterpillar. The caterpillar will crawl along many leaves on plants as it discovers the world around it, eating the leaves as it goes. This is a very simple analogy, but it’s more or less a good way to picture how a web crawler works. Essentially, a web crawler (“crawler” or “bot”) is a program that consumes web pages and does something with them. Sometimes that “something” is add certain types of information to certain pages (i.e. a directory site like Yellow Pages might employ a crawler that looks for information like business addresses and phone numbers to add or correct its directory pages), but that “something” can also reach into “add to a personal database.” Sometimes web crawlers are used to assess a web page, like an accessibility auditor. You can see where this is going, right?
A web crawler you might be familiar with is Googlebot, which is a common crawler that you might want accessing your pages. Googlebot, which is actually hundreds if not thousands of bots employed as part of a bigger program, “crawls” across the web and “indexes” (add it to its database) your website, and then uses a specific algorithm based on the content it finds to list it (“rank”) in an order amongst similar pages when you use Google to search for some sort of content that your page may have.
Web crawlers have been around for at least thirty years, and technology exists to both help web crawlers by guiding them to certain pages or sending them certain signals to let them know what kind of information you have on your page, or by limiting the kind of crawling — if any — they can conduct on your page. There are many ways to signify to a crawler what you do or do not want, but one of the most well-known, widely used solutions is the robots.txt file.
What is a robots.txt file?
Robots.txt is a public file that resides beside your main website files that houses directives for web crawlers. It’s essentially just a bare text file (hence the .txt at the end, that’s the format of the file) that is placed in your top-level directory.
Not all websites have robots.txt files by default. In fact, many websites do not. It is up to the person running the website to create a robots.txt file if it is desired. In 2023, there has been some talk of trying to move away from robots.txt as a way of noindexing (signaling to a web crawler like Googlebot that you do not want it to index your site) your pages. But, so far, robots.txt is still alive and well, and actively being used by many crawlers. This may change and meta tags may instead replace robots.txt, but for now, since generative AI crawlers are using robots.txt, we’ll focus on that.
Where does a robots.txt file go? How do I edit it if I don’t host my own site?
Your robots.txt file should go in your main public directory. For most cpanel sites, this will be in the public_website, public, or similar directory. For any self-hosted WordPress installation, it will go inside your WordPress directory, beside your wp-config.php file. So if you see wp-config.php, you’re in the right place to upload your robots.txt file.
- Go to your SEO Dashboard.
- Select Go to Robots.txt Editor under Tools and settings.
- Click View File.
- Add your robots.txt file info by writing the directives under This is your current file.
Weebly allows you to view and toggle noindex for your website, but does not allow you an easy way to edit robots.txt.
If you are using SquareSpace, you cannot edit your robots.txt file or upload a new one. SquareSpace details what they generate for a robots.txt file in their help files.
If your site architecture/CMS is not listed above, you can Google “[site type] edit robots.txt” and usually you will be given some results from documentation by that CMS.
What does a robots.txt file look like?
Now that we know where to put it and what it does, how do we write something from a crawler? Here’s what the usual content inside a robots.txt looks like:
User-agent: * Disallow:
This robots.txt file allows all web crawlers access to everything uploaded to the public directory.
Here are some more examples.
Blocks all crawlers from everything:
User-agent: * Disallow: /
Blocks only GoogleBot indexing from everything:
User-agent: Googlebot Disallow: /
Blocks only Bingbot from a certain page:
User-agent: Bingbot Disallow: /the-page-url
Blocks all .jpg files from every web crawler:
User-agent: * Disallow: *.jpg
Moz has some really good breakdowns on what kind of things you can put in your robots.txt, and most of it is based off of regular expressions (“regex”, if you’re wondering what the asterisks and such mean) if you need to build an expression in particular. Test your robots.txt file using a tester like this one to see if certain bots are blocked or allowed.
So what can I do about AI?
Here’s the good part. Now that you’ve learned about robots.txt, where to put it, how it works, and how you can direct web crawlers, you can start to block any web crawler that has said it will use data scraped from the internet for AI training.
OpenAI, the people behind ChatGPT and DALL-E, have introduced a user-agent specific to their web crawler. That means you can block their crawler from your content by using the following in your robots.txt:
User-agent: GPTBot Disallow: /
You may also choose to block GoogleBot (or GoogleOther which is registered as having ??? R&D internal usage) given they have announced they will use any and all content scraped from the web by their crawlers for AI training, but be warned, this will also deindex you from Google’s search engine as well.
Keep an eye out for other crawlers like Bingbot, Yext, or other crawlers that may want to use your website for AI training, and update your robots.txt accordingly. Most legitimate web crawlers will publish their user agent name so you can aptly allow or disallow them when necessary — just search for “[crawler name] user agent” or “[crawler name] user agent robots.txt” to find it.
Will this block AI?
It is important to know that this will not block AI. This is just a small step in a line of many steps we now have to take and will continue to add to try to protect our works on the internet against data scraping. Like noted in the meta noai and noimageai post, nefarious web crawlers can choose to ignore a robots.txt entirely. However, if a company is providing its user agent to you, you can feel better about their crawler respecting your robots.txt as they are making a clear attempt to give you an opportunity to employ an “opt out” on your website… Assuming you’re tech-savvy and able to do so.
I will continue to report on ways to protect your artwork from AI as they become more available. If you have any questions, I’m happy to answer them in the comments if I’m able!