Cloudflare updates "robots.txt" — what does that mean for the future of the web?

Cloudflare logo
This move aims to give creators more power in the AI era, though questions remain over whether major players like Google will respect it. (Image credit: Cloudflare, Photo by Conny Schneider on Unsplash)

Robots.txt is a small text file that sits on every website. It tells search engines and bots what they’re allowed to see and what they’re not, working like a digital “do not enter” sign. In the early days of the internet, this worked well.

Search engines like Google and Bing followed the rules, and most website owners were happy with that balance. But the rise of AI has changed the picture. AI bots aren’t indexing websites in the traditional sense. Instead, they copy content to train chatbots or generate answers.

Many AI companies ignore robots.txt entirely, or disguise their crawlers to slip past restrictions. Cloudflare protects around 20% of the internet, which gives it a unique view of how these AI bots behave at scale. That’s why it has introduced the Content Signals Policy, a new way for publishers to say whether their content is okay to use for AI training — or not.

What Cloudflare’s content signals policy actually does

As reported on by digiday, this new policy builds on top of robots.txt by adding extra instructions for bots to follow. Instead of only saying what pages can be crawled, it lets publishers set rules for how their content can be used after it’s accessed.

There are three new “signals” to choose from:

  • search – allows content to be used for building a search index and showing links or snippets in results.
  • ai-input – covers using content directly in AI answers, such as when a chatbot pulls from a page to generate a response.
  • ai-train – controls whether content can be used to train or fine-tune AI models.

These signals use simple yes or no values. For example, a site could allow its content to appear in search results but block it from AI training.

Cloudflare has already rolled this out to more than 3.8 million domains. By default, search is set to “yes,” ai-train is set to “no,” and ai-input is left neutral until the site owner decides otherwise.

Why enforcement still matters — and Google’s role

The Google AI logo (Image credit: Getty Images | NurPhoto)

Whilst this update is a welcome step, some bots will still ignore the new signals. Website owners should combine them with extra protection, such as web application firewalls, which filter and monitor traffic between a site and the internet.

Bot management is also important. This uses machine learning to spot and block malicious automated traffic, while still letting real users through.

Even if some AI bots choose to ignore these rules, the policy strengthens the legal position of publishers. Cloudflare frames content signals as a “reservation of rights,” which could be used in future cases against AI companies.

If AI firms decide to respect the signals, it could set a new standard for the web. If not, stricter blocking and more aggressive legal action are likely — something I’m sure many against AI being used on their content will find welcome.

Another sticking point is how Google handles its crawlers. Googlebot is bundled to cover both search and AI Overviews, meaning publishers cannot opt out of AI features without also losing search visibility.

This creates an unfair trade-off. Either allow Google to use content for AI, or risk losing valuable traffic. Smaller publishers are hit hardest here, as they depend on Google search to reach their audiences.

The future of AI scraping and monetization

It’s good to see Cloudflare taking steps to protect domains from the wave of AI bots currently scraping anything publicly available online. Even ChatGPT appears to train on whatever it can. Its recent video model, Sora 2, can fully recreate missions from Cyberpunk 2077, and it’s hard to believe that permission was ever granted to use that content.

The same goes for videos featuring Mario or Pikachu. Nintendo is unlikely to ignore such uses, but given its history, it’s just as likely it will target a small fan project instead of going after a major AI company.

Cloudflare is also testing a “pay-per-crawl” feature. This would let domain owners charge AI crawlers each time they access a site. If a crawler doesn’t provide payment details, it will be met with a 402 Payment Required error.


Click to follow Windows Central on Google News

Follow Windows Central on Google News to keep our latest news, insights, and features at the top of your feeds!


Adam Hales
Contributor

Adam is a Psychology Master’s graduate passionate about gaming, community building, and digital engagement. A lifelong Xbox fan since 2001, he started with Halo: Combat Evolved and remains an avid achievement hunter. Over the years, he has engaged with several Discord communities, helping them get established and grow. Gaming has always been more than a hobby for Adam—it’s where he’s met many friends, taken on new challenges, and connected with communities that share his passion.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.